Amazon AWS – Page 46

Streamline RAG applications with intelligent metadata filtering using Amazon Bedrock

November 20, 2024

by Mani Khanuja Amazon AWS

Retrieval Augmented Generation (RAG) has become a crucial technique for improving the accuracy and relevance of AI-generated responses. The effectiveness of RAG heavily depends on the quality of context provided to the large language model (LLM), which is typically retrieved from vector stores based on user queries. The relevance of this context directly impacts the model’s ability to generate accurate and contextually appropriate responses.

One effective way to improve context relevance is through metadata filtering, which allows you to refine search results by pre-filtering the vector store based on custom metadata attributes. By narrowing down the search space to the most relevant documents or chunks, metadata filtering reduces noise and irrelevant information, enabling the LLM to focus on the most relevant content.

In some use cases, particularly those involving complex user queries or a large number of metadata attributes, manually constructing metadata filters can become challenging and potentially error-prone. To address these challenges, you can use LLMs to create a robust solution. This approach, which we call intelligent metadata filtering, uses tool use (also known as function calling) to dynamically extract metadata filters from natural language queries. Function calling allows LLMs to interact with external tools or functions, enhancing their ability to process and respond to complex queries.

Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI. One of its key features, Amazon Bedrock Knowledge Bases, allows you to securely connect FMs to your proprietary data using a fully managed RAG capability and supports powerful metadata filtering capabilities.

In this post, we explore an innovative approach that uses LLMs on Amazon Bedrock to intelligently extract metadata filters from natural language queries. By combining the capabilities of LLM function calling and Pydantic data models, you can dynamically extract metadata from user queries. This approach can also enhance the quality of retrieved information and responses generated by the RAG applications.

This approach not only addresses the challenges of manual metadata filter construction, but also demonstrates how you can use Amazon Bedrock to create more effective and user-friendly RAG applications.

Understanding metadata filtering

Metadata filtering is a powerful feature that allows you to refine search results by pre-filtering the vector store based on custom metadata attributes. This approach narrows down the search space to the most relevant documents or passages, reducing noise and irrelevant information. For a comprehensive overview of metadata filtering and its benefits, refer to Amazon Bedrock Knowledge Bases now supports metadata filtering to improve retrieval accuracy.

The importance of context quality in RAG applications

In RAG applications, the accuracy and relevance of generated responses heavily depend on the quality of the context provided to the LLM. This context, typically retrieved from the knowledge base based on user queries, directly impacts the model’s ability to generate accurate and contextually appropriate outputs.

To evaluate the effectiveness of a RAG system, we focus on three key metrics:

Answer relevancy – Measures how well the generated answer addresses the user’s query. By improving the relevance of the retrieved context through dynamic metadata filtering, you can significantly enhance the answer relevancy.
Context recall – Assesses the proportion of relevant information retrieved from the knowledge base. Dynamic metadata filtering helps improve context recall by more accurately identifying and retrieving the most pertinent documents or passages for a given query.
Context precision – Evaluates the accuracy of the retrieved context, making sure the information provided to the LLM is highly relevant to the query. Dynamic metadata filtering enhances context precision by reducing the inclusion of irrelevant or tangentially related information.

By implementing dynamic metadata filtering, you can significantly improve these metrics, leading to more accurate and relevant RAG responses. Let’s explore how to implement this approach using Amazon Bedrock and Pydantic.

Solution overview

In this section, we illustrate the flow of the dynamic metadata filtering solution using the tool use (function calling) capability. The following diagram illustrates high level RAG architecture with dynamic metadata filtering.

The process consists of the following steps:

The process begins when a user asks a query through their interface.
The user’s query is first processed by an LLM using the tool use (function calling) feature. This step is crucial for extracting relevant metadata from the natural language query. The LLM analyzes the query and identifies key entities or attributes that can be used for filtering.
The extracted metadata is used to construct an appropriate metadata filter. This combined query and filter is passed to the RetrieveAndGenerate
This API, part of Amazon Bedrock Knowledge Bases, handles the core RAG workflow. It consists of several sub-steps:
1. The user query is converted into a vector representation (embedding).
2. Using the query embedding and the metadata filter, relevant documents are retrieved from the knowledge base.
3. The original query is augmented with the retrieved documents, providing context for the LLM.
4. The LLM generates a response based on the augmented query and retrieved context.
Finally, the generated response is returned to the user.

This architecture uses the power of tool use for intelligent metadata extraction from a user’s query, combined with the robust RAG capabilities of Amazon Bedrock Knowledge Bases. The key innovation lies in Step 2, where the LLM is used to dynamically interpret the user’s query and extract relevant metadata for filtering. This approach allows for more flexible and intuitive querying, because users can express their information needs in natural language without having to manually specify metadata filters.

The subsequent steps (3–4) follow a more standard RAG workflow, but with the added benefit of using the dynamically generated metadata filter to improve the relevance of retrieved documents. This combination of intelligent metadata extraction and traditional RAG techniques results in more accurate and contextually appropriate responses to user queries.

Prerequisites

Before proceeding with this tutorial, make sure you have the following in place:

AWS account – You should have an AWS account with access to Amazon Bedrock.
Model access – Amazon Bedrock users need to request access to FMs before they’re available for use. For this solution, you need to enable access to the Amazon Titan Embeddings G1 – Text and Anthropic’s Claude Instant 1.2 model in Amazon Bedrock. For more information, refer to Access Amazon Bedrock foundation models.
Knowledge base – You need a knowledge base created in Amazon Bedrock with ingested data and metadata. For detailed instructions on setting up a knowledge base, including data preparation, metadata creation, and step-by-step guidance, refer to Amazon Bedrock Knowledge Bases now supports metadata filtering to improve retrieval accuracy. This post walks you through the entire process of creating a knowledge base and ingesting data with metadata.

In the following sections, we explore how to implement dynamic metadata filtering using the tool use feature in Amazon Bedrock and Pydantic for data validation.

Tool use is a powerful feature in Amazon Bedrock that allows models to access external tools or functions to enhance their response generation capabilities. When you send a message to a model, you can provide definitions for one or more tools that could potentially help the model generate a response. If the model determines it needs a tool, it responds with a request for you to call the tool, including the necessary input parameters.

In our example, we use Amazon Bedrock to extract entities like genre and year from natural language queries about video games. For a query like “A strategy game with cool graphics released after 2023?”” it will extract “strategy” (genre) and “2023” (year). These extracted entities will then dynamically construct metadata filters to retrieve only relevant games from the knowledge base. This allows flexible, natural language querying with precise metadata filtering.

Set up the environment

First, set up your environment with the necessary imports and Boto3 clients:

import json
import boto3
from typing import List, Optional
from pydantic import BaseModel, validator

region = "us-east-1"
bedrock = boto3.client("bedrock-runtime", region_name=region)
bedrock_agent_runtime = boto3.client("bedrock-agent-runtime")

MODEL_ID = "<add-model-id>"
kb_id = "<Your-Knowledge-Base-ID>"

Define Pydantic models

For this solution, you use Pydantic models to validate and structure our extracted entities:

class Entity(BaseModel):
    genre: Optional[str]
    year: Optional[str]

class ExtractedEntities(BaseModel):
    entities: List[Entity]

    @validator('entities', pre=True)
    def remove_duplicates(cls, entities):
        unique_entities = []
        seen = set()
        for entity in entities:
            entity_tuple = tuple(sorted(entity.items()))
            if entity_tuple not in seen:
                seen.add(entity_tuple)
                unique_entities.append(dict(entity_tuple))
        return unique_entities

Implement entity extraction using tool use

You now define a tool for entity extraction with basic instructions and use it with Amazon Bedrock. You should use a proper description for this to work for your use case:

tools = [
    {
        "toolSpec": {
            "name": "extract_entities",
            "description": "Extract named entities from the text. If you are not 100% sure of the entity value, use 'unknown'.",
            "inputSchema": {
                "json": {
                    "type": "object",
                    "properties": {
                        "entities": {
                            "type": "array",
                            "items": {
                                "type": "object",
                                "properties": {
                                    "genre": {"type": "string", "description": "The genre of the game. First alphabet is upper case."},
                                    "year": {"type": "string", "description": "The year when the game was released."}
                                },
                                "required": ["genre", "year"]
                            }
                        }
                    },
                    "required": ["entities"]
                }
            }
        }
    }
]

def extract_entities(text):
    response = bedrock.converse(
        modelId=MODEL_ID,
        inferenceConfig={
            "temperature": 0,
            "maxTokens": 4000
        },
        toolConfig={"tools": tools},
        messages=[{"role": "user", "content": [{"text": text}]}]
    )

    json_entities = None
    for content in response['output']['message']['content']:
        if "toolUse" in content and content['toolUse']['name'] == "extract_entities":
            json_entities = content['toolUse']['input']
            break

    if json_entities:
        return ExtractedEntities.parse_obj(json_entities)
    else:
        print("No entities found in the response.")
        return None

Construct a metadata filter

Create a function to construct the metadata filter based on the extracted entities:

def construct_metadata_filter(extracted_entities):
    if not extracted_entities or not extracted_entities.entities:
        return None

    entity = extracted_entities.entities[0]
    metadata_filter = {"andAll": []}

    if entity.genre and entity.genre != 'unknown':
        metadata_filter["andAll"].append({
            "equals": {
                "key": "genres",
                "value": entity.genre
            }
        })

    if entity.year and entity.year != 'unknown':
        metadata_filter["andAll"].append({
            "greaterThanOrEquals": {
                "key": "year",
                "value": int(entity.year)
            }
        })

    return metadata_filter if metadata_filter["andAll"] else None

Create the main function

Finally, create a main function to process the query and retrieve results:

def process_query(text):
    extracted_entities = extract_entities(text)
    metadata_filter = construct_metadata_filter(extracted_entities)

    response = bedrock_agent_runtime.retrieve(
        knowledgeBaseId=kb_id,
        retrievalConfiguration={
            "vectorSearchConfiguration": {
                "filter": metadata_filter
            }
        },
        retrievalQuery={
            'text': text
        }
    )
    return response

# Example usage
text = "A strategy game with cool graphic released after 2023"
result = process_query(text)

# Print results
for game in result.get('retrievalResults', []):
    print(f"Title: {game.get('content').get('text').split(':')[0].split(',')[-1].replace('score ','')}")
    print(f"Year: {game.get('metadata').get('year')}")
    print(f"Genre: {game.get('metadata').get('genres')}")
    print("---")

This implementation uses the tool use feature in Amazon Bedrock to dynamically extract entities from user queries. It then uses these entities to construct metadata filters, which are applied when retrieving results from the knowledge base.

The key advantages of this approach include:

Flexibility – The system can handle a wide range of natural language queries without predefined patterns
Accuracy – By using LLMs for entity extraction, you can capture nuanced information from user queries
Extensibility – You can expand the tool definition to extract additional metadata fields as needed

Handling edge cases

When implementing dynamic metadata filtering, it’s important to consider and handle edge cases. In this section, we discuss some ways you can address them.

If the tool use process fails to extract metadata from the user query due to an absence of filters or errors, you have several options:

Proceed without filters – This allows for a broad search, but may reduce precision:

if not metadata_filter:
    response = bedrock_agent_runtime.retrieve(
        knowledgeBaseId=kb_id,
        retrievalQuery={'text': text}
    )

Apply a default filter – This can help maintain some level of filtering even when no specific metadata is extracted:

   default_filter = {"andAll": [{"greaterThanOrEquals": {"key": "year", "value": 2020}}]}
   metadata_filter = metadata_filter or default_filter

Use the most common filter – If you have analytics on common user queries, you could apply the most frequently used filter

Strict policy handling – For cases where you want to enforce stricter policies or adhere to specific responsible AI guidelines, you might choose not to process queries that don’t yield metadata:

if not metadata_filter:
    return {
        "error": "I'm sorry, but I couldn't understand the specific details of your request. Could you please provide more information about the type of game or the release year you're interested in?"
    }

This approach makes sure that only queries with clear, extractable metadata are processed, potentially reducing errors and improving overall response quality.

Performance considerations

The dynamic approach introduces an additional FM call to extract metadata, which will increase both cost and latency. To mitigate this, consider the following:

Use a faster, lighter FM for the metadata extraction step. This can help reduce latency and cost while still providing accurate entity extraction.
Implement caching mechanisms for common queries to help avoid redundant FM calls.
Monitor and optimize the performance of your metadata extraction model regularly.

Clean up

After you’ve finished experimenting with this solution, it’s crucial to clean up your resources to avoid unnecessary charges. For detailed cleanup instructions, see Amazon Bedrock Knowledge Bases now supports metadata filtering to improve retrieval accuracy. These steps will guide you through deleting your knowledge base, vector database, AWS Identity and Access Management (IAM) roles, and sample datasets, making sure that you don’t incur unexpected costs.

Conclusion

By implementing dynamic metadata filtering using Amazon Bedrock and Pydantic, you can significantly enhance the flexibility and power of RAG applications. This approach allows for more intuitive querying of knowledge bases, leading to improved context recall and more relevant AI-generated responses.

As you explore this technique, remember to balance the benefits of dynamic filtering against the additional computational costs. We encourage you to try this method in your own RAG applications and share your experiences with the community.

For additional resources, refer to the following:

Happy building with Amazon Bedrock!

About the Authors

Mani Khanuja is a Tech Lead – Generative AI Specialists, author of the book Applied Machine Learning and High-Performance Computing on AWS, and a member of the Board of Directors for Women in Manufacturing Education Foundation Board. She leads machine learning projects in various domains such as computer vision, natural language processing, and generative AI. She speaks at internal and external conferences such AWS re:Invent, Women in Manufacturing West, YouTube webinars, and GHC 23. In her free time, she likes to go for long runs along the beach.

Ishan Singh is a Generative AI Data Scientist at Amazon Web Services, where he helps customers build innovative and responsible generative AI solutions and products. With a strong background in machine learning and natural language processing, Ishan specializes in developing safe and responsible AI systems that drive business value. Outside of work, he enjoys playing competitive volleyball, exploring local bike trails, and spending time with his wife and dog, Beau.

Embedding secure generative AI in mission-critical public safety applications

November 20, 2024

by Lawrence Zorio III Amazon AWS

This post is co-written with Lawrence Zorio III from Mark43.

Public safety organizations face the challenge of accessing and analyzing vast amounts of data quickly while maintaining strict security protocols. First responders need immediate access to relevant data across multiple systems, while command staff require rapid insights for operational decisions. Mission-critical public safety applications require the highest levels of security and reliability when implementing technology capabilities. Mark43, a public safety technology company, recognized this challenge and embedded generative artificial intelligence (AI) capabilities into their application using Amazon Q Business to transform how law enforcement agencies interact with their mission-critical applications. By embedding advanced AI into their cloud-native platform, Mark43 enables officers to receive instant answers to natural language queries and automated case report summaries, reducing administrative time from minutes to seconds. This solution demonstrates how generative AI can enhance public safety operations while allowing officers to focus more time on serving their communities.

This post shows how Mark43 uses Amazon Q Business to create a secure, generative AI-powered assistant that drives operational efficiency and improves community service. We explain how they embedded Amazon Q Business web experience in their web application with low code, so they could focus on creating a rich AI experience for their customers.

Mark43’s public safety solution built on the AWS Cloud

Mark43 offers a cloud-native Public Safety Platform with powerful computer-aided dispatch (CAD), records management system (RMS), and analytics solutions, positioning agencies at the forefront of public safety technology. These solutions make sure public safety agencies have access to the essential tools and data they need to protect and serve their communities effectively. By using purpose-built Amazon Web Services (AWS) cloud services and modern software architecture, Mark43 delivers an intuitive, user-friendly experience that empowers both frontline personnel and command staff. The solution’s advanced analytical capabilities provide real-time insights to support data-driven decision making and enhance operational efficiency. Mark43 has built a robust and resilient microservices architecture using a combination of serverless technologies, such as AWS Lambda, AWS Fargate, and Amazon Elastic Compute Cloud (Amazon EC2). They use event-driven architectures, real-time processing, and purpose-built AWS services for hosting data and running analytics. This, combined with integrated AI capabilities, positions Mark43 to drive innovation in the industry. With its Open API architecture built on AWS and 100+ integrations, Mark43 connects to the applications and data sources agencies rely on for unmatched insights, situational awareness and decision support. This modern data foundation built on AWS allows agencies to leverage the latest technologies and AI models, keeping pace with the evolving technology landscape.

Opportunity for innovation with generative AI

Agency stakeholders have mission-critical roles that demand significant time and administrative interactions with core solutions. With a cloud native Computer Aided Dispatch (CAD) and Records Management System (RMS), Mark43 was able to bring the same modern solutions that have long infiltrated other industries to make police forces more efficient, replacing legacy systems. Now, Mark43 values the opportunity to leverage AI to support the next evolution of innovative technology to drive efficiencies, enhance situational awareness, and support better public safety outcomes.

Leading agencies are embracing AI by setting high standards for data integrity and security, implementing a central strategy to prevent unauthorized use of consumer AI tools, and ensuring a human-in-the-loop approach. Meanwhile, value-add AI tools should seamlessly integrate with existing workflows and applications to prevent sprawl to yet more tools adding unwanted complexity. Mark43 and AWS worked backwards from these requirements to bring secure, easy-to-use, and valuable AI to public safety.

AWS collaborated with Mark43 to embed a frictionless AI assistant directly into their core products, CAD and RMS, for first responders and command staff. Together, we harnessed the power of AI into a secure, familiar, existing workflow with a low barrier to entry for adoption across the user base. The assistant enables first responders to search information, receive summaries, and complete tasks based on their authorized data access within Mark43’s systems, reducing the time needed to capture high value insights.

In just a few weeks, Mark43 deployed an Amazon Q Business application, integrated their data sources using Amazon Q Business built-in data connectors, embedded the Amazon Q Business application into their native app, tested and tuned responses to prompts, and completed a successful beta version of the assistant with their to end users. Figure 1 depicts the overall architecture of Mark43’s application using Amazon Q Business.

Mark43’s solution uses the Amazon Q Business built-in data connectors to unite information from various enterprise applications, document repositories, chat applications, and knowledge management systems. The implementation draws data from objects stored in Amazon Simple Storage Service (Amazon S3) in addition to structured records stored in Amazon Relational Database Service (Amazon RDS). Amazon Q Business automatically uses the data from these sources as context to answer prompts from users of the AI assistant without requiring Mark43 to build and maintain a retrieval augmented generation (RAG) pipeline.

Amazon Q Business provides a chat interface web experience with a web address hosted by AWS. To embed the Amazon Q Business web experience in Mark43’s web application, Mark43 first allowlisted their web application domain using the Amazon Q Business console. Then, Mark43 added an inline frame (iframe) HTML component to their web application with the src attribute set to the web address of the Amazon Q Business web experience. For example, <iframe src=”Amazon Q Business web experience URL”/>. This integration requires a small amount of development effort to create a seamless experience for Mark43 end users.

Security remains paramount in the implementation. Amazon Q Business integrates with Mark43’s existing identity and access management protocols, making sure that users can only access information to which they’re authorized. If a user doesn’t have access to the data outside of Amazon Q Business, then they cannot access the data within Amazon Q Business. The AI assistant respects the same data access restrictions that apply to users in their normal workflow. With Amazon Q Business administrative controls and guardrails, Mark43 administrators can block specific topics and filter both questions and answers using keywords, verifying that responses align with public safety agency guidelines and protocols.

Mark43 is committed to the responsible use of AI. We believe in transparency, informing our users that they’re interacting with an AI solution. We strongly recommend human-in-the-loop review for critical decisions. Importantly, our AI assistant’s responses are limited to authorized data sources only, not drawing from general Large Language Model (LLM) knowledge. Additionally, we’ve implemented Amazon Q’s guardrails to filter undesirable topics, further enhancing the safety and reliability of our AI-driven solutions.

“Mark43 is committed to empowering communities and governments with a modern public safety solution that elevates both safety and quality of life. When we sought a seamless way to integrate AI-powered search and summarization, Amazon Q Business proved the ideal solution.

We added Amazon Q Business into our solution, reaffirming our dedication to equipping agencies with resilient, dependable and powerful technology. Amazon Q’s precision in extracting insights from complex data sources provides law enforcement with immediate access to information, reducing administrative burden from minutes to seconds. By adding Amazon Q into our portfolio of AWS services, we continue to deliver an efficient, intuitive user experience which enables officers to stay focused on serving their communities but also empowers command staff to quickly interpret data and share insights with stakeholders, supporting real-time situational awareness and operational efficiency,”

– Bob Hughes, CEO Mark43, GovTech Provider of Public Safety Solutions.

Mark43 customer feedback

Public Safety agencies are excited about the potential for AI-powered search to enhance investigations, drive real-time decision support, and increase situational awareness. At International Association of Chiefs of Police (IACP) conference in Boston in Oct 2024, one agency who viewed a demo described it as a “game-changer,” while another agency recognized the value of AI capabilities to make officer training programs more efficient. Agency stakeholders noted that AI-powered search will democratize insights across the agency, allowing them to spend more time on higher-value work, instead of answering basic questions.

Conclusion

In this post, we showed you how Mark43 embedded Amazon Q Business web experience into their public safety solution to transform how law enforcement agencies interact with mission-critical applications. Through this integration, Mark43 demonstrated how AI can reduce administrative tasks from minutes to seconds while maintaining the high levels of security required for law enforcement operations.

Looking ahead, Mark43 plans to expand their Amazon Q Business integration with a focus on continuous improvements to the user experience. Mark43 will continue to empower law enforcement with the most advanced, resilient, and user-friendly public safety technology powered by Amazon Q Business and other AWS AI services.

Visit the Amazon Q Business User Guide to learn more about how to embed generative AI into your applications. Request a demo with Mark43 and learn how your agency can benefit from Amazon Q Business in public safety software.

About the authors

Lawrence Zorio III serves as the Chief Information Security Officer at Mark43, where he leads a team of cybersecurity professionals dedicated to safeguarding the confidentiality, integrity, and availability (CIA) of enterprise and customer data, assets, networks, and products. His leadership ensures Mark43’s security strategy aligns with the unique requirements of public safety agencies worldwide. With over 20 years of global cybersecurity experience across the Public Safety, Finance, Healthcare, and Technology sectors, Zorio is a recognized leader in the field. He chairs the Integrated Justice Information System (IJIS) Cybersecurity Working Group, where he helps develop standards, best practices, and recommendations aimed at strengthening cybersecurity defenses against rising cyber threats. Zorio also serves as an advisor to universities and emerging technology firms. Zorio holds a Bachelor of Science in Business Information Systems from the University of Massachusetts Dartmouth and a Master of Science in Innovation from Northeastern University’s D’Amore-McKim School of Business. He has been featured in various news publications, authored multiple security-focused white papers, and is a frequent speaker at industry events.

Ritesh Shah is a Senior Generative AI Specialist at AWS. He partners with customers like Mark43 to drive AI adoption, resulting in millions of dollars in top and bottom line impact for these customers. Outside work, Ritesh tries to be a dad to his AWSome daughter. Connect with him on LinkedIn.

Prajwal Shetty is a GovTech Solutions Architect at AWS and collaborates with Justice and Public Safety (JPS) customers like Mark43. He designs purpose-driven solutions that foster an efficient and secure society, enabling organizations to better serve their communities through innovative technology. Connect with him on LinkedIn.

Garrett Kopeski is an Enterprise GovTech Senior Account Manager at AWS responsible for the business relationship with Justice and Public Safety partners such as Mark43. Garrett collaborates with his customers’ Executive Leadership Teams to connect their business objectives with AWS powered initiatives and projects. Outside work, Garrett pursues physical fitness challenges when he’s not chasing his energetic 2-year-old son. Connect with him on LinkedIn.

Bobby Williams is a Senior Solutions Architect at AWS. He has decades of experience designing, building, and supporting enterprise software solutions that scale globally. He works on solutions across industry verticals and horizontals and is driven to create a delightful experience for every customer.

How FP8 boosts LLM training by 18% on Amazon SageMaker P5 instances

November 20, 2024

by Romil Shah Amazon AWS

Large language models (LLMs) are AI systems trained on vast amounts of text data, enabling them to understand, generate, and reason with natural language in highly capable and flexible ways. LLM training has seen remarkable advances in recent years, with organizations pushing the boundaries of what’s possible in terms of model size, performance, and efficiency. In this post, we explore how FP8 optimization can significantly speed up large model training on Amazon SageMaker P5 instances.

LLM training using SageMaker P5

In 2023, SageMaker announced P5 instances, which support up to eight of the latest NVIDIA H100 Tensor Core GPUs. Equipped with high-bandwidth networking technologies like EFA, P5 instances provide a powerful platform for distributed training, enabling large models to be trained in parallel across multiple nodes. With the use of Amazon SageMaker Model Training, organizations have been able to achieve higher training speeds and efficiency by turning to P5 instances. This showcases the transformative potential of training different scales of models faster and more efficiently using SageMaker Training.

LLM training using FP8

P5 instances, which are NVIDIA H100 GPUs underneath, also come with capabilities of training models using FP8 precision. The FP8 data type has emerged as a game changer in LLM training. By reducing the precision of the model’s weights and activations, FP8 allows for more efficient memory usage and faster computation, without significantly impacting model quality. The throughput for running matrix operations like multipliers and convolutions on 32-bit float tensors is much lower than using 8-bit float tensors. FP8 precision reduces the data footprint and computational requirements, making it ideal for large-scale models where memory and speed are critical. This enables researchers to train larger models with the same hardware resources, or to train models faster while maintaining comparable performance. To make the models compatible for FP8, NVIDIA released the Transformer Engine (TE) library, which provides support for some layers like Linear, LayerNorm, and DotProductAttention. To enable FP8 training, models need to use the TE API to incorporate these layers when casted to FP8. For example, the following Python code shows how FP8-compatible layers can be integrated:

try:
    import transformer_engine.pytorch as te
    using_te = True
except ImportError as ie:
    using_te = False
......
linear_type: nn.Module = te.Linear if using_te else nn.Linear
......
    in_proj = linear_type(dim, 3 * n_heads * head_dim, bias=False, device='cuda' if using_te)
    out_proj = linear_type(n_heads * head_dim, dim, bias=False, device='cuda' if using_te)
......

Results

We ran some tests using 1B-parameter and 7B-parameter LLMs by running training with and without FP8. The test is run on 24 billion tokens for one epoch, thereby providing a comparison for throughput (in tokens per second per GPU) and model performance (in loss numbers). For 1B-parameter models, we computed results to compare performance with FP8 using a different number of instances for distributed training. The following table summarizes our results:

Number of P5 Nodes	Without FP8			With FP8			% Faster by Using FP8	% Loss Higher with FP8 than Without FP8
Number of P5 Nodes	Tokens/sec/GPU	% Decrease	Loss After 1 Epoch	Tokens/sec/GPU	% Decrease	Loss After 1 Epoch	% Faster by Using FP8	% Loss Higher with FP8 than Without FP8
1	40200	–	6.205	40800	–	6.395	1.49	3.06
2	38500	4.2288	6.211	41600	-3.4825	6.338	8.05	2.04
4	39500	1.7412	6.244	42000	-4.4776	6.402	6.32	2.53
8	38200	4.9751	6.156	41800	-3.98	6.365	9.42	3.39
16	35500	11.6915	6.024	39500	1.7412	6.223	11.26	3.3
32	33500	16.6667	6.112	38000	5.4726	6.264	13.43	2.48

The following graph that shows the throughput performance of 1B-parameter model in terms of tokens/second/gpu over different numbers of P5 instances:

For 7B-parameter models, we computed results to compare performance with FP8 using different number of instances for distributed training. The following table summarizes our results:

Number of P5 Nodes	Without FP8			With FP8			% Faster by Using FP8	% Loss Higher with FP8 than Without FP8
Number of P5 Nodes	Tokens/sec/GPU	% Decrease	Loss After 1 Epoch	Tokens/sec/GPU	% Decrease	Loss After 1 Epoch	% Faster by Using FP8	% Loss Higher with FP8 than Without FP8
1	9350	–	6.595	11000	–	6.602	15	0.11
2	9400	-0.5347	6.688	10750	2.2935	6.695	12.56	0.1
4	9300	0.5347	6.642	10600	3.6697	6.634	12.26	-0.12
8	9250	1.0695	6.612	10400	4.9541	6.652	11.06	0.6
16	8700	6.9518	6.594	10100	8.7155	6.644	13.86	0.76
32	7900	15.508	6.523	9700	11.8182	6.649	18.56	1.93

The following graph that shows the throughput performance of 7B-parameter model in terms of tokens/second/gpu over different numbers of P5 instances:

The preceding tables show how, when using FP8, the training of 1B models is faster by 13% and training of 7B models is faster by 18%. As model training speed increases with FP8, there is generally a trade-off with a slower decrease in loss. However, the impact on model performance after one epoch remains minimal, with only about a 3% higher loss for 1B models and 2% higher loss for 7B models using FP8 as compared to training without using FP8. The following graph illustrates the loss performance.

As discussed in Scalable multi-node training with TensorFlow, due to inter-node communication, a small decline in the overall throughput is observed as the number of nodes increases.

The impact on LLM training and beyond

The use of FP8 precision combined with SageMaker P5 instances has significant implications for the field of LLM training. By demonstrating the feasibility and effectiveness of this approach, it opens the door for other researchers and organizations to adopt similar techniques, accelerating progress in large model training. Moreover, the benefits of FP8 and advanced hardware extend beyond LLM training. These advancements can also accelerate research in fields like computer vision and reinforcement learning by enabling the training of larger, more complex models with less time and fewer resources, ultimately saving time and cost. In terms of inference, models with FP8 activations have shown to improve two-fold over BF16 models.

Conclusion

The adoption of FP8 precision and SageMaker P5 instances marks a significant milestone in the evolution of LLM training. By pushing the boundaries of model size, training speed, and efficiency, these advancements have opened up new possibilities for research and innovation in large models. As the AI community builds on these technological strides, we can expect even more breakthroughs in the future. Ongoing research is exploring further improvements through techniques such as PyTorch 2.0 Fully Sharded Data Parallel (FSDP) and TorchCompile. Coupling these advancements with FP8 training could lead to even faster and more efficient LLM training. For those interested in the potential impact of FP8, experiments with 1B or 7B models, such as GPT-Neo or Meta Llama 2, on SageMaker P5 instances could offer valuable insights into the performance differences compared to FP16 or FP32.

About the Authors

Romil Shah is a Sr. Data Scientist at AWS Professional Services. Romil has more than 8 years of industry experience in computer vision, machine learning, generative AI, and IoT edge devices. He works with customers, helping in training, optimizing and deploying foundation models for edge devices and on the cloud.

Mike Garrison is a Global Solutions Architect based in Ypsilanti, Michigan. Utilizing his twenty years of experience, he helps accelerate tech transformation of automotive companies. In his free time, he enjoys playing video games and travel.

Racing into the future: How AWS DeepRacer fueled my AI and ML journey

November 19, 2024

by Matt Camp Amazon AWS

In 2018, I sat in the audience at AWS re:Invent as Andy Jassy announced AWS DeepRacer—a fully autonomous 1/18th scale race car driven by reinforcement learning. At the time, I knew little about AI or machine learning (ML). As an engineer transitioning from legacy networks to cloud technologies, I had never considered myself a developer. But AWS DeepRacer instantly captured my interest with its promise that even inexperienced developers could get involved in AI and ML.

The AWS DeepRacer League was also announced, featuring physical races at AWS Summits worldwide in 2019 and a virtual league in a simulated environment. Winners would qualify to compete for the grand champion title in Las Vegas the following year. For 2018, because AWS DeepRacer had just been unveiled, re:Invent attendees could compete in person at the MGM Grand using pre-trained models.

My colleagues and I from JigsawXYZ immediately headed to the MGM Grand after the keynote. Despite long queues, we persevered, observing others racing while we waited. Participants answered questions about driving preferences to select a pre-trained model. Unlike later competitions, racers had to physically follow the car and place it back on track when it veered off.

We noticed that the AWS-provided models were unstable and slow by today’s standards, frequently going off-track. We concluded that quickly replacing the car on the track could result in a good lap time. Using this strategy, we secured second place on the leaderboard.

The night before the finals, we learned that we had qualified because of a dropout. Panic set in as we realized we would be competing on stage in front of thousands of people while knowing little about ML. We frantically tried to train a model overnight to avoid embarrassment.

The next morning, we found ourselves in the front row of the main auditorium, next to Andy Jassy. Our boss, Rick Fish, represented our team. After an energetic introduction from Indycar commentator Ryan Myrehn, Rick set a lap time of 51.50 seconds, securing the 2018 AWS DeepRacer grand champion title!

Image 2 – Rick Fish accepting the AWS DeepRacer trophy from Matt Wood

2019: Building a community and diving deeper

Back in London, interest in AWS DeepRacer exploded. We spoke at multiple events, including hosting our own An evening with DeepRacer gathering. As the 2019 season approached, I needed to earn my own finals spot. I began training models in the AWS DeepRacer console and experimenting with the physical car, including remote control and first-person view projects.

At the 2019 London AWS Summit, I won the AWS DeepRacer Championship with a lap time of 8.9 seconds, a significant improvement from the previous year. This event also sparked the creation of the AWS DeepRacer Community, which has since grown to over 45,000 members.

My interest in understanding the inner workings of AWS DeepRacer grew. I contributed to open source projects that allowed running the training stack locally, diving deep into AWS services such as Amazon SageMaker and AWS RoboMaker. These efforts led to my nomination as an AWS Community Builder.

Working on community projects improved my skills in Python, Jupyter, numpy, pandas, and ROS. These experiences proved invaluable when I joined Unitary, an AI startup focused on reducing harmful online content. Within a year, we built a world-class inference platform processing over 2 billion video frames daily using dynamically scaled Amazon Elastic Kubernetes Service (Amazon EKS) clusters.

Image 3 – Unitary at the AWS London Summit showcasing dynamically scaled inference using 1000+ EKS nodes

2020-2023: Virtual racing and continued growth

The COVID-19 pandemic shifted AWS DeepRacer competitions online for 2020 and 2021. Despite this, exciting events like the AWS DeepRacer F1 Pro-Am kept the community engaged. The introduction of the AWS DeepRacer Evo, with stereo cameras and a lidar detector, marked a significant hardware upgrade.

In-person racing returned in 2022, and I set a new world record at the London Summit. While I didn’t win the finals that year, the experience of competing and connecting with fellow racers remained invaluable.

Images 4 & 5 – the author hoists the trophy from the 2022 London Summit (left) DeepRacer Community members and Pit Crew hosting a AWS DeepRacer workshop at re:Invent 2023 (right)

2023 brought more intense competition. Although I set another world record in London, it wasn’t enough for first place. I eventually secured a finals spot by winning a virtual league round for Europe. While my performance in the finals didn’t improve on previous results, the opportunity to reconnect with the AWS DeepRacer community was rewarding.

Conclusion: The lasting impact of AWS DeepRacer

Over the past six years, AWS DeepRacer has profoundly impacted my professional and personal life. It has helped me develop a strong foundation in AI and ML, improve my coding skills, and build a network of friends and professional contacts in the tech industry. The experience gained through AWS DeepRacer directly contributed to my success at Unitary, where we’ve achieved recognition as a top UK startup.

As the official AWS DeepRacer league comes to an end, I’m excited to see what the community will achieve next. This journey has shaped my career and life in ways I never expected when I first saw that small autonomous car on stage in 2018.

For those interested in starting their own AI and ML journey, I encourage you to explore the AWS DeepRacer resources available on the AWS website. You can also join the thriving community on Discord to connect with other enthusiasts and learn from their experiences.

About the author

Matt Camp is an AI and ML enthusiast who has been involved with AWS DeepRacer since its inception. He is currently working at Unitary, applying his skills to develop cutting-edge content moderation technology. Matt is an AWS Community Builder and continues to contribute to open source projects in the AWS DeepRacer community.

Your guide to generative AI and ML at AWS re:Invent 2024

November 19, 2024

by Mukund Birje Amazon AWS

The excitement is building for the fourteenth edition of AWS re:Invent, and as always, Las Vegas is set to host this spectacular event. This year, generative AI and machine learning (ML) will again be in focus, with exciting keynote announcements and a variety of sessions showcasing insights from AWS experts, customer stories, and hands-on experiences with AWS services. As you continue to innovate and partner with us to advance the field of generative AI, we’ve curated a diverse range of sessions to support you at every stage of your journey. These sessions are strategically organized across multiple learning topics, so there’s something valuable for everyone, regardless of your experience level.

In this attendee guide, we’re highlighting a few of our favorite sessions to give you a glimpse into what’s in store. As you browse the re:Invent catalog, select your learning topic and use the “Generative AI” area of interest tag to find the sessions most relevant to you.

The technical sessions covering generative AI are divided into six areas: First, we’ll spotlight Amazon Q, the generative AI-powered assistant transforming software development and enterprise data utilization. These sessions, featuring Amazon Q Business, Amazon Q Developer, Amazon Q in QuickSight, and Amazon Q Connect, span the AI/ML, DevOps and Developer Productivity, Analytics, and Business Applications topics. The sessions showcase how Amazon Q can help you streamline coding, testing, and troubleshooting, as well as enable you to make the most of your data to optimize business operations. You will also explore AWS App Studio, a generative AI-powered service that empowers a new set of builders to rapidly create enterprise-grade applications using natural language, generating intelligent, secure, and scalable apps in minutes. Second, we’ll delve into Amazon Bedrock, our fully managed service for building generative AI applications. Learn how you can use leading foundation models (FMs) from industry leaders and Amazon to build and scale your generative AI applications, and understand customization techniques like fine-tuning and Retrieval Augmented Generation (RAG). We’ll cover Amazon Bedrock Agents, capable of running complex tasks using your company’s systems and data. Third, we’ll explore the robust infrastructure services from AWS powering AI innovation, featuring Amazon SageMaker, AWS Trainium, and AWS Inferentia under AI/ML, as well as Compute topics. Discover how the fully managed infrastructure of SageMaker enables high-performance, low cost ML throughout the ML lifecycle, from building and training to deploying and managing models at scale. Fourth, we’ll address responsible AI, so you can build generative AI applications with responsible and transparent practices. Fifth, we’ll showcase various generative AI use cases across industries. And finally, get ready for the AWS DeepRacer League as it takes it final celebratory lap. You don’t want to miss this moment in AWS DeepRacer history, where racers will go head-to-head one last time to become the final champion. Off the race track, we will have dedicated sessions designed to help you continue your learning journey and apply your skills to the rapidly growing field of generative AI.

Visit the Generative AI Zone (GAIZ) at AWS Village in the Venetian Expo Hall to explore hands-on experiences with our newest launches and connect with our generative AI and ML specialists. Through a series of immersive exhibits, you can gain insights into AWS infrastructure for generative AI, learn about building and scaling generative AI applications, and discover how AI assistants are driving business transformation and modernization. As attendees circulate through the GAIZ, subject matter experts and Generative AI Innovation Center strategists will be on-hand to share insights, answer questions, present customer stories from an extensive catalog of reference demos, and provide personalized guidance for moving generative AI applications into production. Experience an immersive selection of innovative generative AI exhibits at the Generative AI and Innovations Pavilion through interactive displays spanning the AWS generative AI stack. Additionally, you can deep-dive into your industry-specific generative AI and ML use cases with our industry experts at the AWS Industries Pavilion.

If you’re new to re:Invent, you can attend sessions of the following types:

Keynotes – Join in person or virtually and learn about all the exciting announcements.
Innovation talks – Learn about the latest cloud technology from AWS technology leaders and discover how these advancements can help you push your business forward. These sessions will be livestreamed, recorded, and published to YouTube.
Breakout sessions – These 60-minute sessions are expected to have broad appeal, are delivered to larger audiences, and will be recorded. If you miss them, you can watch them on demand after re:Invent.
Chalk talks – Enjoy 60 minutes of content delivered to smaller audiences with an interactive whiteboarding session. Chalk talks are where discussions happen, and these offer you the greatest opportunity to ask questions or share your opinion.
Workshops – In these hands-on learning opportunities, in 2 hours, you’ll be able to build a solution to a problem, and understand the inner workings of the resulting infrastructure and cross-service interaction. Bring your laptop and be ready to learn!
Builders’ sessions – These highly interactive 60-minute mini-workshops are conducted in small groups of fewer than 10 attendees. Some of these appeal to beginners, and others are on specialized topics.
Code talks – These talks are similar to our popular chalk talk format, but instead of focusing on an architecture solution with whiteboarding, the speakers lead an interactive discussion featuring live coding or code samples. These 60-minute sessions focus on the actual code that goes into building a solution. Attendees are encouraged to ask questions and follow along.

If you have reserved your seat at any of the sessions, great! If not, we always set aside some spots for walk-ins, so make a plan and come to the session early.

To help you plan your agenda for this year’s re:Invent, here are some highlights of the generative AI and ML sessions. Visit the session catalog to learn about all our generative AI and ML sessions.

Keynotes

Matt Garman, Chief Executive Officer, Amazon Web Services

Tuesday December 3| 8:00 AM – 10:30 AM (PST) | The Venetian

Join AWS CEO Matt Garman to hear how AWS is innovating across every aspect of the world’s leading cloud. He explores how we are reinventing foundational building blocks as well as developing brand-new experiences, all to empower customers and partners with what they need to build a better future.

Swami Sivasubramanian, Vice President of AI and Data

Wednesday December 4 | 8:30 AM – 10:30 AM (PST) | The Venetian

Join Dr. Swami Sivasubramanian, VP of AI and Data at AWS, to discover how you can use a strong data foundation to create innovative and differentiated solutions for your customers. Hear from customer speakers with real-world examples of how they’ve used data to support a variety of use cases, including generative AI, to create unique customer experiences.

Innovation talks

Pasquale DeMaio, Vice President & General Manager of Amazon Connect| BIZ221-INT | Generative AI for customer service

Monday December 2 | 10:30 AM – 11:30 AM (PST) | Venetian | Level 5 | Palazzo Ballroom B

Generative AI promises to revolutionize customer interactions, ushering in a new era of automation, cost efficiencies, and responsiveness. However, realizing this transformative potential requires a holistic approach that harmonizes people, processes, and technology. Through customer success stories and demonstrations of the latest AWS innovations, gain insights into operationalizing generative AI for customer service from the Vice President of Amazon Connect, Pasquale DeMaio. Whether you’re just starting your journey or well on your way, leave this talk with the knowledge and tools to unlock the transformative power of AI for customer interactions, the agent experience, and more.

Mai-Lan Tomsen Bukovec, Vice President, Technology | AIM250-INT | Modern data patterns for modern data strategies

Tuesday December 3 | 11:30 AM – 12:30 PM (PST) | Venetian | Level 5 | Palazzo Ballroom B

Every modern business is a data business, and organizations need to stay nimble to balance data growth with data-driven value. In this talk, you’ll understand how to recognize the latest signals in changing data patterns, and adapt data strategies that flex to changes in consumer behavior and innovations in technology like AI. Plus, learn how to evolve from data aggregation to data semantics to support data-driven applications while maintaining flexibility and governance. Hear from AWS customers who successfully evolved their data strategies for analytics, ML, and AI, and get practical guidance on implementing similar strategies using cutting-edge AWS tools and services.

Dilip Kumar, Vice President, Amazon Q Business | INV202-INT | Creating business breakthroughs with Amazon Q

Wednesday December 4| 11:30 AM – 12:30 PM (PST) | Venetian | Level 5 | Palazzo Ballroom B

Get an overview of Amazon Q Business capabilities, including its ability to answer questions, provide summaries, generate content, and complete assigned tasks. Learn how Amazon Q Business goes beyond search to enable AI-powered actions. Explore how simple it is to build applications using Amazon Q Apps. Then, examine how AWS App Studio empowers a new set of builders to rapidly create business applications tailored to their organization’s needs, and discover how to build richer analytics using Amazon Q in QuickSight.

Baskar Sridharan, VP, AI/ML Services & Infrastructure | AIM276-INT | Generative AI in action: From prototype to production

Wednesday December 4 | 1:00 PM – 2:00 PM (PST) | Venetian | Level 5 | Palazzo Ballroom B

Learn how to transition generative AI from prototypes to production. This includes building custom models, implementing robust data strategies, and scaling architectures for performance and reliability. Additionally, the session will cover empowering business users to drive innovation and growth through this transformative technology.

Adam Seligman, Vice President, Developer Experience | DOP220-INT | Reimagining the developer experience at AWS

Thursday December 5 | 2:00 PM – 3:00 PM (PST) | Venetian | Level 5 | Palazzo Ballroom B

Dive into the pioneering approach AWS takes to integrating generative AI across the entire software development lifecycle. Explore the rich ecosystem of technical resources, networking opportunities, and knowledge-sharing platforms available to you with AWS. Learn from real-world examples of how AWS, developers, and software teams are using the power of generative AI to creative innovative solutions that are shaping the future of software development.

Breakout sessions

DOP210: Accelerate multi-step SDLC tasks with Amazon Q Developer Agents

Monday December 2 | 8:30 AM – 9:30 AM PT

While existing AI assistants focus on code generation with close human guidance, Amazon Q Developer has a unique capability called agents that can use reasoning and planning capabilities to perform multi-step tasks beyond code generation with minimal human intervention. Its agent for software development can solve complex tasks that go beyond code suggestions, such as building entire application features, refactoring code, or generating documentation. Join this session to discover new agent capabilities that help developers go from planning to getting new features in front of customers even faster.

Reserve your seat now

AIM201: Maximize business impact with Amazon Q Apps: The Volkswagen AI journey

Monday December 2 | 10:00 AM – 11:00 AM PT

Discover how Volkswagen harnesses generative AI for optimized job matching and career growth with Amazon Q. Learn from the AWS Product Management team about the benefits of Amazon Q Business and the latest innovations in Amazon Q Apps. Then, explore how Volkswagen used these tools to streamline a job role mapping project, saving thousands of hours. Mario Duarte, Senior Director at Volkswagen Group of America, details the journey toward their first Amazon Q application that helps Volkswagen’s Human Resources build a learning ecosystem that boosts employee development. Leave the session inspired to bring Amazon Q Apps to supercharge your teams’ productivity engines.

Reserve your seat now

BSI101: Reimagine business intelligence with generative AI

Monday December 2 | 1:00 PM – 2:00 PM PT

In this session, get an overview of the generative AI capabilities of Amazon Q in QuickSight. Learn how analysts can build interactive dashboards rapidly, and discover how business users can use natural language to instantly create documents and presentations explaining data and extract insights beyond what’s available in dashboards with data Q&A and executive summaries. Hear from Availity on how 1.5 million active users are using Amazon QuickSight to distill insights from dashboards instantly, and learn how they are using Amazon Q internally to increase efficiency across their business.

Reserve your seat now

AIM272: 7 Principles for effective and cost-efficient Gen AI Apps

Monday December 2 | 2:30 PM – 3: 30 PM PT

As generative AI gains traction, building effective and cost-efficient solutions is paramount. This session outlines seven guiding principles for building effective and cost-efficient generative AI applications. These principles can help businesses and developers harness generative AI’s potential while optimizing resources. Establishing objectives, curating quality data, optimizing architectures, monitoring performance, upholding ethics, and iterating improvements are crucial. With these principles, organizations can develop impactful generative AI applications that drive responsible innovation. Join this session to hear from ASAPP, a leading contact center solutions provider, as they discuss the principles they used to add generative AI-powered innovations to their software with Amazon Bedrock.

Reserve your seat now

DOP214: Unleashing generative AI: Amazon’s journey with Amazon Q Developer

Tuesday December 3 | 12:00 PM – 1:00 PM

Join us to discover how Amazon rolled out Amazon Q Developer to thousands of developers, trained them in prompt engineering, and measured its transformative impact on productivity. In this session, learn best practices for effectively adopting generative AI in your organization. Gain insights into training strategies, productivity metrics, and real-world use cases to empower your developers to harness the full potential of this game-changing technology. Don’t miss this opportunity to stay ahead of the curve and drive innovation within your team.

Reserve your seat now

AIM229: Scale FM development with Amazon SageMaker HyperPod (customer panel)

Tuesday December 3 | 2:30 PM – 3: 30 PM PT

From startups to enterprises, organizations trust AWS to innovate with comprehensive, secure, and price-performant generative AI infrastructure. Amazon SageMaker HyperPod is a purpose-built infrastructure for FM development at scale. In this session, learn how leading AI companies strategize their FM development process and use SageMaker HyperPod to build state-of-the-art FMs efficiently.

Reserve your seat now

BIZ212: Elevate your contact center performance with AI‑powered analytics

Wednesday December 4 | 8:30 AM – 9:30 AM PT

AI is unlocking deeper insights about contact center performance, including customer sentiment, agent performance, and workforce scheduling. Join this session to hear how contact center managers are using AI-powered analytics in Amazon Connect to proactively identify and act on opportunities to improve customer service outcomes. Learn how Toyota utilizes analytics to detect emerging themes and unlock insights used by leaders across the enterprise.

Reserve your seat now

AIM357: Customizing models for enhanced results: Fine-tuning in Amazon Bedrock

Wednesday December 4 | 4:00 PM – 5:00 PM PT

Unleash the power of customized AI by fine-tuning generative AI models in Amazon Bedrock to achieve higher quality results. Discover how to adapt FMs like Meta’s Llama and Anthropic’s Claude models to your specific use cases and domains, boosting accuracy and efficiency. This session covers the technical process, from data preparation to model customization techniques, training strategies, deployment considerations, and post-customization evaluation. Gain the knowledge to take your generative AI applications to new heights, harnessing tailored, high-performance language processing solutions that give you a competitive advantage.

Reserve your seat now

AIM304: Using multiple agents for scalable generative AI applications

Wednesday December 4 | 5:30 PM – 6:30 PM PT

Join this session to learn how Northwestern Mutual transformed their application development support system using Amazon Bedrock multi-agent collaboration with better planning and communication among agents. Learn how they created specialized agents for different tasks like account management, repos, pipeline management, and more to help their developers go faster. Explore the significant productivity gains and efficiency improvements achieved across the organization.

Reserve your seat now

CMP208: Customer Stories: Optimizing AI performance and costs with AWS AI chips

Thursday December 5 | 12:30 PM – 1:30 PM PT

As you increase the use of generative AI to transform your business at scale, rising costs in your model development and deployment infrastructure can adversely impact your ability to innovate and deliver delightful customer experiences. AWS Trainium and AWS Inferentia deliver high-performance AI training and inference while reducing your costs by up to 50%. Attend this session to hear from AWS customers ByteDance, Ricoh, and Arcee about how they realized these benefits to grow their businesses and deliver innovative experiences to their end-users.

Reserve your seat now

AIM359: Streamline model evaluation and selection with Amazon Bedrock

Friday December 6 | 8:30 AM – 9:30 AM

Explore the robust model evaluation capabilities of Amazon Bedrock, designed to select the optimal FMs for your applications. Discover how to create and manage evaluation jobs, use automatic and human reviews, and analyze critical metrics like accuracy, robustness, and toxicity. This session provides practical steps to streamline your model selection process, providing high-quality, reliable AI deployments. Gain essential insights to enhance your generative AI applications through effective model evaluation techniques.

Reserve your seat now

AIM342: Responsible generative AI: Evaluation best practices and tools

Friday December 6 | 10:00 AM – 11:00 AM

With the newfound prevalence of applications built with large language models (LLMs) including features such as RAG, agents, and guardrails, a responsibly driven evaluation process is necessary to measure performance and mitigate risks. This session covers best practices for a responsible evaluation. Learn about open access libraries and AWS services that can be used in the evaluation process, and dive deep on the key steps of designing an evaluation plan, including defining a use case, assessing potential risks, choosing metrics and release criteria, designing an evaluation dataset, and interpreting results for actionable risk mitigation.

Chalk talks

AIM347-R1 : Real-time issue resolution from machine-generated signals with gen AI

Tuesday December 3 | 1:00 PM – 2:00 PM PT

Resolving urgent service issues quickly is crucial for efficient operations and customer satisfaction. This chalk talk demonstrates how to process machine-generated signals into your contact center, allowing your knowledge base to provide real-time solutions. Discover how generative AI can identify problems, provide resolution content, and deliver it to the right person or device through text, voice, and data. Through a real-life IoT company case study, learn how to monitor devices, collect error messages, and respond to issues through a contact center framework using generative AI to accelerate solution provision and delivery, increasing uptime and reducing technician deployments.

Reserve your seat now

AIM407-R: Understand the deep security & privacy controls within Amazon Bedrock

Tuesday December 3 | 2:30 PM – 3:30 PM PT

Amazon Bedrock is designed to keep your data safe and secure, with none of your data being used to train the supported models. While the inference pathways are straightforward to understand, there are many nuances of some of the complex features of Amazon Bedrock that use your data for other non-inference purposes. This includes Amazon Bedrock Guardrails, Agents, and Knowledge Bases, along with the creation of custom models. In this chalk talk, explore the architectures, secure data flows, and complete lifecycle and usage of your data within these features, as you learn the deep details of the security capabilities in Amazon Bedrock.

Reserve your seat now

AIM352: Unlock Extensibility in AWS App Studio with JavaScript and Lambda

Wednesday December 4 | 10:30 AM – 11:30 AM PT

Looking for a better way to build applications that boost your team’s productivity and drive innovation? Explore the fastest and simplest way to build enterprise-grade applications—and how to extend your app’s potential with JavaScript and AWS Lambda. Join to learn hands-on techniques for automating workflows, creating AI-driven experiences, and integrating with popular AWS services. You’ll leave with practical skills to supercharge your application development!

Reserve your seat now

CMP329: Beyond Text: Unlock multimodal AI with AWS AI chips

Wednesday December 4 | 1:30 PM – 2:30 PM PT

Revolutionize your applications with multi-modal AI. Learn how to harness the power of AWS AI chips to create intelligent systems that understand and process text, images, and video. Explore advanced models, like Idefics2 and Chameleon, to build exceptional AI assistants capable of OCR, document analysis, visual reasoning, and creative content generation.

Reserve your seat now

AIM343-R: Advancing responsible AI: Managing generative AI risk

Wednesday December 4 | 4:00 PM – 5:00 PM

Risk assessment is an essential part of responsible AI (RAI) development and is an increasingly common requirement in AI standards and laws such as ISO 42001 and the EU AI Act. This chalk talk provides an introduction to best practices for RAI risk assessment for generative AI applications, covering controllability, veracity, fairness, robustness, explainability, privacy and security, transparency, and governance. Explore examples to estimate the severity and likelihood of potential events that could be harmful. Learn about Amazon SageMaker tooling for model governance, bias, explainability, and monitoring, and about transparency in the form of service cards as potential risk mitigation strategies.

Reserve your seat now

AIM366: Bring your gen AI models to Amazon Bedrock using Custom Model Import

Thursday December 5 | 1:00 PM – 2:00 PM

Learn how to accelerate your generative AI application development with Amazon Bedrock Custom Model Import. Seamlessly bring your fine-tuned models into a fully managed, serverless environment, and use the Amazon Bedrock standardized API and features like Amazon Bedrock Agents and Amazon Bedrock Knowledge Bases to accelerate generative AI application development. Discover how Salesforce achieved 73% cost savings while maintaining high accuracy through this capability. Walk away with knowledge on how to build a production-ready, serverless generative AI application with a fine-tuned model.

Reserve your seat now

Workshops

AIM315: Transforming intelligent document processing with generative AI

Monday December 2 | 8 AM – 10 AM PT

This workshop covers the use of generative AI models for intelligent document processing tasks. It introduces intelligent document processing and demonstrates how generative AI can enhance capabilities like multilingual OCR, document classification based on content/structure/visuals, document rule matching using RAG models, and agentic frameworks that combine generative models with decision-making and task orchestration. Attendees will learn practical applications of generative AI for streamlining and automating document-centric workflows.

Reserve your seat now

DOP308-R: Accelerating enterprise development with Amazon Q Developer

Monday December | 12:00 PM – 2:00 PM PT

In this workshop, explore the transformative impact of generative AI in development. Get hands-on experience with Amazon Q Developer to learn how it can help you understand, build, and operate AWS applications. Explore the IDE to see how Amazon Q provides software development assistance, including code explanation, generation, modernization, and much more. You must bring your laptop to participate.

Reserve your seat now

BSI204-R1: Hands-on with Amazon Q in QuickSight: A step-by-step workshop

Wednesday December 4 | 1:00 PM – 3:00 PM

In this workshop, explore the generative BI capabilities of Amazon Q in QuickSight. Experience authoring visuals and refining them using natural language. Learn how business users can use natural language to generate data stories to create highly customizable narratives or slide decks from data. Discover how natural language Q&A with Amazon Q helps users gain insights beyond what is presented on dashboards while executive summaries provide an at-a-glance view of data, surfacing trends and explanations. You must bring your laptop to participate.

Reserve your seat now

AIM327: Fine-tune and deploy an LLM using Amazon SageMaker and AWS AI chips

Wednesday December 4 | 3:30 PM – 5:30 PM PT

As deep learning models have grown in size and complexity, there is a need for specialized ML accelerators to address the increasing training and inference demands of these models, while also delivering high performance, scalability, and cost-effectiveness. In this workshop, use AWS purpose-built ML accelerators, AWS Trainium and AWS Inferentia, to fine-tune and then run inference using an LLM based on the Meta Llama architecture. You must bring your laptop to participate.

Reserve your seat now

AIM402: Revolutionizing multimodal data search with Amazon Q Business

Wednesday December 4 | 3:30 PM – 5:30 PM PT

Today’s enterprises deal with data in various formats, including audio, image, video, and text, scattered across different documents. Searching through this diverse content to find useful information is a significant challenge. This workshop explores how Amazon Q Business transforms the way enterprises search and discover data across multiple formats. By utilizing cutting-edge AI and ML technologies, Amazon Q Business helps enterprises navigate their content seamlessly. Find out how this powerful tool accelerates real-world use cases by making it straightforward to extract actionable insights from multimodal datasets. You must bring your laptop to participate.

Reserve your seat now

Builder’s sessions

CMP304-R: Fine-tune Hugging Face LLMs using Amazon SageMaker and AWS Trainium

December Tuesday 3 | 2:30 PM – 3:30 PM

LLMs are pre-trained on vast amounts of data and perform well across a variety of general-purpose tasks and benchmarks without further specialized training. In practice, however, it is common to improve the performance of a pre-trained LLM by fine-tuning the model using a smaller task-specific or domain-specific dataset. In this builder’s session, learn how to use Amazon SageMaker to fine-tune a pre-trained Hugging Face LLM using AWS Trainium, and then use the fine-tuned model for inference. You must bring your laptop to participate.

Reserve your seat now

AIM328: Optimize your cloud investments using Amazon Bedrock

December Thursday 5 | 2:30 PM – 3:30 PM

Manually tracking the interconnected nature of deployed cloud resources and reviewing their utilization can be complex and time-consuming. In this builders’ session, see a demo on how you can optimize your cloud investments to maximize efficiency and cost-effectiveness. Explore a novel approach that harnesses AWS services like Amazon Bedrock, AWS CloudFormation, Amazon Neptune, and Amazon CloudWatch to analyze resource utilization and manage unused AWS resources. Using Amazon Bedrock, analyze the source code to identify the AWS resources used in the application. Apply this information to build a knowledge graph that represents the interconnected AWS resources. You must bring a laptop to participate.

Reserve your seat now

AIM403-R: Accelerate FM pre-training on Amazon SageMaker HyperPod

December Monday 2 | 2:30 – 3:30 PM

Amazon SageMaker HyperPod removes the undifferentiated heavy lifting involved in building and optimizing ML infrastructure for training FMs, reducing training time by up to 40%. In this builders’ session, learn how to pre-train an LLM using Slurm on SageMaker HyperPod. Explore the model pre-training workflow from start to finish, including setting up clusters, troubleshooting convergence issues, and running distributed training to improve model performance.

Reserve your seat now

Code talks

DOP315: Optimize your cloud environments in the AWS console with generative AI

December Monday 2 | 5:30 PM – 6 30 PM

Available on the AWS Management Console, Amazon Q Developer is the only AI assistant that is an expert on AWS, helping developers and IT pros optimize their AWS Cloud environments. Proactively diagnose and resolve errors and networking issues, provide guidance on architectural best practices, analyze billing information and trends, and use natural language in chat to manage resources in your AWS account. Learn how Amazon Q Developer accelerates task completion with tailored recommendations based on your specific AWS workloads, shifting from a reactive review to proactive notifications and remediation.

Reserve your seat now

AIM405: Learn to securely invoke Amazon Q Business Chat API

December Wednesday 4 | 2:30 PM – 3:30 PM

Join this code talk to learn how to use the Amazon Q Business identity-aware ChatSync API. First, hear an overview of identity-aware APIs, and then learn how to configure an identity provider as a trusted token issuer. Next, discover how your application can obtain an AWS STS token to assume a role that calls the ChatSync API. Finally, see how a client-side application uses the ChatSync API to answer questions from your documents indexed in Amazon Q Business.

Reserve your seat now

AIM406: Attain ML excellence with proficiency in Amazon SageMaker Python SDK

December Wednesday 4 |4:30 PM – 5:30 PM

In this comprehensive code talk, delve into the robust capabilities of the Amazon SageMaker Python SDK. Explore how this powerful tool streamlines the entire ML lifecycle, from data preparation to model deployment. Discover how to use pre-built algorithms, integrate custom models seamlessly, and harness the power of popular Python libraries within the SageMaker platform. Gain hands-on experience in data management, model training, monitoring, and seamless deployment to production environments. Learn best practices and insider tips to optimize your data science workflow and accelerate your ML journey using the SageMaker Python SDK.

Reserve your seat now

AWS DeepRacer

ML enthusiasts, start your engines—AWS DeepRacer is back at re:Invent with a thrilling finale to 6 years of ML innovation! Whether you’re an ML pro or just starting out, the AWS DeepRacer championship offers an exciting glimpse into cutting-edge reinforcement learning. The action kicks off on December 2 with the Last Chance Qualifier, followed by 3 days of intense competition as 32 global finalists race for a whopping $50,000 prize pool. Don’t miss the grand finale on December 5, where top racers will battle it out on the challenging Forever Raceway in the Data Pavilion. This year, we’re taking AWS DeepRacer beyond the track with a series of four all-new workshops. These sessions are designed to help you use your reinforcement learning skills in the rapidly expanding field of generative AI. Learn to apply AWS DeepRacer skills to LLMs, explore multi-modal semantic search, and create AI-powered chatbots.

Exciting addition: We are introducing the AWS LLM League—a groundbreaking program that builds on the success of AWS DeepRacer to bring hands-on learning to the world of generative AI. The LLM League offers participants a unique opportunity to gain practical experience in model customization and fine-tuning, skills that are increasingly crucial in today’s AI landscape. Join any of the three DPR-101 sessions to demystify LLMs using your AWS DeepRacer know-how.

Make sure to check out the re:Invent content catalog for all the generative AI and ML content at re:Invent.

Let the countdown begin. See you at re:Invent!

About the authors

Mukund Birje is a Sr. Product Marketing Manager on the AIML team at AWS. In his current role he’s focused on driving adoption of AWS data services for generative AI. He has over 10 years of experience in marketing and branding across a variety of industries. Outside of work you can find him hiking, reading, and trying out new restaurants. You can connect with him on LinkedIN

Dr. Andrew Kane is an AWS Principal WW Tech Lead (AI Language Services) based out of London. He focuses on the AWS Language and Vision AI services, helping our customers architect multiple AI services into a single use-case driven solution. Before joining AWS at the beginning of 2015, Andrew spent two decades working in the fields of signal processing, financial payments systems, weapons tracking, and editorial and publishing systems. He is a keen karate enthusiast (just one belt away from Black Belt) and is also an avid home-brewer, using automated brewing hardware and other IoT sensors.

Customize small language models on AWS with automotive terminology

November 19, 2024

by Bruno Pistone Amazon AWS

In the rapidly evolving world of AI, the ability to customize language models for specific industries has become more important. Although large language models (LLMs) are adept at handling a wide range of tasks with natural language, they excel at general purpose tasks as compared with specialized tasks. This can create challenges when processing text data from highly specialized domains with their own distinct terminology or specialized tasks where intrinsic knowledge of the LLM is not well-suited for solutions such as Retrieval Augmented Generation (RAG).

For instance, in the automotive industry, users might not always provide specific diagnostic trouble codes (DTCs), which are often proprietary to each manufacturer. These codes, such as P0300 for a generic engine misfire or C1201 for an ABS system fault, are crucial for precise diagnosis. Without these specific codes, a general purpose LLM might struggle to provide accurate information. This lack of specificity can lead to hallucinations in the generated responses, where the model invents plausible but incorrect diagnoses, or sometimes result in no answers at all. For example, if a user simply describes “engine running rough” without providing the specific DTC, a general LLM might suggest a wide range of potential issues, some of which may be irrelevant to the actual problem, or fail to provide any meaningful diagnosis due to insufficient context. Similarly, in tasks like code generation and suggestions through chat-based applications, users might not specify the APIs they want to use. Instead, they often request help in resolving a general issue or in generating code that utilizes proprietary APIs and SDKs.

Moreover, generative AI applications for consumers can offer valuable insights into the types of interactions from end-users. With appropriate feedback mechanisms, these applications can also gather important data to continuously improve the behavior and responses generated by these models.

For these reasons, there is a growing trend in the adoption and customization of small language models (SLMs). SLMs are compact transformer models, primarily utilizing decoder-only or encoder-decoder architectures, typically with parameters ranging from 1–8 billion. They are generally more efficient and cost-effective to train and deploy compared to LLMs, and are highly effective when fine-tuned for specific domains or tasks. SLMs offer faster inference times, lower resource requirements, and are suitable for deployment on a wider range of devices, making them particularly valuable for specialized applications and edge computing scenarios. Additionally, more efficient techniques for customizing both LLMs and SLMs, such as Low Rank Adaptation (LoRA), are making these capabilities increasingly accessible to a broader range of customers.

AWS offers a wide range of solutions for interacting with language models. Amazon Bedrock is a fully managed service that offers foundation models (FMs) from Amazon and other AI companies to help you build generative AI applications and host customized models. Amazon SageMaker is a comprehensive, fully managed machine learning (ML) service to build, train, and deploy LLMs and other FMs at scale. You can fine-tune and deploy models with Amazon SageMaker JumpStart or directly through Hugging Face containers.

In this post, we guide you through the phases of customizing SLMs on AWS, with a specific focus on automotive terminology for diagnostics as a Q&A task. We begin with the data analysis phase and progress through the end-to-end process, covering fine-tuning, deployment, and evaluation. We compare a customized SLM with a general purpose LLM, using various metrics to assess vocabulary richness and overall accuracy. We provide a clear understanding of customizing language models specific to the automotive domain and its benefits. Although this post focuses on the automotive domain, the approaches are applicable to other domains. You can find the source code for the post in the associated Github repository.

Solution overview

This solution uses multiple features of SageMaker and Amazon Bedrock, and can be divided into four main steps:

Data analysis and preparation – In this step, we assess the available data, understand how it can be used to develop solution, select data for fine-tuning, and identify required data preparation steps. We use Amazon SageMaker Studio, a comprehensive web-based integrated development environment (IDE) designed to facilitate all aspects of ML development. We also employ SageMaker jobs to access more computational power on-demand, thanks to the SageMaker Python SDK.
Model fine-tuning – In this step, we prepare prompt templates for fine-tuning SLM. For this post, we use Meta Llama3.1 8B Instruct from Hugging Face as the SLM. We run our fine-tuning script directly from the SageMaker Studio JupyterLab environment. We use the @remote decorator feature of the SageMaker Python SDK to launch a remote training job. The fine-tuning script uses LoRA, distributing compute across all available GPUs on a single instance.
Model deployment – When the fine-tuning job is complete and the model is ready, we have two deployment options:
- Deploy in SageMaker by selecting the best instance and container options available.
- Deploy in Amazon Bedrock by importing the fine-tuned model for on-demand use.
Model evaluation – In this final step, we evaluate the fine-tuned model against a similar base model and a larger model available from Amazon Bedrock. Our evaluation focuses on how well the model uses specific terminology for the automotive space, as well as the improvements provided by fine-tuning in generating answers.

The following diagram illustrates the solution architecture.

Using the Automotive_NER dataset

The Automotive_NER dataset, available on the Hugging Face platform, is designed for named entity recognition (NER) tasks specific to the automotive domain. This dataset is specifically curated to help identify and classify various entities related to the automotive industry and uses domain-specific terminologies.

The dataset contains approximately 256,000 rows; each row contains annotated text data with entities related to the automotive domain, such as car brands, models, component, description of defects, consequences, and corrective actions. The terminology used to describe defects, reference to components, or error codes reported is a standard for the automotive industry. The fine-tuning process enables the language model to learn the domain terminologies better and helps improve the vocabulary used in the generation of answers and overall accuracy for the generated answers.

The following table is an example of rows contained in the dataset.

1	COMPNAME	DESC_DEFECT	CONEQUENCE_DEFECT	CORRECTIVE_ACTION
2	ELECTRICAL SYSTEM:12V/24V/48V BATTERY:CABLES	CERTAIN PASSENGER VEHICLES EQUIPPED WITH ZETEC ENGINES, LOOSE OR BROKEN ATTACHMENTS AND MISROUTED BATTERY CABLES COULD LEAD TO CABLE INSULATION DAMAGE.	THIS, IN TURN, COULD CAUSE THE BATTERY CABLES TO SHORT RESULTING IN HEAT DAMAGE TO THE CABLES. BESIDES HEAT DAMAGE, THE “CHECK ENGINE” LIGHT MAY ILLUMINATE, THE VEHICLE MAY FAIL TO START, OR SMOKE, MELTING, OR FIRE COULD ALSO OCCUR.	DEALERS WILL INSPECT THE BATTERY CABLES FOR THE CONDITION OF THE CABLE INSULATION AND PROPER TIGHTENING OF THE TERMINAL ENDS. AS NECESSARY, CABLES WILL BE REROUTED, RETAINING CLIPS INSTALLED, AND DAMAGED BATTERY CABLES REPLACED. OWNER NOTIFICATION BEGAN FEBRUARY 10, 2003. OWNERS WHO DO NOT RECEIVE THE FREE REMEDY WITHIN A REASONABLE TIME SHOULD CONTACT FORD AT 1-866-436-7332.
3	ELECTRICAL SYSTEM:12V/24V/48V BATTERY:CABLES	CERTAIN PASSENGER VEHICLES EQUIPPED WITH ZETEC ENGINES, LOOSE OR BROKEN ATTACHMENTS AND MISROUTED BATTERY CABLES COULD LEAD TO CABLE INSULATION DAMAGE.	THIS, IN TURN, COULD CAUSE THE BATTERY CABLES TO SHORT RESULTING IN HEAT DAMAGE TO THE CABLES. BESIDES HEAT DAMAGE, THE “CHECK ENGINE” LIGHT MAY ILLUMINATE, THE VEHICLE MAY FAIL TO START, OR SMOKE, MELTING, OR FIRE COULD ALSO OCCUR.	DEALERS WILL INSPECT THE BATTERY CABLES FOR THE CONDITION OF THE CABLE INSULATION AND PROPER TIGHTENING OF THE TERMINAL ENDS. AS NECESSARY, CABLES WILL BE REROUTED, RETAINING CLIPS INSTALLED, AND DAMAGED BATTERY CABLES REPLACED. OWNER NOTIFICATION BEGAN FEBRUARY 10, 2003. OWNERS WHO DO NOT RECEIVE THE FREE REMEDY WITHIN A REASONABLE TIME SHOULD CONTACT FORD AT 1-866-436-7332.
4	EQUIPMENT:OTHER:LABELS	ON CERTAIN FOLDING TENT CAMPERS, THE FEDERAL CERTIFICATION (AND RVIA) LABELS HAVE THE INCORRECT GROSS VEHICLE WEIGHT RATING, TIRE SIZE, AND INFLATION PRESSURE LISTED.	IF THE TIRES WERE INFLATED TO 80 PSI, THEY COULD BLOW RESULTING IN A POSSIBLE CRASH.	OWNERS WILL BE MAILED CORRECT LABELS FOR INSTALLATION ON THEIR VEHICLES. OWNER NOTIFICATION BEGAN SEPTEMBER 23, 2002. OWNERS SHOULD CONTACT JAYCO AT 1-877-825-4782.
5	STRUCTURE	ON CERTAIN CLASS A MOTOR HOMES, THE FLOOR TRUSS NETWORK SUPPORT SYSTEM HAS A POTENTIAL TO WEAKEN CAUSING INTERNAL AND EXTERNAL FEATURES TO BECOME MISALIGNED. THE AFFECTED VEHICLES ARE 1999 – 2003 CLASS A MOTOR HOMES MANUFACTURED ON F53 20,500 POUND GROSS VEHICLE WEIGHT RATING (GVWR), FORD CHASSIS, AND 2000-2003 CLASS A MOTOR HOMES MANUFACTURED ON W-22 22,000 POUND GVWR, WORKHORSE CHASSIS.	CONDITIONS CAN RESULT IN THE BOTTOMING OUT THE SUSPENSION AND AMPLIFICATION OF THE STRESS PLACED ON THE FLOOR TRUSS NETWORK. THE ADDITIONAL STRESS CAN RESULT IN THE FRACTURE OF WELDS SECURING THE FLOOR TRUSS NETWORK SYSTEM TO THE CHASSIS FRAME RAIL AND/OR FRACTURE OF THE FLOOR TRUSS NETWORK SUPPORT SYSTEM. THE POSSIBILITY EXISTS THAT THERE COULD BE DAMAGE TO ELECTRICAL WIRING AND/OR FUEL LINES WHICH COULD POTENTIALLY LEAD TO A FIRE.	DEALERS WILL INSPECT THE FLOOR TRUSS NETWORK SUPPORT SYSTEM, REINFORCE THE EXISTING STRUCTURE, AND REPAIR, AS NEEDED, THE FLOOR TRUSS NETWORK SUPPORT. OWNER NOTIFICATION BEGAN NOVEMBER 5, 2002. OWNERS SHOULD CONTACT MONACO AT 1-800-685-6545.
6	STRUCTURE	ON CERTAIN CLASS A MOTOR HOMES, THE FLOOR TRUSS NETWORK SUPPORT SYSTEM HAS A POTENTIAL TO WEAKEN CAUSING INTERNAL AND EXTERNAL FEATURES TO BECOME MISALIGNED. THE AFFECTED VEHICLES ARE 1999 – 2003 CLASS A MOTOR HOMES MANUFACTURED ON F53 20,500 POUND GROSS VEHICLE WEIGHT RATING (GVWR), FORD CHASSIS, AND 2000-2003 CLASS A MOTOR HOMES MANUFACTURED ON W-22 22,000 POUND GVWR, WORKHORSE CHASSIS.	CONDITIONS CAN RESULT IN THE BOTTOMING OUT THE SUSPENSION AND AMPLIFICATION OF THE STRESS PLACED ON THE FLOOR TRUSS NETWORK. THE ADDITIONAL STRESS CAN RESULT IN THE FRACTURE OF WELDS SECURING THE FLOOR TRUSS NETWORK SYSTEM TO THE CHASSIS FRAME RAIL AND/OR FRACTURE OF THE FLOOR TRUSS NETWORK SUPPORT SYSTEM. THE POSSIBILITY EXISTS THAT THERE COULD BE DAMAGE TO ELECTRICAL WIRING AND/OR FUEL LINES WHICH COULD POTENTIALLY LEAD TO A FIRE.	DEALERS WILL INSPECT THE FLOOR TRUSS NETWORK SUPPORT SYSTEM, REINFORCE THE EXISTING STRUCTURE, AND REPAIR, AS NEEDED, THE FLOOR TRUSS NETWORK SUPPORT. OWNER NOTIFICATION BEGAN NOVEMBER 5, 2002. OWNERS SHOULD CONTACT MONACO AT 1-800-685-6545.

Data analysis and preparation on SageMaker Studio

When you’re fine-tuning LLMs, the quality and composition of your training data are crucial (quality over quantity). For this post, we implemented a sophisticated method to select 6,000 rows out of 256,000. This method uses TF-IDF vectorization to identify the most significant and the rarest words in the dataset. By selecting rows containing these words, we maintained a balanced representation of common patterns and edge cases. This improves computational efficiency and creates a high-quality, diverse subset leading to effective model training.

The first step is to open a JupyterLab application previously created in our SageMaker Studio domain.

After you clone the git repository, install the required libraries and dependencies:

pip install -r requirements.txt

The next step is to read the dataset:

from datasets import load_dataset
import pandas as pd

dataset = load_dataset("sp01/Automotive_NER")
df = pd.DataFrame(dataset['train'])

The first step of our data preparation activity is to analyze the importance of the words in our dataset, for identifying both the most important (frequent and distinctive) words and the rarest words in the dataset, by using Term Frequency-Inverse Document Frequency (TF-IDF) vectorization.

Given the dataset’s size, we decided to run the fine-tuning job using Amazon SageMaker Training.

By using the @remote function capability of the SageMaker Python SDK, we can run our code into a remote job with ease.

In our case, the TF-IDF vectorization and the extraction of the top words and bottom words are performed in a SageMaker training job directly from our notebook, without any code changes, by simply adding the @remote decorator on top of our function. You can define the configurations required by the SageMaker training job, such as dependencies and training image, in a config.yaml file. For more details on the settings supported by the config file, see Using the SageMaker Python SDK

See the following code:

SchemaVersion: '1.0'
SageMaker:
  PythonSDK:
    Modules:
      RemoteFunction:
        Dependencies: ./requirements.txt
        ImageUri: 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:2.4-gpu-py311'
        InstanceType: ml.g5.12xlarge
        PreExecutionCommands:
          - 'export NCCL_P2P_DISABLE=1'
  Model:
    EnableNetworkIsolation: false

Next step is to define and execute our processing function:

import numpy as np
import re
from sagemaker.remote_function import remote
from sklearn.feature_extraction.text import TfidfVectorizer
import string

@remote(volume_size=10, job_name_prefix=f"preprocess-auto-ner-auto-merge", instance_type="ml.m4.10xlarge")
def preprocess(df,
               top_n=6000,
               bottom_n=6000
    ):
    # Download nltk stopwords
    import nltk
    nltk.download('stopwords')
    from nltk.corpus import stopwords

    # Define a function to preprocess text
    def preprocess_text(text):
        if not isinstance(text, str):
            # Return an empty string or handle the non-string value as needed
            return ''
    
        # Remove punctuation
        text = re.sub(r'[%s]' % re.escape(string.punctuation), '', text)
    
        # Convert to lowercase
        text = text.lower()
    
        # Remove stop words (optional)
        stop_words = set(stopwords.words('english'))
        text = ' '.join([word for word in text.split() if word not in stop_words])
    
        return text
    
    print("Applying text preprocessing")
    
    # Preprocess the text columns
    df['DESC_DEFECT'] = df['DESC_DEFECT'].apply(preprocess_text)
    df['CONEQUENCE_DEFECT'] = df['CONEQUENCE_DEFECT'].apply(preprocess_text)
    df['CORRECTIVE_ACTION'] = df['CORRECTIVE_ACTION'].apply(preprocess_text)
    
    # Create a TfidfVectorizer object
    tfidf_vectorizer = TfidfVectorizer()

    print("Compute TF-IDF")
    
    # Fit and transform the text data
    X_tfidf = tfidf_vectorizer.fit_transform(df['DESC_DEFECT'] + ' ' + df['CONEQUENCE_DEFECT'] + ' ' + df['CORRECTIVE_ACTION'])
    
    # Get the feature names (words)
    feature_names = tfidf_vectorizer.get_feature_names_out()
    
    # Get the TF-IDF scores
    tfidf_scores = X_tfidf.toarray()
    
    top_word_indices = np.argsort(tfidf_scores.sum(axis=0))[-top_n:]
    bottom_word_indices = np.argsort(tfidf_scores.sum(axis=0))[:bottom_n]

    print("Extracting top and bottom words")
    
    # Get the top and bottom words
    top_words = [feature_names[i] for i in top_word_indices]
    bottom_words = [feature_names[i] for i in bottom_word_indices]

    return top_words, bottom_words

top_words, bottom_words = preprocess(df)

After we extract the top and bottom 6,000 words based on their TF-IDF scores from our original dataset, we classify each row in the dataset based on whether it contained any of these important or rare words. Rows are labeled as ‘top’ if they contained important words, ‘bottom’ if they contained rare words, or ‘neither’ if they don’t contain either:

# Create a function to check if a row contains important or rare words
def contains_important_or_rare_words(row):
    try:
        if ("DESC_DEFECT" in row.keys() and row["DESC_DEFECT"] is not None and
            "CONEQUENCE_DEFECT" in row.keys() and row["CONEQUENCE_DEFECT"] is not None and
            "CORRECTIVE_ACTION" in row.keys() and row["CORRECTIVE_ACTION"] is not None):
            text = row['DESC_DEFECT'] + ' ' + row['CONEQUENCE_DEFECT'] + ' ' + row['CORRECTIVE_ACTION']
        
            text_words = set(text.split())
        
            # Check if the row contains any important words (top_words)
            for word in top_words:
                if word in text_words:
                    return 'top'
        
            # Check if the row contains any rare words (bottom_words)
            for word in bottom_words:
                if word in text_words:
                    return 'bottom'
        
            return 'neither'
        else:
            return 'none'
    except Exception as e:
        raise e

df['word_type'] = df.apply(contains_important_or_rare_words, axis=1)

Finally, we create a balanced subset of the dataset by selecting all rows containing important words (‘top’) and an equal number of rows containing rare words (‘bottom’). If there aren’t enough ‘bottom’ rows, we filled the remaining slots with ‘neither’ rows.

	DESC_DEFECT	CONEQUENCE_DEFECT	CORRECTIVE_ACTION	word_type
2	ON CERTAIN FOLDING TENT CAMPERS, THE FEDERAL C…	IF THE TIRES WERE INFLATED TO 80 PSI, THEY COU…	OWNERS WILL BE MAILED CORRECT LABELS FOR INSTA…	top
2402	CERTAIN PASSENGER VEHICLES EQUIPPED WITH DUNLO…	THIS COULD RESULT IN PREMATURE TIRE WEAR.	DEALERS WILL INSPECT AND IF NECESSARY REPLACE …	bottom
0	CERTAIN PASSENGER VEHICLES EQUIPPED WITH ZETEC…	THIS, IN TURN, COULD CAUSE THE BATTERY CABLES …	DEALERS WILL INSPECT THE BATTERY CABLES FOR TH…	neither

Finally, we randomly sampled 6,000 rows from this balanced set:

# Select all rows from each group
top_rows = df[df['word_type'] == 'top']
bottom_rows = df[df['word_type'] == 'bottom']

# Combine the two groups, ensuring a balanced dataset
if len(bottom_rows) > 0:
    df = pd.concat([top_rows, bottom_rows.sample(n=len(bottom_rows), random_state=42)], ignore_index=True)
else:
    df = top_rows.copy()

# If the combined dataset has fewer than 6010 rows, fill with remaining rows
if len(df) < 6000:
    remaining_rows = df[df['word_type'] == 'neither'].sample(n=6010 - len(df), random_state=42)
    df = pd.concat([df, remaining_rows], ignore_index=True)

df = df.sample(n=6000, random_state=42)

Fine-tuning Meta Llama 3.1 8B with a SageMaker training job

After selecting the data, we need to prepare the resulting dataset for the fine-tuning activity. By examining the columns, we aim to adapt the model for two different tasks:

The following code is for the first prompt:

# User: 
{MFGNAME}
{COMPNAME}
{DESC_DEFECT}
# AI: 
{CONEQUENCE_DEFECT}

With this prompt, we instruct the model to highlight the possible consequences of a defect, given the manufacturer, component name, and description of the defect.

The following code is for the second prompt:

# User:
{MFGNAME}
{COMPNAME}
{DESC_DEFECT}
# AI: 
{CORRECTIVE_ACTION}

With this second prompt, we instruct the model to suggest possible corrective actions for a given defect and component of a specific manufacturer.

First, let’s split the dataset into train, test, and validation subsets:

from sklearn.model_selection import train_test_split

train, test = train_test_split(df, test_size=0.1, random_state=42)
train, valid = train_test_split(train, test_size=10, random_state=42)

Next, we create prompt templates to convert each row item into the two prompt formats previously described:

from random import randint

# template dataset to add prompt to each sample
def template_dataset_consequence(sample):
    # custom instruct prompt start
    prompt_template = f"""
    <|begin_of_text|><|start_header_id|>user<|end_header_id|>
    These are the information related to the defect    

    Manufacturer: {{mfg_name}}
    Component: {{comp_name}}
    Description of a defect:
    {{desc_defect}}
    
    What are the consequences of defect?
    <|eot_id|><|start_header_id|>assistant<|end_header_id|>
    {{consequence_defect}}
    <|end_of_text|><|eot_id|>
    """
    sample["text"] = prompt_template.format(
        mfg_name=sample["MFGNAME"],
        comp_name=sample["COMPNAME"],
        desc_defect=sample["DESC_DEFECT"].lower(),
        consequence_defect=sample["CONEQUENCE_DEFECT"].lower())
    return sample

from random import randint

# template dataset to add prompt to each sample
def template_dataset_corrective_action(sample):
    # custom instruct prompt start
    prompt_template = f"""
    <|begin_of_text|><|start_header_id|>user<|end_header_id|>
    Manufacturer: {{mfg_name}}
    Component: {{comp_name}}
    
    Description of a defect:
    {{desc_defect}}
    
    What are the possible corrective actions?
    <|eot_id|><|start_header_id|>assistant<|end_header_id|>
    {{corrective_action}}
    <|end_of_text|><|eot_id|>
    """
    sample["text"] = prompt_template.format(
        mfg_name=sample["MFGNAME"],
        comp_name=sample["COMPNAME"],
        desc_defect=sample["DESC_DEFECT"].lower(),
        corrective_action=sample["CORRECTIVE_ACTION"].lower())
    return sample

Now we can apply the template functions template_dataset_consequence and template_dataset_corrective_action to our datasets:

As a final step, we concatenate the four resulting datasets for train and test:

Our final training dataset comprises approximately 12,000 elements, properly split into about 11,000 for training and 1,000 for testing.

Now we can prepare the training script and define the training function train_fn and put the @remote decorator on the function.

The training function does the following:

Tokenizes and chunks the dataset
Sets up BitsAndBytesConfig, for model quantization, which specifies the model should be loaded in 4-bit
Uses mixed precision for the computation, by converting model parameters to bfloat16
Loads the model
Creates LoRA configurations that specify ranking of update matrices (r), scaling factor (lora_alpha), the modules to apply the LoRA update matrices (target_modules), dropout probability for Lora layers (lora_dropout), task_type, and more
Starts the training and evaluation

Because we want to distribute the training across all the available GPUs in our instance, by using PyTorch Distributed Data Parallel (DDP), we use the Hugging Face Accelerate library that enables us to run the same PyTorch code across distributed configurations.

For optimizing memory resources, we have decided to run a mixed precision training:

from accelerate import Accelerator
from huggingface_hub import login
from peft import AutoPeftModelForCausalLM, LoraConfig, get_peft_model, prepare_model_for_kbit_training
from sagemaker.remote_function import remote

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, set_seed
import transformers

# Start training
@remote(
    keep_alive_period_in_seconds=0,
    volume_size=100, job_name_prefix=f"train-{model_id.split('/')[-1].replace('.', '-')}-auto",
    use_torchrun=True,
    nproc_per_node=4)

def train_fn(
        model_name,
        train_ds,
        test_ds=None,
        lora_r=8,
        lora_alpha=16,
        lora_dropout=0.1,
        per_device_train_batch_size=8,
        per_device_eval_batch_size=8,
        gradient_accumulation_steps=1,
        learning_rate=2e-4,
        num_train_epochs=1,
        fsdp="",
        fsdp_config=None,
        gradient_checkpointing=False,
        merge_weights=False,
        seed=42,
        token=None
):

    set_seed(seed)
    accelerator = Accelerator()
    if token is not None:
        login(token=token)
    tokenizer = AutoTokenizer.from_pretrained(model_name)

    # Set Tokenizer pad Token
    tokenizer.pad_token = tokenizer.eos_token
    with accelerator.main_process_first():

        # tokenize and chunk dataset
        lm_train_dataset = train_ds.map(
            lambda sample: tokenizer(sample["text"]), remove_columns=list(train_ds.features)
        )

        print(f"Total number of train samples: {len(lm_train_dataset)}")

        if test_ds is not None:

            lm_test_dataset = test_ds.map(
                lambda sample: tokenizer(sample["text"]), remove_columns=list(test_ds.features)
            )

            print(f"Total number of test samples: {len(lm_test_dataset)}")
        else:
            lm_test_dataset = None
      
    torch_dtype = torch.bfloat16

    # Defining additional configs for FSDP
    if fsdp != "" and fsdp_config is not None:
        bnb_config_params = {
            "bnb_4bit_quant_storage": torch_dtype
        }

        model_configs = {
            "torch_dtype": torch_dtype
        }

        fsdp_configurations = {
            "fsdp": fsdp,
            "fsdp_config": fsdp_config,
            "gradient_checkpointing_kwargs": {
                "use_reentrant": False
            },
            "tf32": True
        }

    else:
        bnb_config_params = dict()
        model_configs = dict()
        fsdp_configurations = dict()
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch_dtype,
        **bnb_config_params
    )

    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        trust_remote_code=True,
        quantization_config=bnb_config,
        attn_implementation="flash_attention_2",
        use_cache=not gradient_checkpointing,
        cache_dir="/tmp/.cache",
        **model_configs
    )

    if fsdp == "" and fsdp_config is None:
        model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=gradient_checkpointing)

    if gradient_checkpointing:
        model.gradient_checkpointing_enable()

    config = LoraConfig(
        r=lora_r,
        lora_alpha=lora_alpha,
        target_modules="all-linear",
        lora_dropout=lora_dropout,
        bias="none",
        task_type="CAUSAL_LM"
    )

    model = get_peft_model(model, config)
    print_trainable_parameters(model)

    trainer = transformers.Trainer(
        model=model,
        train_dataset=lm_train_dataset,
        eval_dataset=lm_test_dataset if lm_test_dataset is not None else None,
        args=transformers.TrainingArguments(
            per_device_train_batch_size=per_device_train_batch_size,
            per_device_eval_batch_size=per_device_eval_batch_size,
            gradient_accumulation_steps=gradient_accumulation_steps,
            gradient_checkpointing=gradient_checkpointing,
            logging_strategy="steps",
            logging_steps=1,
            log_on_each_node=False,
            num_train_epochs=num_train_epochs,
            learning_rate=learning_rate,
            bf16=True,
            ddp_find_unused_parameters=False,
            save_strategy="no",
            output_dir="outputs",
            **fsdp_configurations
        ),

        data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
    )

    trainer.train()

    if trainer.is_fsdp_enabled:
        trainer.accelerator.state.fsdp_plugin.set_state_dict_type("FULL_STATE_DICT")

    if merge_weights:
        output_dir = "/tmp/model"
        # merge adapter weights with base model and save
        # save int 4 model
        trainer.model.save_pretrained(output_dir, safe_serialization=False)
      
        if accelerator.is_main_process:
            # clear memory
            del model
            del trainer
            torch.cuda.empty_cache()

            # load PEFT model
            model = AutoPeftModelForCausalLM.from_pretrained(
                output_dir,
                torch_dtype=torch.float16,
                low_cpu_mem_usage=True,
                trust_remote_code=True,
            ) 

            # Merge LoRA and base model and save
            model = model.merge_and_unload()
            model.save_pretrained(
                "/opt/ml/model", safe_serialization=True, max_shard_size="2GB"
            )

    else:
        trainer.model.save_pretrained("/opt/ml/model", safe_serialization=True)

    if accelerator.is_main_process:
        tokenizer.save_pretrained("/opt/ml/model")

We can specify to run a distributed job in the @remote function through the parameters use_torchrun and nproc_per_node, which indicates if the SageMaker job should use as entrypoint torchrun and the number of GPUs to use. You can pass optional parameters like volume_size, subnets, and security_group_ids using the @remote decorator.

Finally, we run the job by invoking train_fn():

model_id = "meta-llama/Meta-Llama-3.1-8B-Instruct"

train_fn(
    model_id,
    train_ds=train_dataset,
    test_ds=test_dataset,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    gradient_accumulation_steps=2,
    gradient_checkpointing=True,
    num_train_epochs=1,
    merge_weights=True,
    token="<HF_TOKEN>"
)

The training job runs on the SageMaker training cluster. The training job took about 42 minutes, by distributing the computation across the 4 available GPUs on the selected instance type ml.g5.12xlarge.

We choose to merge the LoRA adapter with the base model. This decision was made during the training process by setting the merge_weights parameter to True in our train_fn() function. Merging the weights provides us with a single, cohesive model that incorporates both the base knowledge and the domain-specific adaptations we’ve made through fine-tuning.

By merging the model, we gain flexibility in our deployment options.

Model deployment

When deploying a fine-tuned model on AWS, multiple deployment strategies are available. In this post, we explore two deployment methods:

SageMaker real-time inference – This option is designed for having full control of the inference resources. We can use a set of available instances and deployment options for hosting our model. By using the SageMaker built-in containers, such as DJL Serving or Hugging Face TGI, we can use the inference script and the optimization options provided in the container.
Amazon Bedrock Custom Model Import – This option is designed for importing and deploying custom language models. We can use this fully managed capability for interacting with the deployed model with on-demand throughput.

Model deployment with SageMaker real-time inference

SageMaker real-time inference is designed for having full control over the inference resources. It allows you to use a set of available instances and deployment options for hosting your model. By using the SageMaker built-in container Hugging Face Text Generation Inference (TGI), you can take advantage of the inference script and optimization options available in the container.

In this post, we deploy the fine-tuned model to a SageMaker endpoint for running inference, which will be used for evaluating the model in the next step.

We create the HuggingFaceModel object, which is a high-level SageMaker model class for working with Hugging Face models. The image_uri parameter specifies the container image URI for the model, and model_data points to the Amazon Simple Storage Service (Amazon S3) location containing the model artifact (automatically uploaded by the SageMaker training job). We also specify a set of environment variables to configure the number of GPUs (SM_NUM_GPUS), quantization methodology (QUANTIZE), and maximum input and total token lengths (MAX_INPUT_LENGTH and MAX_TOTAL_TOKENS).

model = HuggingFaceModel(
    image_uri=image_uri,
    model_data=f"s3://{bucket_name}/{job_name}/{job_name}/output/model.tar.gz",
    role=get_execution_role(),
    env={
        'HF_MODEL_ID': "/opt/ml/model", # path to where sagemaker stores the model
        'SM_NUM_GPUS': json.dumps(number_of_gpu), # Number of GPU used per replica
        'QUANTIZE': 'bitsandbytes',
        'MAX_INPUT_LENGTH': '4096',
        'MAX_TOTAL_TOKENS': '8192'
    }
)

After creating the model object, we can deploy it to an endpoint using the deploy method. The initial_instance_count and instance_type parameters specify the number and type of instances to use for the endpoint. The container_startup_health_check_timeout and model_data_download_timeout parameters set the timeout values for the container startup health check and model data download, respectively.

predictor = model.deploy(
    initial_instance_count=instance_count,
    instance_type=instance_type,
    container_startup_health_check_timeout=health_check_timeout,
    model_data_download_timeout=3600
)

It takes a few minutes to deploy the model before it becomes available for inference and evaluation. The endpoint is invoked using the AWS SDK with the boto3 client for sagemaker-runtime, or directly by using the SageMaker Python SDK and the predictor previously created, by using the predict API.

body = {
        'inputs': prompt,
        'parameters': {
            "top_p": 0.9,
            "temperature": 0.2,
            "max_new_tokens": 512,
            "return_full_text": False,
            "stop": [
                '<|eot_id|>',
                '<|end_of_text|>'
            ]
        }
    }
response = predictor.predict(body)

Model deployment with Amazon Bedrock Custom Model Import

Amazon Bedrock Custom Model Import is a fully managed capability, currently in public preview, designed for importing and deploying custom language models. It allows you to interact with the deployed model both on-demand and by provisioning the throughput.

In this section, we use the Custom Model Import feature in Amazon Bedrock for deploying our fine-tuned model in the fully managed environment of Amazon Bedrock.

After defining the model and job_name variables, we import our model from the S3 bucket by supplying it in the Hugging Face weights format.

Next, we use a preexisting AWS Identity and Access Management (IAM) role that allows reading the binary file from Amazon S3 and create the import job resource in Amazon Bedrock for hosting our model.

It takes a few minutes to deploy the model, and it can be invoked using the AWS SDK with the boto3 client for bedrock-runtime by using the invoke_model API:

fine_tuned_model_id = “<MODEL_ARN>”

body = {
        "prompt": prompt,
        "temperature": 0.1,
        "top_p": 0.9,
    }

response = bedrock_client.invoke_model(
        modelId=fine_tuned_model_id,
        body=json.dumps(body)
)

Model evaluation

In this final step, we evaluate the fine-tuned model against the base models Meta Llama 3 8B Instruct and Meta Llama 3 70B Instruct on Amazon Bedrock. Our evaluation focuses on how well the model uses specific terminology for the automotive space and the improvements provided by fine-tuning in generating answers.

The fine-tuned model’s ability to understand components and error descriptions for diagnostics, as well as identify corrective actions and consequences in the generated answers, can be evaluated on two dimensions.

To evaluate the quality of the generated text and whether the vocabulary and terminology used are appropriate for the task and industry, we use the Bilingual Evaluation Understudy (BLEU) score. BLEU is an algorithm for evaluating the quality of text, by calculating n-gram overlap between the generated and the reference text.

To evaluate the accuracy of the generated text and see if the generated answer is similar to the expected one, we use the Normalized Levenshtein distance. This algorithm evaluates how close the calculated or measured values are to the actual value.

The evaluation dataset comprises 10 unseen examples of component diagnostics extracted from the original training dataset.

The prompt template for the evaluation is structured as follows:

<|begin_of_text|><|start_header_id|>user<|end_header_id|>
Manufacturer: {row['MFGNAME']}
Component: {row['COMPNAME']}

Description of a defect:
{row['DESC_DEFECT']}

What are the consequences?
<|eot_id|><|start_header_id|>assistant<|end_header_id|>

BLEU score evaluation with base Meta Llama 3 8B and 70B Instruct

The following table and figures show the calculated values for the BLEU score comparison (higher is better) with Meta Llama 3 8B and 70 B Instruct.

	Example	Fine-Tuned Score	Base Score: Meta Llama 3 8B	Base Score: Meta Llama 3 70B
1	2733	0. 2936	5.10E-155	4.85E-155
2	3382	0.1619	0.058	1.134E-78
3	1198	0.2338	1.144E-231	3.473E-155
4	2942	0.94854	2.622E-231	3.55E-155
5	5151	1.28E-155	0	0
6	2101	0.80345	1.34E-78	1.27E-78
7	5178	0.94854	0.045	3.66E-155
8	1595	0.40412	4.875E-155	0.1326
9	2313	0.94854	3.03E-155	9.10E-232
10	557	0.89315	8.66E-79	0.1954

By comparing the fine-tuned and base scores, we can assess the performance improvement (or degradation) achieved by fine-tuning the model in the vocabulary and terminology used.

The analysis suggests that for the analyzed cases, the fine-tuned model outperforms the base model in the vocabulary and terminology used in the generated answer. The fine-tuned model appears to be more consistent in its performance.

Normalized Levenshtein distance with base Meta Llama 3 8B Instruct

The following table and figures show the calculated values for the Normalized Levenshtein distance comparison with Meta Llama 3 8B and 70B Instruct.

	Example	Fine-tuned Score	Base Score – Llama 3 8B	Base Score – Llama 3 70B
1	2733	0.42198	0.29900	0.27226
2	3382	0.40322	0.25304	0.21717
3	1198	0.50617	0.26158	0.19320
4	2942	0.99328	0.18088	0.19420
5	5151	0.34286	0.01983	0.02163
6	2101	0.94309	0.25349	0.23206
7	5178	0.99107	0.14475	0.17613
8	1595	0.58182	0.19910	0.27317
9	2313	0.98519	0.21412	0.26956
10	557	0.98611	0.10877	0.32620

By comparing the fine-tuned and base scores, we can assess the performance improvement (or degradation) achieved by fine-tuning the model on the specific task or domain.

The analysis shows that the fine-tuned model clearly outperforms the base model across the selected examples, suggesting the fine-tuning process has been quite effective in improving the model’s accuracy and generalization in understanding the specific cause of the component defect and providing suggestions on the consequences.

In the evaluation analysis performed for both selected metrics, we can also highlight some areas for improvement:

Example repetition – Provide similar examples for further improvements in the vocabulary and generalization of the generated answer, increasing the accuracy of the fine-tuned model.
Evaluate different data processing techniques – In our example, we selected a subset of the original dataset by analyzing the frequency of words across the entire dataset, extracting the rows containing the most meaningful information and identifying outliers. Further curation of the dataset by properly cleaning and expanding the number of examples can increase the overall performance of the fine-tuned model.

Clean up

After you complete your training and evaluation experiments, clean up your resources to avoid unnecessary charges. If you deployed the model with SageMaker, you can delete the created real-time endpoints using the SageMaker console. Next, delete any unused SageMaker Studio resources. If you deployed the model with Amazon Bedrock Custom Model Import, you can delete the imported model using the Amazon Bedrock console.

Conclusion

This post demonstrated the process of customizing SLMs on AWS for domain-specific applications, focusing on automotive terminology for diagnostics. The provided steps and source code show how to analyze data, fine-tune models, deploy them efficiently, and evaluate their performance against larger base models using SageMaker and Amazon Bedrock. We further highlighted the benefits of customization by enhancing vocabulary within specialized domains.

You can evolve this solution further by implementing proper ML pipelines and LLMOps practices through Amazon SageMaker Pipelines. SageMaker Pipelines enables you to automate and streamline the end-to-end workflow, from data preparation to model deployment, enhancing reproducibility and efficiency. You can also improve the quality of training data using advanced data processing techniques. Additionally, using the Reinforcement Learning from Human Feedback (RLHF) approach can align the model response to human preferences. These enhancements can further elevate the performance of customized language models across various specialized domains. You can find the sample code discussed in this post on the GitHub repo.

About the authors

Bruno Pistone is a Senior Generative AI and ML Specialist Solutions Architect for AWS based in Milan. He works with large customers helping them to deeply understand their technical needs and design AI and Machine Learning solutions that make the best use of the AWS Cloud and the Amazon Machine Learning stack. His expertise include: Machine Learning end to end, Machine Learning Industrialization, and Generative AI. He enjoys spending time with his friends and exploring new places, as well as travelling to new destinations

Gopi Krishnamurthy is a Senior AI/ML Solutions Architect at Amazon Web Services based in New York City. He works with large Automotive and Industrial customers as their trusted advisor to transform their Machine Learning workloads and migrate to the cloud. His core interests include deep learning and serverless technologies. Outside of work, he likes to spend time with his family and explore a wide range of music.

Automate emails for task management using Amazon Bedrock Agents, Amazon Bedrock Knowledge Bases, and Amazon Bedrock Guardrails

November 19, 2024

by Manu Mishra Amazon AWS

In this post, we demonstrate how to create an automated email response solution using Amazon Bedrock and its features, including Amazon Bedrock Agents, Amazon Bedrock Knowledge Bases, and Amazon Bedrock Guardrails.

Amazon Bedrock is a fully managed service that makes foundation models (FMs) from leading AI startups and Amazon Web Services available through an API, so you can choose from a wide range of FMs to find the model that is best suited for your use case. Amazon Bedrock offers a serverless experience so you can get started quickly, privately customize FMs with your own data, and integrate and deploy them into your applications using AWS tools without having to manage infrastructure.

With Amazon Bedrock and other AWS services, you can build a generative AI-based email support solution to streamline email management, enhancing overall customer satisfaction and operational efficiency.

Challenges of knowledge management

Email serves as a crucial communication tool for businesses, but traditional processing methods such as manual processing often fall short when handling the volume of incoming messages. This can lead to inefficiencies, delays, and errors, diminishing customer satisfaction.

Key challenges include the need for ongoing training for support staff, difficulties in managing and retrieving scattered information, and maintaining consistency across different agents’ responses.

Organizations possess extensive repositories of digital documents and data that may remain underutilized due to their unstructured and dispersed nature. Additionally, although specific APIs and applications exist to handle customer service tasks, they often function in silos and lack integration.

The benefits of AI-powered solutions

To address these challenges, businesses are adopting generative AI to automate and refine email response processes. AI integration accelerates response times and increases the accuracy and relevance of communications, enhancing customer satisfaction. By using AI-driven solutions, organizations can overcome the limitations of manual email processing, streamlining operations and improving the overall customer experience.

A robust AI-driven email support agent must have the following capabilities:

Comprehensively access and apply knowledge – Extract and use information from various file formats and data stores across the organization to inform customer interactions.
Seamlessly integrate with APIs – Interact with existing business APIs to perform real-time actions such as transaction processing or customer data updates directly through email.
Incorporate continuous awareness – Continually integrate new data, such as updated documents or revised policies, allowing the AI to recognize and use the latest information without retraining.
Uphold security and compliance standards – Adhere to required data security protocols and compliance mandates specific to the industry to protect sensitive customer information and maintain trust. Implement governance mechanisms to make sure AI-generated responses align with brand standards and regulatory requirements, preventing non-relevant communications.

Solution overview

This section outlines the architecture designed for an email support system using generative AI. The following diagram illustrates the integration of various components crucial for improving the handling of customer emails.

High Level System Design

The solution consists of the following components:

Email service – This component manages incoming and outgoing customer emails, serving as the primary interface for email communications.
AI-powered email processing engine – Central to the solution, this engine uses AI to analyze and process emails. It interacts with databases and APIs, extracting necessary information and determining appropriate responses to provide timely and accurate customer service.
Information repository – This repository holds essential documents and data that support customer service processes. The AI engine accesses this resource to pull relevant information needed to effectively address customer inquiries.
Business applications – This component performs specific actions identified from email requests, such as processing transactions or updating customer records, enabling prompt and precise fulfillment of customer needs.
Non-functional requirements (NFRs) – This includes the following:
- Security – Protects data and secures processing across interactions to maintain customer trust.
- Monitoring – Monitors system performance and user activity to maintain operational reliability and efficiency.
- Performance – Provides high efficiency and speed in email responses to sustain customer satisfaction.
- Brand protection – Maintains the quality and consistency of customer interactions, protecting the company’s reputation.

The following diagram provides a detailed view of the architecture to enhance email support using generative AI. This system integrates various AWS services and custom components to automate the processing and handling of customer emails efficiently and effectively.

The workflow includes the following steps:

Amazon WorkMail manages incoming and outgoing customer emails. When a customer sends an email, WorkMail receives it and invokes the next component in the workflow.
An email handler AWS Lambda function is invoked by WorkMail upon the receipt of an email, and acts as the intermediary that receives requests and passes it to the appropriate agent.
These AI agents process the email content, apply decision-making logic, and draft email responses based on the customer’s inquiry and relevant data accessed.
1. Guardrails make sure the interactions conform to predefined standards and policies to maintain consistency and accuracy.
2. The system indexes documents and files stored in Amazon Simple Storage Service (Amazon S3) using Amazon OpenSearch Service for quick retrieval. These indexed documents provide a comprehensive knowledge base that the AI agents consult to inform their responses.
3. Business APIs are invoked by AI agents when specific transactions or updates need to be run in response to a customer’s request. The APIs make sure actions taken are appropriate and accurate according to the processed instructions.
After the response email is finalized by the AI agents, it’s sent to Amazon Simple Email Service (Amazon SES).
Amazon SES dispatches the response back to the customer, completing the interaction loop.

Deploy the solution

To evaluate this solution, we have provided sample code that allows users to make a restaurant reservation through email and ask other questions about the restaurant, such as menu offerings. Refer to the GitHub repository for deployment instructions.

The high-level deployment steps are as follows:

Install the required prerequisites, including the AWS Command Line Interface (AWS CLI), Node.js, and AWS Cloud Development Kit (AWS CDK), then clone the repository and install the necessary NPM packages.
Deploy the AWS CDK project to provision the required resources in your AWS account.
Follow the post-deployment steps in the GitHub repository’s README file to configure an email support account to receive emails in WorkMail to invoke the Lambda function upon email receipt.

When the deployment is successful (which may take 7–10 minutes to complete), you can start testing the solution.

Test the solution

This solution uses Amazon Bedrock to automate restaurant table reservations and menu inquiries as an example; however, a similar approach can be adapted for various industries and workflows. Traditionally, customers email restaurants for these services, requiring staff to respond manually. By automating these processes, the solution streamlines operations, reduces manual effort, and enhances user experience by delivering real-time responses.
You can send an email to the support email address to test the generative AI system’s ability to process requests, make reservations, and provide menu information while adhering to the guardrails.

On the WorkMail console, navigate to the organization gaesas-stk-org-<random id>.
Choose Users in the navigation pane, and navigate to the support user.
Locate the email address for this user.
Send an email requesting information from the automated support account using your preferred email application.

The following image shows a conversation between the customer and the automated support agent.

Clean up

To clean up resources, run the following command from the project’s folder:

cdk destroy

Conclusion

In this post, we examined how you can integrate AWS services to build a generative AI-based email support solution. By using WorkMail for handling email traffic, Lambda for the processing logic, and Amazon SES for dispatching responses, the system efficiently manages and responds to customer emails. Additionally, Amazon Bedrock agents, supplemented by guardrails and supported by an OpenSearch Service powered information repository, make sure responses are accurate and compliant with regulatory standards. This cohesive use of AWS services not only streamlines email management but also makes sure each customer interaction is handled with precision, enhancing overall customer satisfaction and operational efficiency.

You can adapt and extend the business logic and processes demonstrated in this solution to suit specific organizational needs. Developers can modify the Lambda functions, update the knowledge bases, and adjust the agent behavior to align with unique business requirements. This flexibility empowers you to tailor the solution, providing a seamless integration with your existing systems and workflows.

About the Authors

Manu Mishra is a Senior Solutions Architect at AWS with over 16 years of experience in the software industry, specializing in artificial intelligence, data and analytics, and security. His expertise spans strategic oversight and hands-on technical leadership, where he reviews and guides the work of both internal and external customers. Manu collaborates with AWS customers to shape technical strategies that drive impactful business outcomes, providing alignment between technology and organizational goals.

AK Soni is a Senior Technical Account Manager with AWS Enterprise Support, where he empowers enterprise customers to achieve their business goals by offering proactive guidance on implementing innovative cloud and AI/ML-based solutions aligned with industry best practices. With over 19 years of experience in enterprise application architecture and development, he uses his expertise in generative AI technologies to enhance business operations and overcome existing technological limitations. As a part of the AI/ML community at AWS, AK guides customers in designing generative AI solutions and trains AI/ML enthusiastic AWS employees to gain membership in the AWS generative AI community, providing valuable insights and recommendations to harness the power of generative AI.

Accelerate analysis and discovery of cancer biomarkers with Amazon Bedrock Agents

November 19, 2024

by Hasan Poonawala Amazon AWS

According to the National Cancer Institute, a cancer biomarker is a “biological molecule found in blood, other body fluids, or tissues that is a sign of a normal or abnormal process, or of a condition or disease such as cancer.” Biomarkers typically differentiate an affected patient from a person without the disease. Well-known cancer biomarkers include EGFR for lung cancer, HER2 for breast cancer, PSA for prostrate cancer, and so on. The BEST (Biomarkers, EndpointS, and other Tools) resource categorizes biomarkers into several types such as diagnostic, prognostic, and predictive biomarkers that can be measured with various techniques including molecular, imaging, and physiological measurements.

A study published in Nature Reviews Drug Discovery mentions that the overall success rate for oncology drugs from Phase I to approval is only around 5%. Biomarkers play a crucial role in enhancing the success of clinical development by improving patient stratification for trials, expediting drug development, reducing costs and risks, and enabling personalized medicine. For example, a study of 1,079 oncology drugs found that the success rates for drugs developed with a biomarker was 24% versus 6% for compounds developed without biomarkers.

Research scientists and real-world evidence (RWE) experts face numerous challenges to analyze biomarkers and validate hypotheses for biomarker discovery with their existing set of tools. Most notably, this includes manual and time-consuming steps for search, summarization, and insight generation across various biomedical literature (for example, PubMed), public scientific databases (for example, Protein Data Bank), commercial data banks and internal enterprise proprietary data. They want to quickly use, modify, or develop tools necessary for biomarker identification and correlation across modalities, indications, drug exposures and treatments, and associated endpoint outcomes such as survival. Each experiment might employ various combinations of data, tools, and visualization. Evidence in scientific literature should be simple to identify and cite with relevant context.

Amazon Bedrock Agents enables generative AI applications to execute multistep tasks across internal and external resources. Bedrock agents can streamline workflows and provide AI automation to boost productivity. In this post, we show you how agentic workflows with Amazon Bedrock Agents can help accelerate this journey for research scientists with a natural language interface. We define an example analysis pipeline, specifically for lung cancer survival with clinical, genomics, and imaging modalities of biomarkers. We showcase a variety of tools including database retrieval with Text2SQL, statistical models and visual charts with scientific libraries, biomedical literature search with public APIs and internal evidence, and medical image processing with Amazon SageMaker jobs. We demonstrate advanced capabilities of agents for self-review and planning that help build trust with end users by breaking down complex tasks into a series of steps and showing the chain of thought to generate the final answer. The code for this solution is available in GitHub.

Multi-modal biomarker analysis workflow

Some example scientific requirements from research scientists analyzing multi-modal patient biomarkers include:

What are the top five biomarkers associated with overall survival? Show me a Kaplan Meier plot for high and low risk patients.
According to literature evidence, what properties of the tumor are associated with metagene X activity and EGFR pathway?
Can you compute the imaging biomarkers for patient cohort with low gene X expression? Show me the tumor segmentation and the sphericity and elongation values.

To answer the preceding questions, research scientists typically run a survival analysis pipeline (as shown in the following illustration) with multimodal data; including clinical, genomic, and computed tomography (CT) imaging data.

They might need to:

Preprocess programmatically a diverse set of input data, structured and unstructured, and extract biomarkers (radiomic/genomic/clinical and others).
Conduct statistical survival analyses such as the Cox proportional hazards model, and generate visuals such as Kaplan-Meier curves for interpretation.
Conduct gene set enrichment analysis (GSEA) to identify significant genes.
Research relevant literature to validate initial findings.
Associate findings to radiogenomic biomarkers.

Flowchart showing cancer biomarker analysis pipeline with four data inputs and five analysis stages.

Diagram illustrates cancer biomarker discovery workflow. Four inputs: CT scans, gene data, survival data, and drug data undergo preprocessing. Analysis follows through survival analysis, gene enrichment, evidence gathering, and radio-genomic associations.

Solution overview

We propose a large-language-model (LLM) agents-based framework to augment and accelerate the above analysis pipeline. Design patterns for LLM agents, as described in Agentic Design Patterns Part 1 by Andrew Ng, include the capabilities for reflection, tool use, planning and multi-agent collaboration. An agent helps users complete actions based on both proprietary and public data and user input. Agents orchestrate interactions between foundation models (FMs), data sources, software applications, and user conversations. In addition, agents automatically call APIs to take actions and search knowledge bases to supplement information for these actions.

Diagram showing LLM agents framework with tools integration for biomarker analysis.

Architecture diagram showing Agents for Bedrock system flow. Left side shows agents processing steps 1 to n. Right side displays tools including biomarker query engine, scientific analysis tools, data analysis, external APIs, literature store, and medical imaging.

As shown in the preceding figure, we define our solution to include planning and reasoning with multiple tools including:

Biomarker query engine: Convert natural language questions to SQL statements and execute on an Amazon Redshift database of biomarkers.
Scientific analysis and plotting engine: Use a custom container with lifelines library to build survival regression models and visualization such as Kaplan Meier charts for survival analysis.
Custom data analysis: Process intermediate data and generate visualizations such as bar charts automatically to provide further insights. Use custom actions or Code Interpreter functionality with Bedrock Agents.
External data API: Use PubMed APIs to search biomedical literature for specific evidence.
Internal literature: Use Amazon Bedrock Knowledge Bases to give agents contextual information from internal literature evidence for Retrieval Augmented Generation (RAG) to deliver responses from a curated repository of publications.
Medical imaging: Use Amazon SageMaker jobs to augment agents with the capability to trigger asynchronous jobs with an ephemeral cluster to process CT scan images.

Dataset description

The non-small cell lung cancer (NSCLC) radiogenomic dataset comprises medical imaging, clinical, and genomic data collected from a cohort of early-stage NSCLC patients referred for surgical treatment. Each data modality presents a different view of a patient. It consists of clinical data reflective of electronic health records (EHR) such as age, gender, weight, ethnicity, smoking status, tumor node metastasis (TNM) stage, histopathological grade, and survival outcome. The genomic data contains gene mutation and RNA sequencing data from samples of surgically excised tumor tissue. It includes CT, positron emission tomography (PET)/CT images, semantic annotations of the tumors as observed on the medical images using a controlled vocabulary, segmentation maps of tumors in the CT scans, and quantitative values obtained from the PET/CT scans.

We reuse the data pipelines described in this blog post.

Clinical data

The data is stored in CSV format as shown in the following table. Each row corresponds to the medical records of a patient.

Sample table from NSCLC dataset showing two patient records. Data includes demographics (age, weight), clinical history (smoking status), and treatment information (chemotherapy, EGFR mutation status). Case R01-005 shows a deceased 84-year-old former smoker, while R01-006 shows a living 62-year-old former smoker.

Genomics data

The following table shows the tabular representation of the gene expression data. Each row corresponds to a patient, and the columns represent a subset of genes selected for demonstration. The value denotes the expression level of a gene for a patient. A higher value means the corresponding gene is highly expressed in that specific tumor sample.

Table displays gene expression data for patients R01-024 and R01-153. Shows expression levels for genes IRIG1, HPGD, GDF15, CDH2, and POSTN. Values range from 0 to 36.4332, with CDH2 showing no expression (0) in both cases.

Medical imaging data

The following image is an example overlay of a tumor segmentation onto a lung CT scan (case R01-093 in the dataset).

Medical imaging showing three lung CT scan views with tumor segmentation.

Three grayscale CT scan views of lungs showing tumor segmentation marked with bright spots. Images display frontal, sagittal, and axial views of the lung, each with crosshair markers indicating the tumor location.

Deployment and getting started

Follow the deployment instructions described in the GitHub repo.
Full deployment takes approximately 10–15 minutes. After deployment, you can access the sample UI to test the agent with sample questions available in the UI or the chain of thought reasoning example.

The stack can also be launched in the us-east-1 or us-west-2 AWS Regions by choosing launch stack in the following:

Region	Infrastructure.yaml
us-east-1
us-west-2

Amazon Bedrock Agents deep dive

The following diagram describes the key components of the agent that interacts with the users through a web application.

Architecture diagram showing Agents for Bedrock orchestrating biomarker analysis workflow across AWS services.

Architecture diagram showcasing Agents for Bedrock as the central orchestrator. The Agent processes user queries by coordinating multiple tools: handling biomarker queries through Redshift, running scientific analysis with Lambda, searching biomedical literature, and processing medical images via SageMaker. The Agent uses foundation models hosted on Amazon Bedrock to understand requests and generate responses.

Large language models

LLMs, such as Anthropic’s Claude or Amazon Titan models, possess the ability to understand and generate human-like text. They enable agents to comprehend user queries, generate appropriate responses, and perform complex reasoning tasks. In the deployment, we use Anthropic’s Claude 3 Sonnet model.

Prompt templates

Prompt templates are pre-designed structures that guide the LLM’s responses and behaviors. These templates help shape the agent’s personality, tone, and specific capabilities to understand scientific terminology. By carefully crafting prompt templates, you can help make sure that agents maintain consistency in their interactions and adhere to specific guidelines or brand voice. Amazon Bedrock Agents provides default prompt templates for pre-processing users’ queries, orchestration, a knowledge base, and a post-processing template.

Instructions

In addition to the prompt templates, instructions describe what the agent is designed to do and how it can interact with users. You can use instructions to define the role of a specific agent and how it can use the available set of actions under different conditions. Instructions are augmented with the prompt templates as context for each invocation of the agent. You can find how we define our agent instructions in agent_build.yaml.

User input

User input is the starting point for an interaction with an agent. The agent processes this input, understanding the user’s intent and context, and then formulates an appropriate chain of thought. The agent will determine whether it has the required information to answer the user’s question or need to request more information from the user. If more information is required from the user, the agent will formulate the question to request additional information. Amazon Bedrock Agents are designed to handle a wide range of user inputs, from simple queries to complex, multi-turn conversations.

Amazon Bedrock Knowledge Bases

The Amazon Bedrock knowledge base is a repository of information that has been vectorized from the source data and that the agent can access to supplement its responses. By integrating an Amazon Bedrock knowledge base, agents can provide more accurate and contextually appropriate answers, especially for domain-specific queries that might not be covered by the LLM’s general knowledge. In this solution, we include literature on non-small cell lung cancer that can represent internal evidence belonging to a customer.

Action groups

Action groups are collections of specific functions or API calls that Amazon Bedrock Agents can perform. By defining action groups, you can extend the agent’s capabilities beyond mere conversation, enabling it to perform practical, real-world tasks. The following tools are made available to the agent through action groups in the solution. The source code can be found in the ActionGroups folder in the repository.

Text2SQL and Redshift database invocation: The Text2SQL action group allows the agent to get the relevant schema of the Redshift database, generate a SQL query for the particular sub-question, review and refine the SQL query with an additional LLM invocation, and finally execute the SQL query to retrieve the relevant results from the Redshift database. The action group contains OpenAPI schema for these actions. If the query execution returns a result greater than the acceptable lambda return payload size, the action group writes the data to an intermediate Amazon Simple Storage Service (Amazon S3) location instead.
Scientific analysis with a custom container: The scientific analysis action group allows the agent to use a custom container to perform scientific analysis with specific libraries and APIs. In this solution, these include tasks such as fitting survival regression models and Kaplan Meier plot generation for survival analysis. The custom container allows a user to verify that the results are repeatable without deviations in library versions or algorithmic logic. This action group defines functions with specific parameters for each of the required tasks. The Kaplan Meier plot is output to Amazon S3.
Custom data analysis: By enabling Code Interpreter with the Amazon Bedrock agent, the agent can generate and execute Python code in a secure compute environment. You can use this to run custom data analysis with code and generate generic visualizations of the data.
Biomedical literature evidence with PubMed: The PubMed action group allows the agent to interact with the PubMed Entrez Programming Utilities (E-utilities) API to fetch biomedical literature. The action group contains OpenAPI schema that accepts user queries to search across PubMed for articles. The Lambda function provides a convenient way to search for and retrieve scientific articles from the PubMed database. It allows users to perform searches using specific queries, retrieve article metadata, and handle the complexities of API interactions. Overall, the agent uses this action group and serves as a bridge between a researcher’s query and the PubMed database, simplifying the process of accessing and processing biomedical research information.
Medical imaging with SageMaker jobs: The medical imaging action group allows the agent to process CT scan images of specific patient groups by triggering a SageMaker processing job. We re-use the medical imaging component from this previous blog.

The action group creates patient-level 3-dimensional radiomic features that explain the size, shape, and visual attributes of the tumors observed in the CT scans and stores them in Amazon S3. For each patient study, the following steps are performed, as shown in the figure that follows:

1. Read the 2D DICOM slice files for both the CT scan and tumor segmentation, combine them to 3D volumes, and save the volumes in NIfTI format.
2. Align CT volume and tumor segmentation so we can focus the computation inside the tumor.
3. Compute radiomic features describing the tumor region using the pyradiomics library. It extracts 120 radiomic features of eight classes such as statistical representations of the distribution and co-occurrence of the intensity within the tumorous region of interest, and shape-based measurements describing the tumor morphologically.

Flowchart showing medical imaging processing pipeline from DICOM files to radiomic feature extraction.

Workflow diagram illustrating the medical image processing steps for tumor analysis. Process starts with 2D DICOM files (scan and segmentation) from S3 bucket, combines them into 3D volumes in NIfTI format, aligns CT study with segmentation, and ends with radiomic feature computation. Icons show each stage: document icons for DICOM files, S3 bucket symbol, lung CT scan images, segmented tumor view, and tabular data representing extracted features. Green arrows indicate data flow between stages.

Chain of thought reasoning and responses

Let’s look at a few examples of chain of thought reasoning and execution with the agent. The agent interprets the user question, generates a sequence of steps, and executes them with the relevant action group.
To respond to the following questions, the agent sets up the following orchestration workflows with the available action groups.

Question 1:

What is the best gene expression biomarker (lowest p value) with overall survival for patients that have undergone chemotherapy, show me a bar chart with the top five biomarkers.

Steps:

Get DB schema and generate SQL
Review and refine the SQL with self-evaluation
Execute the SQL query and store the results in Amazon S3
Fit a survival regression model with the custom container API
Retrieve the model fit summary results and sort the top five biomarkers based on the lowest p-value
Generate a bar chart with the top five biomarkers
Summarize the result to the user

Agent response: Gene biomarker ‘gdf15’ has the lowest p value of 0.02, therefore gdf15 is the best gene biomarker. The bar chart shows the top 5 biomarkers and their corresponding p values.

Question 2:

Show me a Kaplan Meier plot for chemotherapy patients stratified by GDF15 expression levels less than 10 and greater than 10.

Steps:

Get DB schema and generate SQL
Review and refine the SQL with self-evaluation
Execute the SQL query and retrieve the results
Process survival data by splitting it into 2 groups based on threshold
Generate a Kaplan-Meier survival chart for baseline and condition groups
Summarize the results to the user

Agent response: The Kaplan-Meier chart has been output to Amazon S3 and shows the survival curves for the two groups of chemotherapy patients stratified by their gdf15 expression levels. A clear separation between the curves indicates that the baseline group (less than 10) has better survival outcomes than the condition group (greater than 10).

Question 3:

Can you compute the imaging biomarkers for patient cohort with low gene X expression? Show me the tumor segmentation and the sphericity and elongation values.

Steps:

Get DB schema and generate SQL
Review and refine the SQL with self-evaluation
Execute the SQL query and identify the patient cohort
Compute imaging biomarkers for each patient in the cohort with a SageMaker medical imaging job
Wait for job execution to complete
On job completion, retrieve and analyze the results
Extract the required biomarkers and visualize the tumor segmentation output images

Agent response: The compute imaging biomarker job to process the medical images of the patients was triggered and is now complete. The following patients were selected based on their gene X expression values. The elongation and sphericity values for the patient tumors are displayed along with the corresponding tumor segmentation images.

Conclusion

Research scientists face significant challenges in identifying and validating biomarkers specific to cancer subtypes and relevant to interventions and patient outcomes. Existing tools often require intensive manual steps to search, summarize, and generate insights across diverse data sources. This post has demonstrated how Amazon Bedrock Agents can offer a flexible framework and relevant tools to help accelerate this critical discovery process.

By providing an example analysis pipeline for lung cancer survival, we showcased how these agentic workflows use a natural language interface, database retrieval, statistical modeling, literature search, and medical image processing to transform complex research queries into actionable insights. The agent used advanced and intelligent capabilities such as self-review and planning, breaking down tasks into step-by-step analyses and transparently displaying the chain of thought behind the final answers. While the potential impact of this technology on pharmaceutical research and clinical trial outcomes remains to be fully realized, solutions like this can help automate data analysis and hypothesis validation tasks.

The code for this solution is available on GitHub, and we encourage you to explore and build upon this template. For examples to get started with Amazon Bedrock Agents, check out the Amazon Bedrock Agents GitHub repository.

About the Authors

Hasan Poonawala is a Senior AI/ML Solutions Architect at AWS, working with Healthcare and Life Sciences customers. Hasan helps design, deploy and scale Generative AI and Machine learning applications on AWS. He has over 15 years of combined work experience in machine learning, software development and data science on the cloud. In his spare time, Hasan loves to explore nature and spend time with friends and family.

Michael Hsieh is a Principal AI/ML Specialist Solutions Architect. He works with HCLS customers to advance their ML journey with AWS technologies and his expertise in medical imaging. As a Seattle transplant, he loves exploring the great mother nature the city has to offer, such as the hiking trails, scenery kayaking in the SLU, and the sunset at Shilshole Bay.

Nihir Chadderwala is a Senior AI/ML Solutions Architect on the Global Healthcare and Life Sciences team. His background is building big data and AI-powered solutions to customer problems in a variety of domains such as software, media, automotive, and healthcare. In his spare time, he enjoys playing tennis, and watching and reading about Cosmos.

Zeek Granston is an Associate AI/ML Solutions Architect focused on building effective artificial intelligence and machine learning solutions. He stays current with industry trends to deliver practical results for clients. Outside of work, Zeek enjoys building AI applications, and playing basketball.

Automate building guardrails for Amazon Bedrock using test-driven development

November 19, 2024

by Harsh Patel Amazon AWS

As companies of all sizes continue to build generative AI applications, the need for robust governance and control mechanisms becomes crucial. With the growing complexity of generative AI models, organizations face challenges in maintaining compliance, mitigating risks, and upholding ethical standards. This is where the concept of guardrails comes into play, providing a comprehensive framework for implementing governance and control measures with safeguards customized to your application requirements and responsible AI policies.

Amazon Bedrock Guardrails helps implement safeguards for generative AI applications based on specific use cases and responsible AI policies. Amazon Bedrock Guardrails assists in controlling the interaction between users and foundation models (FMs) by detecting and filtering out undesirable and potentially harmful content, while maintaining safety and privacy. Organizations can define denied topics, making sure that FMs refrain from providing information or advice on undesirable subjects; configure content filters to set thresholds for blocking harmful content across categories such as hate, insults, sexual, violence, and misconduct; redact sensitive and personally identifiable information (PII) to protect privacy; and block inappropriate content with a custom word filter. You can create multiple guardrails with different configurations, each tailored to specific use cases, and continuously monitor and analyze user inputs and FM responses that might violate customer-defined policies. By proactively implementing guardrails, companies can future-proof their generative AI applications while maintaining a steadfast commitment to ethical and responsible AI practices.

In this post, we explore a solution that automates building guardrails using a test-driven development approach.

Iterative development

Although implementing Amazon Bedrock Guardrails is a crucial step in practicing responsible AI, it’s important to recognize that these safeguards aren’t static. As models evolve and new use cases emerge, organizations must be proactive in refining and adapting their guardrails to maintain effectiveness and alignment with their responsible AI policies.

To address this challenge, we recommend builders adopt a test-driven development (TDD) approach when building and maintaining their guardrails. TDD is a software development methodology that emphasizes writing tests before implementing actual code. By applying this methodology to guardrails, organizations can proactively identify edge cases, potential vulnerabilities, and areas for improvement, making sure that their guardrails remain robust and fit for purpose. TDD for guardrails offers several benefits. It promotes a structured and systematic approach to refining and validating guardrails, reducing the risk of unintended consequences or gaps in coverage. Additionally, TDD facilitates collaboration and knowledge sharing among teams, because tests serve as living documentation and a shared understanding of the expected behavior and constraints.

In this post, we present a solution that takes a TDD approach to guardrail development, allowing you to improve your guardrails over time.

Solution overview

In this solution, you use a TDD approach to improve your guardrails. You first create a guardrail, then build a testing dataset, and finally evaluate the guardrail using the testing dataset. Using the test results from your evaluation of the guardrail, you can go back and update it and reevaluate. This allows you to maintain the TDD approach and improve your guardrail over multiple iterations. The solution also includes an optional step where you invoke an FM to generate and implement changes to your guardrail based on the test results; we recommend using that step to understand the different ways to update the guardrail because it doesn’t guarantee all test cases will pass.

This workflow is shown in the following diagram.

This diagram presents the main workflow (Steps 1–4) and the optional automated workflow (Steps 5–7).

Prerequisites

Before you start, make sure you have the following prerequisites in place:

Create an AWS account, or sign in to your existing account.
Make sure that you have the correct AWS Identity and Access Management (IAM) permissions to use Amazon Bedrock.
Have access to the large language model (LLM) that will be used. This solution uses Anthropic’s Claude 3 Sonnet and Claude 3 Haiku models.
Install Python 3.8 or greater in your environment.
Install pip.
Configure your AWS credentials.

Clone the repo

To get started, clone the repository by running the following command, and then switch to the working directory:

git clone https://github.com/aws-samples/amazon-bedrock-samples/responsible-ai/tdd-guardrail

Build your guardrail

To build the guardrail, you can use the CreateGuardrail API. There are multiple components to a guardrail for Amazon Bedrock. This API allows you to configure the following policies programmatically:

Content filters – You can configure thresholds to block input prompts or model responses containing harmful content such as hate, insults, sexual, violence, misconduct (including criminal activity), and prompt attacks (prompt injection and jailbreaks). For example, an ecommerce site can design its online assistant to not use inappropriate language, such as hate speech or insults.
Denied topics – You can define a set of topics to deny within your generative AI application. For example, a banking assistant application can be designed to deny topics related to illegal investment advice.
Word filters – You can configure a set of custom words or phrases that you want to detect and block in the interaction between your users and generative AI applications. For example, you can detect and block profanity as well as specific custom words such as competitor names, or other offensive words.
Sensitive information filters – You can detect sensitive content such as PII or custom regular expressions (regex) entities in user inputs and FM responses. Based on the use case, you can reject inputs that contain sensitive information or redact them in FM responses. For example, you can redact users’ PII while generating summaries from customer and agent conversation transcripts.
Contextual grounding check – You can detect and filter hallucinations in model responses if they aren’t grounded (factually inaccurate or add new information) in the source information or are irrelevant to the user’s query. For example, you can block or flag responses in Retrieval Augmented Generation (RAG) applications if the model’s responses deviate from the information in the retrieved passages or don’t answer the user’s questions.

To test this solution, you create a guardrail for a math tutoring business, which stops the model from providing responses for non-math tutoring, in-person tutoring, or tutoring outside grades 6–12 requests. See the following code:

create_response = client.create_guardrail(
    name='math-tutoring-guardrail',
    description='Prevents the model from providing non-math tutoring, in-person tutoring, or tutoring outside grades 6-12.',
    topicPolicyConfig={
        'topicsConfig': [
            {
                'name': 'In-Person Tutoring',
                'definition': 'Requests for face-to-face, physical tutoring sessions.',
                'examples': [
                    'Can you tutor me in person?',
                    'Do you offer home tutoring visits?',
                    'I need a tutor to come to my house.'
                ],
                'type': 'DENY'
            },
            {
                'name': 'Non-Math Tutoring',
                'definition': 'Requests for tutoring in subjects other than mathematics.',
                'examples': [
                    'Can you help me with my English homework?',
                    'I need a science tutor.',
                    'Do you offer history tutoring?'
                ],
                'type': 'DENY'
            },
            {
                'name': 'Non-6-12 Grade Tutoring',
                'definition': 'Requests for tutoring students outside of grades 6-12.',
                'examples': [
                    'Can you tutor my 5-year-old in math?',
                    'I need help with college-level calculus.',
                    'Do you offer math tutoring for adults?'
                ],
                'type': 'DENY'
            }
        ]
    },
    contentPolicyConfig={
        'filtersConfig': [
            {
                'type': 'SEXUAL',
                'inputStrength': 'HIGH',
                'outputStrength': 'HIGH'
            },
            {
                'type': 'VIOLENCE',
                'inputStrength': 'HIGH',
                'outputStrength': 'HIGH'
            },
            {
                'type': 'HATE',
                'inputStrength': 'HIGH',
                'outputStrength': 'HIGH'
            },
            {
                'type': 'INSULTS',
                'inputStrength': 'HIGH',
                'outputStrength': 'HIGH'
            },
            {
                'type': 'MISCONDUCT',
                'inputStrength': 'HIGH',
                'outputStrength': 'HIGH'
            },
            {
                'type': 'PROMPT_ATTACK',
                'inputStrength': 'HIGH',
                'outputStrength': 'NONE'
            }
        ]
    },
    wordPolicyConfig={
        'wordsConfig': [
            {'text': 'in-person tutoring'},
            {'text': 'home tutoring'},
            {'text': 'face-to-face tutoring'},
            {'text': 'elementary school'},
            {'text': 'college'},
            {'text': 'university'},
            {'text': 'adult education'},
            {'text': 'english tutoring'},
            {'text': 'science tutoring'},
            {'text': 'history tutoring'}
        ],
        'managedWordListsConfig': [
            {'type': 'PROFANITY'}
        ]
    },
    sensitiveInformationPolicyConfig={
        'piiEntitiesConfig': [
            {'type': 'EMAIL', 'action': 'ANONYMIZE'},
            {'type': 'PHONE', 'action': 'ANONYMIZE'},
            {'type': 'NAME', 'action': 'ANONYMIZE'}
        ]
    },
    blockedInputMessaging="""I'm sorry, but I can only assist with math tutoring for students in grades 6-12. For other subjects, grade levels, or in-person tutoring, please contact our customer service team for more information on available services.""",
    blockedOutputsMessaging="""I apologize, but I can only provide information and assistance related to math tutoring for students in grades 6-12. If you have any questions about our online math tutoring services for these grade levels, please feel free to ask.""",
    tags=[
        {'key': 'purpose', 'value': 'math-tutoring-guardrail'},
        {'key': 'environment', 'value': 'production'}
    ]
)

The API response will include a guardrail ID and version. You use these two fields to interact with the guardrail in the following sections.

Build the testing dataset

The tests.csv file in the project directory consists of a testing dataset for the math-tutoring-guardrail created in the previous step. Upload your own dataset to the data folder in the project directory as a CSV file following the same structure as the sample tests.csv file based on your specific use case. The dataset must contain the following columns:

- - - test_number is a unique identifier for each test case.

- - - test_type is either INPUT or OUTPUT.

- - - test_content_query is the user’s query or input.

- - - test_content_grounding_source is context information for the AI (if applicable).

- - - test_content_guard_content is the AI’s response (for the OUTPUT tests).

- - - expected_action is set to GUARDRAIL_INTERVENED or NONE. Set it to GUARDRAIL_INTERVENED when the prompt should be blocked by the guardrail and to NONE when the prompt should pass the guardrail.

Make sure your test dataset is comprehensively testing all the elements of your guardrail system. You load the tests file into the workflow using the pandas library in Python. Using df.head(), you can see the first five rows of the pandas dataframe object and verify that the dataset has been read correctly:

# Import the data file
import pandas as pd
df = pd.read_csv('data/tests.csv')
df.head()

Evaluate the guardrail with the testing dataset

To run the tests on the created guardrail, use the ApplyGuardrails API. This applies the guardrail for model input or model response output text without needing to invoke the FM.

The ApplyGuardrail API requires the following:

Guardrail identifier – The unique ID for the guardrail being tested
Guardrail version – The version of the guardrail that you want to test
Source – The source of the data used in the request to apply the guardrail (INPUT or OUTPUT)
Content – The details used in the request to apply the guardrail

We use the guardrail ID and version from the CreateGuardrail API response. The source and content will be extracted from the tests CSV created in the previous step. The following code reads through your CSV file and prepares the source and content for the ApplyGuardrails API call:

with open(input_file, 'r') as infile, open(output_file, 'w', newline='') as outfile:
        reader = csv.DictReader(infile)
        fieldnames = reader.fieldnames + ['test_result', 'achieved_expected_result', 'guardrail_api_response']
        writer = csv.DictWriter(outfile, fieldnames=fieldnames)
        writer.writeheader()

        for row_number, row in enumerate(reader, start=1):
            content = []
            if row['test_type'] == 'INPUT':
                content = [{"text": {"text": row['test_content_query']}}]
            elif row['test_type'] == 'OUTPUT':
                content = [
                    {"text": {"text": row['test_content_grounding_source'], "qualifiers": ["grounding_source"]}},
                    {"text": {"text": row['test_content_query'], "qualifiers": ["query"]}},
                    {"text": {"text": row['test_content_guard_content'], "qualifiers": ["guard_content"]}},
                ]
            
            # Remove empty content items
            content = [item for item in content if item['text']['text']]

You can call the ApplyGuardrail API for each row in the testing dataset. Based on the API response, you can determine the guardrail’s action. If the guardrail’s action matches the expected action, the test is considered True (passed), otherwise False (failed). Additionally, each row of the API response is saved so the user can explore the response as needed. These test results will then be written to an output CSV file. See the following code:

with open(input_file, 'r') as infile, open(output_file, 'w', newline='') as outfile:
        reader = csv.DictReader(infile)
        fieldnames = reader.fieldnames + ['test_result', 'achieved_expected_result', 'guardrail_api_response']
        writer = csv.DictWriter(outfile, fieldnames=fieldnames)
        writer.writeheader()

        for row_number, row in enumerate(reader, start=1):
            content = []
            if row['test_type'] == 'INPUT':
                content = [{"text": {"text": row['test_content_query']}}]
            elif row['test_type'] == 'OUTPUT':
                content = [
                    {"text": {"text": row['test_content_grounding_source'], "qualifiers": ["grounding_source"]}},
                    {"text": {"text": row['test_content_query'], "qualifiers": ["query"]}},
                    {"text": {"text": row['test_content_guard_content'], "qualifiers": ["guard_content"]}},
                ]
            
            # Remove empty content items
            content = [item for item in content if item['text']['text']]

            # Make the actual API call
            response = apply_guardrail(content, row['test_type'], guardrail_id, guardrail_version)

            if response:
                actual_action = response.get('action', 'NONE')
                expected_action = row['expected_action']
                achieved_expected = actual_action == expected_action

                # Prepare the API response for CSV
                api_response = json.dumps( {
                    "action": actual_action,
                    "outputs": response.get('outputs', []),
                    "assessments": response.get('assessments', [])
                })

                # Write the results
                row.update({
                    'test_result': actual_action,
                    'achieved_expected_result': str(achieved_expected).upper(),
                    'guardrail_api_response': api_response
                })
            else:
                # Handle the case where the API call failed
                row.update({
                    'test_result': 'API_CALL_FAILED',
                    'achieved_expected_result': 'FALSE',
                    'guardrail_api_response': json.dumps({"error": "API call failed"})
                })

            writer.writerow(row)
            print(f"Processed row {row_number}")  # New line to print progress

    print(f"Processing complete. Results written to {output_file}")

After reviewing the test results, you can update the guardrail as required to help meet your applications needs. This approach allows you to practice TDD when working with Amazon Bedrock Guardrails. In the following table, you can see tests that failed, which resulted in the achieved_expected_result being FALSE because the guardrail intervened when it shouldn’t have. Therefore, we can modify the denied topics and additional filters on our guardrail to make sure we pass this test.

Guardrail test results

Using the TDD approach, you can improve your guardrail over time by improving the guardrail’s success in stopping bad actors from misusing the application, identifying edge cases or gaps you might not have previously considered, and adhering to responsible AI policies.

Optional: Automate the workflow and iteratively improve the guardrail

We recommend reviewing your test results after each iteration. This step doesn’t guarantee the guardrail will pass all tests. You should use this step to help understand how to modify your existing guardrail configuration.

When practicing the TDD approach, we recommend improving the guardrail over time through multiple iterations. This optional step allows you to prompt the user for details, which are then used to build a guardrail and test cases from scratch. Then, you allow the user to input n number of iterations, where in each iteration you rerun all the tests and adjust the guardrail’s denied topics based on the test results.

To create the guardrail, prompt the user for the guardrail name and description. With the given description, you use the InvokeModel API with the guardrail_prompt.txt system prompt to generate the denied topics of your guardrail. Using this configuration, you invoke the CreateGuardrail API to build the guardrail. You can validate that a new guardrail has been created by refreshing your Amazon Bedrock Guardrails dashboard. In the following screenshot, you can see that a new guardrail for a photography application has been created.

New guardrail created

Using the same parameters, you can use the InvokeModel API to generate test cases for your newly created guardrail. The tests_prompt.txt file provides a system prompt that makes sure that the FM creates 30 test cases with 20 input tests and 10 output tests. To practice TDD, use these test cases and iteratively modify the existing guardrail n times as requested by the user based on the test results of each iteration.

The process of iteratively modifying the existing guardrail consists of four steps:

Use the GetGuardrail API to fetch the most recent configuration of your guardrail:

current_guardrail_details = client.get_guardrail(
	guardrailIdentifier=guardrail_id,
	guardrailVersion=version
) 

current_denied_topics = current_guardrail_details[‘topicPolicy’][‘topics’]
current_name = current_guardrail_details[‘name’]
current_description = guardrail_description
current_id = current_guardrail_details[‘guardrailId’]
current_version = current_guardrail_details[‘version’]

Use the CreateGuardrailVersion API to create a new version of your guardrail for each iteration. This allows you to keep track of every modified guardrail through each iteration. This API works asynchronously, so your code will continue to run even if the guardrail hasn’t completed versioning. Use the guardrail_ready_check function to validate that the guardrail is in the ‘READY’ state before continuing to run code.

response = client.create_guardrail_version(
	guardrailIdentifier=current_id
	description=”Iteration “+str(i)+” – “+current_description
	clientRequestToken=f”GuardrailUpdate-{int(time.time())}-{uuid.uuid4().hex}”
)
guardrail_ready_check(guardrail_id,15,10)

The guardrail_ready_check function uses the GetGuardrail API to get the current status of your guardrail. If the guardrail is not in the ‘READY’ state, this function implements wait logic until it is, or results in a timeout error.

def guardrail_ready_check(guardrail_id, max_attempts, delay):
	#Poll for ready state
	for attempt in range(max_attempts):
		try:
			guardrail_status = client.get_guardrail(guardrailIdentifier=guardrail_id)[‘status’]
			if guardrail_status == ‘READY’:
				print(f”Guardrail {guardrail_id} is now in READY state.”)
				return response
			elif guardrail_status == ‘FAILED’:
				raise Exception(f”Guardrail {guardrail_id} update failed.”)
			else:
				print(f”Guardrail {guardrail_id} is in {guardrail_status} state. Waiting...”)
				time.sleep(delay)
		except Exception as e:
			print(f”Error checking guardrail status: {str(e)}”)
			time.sleep(delay)
	raise TimeoutError(f”Guardrail {guardrail_id} did not reach READY state within the expected time.”)

Evaluate the guardrail against the auto_generated_tests.csv file using the process_tests function created in the earlier steps:

process_tests(input_file, output_file, current_id, current_version)

test_results = pd.read_csv(output_file)

The input_file will be your auto_generated_tests.csv file. However, the output_file is dynamically named based on the iteration. For example, for iteration 3, it will name the results file test_results_3.csv.

Based on the test results from each iteration, use the InvokeModel API to generate modified denied topics. The get_denied_topics function uses the guardrail_prompt.txt when invoking the API, which engineers the model to consider the test results and guardrail description when modifying the denied topics.

updated_topics = get_denied_topics(guardrail_description, current_denied_topics, test_results)

Using the newly generated denied topics, invoke the UpdateGuardrail API through the update_guardrail function. This provides an updated configuration to your existing guardrail and updates it accordingly.

update_guardrail(current_id, current_name, current_description, current_version, updated_topics)

After completing n iterations, you will have n versions of the guardrail created as well as n test results, as shown in the following screenshot. This allows you to review each iteration and update your guardrail’s configuration to help meet your application’s requirements. When using TDD, it’s important to validate your test results and verify that you’re making improvements over time for the best results.

Guardrail versions

Clean up

In this solution, you created a guardrail, built a dataset, evaluated the guardrail against the dataset, and iteratively modified the guardrail based on the test results. To clean up, use the DeleteGuardrail API, which deletes the guardrail using the guardrail ID and guardrail version.

Pricing

This solution uses Amazon Bedrock, which bills based on FM invocation and guardrail usage:

FM invocation – You are billed based on the number of input and output tokens; one token equals one word or sub-word depending on the model used. For this solution, we used Anthropic’s Claude 3 Sonnet and Claude 3 Haiku models. The size of the input and output tokens is based on the size of the test prompt and size of the response.
Guardrails – You are billed based on the configuration of your guardrail policies. Each policy is billed per 1,000 text units, where each text unit can contain up to 1,000 characters.

See Amazon Bedrock pricing for more details.

Conclusion

When developing generative AI applications, it’s crucial to implement robust safeguards and governance measures to maintain responsible AI use. Amazon Bedrock Guardrails provides a framework to achieve this. However, guardrails aren’t static entities—they require continuous refinement and adaptation to keep pace with evolving use cases, malicious threats, and responsible AI policies. TDD is a software development methodology that encourages improving software through iterative development cycles.

As shown in this post, you can adopt TDD when building safeguards for your generative AI applications. By systematically testing and refining guardrails, companies can not only reduce potential risks and operational inefficiencies, but also foster a culture of shared knowledge among technical teams, driving continuous improvement and strategic decision-making in AI development.

We recommend integrating the TDD approach in your software development practices to make sure that you’re improving your safeguards over time as new edge cases arise and your use cases evolve. Leave a comment on this post or open an issue on GitHub if you have any questions.

About the Author

Harsh Patel is an AWS Solutions Architect supporting 200+ SMB customers across the United States to drive digital transformation through cloud-native solutions. As an AI&ML Specialist, he focuses on Generative AI, Computer Vision, Reinforcement Learning and Anomaly Detection. Outside the tech world, he recharges by hitting the golf course and embarking on scenic hikes with his dog.

Aditi Rajnish is a Second-year software engineering student at University of Waterloo. Her interests include computer vision, natural language processing, and edge computing. She is also passionate about community-based STEM outreach and advocacy. In her spare time, she can be found rock climbing, playing the piano, or learning how to bake the perfect scone.

Raj Pathak is a Principal Solutions Architect and Technical advisor to Fortune 50 and Mid-Sized FSI (Banking, Insurance, Capital Markets) customers across Canada and the United States. Raj specializes in Machine Learning with applications in Generative AI, Natural Language Processing, Intelligent Document Processing, and MLOps.

Build cost-effective RAG applications with Binary Embeddings in Amazon Titan Text Embeddings V2, Amazon OpenSearch Serverless, and Amazon Bedrock Knowledge Bases

November 18, 2024

by Shreyas Subramanian Amazon AWS

Today, we are happy to announce the availability of Binary Embeddings for Amazon Titan Text Embeddings V2 in Amazon Bedrock Knowledge Bases and Amazon OpenSearch Serverless. With support for binary embedding in Amazon Bedrock and a binary vector store in OpenSearch Serverless, you can use binary embeddings and binary vector store to build Retrieval Augmented Generation (RAG) applications in Amazon Bedrock Knowledge Bases, reducing memory usage and overall costs.

Amazon Bedrock is a fully managed service that provides a single API to access and use various high-performing foundation models (FMs) from leading AI companies. Amazon Bedrock also offers a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI. Using Amazon Bedrock Knowledge Bases, FMs and agents can retrieve contextual information from your company’s private data sources for RAG. RAG helps FMs deliver more relevant, accurate, and customized responses.

Amazon Titan Text Embeddings models generate meaningful semantic representations of documents, paragraphs, and sentences. Amazon Titan Text Embeddings takes as an input a body of text and generates a 1,024 (default), 512, or 256 dimensional vector. Amazon Titan Text Embeddings are offered through latency-optimized endpoint invocation for faster search (recommended during the retrieval step) and throughput-optimized batch jobs for faster indexing. With Binary Embeddings, Amazon Titan Text Embeddings V2 will represent data as binary vectors with each dimension encoded as a single binary digit (0 or 1). This binary representation will convert high-dimensional data into a more efficient format for storage and computation.

Amazon OpenSearch Serverless is a serverless deployment option for Amazon OpenSearch Service, a fully managed service that makes it simple to perform interactive log analytics, real-time application monitoring, website search, and vector search with its k-nearest neighbor (kNN) plugin. It supports exact and approximate nearest-neighbor algorithms and multiple storage and matching engines. It makes it simple for you to build modern machine learning (ML) augmented search experiences, generative AI applications, and analytics workloads without having to manage the underlying infrastructure.

The OpenSearch Serverless kNN plugin now supports 16-bit (FP16) and binary vectors, in addition to 32-bit floating point vectors (FP32). You can store the binary embeddings generated by Amazon Titan Text Embeddings V2 for lower costs by setting the kNN vector field type to binary. The vectors can be stored and searched in OpenSearch Serverless using PUT and GET APIs.

This post summarizes the benefits of this new binary vector support across Amazon Titan Text Embeddings, Amazon Bedrock Knowledge Bases, and OpenSearch Serverless, and gives you information on how you can get started. The following diagram is a rough architecture diagram with Amazon Bedrock Knowledge Bases and Amazon OpenSearch Serverless.

You can lower latency and reduce storage costs and memory requirements in OpenSearch Serverless and Amazon Bedrock Knowledge Bases with minimal reduction in retrieval quality.

We ran the Massive Text Embedding Benchmark (MTEB) retrieval data set with binary embeddings. On this data set, we reduced storage, while observing a 25-times improvement in latency. Binary embeddings maintained 98.5% of the retrieval accuracy with re-ranking, and 97% without re-ranking. Compare these results to the results we got using full precision (float32) embeddings. In end-to-end RAG benchmark comparisons with full-precision embeddings, Binary Embeddings with Amazon Titan Text Embeddings V2 retain 99.1% of the full-precision answer correctness (98.6% without reranking). We encourage customers to do their own benchmarks using Amazon OpenSearch Serverless and Binary Embeddings for Amazon Titan Text Embeddings V2.

OpenSearch Serverless benchmarks using the Hierarchical Navigable Small Worlds (HNSW) algorithm with binary vectors have unveiled a 50% reduction in search OpenSearch Computing Units (OCUs), translating to cost savings for users. The use of binary indexes has resulted in significantly faster retrieval times. Traditional search methods often rely on computationally intensive calculations such as L2 and cosine distances, which can be resource-intensive. In contrast, binary indexes in Amazon OpenSearch Serverless operate on Hamming distances, a more efficient approach that accelerates search queries.

In the following sections we’ll discuss the how-to for binary embeddings with Amazon Titan Text Embeddings, binary vectors (and FP16) for vector engine, and binary embedding option for Amazon Bedrock Knowledge Bases To learn more about Amazon Bedrock Knowledge Bases, visit Knowledge Bases now delivers fully managed RAG experience in Amazon Bedrock.

Generate Binary Embeddings with Amazon Titan Text Embeddings V2

Amazon Titan Text Embeddings V2 now supports Binary Embeddings and is optimized for retrieval performance and accuracy across different dimension sizes (1024, 512, 256) with text support for more than 100 languages. By default, Amazon Titan Text Embeddings models produce embeddings at Floating Point 32 bit (FP32) precision. Although using a 1024-dimension vector of FP32 embeddings helps achieve better accuracy, it also leads to large storage requirements and related costs in retrieval use cases.

To generate binary embeddings in code, add the right embeddingTypes parameter in your invoke_model API request to Amazon Titan Text Embeddings V2:

import json
import boto3
import numpy as np
rt_client = boto3.client("bedrock-runtime")

response = rt_client.invoke_model(modelId="amazon.titan-embed-text-v2:0", 
          body=json.dumps(
               {
                   "inputText":"What is Amazon Bedrock?",
                   "embeddingTypes": ["binary","float"]
               }))['body'].read()

embedding = np.array(json.loads(response)["embeddingsByType"]["binary"], dtype=np.int8)

As in the request above, we can request either the binary embedding alone or both binary and float embeddings. The preceding embedding above is a 1024-length binary vector similar to:

array([0, 1, 1, ..., 0, 0, 0], dtype=int8)

For more information and sample code, refer to Amazon Titan Embeddings Text.

Configure Amazon Bedrock Knowledge Bases with Binary Vector Embeddings

You can use Amazon Bedrock Knowledge Bases, to take advantage of the Binary Embeddings with Amazon Titan Text Embeddings V2 and the binary vectors and Floating Point 16 bit (FP16) for vector engine in Amazon OpenSearch Serverless, without writing a single line of code. Follow these steps:

On the Amazon Bedrock console, create a knowledge base. Provide the knowledge base details, including name and description, and create a new or use an existing service role with the relevant AWS Identity and Access Management (IAM) permissions. For information on creating service roles, refer to Service roles. Under Choose data source, choose Amazon S3, as shown in the following screenshot. Choose Next.
Configure the data source. Enter a name and description. Define the source S3 URI. Under Chunking and parsing configurations, choose Default. Choose Next to continue.
Complete the knowledge base setup by selecting an embeddings model. For this walkthrough, select Titan Text Embedding v2. Under Embeddings type, choose Binary vector embeddings. Under Vector dimensions, choose 1024. Choose Quick Create a New Vector Store. This option will configure a new Amazon Open Search Serverless store that supports the binary data type.

You can check the knowledge base details after creation to monitor the data source sync status. After the sync is complete, you can test the knowledge base and check the FM’s responses.

Conclusion

As we’ve explored throughout this post, Binary Embeddings are an option in Amazon Titan Text Embeddings V2 models available in Amazon Bedrock and the binary vector store in OpenSearch Serverless. These features significantly reduce memory and disk needs in Amazon Bedrock and OpenSearch Serverless, resulting in fewer OCUs for the RAG solution. You’ll also experience better performance and improvement in latency, but there will be some impact on the accuracy of the results compared to using the full float data type (FP32). Although the drop in accuracy is minimal, you have to decide if it suits your application. The specific benefits will vary based on factors such as the volume of data, search traffic, and storage requirements, but the examples discussed in this post illustrate the potential value.

Binary Embeddings support in Amazon Open Search Serverless, Amazon Bedrock Knowledge Bases, and Amazon Titan Text Embeddings v2 are available today in all AWS Regions where the services are already available. Check the Region list for details and future updates. To learn more about Amazon Knowledge Bases, visit the Amazon Bedrock Knowledge Bases product page. For more information regarding Amazon Titan Text Embeddings, visit Amazon Titan in Amazon Bedrock. For more information on Amazon OpenSearch Serverless, visit the Amazon OpenSearch Serverless product page. For pricing details, review the Amazon Bedrock pricing page.

Give the new feature a try in the Amazon Bedrock console today. Send feedback to AWS re:Post for Amazon Bedrock or through your usual AWS contacts and engage with the generative AI builder community at community.aws.

About the Authors

Shreyas Subramanian is a principal data scientist and helps customers by using generative AI and deep learning to solve their business challenges using AWS services. Shreyas has a background in large-scale optimization and ML and in the use of ML and reinforcement learning for accelerating optimization tasks.

Ron Widha is a Senior Software Development Manager with Amazon Bedrock Knowledge Bases, helping customers easily build scalable RAG applications.

Satish Nandi is a Senior Product Manager with Amazon OpenSearch Service. He is focused on OpenSearch Serverless and has years of experience in networking, security and AI/ML. He holds a bachelor’s degree in computer science and an MBA in entrepreneurship. In his free time, he likes to fly airplanes and hang gliders and ride his motorcycle.

Vamshi Vijay Nakkirtha is a Senior Software Development Manager working on the OpenSearch Project and Amazon OpenSearch Service. His primary interests include distributed systems.