Integrate dynamic web content in your generative AI application using a web search API and Amazon Bedrock Agents

Integrate dynamic web content in your generative AI application using a web search API and Amazon Bedrock Agents

Amazon Bedrock Agents offers developers the ability to build and configure autonomous agents in their applications. These agents help users complete actions based on organizational data and user input, orchestrating interactions between foundation models (FMs), data sources, software applications, and user conversations.

Amazon Bedrock agents use the power of large language models (LLMs) to perform complex reasoning and action generation. This approach is inspired by the ReAct (reasoning and acting) paradigm, which combines reasoning traces and task-specific actions in an interleaved manner.

Amazon Bedrock agents use LLMs to break down tasks, interact dynamically with users, run actions through API calls, and augment knowledge using Amazon Bedrock Knowledge Bases. The ReAct approach enables agents to generate reasoning traces and actions while seamlessly integrating with company systems through action groups. By offering accelerated development, simplified infrastructure, enhanced capabilities through chain-of-thought (CoT) prompting, and improved accuracy, Amazon Bedrock Agents allows developers to rapidly build sophisticated AI solutions that combine the power of LLMs with custom actions and knowledge bases, all without managing underlying complexity.

Web search APIs empower developers to seamlessly integrate powerful search capabilities into their applications, providing access to vast troves of internet data with just a few lines of code. These APIs act as gateways to sophisticated search engines, allowing applications to programmatically query the web and retrieve relevant results including webpages, images, news articles, and more.

By using web search APIs, developers can enhance their applications with up-to-date information from across the internet, enabling features like content discovery, trend analysis, and intelligent recommendations. With customizable parameters for refining searches and structured response formats for parsing, web search APIs offer a flexible and efficient solution for harnessing the wealth of information available on the web.

Amazon Bedrock Agents offers a powerful solution for enhancing chatbot capabilities, and when combined with web search APIs, they address a critical customer pain point. In this post, we demonstrate how to use Amazon Bedrock Agents with a web search API to integrate dynamic web content in your generative AI application.

Benefits of integrating a web search API with Amazon Bedrock Agents

Let’s explore how this integration can revolutionize your chatbot experience:

  • Seamless in-chat web search – By incorporating web search APIs into your Amazon Bedrock agents, you can empower your chatbot to perform real-time web searches without forcing users to leave the chat interface. This keeps users engaged within your application, improving overall user experience and retention.
  • Dynamic information retrieval – Amazon Bedrock agents can use web search APIs to fetch up-to-date information on a wide range of topics. This makes sure that your chatbot provides the most current and relevant responses, enhancing its utility and user trust.
  • Contextual responses – Amazon Bedrock agent uses CoT prompting, enabling FMs to plan and run actions dynamically. Through this approach, agents can analyze user queries and determine when a web search is necessary or—if enabled—gather more information from the user to complete the task. This allows your chatbot to blend information from APIs, knowledge bases, and up-to-date web-sourced content, creating a more natural and informative conversation flow. With these capabilities, agents can provide responses that are better tailored to the user’s needs and the current context of the interaction.
  • Enhanced problem solving – By integrating web search APIs, your Amazon Bedrock agent can tackle a broader range of user inquiries. Whether it’s troubleshooting a technical issue or providing industry insights, your chatbot becomes a more versatile and valuable resource for users.
  • Minimal setup, maximum impact – Amazon Bedrock agents simplify the process of adding web search functionality to your chatbot. With just a few configuration steps, you can dramatically expand your chatbot’s knowledge base and capabilities, all while maintaining a streamlined UI.
  • Infrastructure as code – You can use AWS CloudFormation or the AWS Cloud Development Kit (AWS CDK) to deploy and manage Amazon Bedrock agents.

By addressing the customer challenge of expanding chatbot functionality without complicating the user experience, the combination of web search APIs and Amazon Bedrock agents offers a compelling solution. This integration allows businesses to create more capable, informative, and user-friendly chatbots that keep users engaged and satisfied within a single interface.

Solution overview

This solution uses Amazon Bedrock Agents with a web search capability that integrates external search APIs (SerpAPI and Tavily AI) with the agent. The architecture consists of the following key components:

Visual representation of the system

  • An Amazon Bedrock agent orchestrates the interaction between the user and search APIs, handling the chat sessions and optionally long-term memory
  • An AWS Lambda function implements the logic for calling external search APIs and processing results
  • External search APIs (SerpAPI and Tavily AI) provide web search capabilities
  • Amazon Bedrock FMs generate natural language responses based on search results
  • AWS Secrets Manager securely stores API keys for external services

The solution flow is as follows:

  1. User input is received by the Amazon Bedrock agent, powered by Anthropic Claude 3 Sonnet on Amazon Bedrock.
  2. The agent determines if a web search is necessary, or comes back to the user with clarifying questions.
  3. If required, the agent invokes one of two Lambda functions to perform a web search: SerpAPI for up-to-date events or Tavily AI for web research-heavy questions.
  4. The Lambda function retrieves the API secrets securely from Secrets Manager, calls the appropriate search API, and processes the results.
  5. The agent generates the final response based on the search results.
  6. The response is returned to the user after final output guardrails are applied.

The following figure is a visual representation of the system we are going to implement.

We demonstrate two methods to build this solution. To set up the agent on the AWS Management Console, we use the new agent builder. The following GitHub repository contains the Python AWS CDK code to deploy the same example.

Prerequisites

Make sure you have the following prerequisites:

Amazon Bedrock agents support models like Amazon Titan Text and Anthropic Claude models. Each model has different capabilities and pricing. For the full list of supported models, see Supported regions and models for Amazon Bedrock Agents.

For this post, we use the Anthropic Claude 3 Sonnet model.

Configure the web search APIs

Both SERPER (SerpAPI) and Tavily AI provide web search APIs that can be integrated with Amazon Bedrock agents by calling their REST-based API endpoints from a Lambda function. However, they have some key differences that can influence when you would use each one:

  • SerpAPI provides access to multiple search engines, including Google, Bing, Yahoo, and others. It offers granular control over search parameters and result types (for example, organic results, featured snippets, images, and videos). SerpAPI might be better suited for tasks requiring specific search engine features or when you need results from multiple search engines.
  • Tavily AI is specifically designed for AI agents and LLMs, focusing on delivering relevant and factual results. It offers features like including answers, raw content, and images in search results. It provides customization options such as search depth (basic or advanced) and the ability to include or exclude specific domains. It’s optimized for speed and efficiency in delivering real-time results.

You would use SerpAPI if you need results from specific search engines or multiple engines, and Tavily AI when relevance and factual accuracy are crucial.

Ultimately, the choice between SerpAPI and Tavily AI depends on your specific research requirements, the level of control you need over search parameters, and whether you prioritize general search engine capabilities or AI-optimized results.

For the example in this post, we chose to use both and let the agent decide which API is the more appropriate one, depending on the question or prompt. The agent can also opt to call both if one doesn’t provide a good enough answer. Both SerpAPI and Tavily AI provide a free tier that can be used for the example in this post.

For both APIs, API keys are required and are available from Serper and Tavily.

We securely store the obtained API keys in Secrets Manager. The following examples create secrets for the API keys:

aws secretsmanager create-secret 
--name SERPER_API_KEY 
--description "The API secret key for Serper." 
--secret-string "$SERPER_API_KEY"

aws secretsmanager create-secret 
--name TAVILY_API_KEY 
--description "The API secret key for Tavily AI." 
--secret-string "$TAVILY_API_KEY"

When you enter commands in a shell, there is a risk of the command history being accessed or utilities having access to your command parameters. For more information, see Mitigate the risks of using the AWS CLI to store your AWS Secrets Manager secrets.

Now that the APIs are configured, you can start building the web search Amazon Bedrock agent.

In the following section, we present two methods to create your agent: through the console and using the AWS CDK. Although the console path offers a more visual approach, we strongly recommend using the AWS CDK for deploying the agent. This method not only provides a more robust deployment process, but also allows you to examine the underlying code. Let’s explore both options to help you choose the best approach for your needs.

Build a web search Amazon Bedrock agent using the console

In the first example, you build a web search agent using the Amazon Bedrock console to create and configure the agent, and then the Lambda console to configure and deploy a Lambda function.

Create a web search agent

To create a web search agent using the console, complete the following steps:

  1. On the Amazon Bedrock console, choose Agents in the navigation pane.
  2. Choose Create agent.
  3. Enter a name for the agent (such as websearch-agent) and an optional description, then choose Create.

Create Agent Dialogue

You are now in the new agent builder, where you can access and edit the configuration of an agent.

  1. For Agent resource role, leave the default Create and use a new service role

This option automatically creates the AWS Identity and Access Management (IAM) role assumed by the agent.

  1. For the model, choose Anthropic and Claude 3 Sonnet.

Instructions for the Agent

  1. For Instructions for the Agent, provide clear and specific instructions to tell the agent what it should do. For the web search agent, enter:
You are an agent that can handle various tasks as described below:
1/ Helping users do research and finding up-to-date information. For up-to-date information always uses web search. Web search has two flavors:
a/ Google Search - this is great for looking up up-to-date information and current events
b/ Tavily AI Search - this is used to do deep research on topics your user is interested in. Not good for being used on news because it does not order search results by date.

As you can see from the instruction, we decided to name the SerpAPI option Google Search. In our tests with the Anthropic Claude 3 Sonnet model, Google Search is synonymous with web search. Because the instruction is a natural language instruction to the model, we want to stay as close to the assumed usage of words in a language, therefore, we use Google Search instead of SerpAPI. However, this could vary from model to model. We encourage you to test new instructions when changing the model.
  1. Choose Add in the Action groups

Action groups are how agents can interact with external systems or APIs to get more information or perform actions.

  1. For Enter action group name, enter action-group-web-search for the action group.
  2. For Action group type, select Define with function details so you can specify functions and their parameters as JSON instead of providing an Open API schema.
  3. For Action group invocation, set up what the agent does after this action group is identified by the model. Because we want to call the web search APIs, select Quick create a new Lambda function.

With this option, Amazon Bedrock creates a basic Lambda function for your agent that you can later modify on the Lambda console for the use case of calling the web search APIs. The agent will predict the function and function parameters needed to fulfil its goal and pass the parameters to the Lambda function.

Create Action group

  1. Now, configure the two functions of the action group—one for the SerpAPI Google search, and one for the Tavily AI search.
  2. For each of the two functions, for Parameters, add search_query with a description.

This is a parameter of type String and is required by each of the functions.

  1. Choose Create to complete the creation of the action group.

Action group functions

We use the following parameter descriptions:

“The search query for the Google web search.”
“The search query for the Tavily web search.”

We encourage you to try to add a target website as an extra parameter to the action group functions. Take a look at the lambda function code and infer the settings.

You will be redirected to the agent builder console.

  1. Choose Save to save your agent configuration.

Configure and deploy a Lambda function

Complete the following steps to update the action group Lambda function:

  1. On the Lambda console, locate the new Lambda function with the name action-group-web-search-.
  2. Edit the provided starting code and implement the web search use case:
import http.client
import json
… 
def lambda_handler(event, _):
    action_group = event["actionGroup"]
    function = event["function"]
    parameters = event.get("parameters", [])
    search_query, target_website = extract_search_params(action_group, function, parameters)
    search_results: str = ""
    if function == "tavily-ai-search":
        search_results = tavily_ai_search(search_query, target_website)
    elif function == "google-search":
        search_results = google_search(search_query, target_website)
    # Prepare the response
    function_response_body = {"TEXT": {"body": f"Here are the top search results for the query '{search_query}': {search_results} "}}
    action_response = {
        "actionGroup": action_group,
        "function": function,
        "functionResponse": {"responseBody": function_response_body},
    }
    response = {"response": action_response, "messageVersion": event["messageVersion"]}
    return response

The code is truncated for brevity. The full code is available on GitHub.

  1. Choose Deploy.

The function is configured with a resource-based policy that allows Amazon Bedrock to invoke the function. For this reason, you don’t need to update the IAM role used by the agent.

As part of the Quick create a new Lambda function option selected earlier, the agent builder configured the function with a resource-based policy that allows the Amazon Bedrock service principal to invoke the function. There is no need to update the IAM role used by the agent. However, the function needs permission to access API keys saved in Secrets Manager.

  1. On the function details page, choose the Configuration tab, then choose Permissions.
  2. Choose the link for Role name to open the role on the IAM console.

Execution role

  1. Open the JSON view of the IAM policy under Policy name and choose Edit to edit the policy.

Permissions policies

  1. Add the following statement, which gives the Lambda function the required access to read the API keys from Secrets Manager. Adjust the Region code as needed, and provide your AWS account ID.
{
  "Action": "secretsmanager:GetSecretValue",
  "Resource": [
    "arn:aws:secretsmanager:us-west-2:<account_id>:secret:SERPER_API_KEY*",
    "arn:aws:secretsmanager:<region_name>:<account_id>:secret:TAVILY_API_KEY*"
  ],
  "Effect": "Allow",
  "Sid": "GetSecretsManagerSecret"
}

Test the agent

You’re now ready to test the agent.

  1. On the Amazon Bedrock console, on the websearch-agent details page, choose Test.
  2. Choose Prepare to prepare the agent and test it with the latest changes.
  3. As test input, you can ask a question such as “What are the latest news from AWS?”

Test the agent

  1. To see the details of each step of the agent orchestration, including the reasoning steps, choose Show trace (already opened in the preceding screenshot).

This helps you understand the agent decisions and debug the agent configuration if the result isn’t as expected. We encourage you to investigate how the instructions for the agent and the tool instructions are handed to the agent by inspecting the traces of the agent.

In the next section, we walk through deploying the web search agent with the AWS CDK.

Build a web search Amazon Bedrock agent with the AWS CDK

Both AWS CloudFormation and AWS CDK support have been released for Amazon Bedrock Agents, so you can develop and deploy the preceding agent completely in code.

The AWS CDK example in this post uses Python. The following are the required steps to deploy this solution:

  1. Install the AWS CDK version 2.174.3 or later and set up your AWS CDK Python environment with Python 3.11 or later.
  2. Clone the GitHub repository and install the dependencies.
  3. Run AWS CDK bootstrapping on your AWS account.

The structure of the sample AWS CDK application repository is:

  • /app.py file – Contains the top-level definition of the AWS CDK app
  • /cdk folder – Contains the stack definition for the web search agent stack
  • /lambda folder – Contains the Lambda function runtime code that handles the calls to the Serper and Tavily AI APIs
  • /test folder – Contains a Python script to test the deployed agent

To create an Amazon Bedrock agent, the key resources required are:

  • An action group that defines the functions available to the agent
  • A Lambda function that implements these functions
  • The agent itself, which orchestrates the interactions between the FMs, functions, and user conversations

AWS CDK code to define an action group

The following Python code defines an action group as a Level 1 (L1) construct. L1 constructs, also known as AWS CloudFormation resources, are the lowest-level constructs available in the AWS CDK and offer no abstraction. Currently, the available Amazon Bedrock AWS CDK constructs are L1. With the action_group_executor parameter of AgentActionGroupProperty, you define the Lambda function containing the business logic that is carried out when the action is invoked.

action_group = bedrock.CfnAgent.AgentActionGroupProperty(
    action_group_name=f"{ACTION_GROUP_NAME}",
    description="Action that will trigger the lambda",
    action_group_executor=bedrock.CfnAgent.ActionGroupExecutorProperty(lambda_=lambda_function.function_arn),
    function_schema=bedrock.CfnAgent.FunctionSchemaProperty(
        functions=[
            bedrock.CfnAgent.FunctionProperty(
                name="tavily-ai-search",
                description="""
                    To retrieve information via the internet
                    or for topics that the LLM does not know about and
                    intense research is needed.
                """,
                parameters={
                    "search_query": bedrock.CfnAgent.ParameterDetailProperty(
                        type="string",
                        description="The search query for the Tavily web search.",
                        required=True,
                    )
                },
            ),
            bedrock.CfnAgent.FunctionProperty(
                name="google-search",
                description="For targeted news, like 'what are the latest news in Austria' or similar.",
                parameters={
                    "search_query": bedrock.CfnAgent.ParameterDetailProperty(
                        type="string",
                        description="The search query for the Google web search.",
                        required=True,
                    )
                },
            ),
        ]
),

After the Amazon Bedrock agent determines the API operation that it needs to invoke in an action group, it sends information alongside relevant metadata as an input event to the Lambda function.

The following code shows the Lambda handler function that extracts the relevant metadata and populated fields from the request body parameters to determine which function (Serper or Tavily AI) to call. The extracted parameter is search_query, as defined in the preceding action group function. The complete Lambda Python code is available in the GitHub repository.

def lambda_handler(event, _):  # type: ignore
    action_group = event["actionGroup"]
    function = event["function"]
    parameters = event.get("parameters", [])
    search_query, target_website = extract_search_params(action_group, function, parameters)
    search_results: str = ""
    if function == "tavily-ai-search":
        search_results = tavily_ai_search(search_query, target_website)
    elif function == "google-search":
        search_results = google_search(search_query, target_website)

Lastly, with the CfnAgent AWS CDK construct, specify an agent as a resource. The auto_prepare=True parameter creates a DRAFT version of the agent that can be used for testing.

  agent_instruction = """
      You are an agent that can handle various tasks as described below:
      1/ Helping users do research and finding up to date information. For up to date information always
         uses web search. Web search has two flavours:
         1a/ Google Search - this is great for looking up up to date information and current events
         2b/ Tavily AI Search - this is used to do deep research on topics your user is interested in. Not good on being used on news as it does not order search results by date.
      2/ Retrieving knowledge from the vast knowledge bases that you are connected to.
  """

  agent = bedrock.CfnAgent(
      self,
      "WebSearchAgent",
      agent_name="websearch_agent",
      foundation_model="anthropic.claude-3-sonnet-20240229-v1:0",
      action_groups=[action_group],
      auto_prepare=True,
      instruction=agent_instruction,
      agent_resource_role_arn=agent_role.role_arn,
   )

Deploy the AWS CDK application

Complete the following steps to deploy the agent using the AWS CDK:

  1. Clone the example AWS CDK code:
git clone https://github.com/aws-samples/websearch_agent
  1. Create a Python virtual environment, activate it, and install Python dependencies (make sure that you’re using Python 3.11 or later):
python -m venv .venv && source .venv/bin/activate && pip install -r requirements.txt
  1. To deploy the agent AWS CDK example, run the cdk deploycommand:
cdk deploy

When the AWS CDK deployment is finished, it will output values for agent_id and agent_alias_id:

Outputs:
WebSearchAgentStack.agentaliasid = <agent_alias_id>
WebSearchAgentStack.agentid = <agent_id>
WebSearchAgentStack.agentversion = DRAFT

For example:

WebSearchAgentStack.agentaliasid = XP3JHPEDMK
WebSearchAgentStack.agentid = WFRPT9IMBO
WebSearchAgentStack.agentversion = DRAFT

Make a note of the outputs; you need them to test the agent in the next step.

Test the agent

To test the deployed agent, a Python script is available in the test/ folder. You must be authenticated using an AWS account and an AWS_REGION environment variable set. For details, see Configure the AWS CLI.

To run the script, you need the output values and to pass in a question using the -prompt parameter:

python invoke-agent.py --agent_id <agent_id> --agent_alias_id <agent_alias_id> --prompt "What are the latest AWS news?"

For example, with the outputs we received from the preceding cdk deploy command, you would run the following:

python invoke-agent.py --agent_id WFRPT9IMBO --agent_alias_id XP3JHPEDMK --prompt "What are the latest AWS news?"

You would receive the following response (output is truncated for brevity):

Here are some of the latest major AWS news and announcements:
At the recent AWS Summit in New York, AWS announced several new services and capabilities across areas like generative AI, machine learning, databases, and more.
Amazon Q, AWS's generative AI assistant, has been integrated with Smartsheet to provide AI-powered assistance to employees. Amazon Q Developer has also reached general availability with new features for developers.
AWS plans to launch a new Region in Mexico called the AWS Mexico (Central) Region, which will be the second AWS Region in Mexico ....

Clean up

To delete the resources deployed with the agent AWS CDK example, run the following command:

cdk destroy

Use the following commands to delete the API keys created in Secrets Manager:

aws secretsmanager delete-secret —secret-id SERPER_API_KEY
aws secretsmanager delete-secret —secret-id TAVILY_API_KEY

Key considerations

Let’s dive into some key considerations when integrating web search into your AI systems.

API usage and cost management

When working with external APIs, it’s crucial to make sure that your rate limits and quotas don’t become bottlenecks for your workload. Regularly check and identify limiting factors in your system and validate that it can handle the load as it scales. This might involve implementing a robust monitoring system to track API usage, setting up alerts for when you’re approaching limits, and developing strategies to gracefully handle rate-limiting scenarios.

Additionally, carefully consider the cost implications of external APIs. The amount of content returned by these services directly translates into token usage in your language models, which can significantly impact your overall costs. Analyze the trade-offs between comprehensive search results and the associated token consumption to optimize your system’s efficiency and cost-effectiveness. Consider implementing caching mechanisms for frequently requested information to reduce API calls and associated costs.

Privacy and security considerations

It’s essential to thoroughly review the pricing and privacy agreements of your chosen web search provider. The agentic systems you’re building can potentially leak sensitive information to these providers through the search queries sent. To mitigate this risk, consider implementing data sanitization techniques to remove or mask sensitive information before it reaches the search provider. This becomes especially crucial when building or enhancing secure chatbots and internally facing systems—educating your users about these privacy considerations is therefore of utmost importance.

To add an extra layer of security, you can implement guardrails, such as those provided by Amazon Bedrock Guardrails, in the Lambda functions that call the web search. This additional safeguard can help protect against inadvertent information leakage to web search providers. These guardrails could include pattern matching to detect potential personally identifiable information (PII), allow and deny lists for certain types of queries, or AI-powered content classifiers to flag potentially sensitive information.

Localization and contextual search

When designing your web search agent, it’s crucial to consider that end-users are accustomed to the search experience provided by standard web browsers, especially on mobile devices. These browsers often supply additional context as part of a web search, significantly enhancing the relevance of results. Key aspects of localization and contextual search include language considerations, geolocation, search history and personalization, and time and date context. For language considerations, you can implement language detection to automatically identify the user’s preferred language or provide it through the agent’s session context.

Refer to Control agent session context for details on how to provide session context in Amazon Bedrock Agents for more details.

It’s important to support multilingual queries and results, using a model that supports your specific language needs. Geolocation is another critical factor; utilizing the user’s approximate location (with permission) can provide geographically relevant results. Search history and personalization can greatly enhance the user experience. Consider implementing a system (with user consent) to remember recent searches and use this context for result ranking. You can customize an Amazon Bedrock agent with the session state feature. Adding a user’s location attributes to the session state is a potential implementation option.

Additionally, allow users to set persistent preferences for result types, such as preferring videos over text articles. Time and date context is also vital; use the user’s local time zone for time-sensitive queries like “latest news on quarterly numbers of company XYZ, now,” and consider seasonal context for queries that might have different meanings depending on the time of year.

For instance, without providing such extra information, a query like “What is the current weather in Zurich?” could yield results for any Zurich globally, be it in Switzerland or various locations in the US. By incorporating these contextual elements, your search agent can distinguish that a user in Europe is likely asking about Zurich, Switzerland, whereas a user in Illinois might be interested in the weather at Lake Zurich. To implement these features, consider creating a system that safely collects and utilizes relevant user context. However, always prioritize user privacy and provide clear opt-in mechanisms for data collection. Clearly communicate what data is being used and how it enhances the search experience. Offer users granular control over their data and the ability to opt out of personalized features. By carefully balancing these localization and contextual search elements, you can create a more intuitive and effective web search agent that provides highly relevant results while respecting user privacy.

Performance optimization and testing

Performance optimization and testing are critical aspects of building a robust web search agent. Implement comprehensive latency testing to measure response times for various query types and content lengths across different geographical regions. Conduct load testing to simulate concurrent users and identify system limits if applicable to your application. Optimize your Lambda functions for cold starts and runtime, and consider using Amazon CloudFront to reduce latency for global users. Implement error handling and resilience measures, including fallback mechanisms and retry logic. Set up Amazon CloudWatch alarms for key metrics such as API latency and error rates to enable proactive monitoring and quick response to performance issues.

To test the solution end to end, create a dataset of questions and correct answers to test if changes to your system improve or deteriorate the information retrieval capabilities of your app.

Migration strategies

For organizations considering a migration from open source frameworks like LangChain to Amazon Bedrock Agents, it’s important to approach the transition strategically. Begin by mapping your current ReAct agent’s logic to the Amazon Bedrock agents’ action groups and Lambda functions. Identify any gaps in functionality and plan for alternative solutions or custom development where necessary. Adapt your existing API calls to work with the Amazon Bedrock API and update authentication methods to use IAM roles and policies.

Develop comprehensive test suites to make sure functionalities are correctly replicated in the new environment. One significant advantage of Amazon Bedrock agents is the ability to implement a gradual rollout. By using the agent alias ID, you can quickly direct traffic between different versions of your agent, allowing for a smooth and controlled migration process. This approach enables you to test and validate your new implementation with a subset of users or queries before fully transitioning your entire system.

By carefully balancing these considerations—from API usage and costs to privacy concerns, localization, performance optimization, and migration strategies—you can create a more intelligent, efficient, and user-friendly search experience that respects individual preferences and data protection regulations. As you build and refine your web search agent with Amazon Bedrock, keep these factors in mind to provide a robust, scalable, and responsible AI system.

Expanding the solution

With this post, you’ve taken the first step towards revolutionizing your applications with Amazon Bedrock Agents and the power of agentic workflows with LLMs. You’ve not only learned how to integrate dynamic web content, but also gained insights into the intricate relationship between AI agents and external information sources.

Transitioning your existing systems to Amazon Bedrock agents is a seamless process, and with the AWS CDK, you can manage your agentic AI infrastructure as code, providing scalability, reliability, and maintainability. This approach not only streamlines your development process, but also paves the way for more sophisticated AI-driven applications that can adapt and grow with your business needs.

Expand your horizons and unlock even more capabilities:

  • Connect to an Amazon Bedrock knowledge base – Augment your agents’ knowledge by integrating them with a centralized knowledge repository, enabling your AI to draw upon a vast, curated pool of information tailored to your specific domain.
  • Embrace streaming – Use the power of streaming responses to provide an enhanced user experience and foster a more natural and interactive conversation flow, mimicking the real-time nature of human dialogue and keeping users engaged throughout the interaction.
  • Expose ReAct prompting and tool use – Parse the streaming output on your frontend to visualize the agent’s reasoning process and tool usage, providing invaluable transparency and interpretability for your users, building trust, and allowing users to understand and verify the AI’s decision-making process.
  • Utilize memory for Amazon Bedrock Agents – Amazon Bedrock agents can retain a summary of their conversations with each user and are able to provide a smooth, adaptive experience if enabled. This allows you to give extra context for tasks like web search and topics of interest, creating a more personalized and contextually aware interaction over time.
  • Give extra context – As outlined earlier, context matters. Try to implement additional user context through the session attributes that you can provide through the session state. Refer to Control agent session context for the technical implementations, and consider how this context can be used responsibly to enhance the relevance and accuracy of your agent’s responses.
  • Add agentic web research – Agents allow you to build very sophisticated workflows. Our system is not limited to a simple web search. The Lambda function can also serve as an environment to implement an agentic web research with multi-agent collaboration, enabling more comprehensive and nuanced information gathering and analysis.

What other tools would you use to complement your agent? Refer to the aws-samples GitHub repo for Amazon Bedrock Agents to see what others have built and consider how these tools might be integrated into your own unique AI solutions.

Conclusion

The future of generative AI is here, and Amazon Bedrock Agents is your gateway to unlocking its full potential. Embrace the power of agentic LLMs and experience the transformative impact they can have on your applications and user experiences. As you embark on this journey, remember that the true power of AI lies not just in its capabilities, but in how we thoughtfully and responsibly integrate it into our systems to solve real-world problems and enhance human experiences.

If you would like us to follow up with a second post tackling any points discussed here, feel free to leave a comment. Your engagement helps shape the direction of our content and makes sure we’re addressing the topics that matter most to you and the broader AI community.

In this post, you have seen the steps needed to integrate dynamic web content and harness the full potential of generative AI, but don’t stop here. Transitioning your existing systems to Amazon Bedrock agents is a seamless process, and with the AWS CDK, you can manage your agentic AI infrastructure as code, providing scalability, reliability, and maintainability.


About the Authors

Philipp Kaindl is a Senior Artificial Intelligence and Machine Learning Specialist Solutions Architect at AWS. With a background in data science and mechanical engineering, his focus is on empowering customers to create lasting business impact with the help of AI. Connect with Philipp on LinkedIn.

Markus Rollwagen is a Senior Solutions Architect at AWS, based in Switzerland. He enjoys deep dive technical discussions, while keeping an eye on the big picture and the customer goals. With a software engineering background, he embraces infrastructure as code and is passionate about all things security. Connect with Markus on LinkedIn.

Read More

Build a generative AI assistant to enhance employee experience using Amazon Q Business

Build a generative AI assistant to enhance employee experience using Amazon Q Business

In today’s fast-paced business environment, organizations are constantly seeking innovative ways to enhance employee experience and productivity. There are many challenges that can impact employee productivity, such as cumbersome search experiences or finding specific information across an organization’s vast knowledge bases. Additionally, with the rise of remote and hybrid work models, traditional support systems such as IT Helpdesks and HR might struggle to keep up with the increased demand for assistance. Productivity loss because of these challenges can lead to lengthy onboarding times for new employees, extended task completion times, and call volumes for undifferentiated IT and HR support, to name a few.

Amazon Q Business is a fully managed, generative artificial intelligence (AI) powered assistant that can address the challenges mentioned above by providing 24/7 support tailored to individual needs. It can handle a wide range of tasks such as answering questions, providing summaries, and generating content and completing tasks based on data in your organization. Additionally, Amazon Q Business offers enterprise-grade data security and privacy and has guardrails built-in that are configurable by an admin. Customers like Deriv were successfully able to reduce new employee onboarding time by up to 45% and overall recruiting efforts by as much as 50% by making generative AI available to all of their employees in a safe way.

In this blog post, we will talk about Amazon Q Business use cases, walk-through an example application, and discuss approaches for measuring productivity gains.

Use cases overview

Some key use cases for Amazon Q Business for organizations include:

  • Providing grounded responses to employees: An organization can deploy Amazon Q Business on their internal data, documents, products, and services. This allows Amazon Q Business to understand the business context and provide tailored assistance to employees on common questions, tasks, and issues.
  • Improving employee experience: By deploying Amazon Q Business across various environments like websites, apps, and chatbots, organizations can provide unified, engaging and personalized experiences. Employees will have a consistent experience wherever they choose to interact with the generative AI assistant.
  • Knowledge management: Amazon Q Business helps organizations use their institutional knowledge more effectively. It can be integrated with internal knowledge bases, manuals, best practices, and more, to provide a centralized source of information to employees.
  • Project management and issue tracking: With Amazon Q Business plugins, users can use natural language to open tickets without leaving the chat interface. Previously resolved tickets can also be used to help reduce overall ticket volumes and get employees the information they need faster to resolve an issue.

Amazon Q Business features

The Amazon Q Business-powered chatbot aims to provide comprehensive support to users with a multifaceted approach. It offers multiple data source connectors that can connect to your data sources and help you create your generative AI solution with minimal configuration. Amazon Q Business supports over 40 connectors at the time of writing. Additionally, Amazon Q Business also supports plugins to enable users to take action from within the conversation. There are four native plugins offered, and a custom plugin option to integrate with any third-party application.

Using the Business User Store feature, users see chat responses generated only from the documents that they have access to within an Amazon Q Business application. You can also customize your application environment to your organizational needs by using application environment guardrails or chat controls such as global controls and topic-level controls that you can configure to manage the user chat experience.

Features like document enrichment and relevance tuning together play a key role in further customizing and enhancing your applications. The document enrichment feature helps you control both what documents and document attributes are ingested into your index and also how they’re ingested. Using document enrichment, you can create, modify, or delete document attributes and document content when you ingest them into your Amazon Q Business index. You can then assign weights to document attributes after mapping them to index fields using the relevance tuning feature. You can use these assigned weights to fine-tune the underlying ranking of Retrieval-Augmented Generation (RAG)-retrieved passages within your application environment to optimize the relevance of chat responses.

Amazon Q Business offers robust security features to protect customer data and promote responsible use of the AI assistant. It uses pre-trained machine learning models and does not use customer data to train or improve the models. The service supports encryption at rest and in transit, and administrators can configure various security controls such as restricting responses to enterprise content only, specifying blocked words or phrases, and defining special topics with customized guardrails. Additionally, Amazon Q Business uses the security capabilities of Amazon Bedrock, the underlying AWS service, to enforce safety, security, and responsible use of AI.

Sample application architecture

The following figure shows a sample application architecture.

Sample Architecture Diagram

Application architecture walkthrough

Before you begin to create an Amazon Q Business application environment, make sure that you complete the setting up tasks and review the Before you begin section. This includes tasks like setting up required AWS Identity and Access Management (IAM) roles and enabling and pre-configuring an AWS IAM Identity Center instance.

As the next step towards creating a generative AI assistant, you can create the Amazon Q Business web experience. The web experience can be created using either the AWS Management Console or the Amazon Q Business APIs.

After creating your Amazon Q Business application environment, you create and select the retriever and provision the index that will power your generative AI web experience. The retriever pulls data from the index in real time during a conversation. After you select a retriever for your Amazon Q Business application environment, you connect data sources to it.

This sample application connects to repositories like Amazon Simple Storage Service (Amazon S3) and SharePoint, and to public facing websites or internal company websites using Amazon Q Web Crawler. The application also integrates with service and project management tools such as ServiceNow and Jira and enterprise communication tools such as Slack and Microsoft Teams. The application uses built-in plugins for Jira and ServiceNow to enable users to perform specific tasks related to supported third-party services from within their web experience chat, such as creating a Jira ticket or opening an incident in ServiceNow.

After the data sources are configured, data is integrated and synchronized into container indexes that are maintained by the Amazon Q Business service. Authorized users interact with the application environment through the web experience URL after successfully authenticating. You could also use Amazon Q Business APIs to build a custom UI to implement special features such as handling feedback, using company brand colors and templates, and using a custom sign-in. It also enables conversing with Amazon Q through an interface personalized to your use case.

Application demo

Here are a few screenshots demonstrating an AI assistant application using Amazon Q Business. These screenshots illustrate a scenario where an employee interacts with the Amazon Q Business chatbot to get summaries, address common queries related to IT support, and open tickets or incidents using IT service management (ITSM) tools such as ServiceNow.

  1. Employee A interacts with the application to get help when wireless access was down and receives suggested actions to take:
    Screenshot showing employee interacting with the application to get help when wireless access was down
  2. Employee B interacts with the application to report an incident of wireless access down and receives a form to fill out to create a ticket:
    Screenshot showing employee interacting with the form presented by the application to create an incident in ServiceNow
    Screenshot showing the created incident in the application
    An incident is created in ServiceNow based on Employee B’s interaction:
    Screenshot of the created incident in ServiceNow
  3. A new employee in the organization interacts with the application to ask several questions about company policies and receives reliable answers:
    Screenshot showing employee interacting with the application to ask several questions about company policies
  4. A new employee in the organization asks the application how to reach IT support and receives detailed IT support contact information:
    Screenshot showing employee interacting with the application on how to reach IT support

Approaches for measuring productivity gains:

There are several approaches to measure productivity gains achieved by using a generative AI assistant. Here are some common metrics and methods:

Average search time reduction: Measure the time employees spend searching for information or solutions before and after implementing the AI assistant. A reduction in average search time indicates faster access to information, which can lead to shorter task completion times and improved efficiency.

    • Units: Percentage reduction in search time or absolute time saved (for example, hours or minutes)
    • Example: 40% reduction in average search time or 1 hour saved per employee per day

Task completion time: Measure the time taken to complete specific tasks or processes with and without the AI assistant. Shorter completion times suggest productivity gains.

    • Units: Percentage reduction in task completion time or absolute time saved (for example, hours or minutes)
    • Example: 30% reduction in task completion time or 2 hours saved per task

Recurring issues: Monitor the number of tickets raised for recurring issues and issues related to tasks or processes that the AI assistant can handle. A decrease in these tickets indicates improved productivity and reduced workload for employees.

    • Units: Percentage reduction in recurring issue frequency or absolute reduction in occurrences
    • Example: 40% reduction in the frequency of recurring issue X or 50 fewer occurrences per quarter

Overall ticket volume: Track the total number of tickets or issues raised related to tasks or processes that the AI assistant can handle.

    • Units: Percentage reduction in ticket volume or absolute number of tickets reduced
    • Example: 30% reduction in relevant ticket volume or 200 fewer tickets per month

Employee onboarding duration: Evaluate the time required for new employees to become fully productive with and without the AI assistant. Shorter onboarding times can indicate that the AI assistant is providing effective support, which translates to cost savings and faster time-to-productivity.

    • Units: Percentage reduction in onboarding time or absolute time saved (for example, days or weeks)
    • Example: 20% reduction in onboarding duration or 2 weeks saved per new employee

Employee productivity metrics: Track metrics such as output per employee or output quality before and after implementing the AI assistant. Improvements in these metrics can indicate productivity gains.

    • Units: Percentage improvement in output quality or reduction in rework or corrections
    • Example: 15% improvement in output quality or 30% reduction in rework required

Cost savings: Calculate the cost savings achieved through reduced labor hours, improved efficiency, and faster turnaround times enabled by the AI assistant.

    • Units: Monetary value (for example, dollars or euros) saved
    • Example: $100,000 in cost savings due to increased productivity

Knowledge base utilization: Measure the increase in utilization or effectiveness of knowledge bases or self-service resources because of the AI assistant’s ability to surface relevant information.

    • Units: Percentage increase in knowledge base utilization
    • Example: 20% increase in knowledge base utilization

Employee satisfaction surveys: Gather feedback from employees on their perceived productivity gains, time savings, and overall satisfaction with the AI assistant. Positive feedback can lead to increased retention, better performance, and a more positive work environment.

    • Units: Employee satisfaction score or percentage of employees reporting positive impact
    • Example: 80% of employees report increased productivity and satisfaction with the AI assistant

It’s important to establish baseline measurements before introducing the AI assistant and then consistently track the relevant metrics over time. Additionally, conducting controlled experiments or pilot programs can help isolate the impact of the AI assistant from other factors affecting productivity.

Conclusion

In this blog post, we explored how you can use Amazon Q Business to build generative AI assistants that enhance employee experience and boost productivity. By seamlessly integrating with internal data sources, knowledge bases, and productivity tools, Amazon Q Business equips your workforce with instant access to information, automated tasks, and personalized support. Using its robust capabilities, including multi-source connectors, document enrichment, relevance tuning, and enterprise-grade security, you can create tailored AI solutions that streamline workflows, optimize processes, and drive tangible gains in areas like task completion times, issue resolution, onboarding efficiency, and cost savings.

Unlock the transformative potential of Amazon Q Business and future-proof your organization—contact your AWS account team today.

Read more about Amazon Q


About the Authors

Puneeth Ranjan Komaragiri is a Principal Technical Account Manager at Amazon Web Services (AWS). He is particularly passionate about Monitoring and Observability, Cloud Financial Management, and Generative Artificial Intelligence (Gen-AI) domains. In his current role, Puneeth enjoys collaborating closely with customers, leveraging his expertise to help them design and architect their cloud workloads for optimal scale and resilience.

Krishna Pramod is a Senior Solutions Architect at AWS. He works as a trusted advisor for customers, helping customers innovate and build well-architected applications in AWS cloud. Outside of work, Krishna enjoys reading, music and traveling.

Tim McLaughlin is a Senior Product Manager for Amazon Q Business at Amazon Web Services (AWS). He is passionate about helping customers adopt generative AI services to meet evolving business challenges. Outside of work, Tim enjoys spending time with his family, hiking, and watching sports.

Read More

Introducing document-level sync reports: Enhanced data sync visibility in Amazon Kendra

Introducing document-level sync reports: Enhanced data sync visibility in Amazon Kendra

Amazon Kendra is an intelligent search service powered by machine learning (ML). Amazon Kendra helps you aggregate content from a variety of content repositories into a centralized index that lets you quickly search all your enterprise data and find the most accurate answer.

Amazon Kendra securely connects to over 40 data sources. When using your data source, you might want better visibility into the document processing lifecycle during data source sync jobs. They could include knowing the status of each document you attempted to crawl and index, as well as being able to troubleshoot why certain documents were not returned with the expected answers. Additionally, you might need access to metadata, timestamps, and access control lists (ACLs) for the indexed documents.

We are pleased to announce a new feature now available in Amazon Kendra that significantly improves visibility into data source sync operations. The latest release introduces a comprehensive document-level report incorporated into the sync history, providing administrators with granular indexing status, metadata, and ACL details for every document processed during a data source sync job. This enhancement to sync job observability enables administrators to quickly investigate and resolve ingestion or access issues encountered while setting up Amazon Kendra indexes. The detailed document reports are persisted in the new SYNC_RUN_HISTORY_REPORT log stream under the Amazon Kendra index log group, so critical sync job details are available on-demand when troubleshooting.

In this post, we discuss the benefits of this new feature and how it offers enhanced data sync visibility in Amazon Kendra.

Lifecycle of a document in a data source sync run job

In this section, we examine the lifecycle of a document within a data source sync in Amazon Kendra. This provides valuable insight into the sync process. The data source sync comprises three key stages: crawling, syncing, and indexing. Crawling involves the connector connecting to the data source and extracting documents meeting the defined sync scope according to the data source configuration. These documents are then synced to the Amazon Kendra index during the syncing phase. Finally, indexing makes the synced documents searchable within the Amazon Kendra environment.

The following diagram shows a flowchart of a sync run job.

Crawling stage

The first stage is the crawling stage, where the connector crawls all documents and their metadata from the data source. During this stage, the connector also compares the checksum of the document against the Amazon Kendra index to determine if a particular document needs to be added, modified, or deleted from the index. This operation corresponds to the CrawlAction field in the sync run history report.

If the document is unmodified, it’s marked as UNMODIFIED and skipped in the rest of the stages. If any document fails in the crawling stage, for example due to throttling errors, broken content, or if the document size is too big, that document is marked in the sync run history report with the CrawlStatus as FAILED. If the document was skipped due to any validation errors, its CrawlStatus is marked as SKIPPED. These documents are not sent to the next stage. All successful documents are marked as SUCCESS and are sent forward.

We also capture the ACLs and metadata on each document in this stage to be able to add it to the sync run history report.

Syncing stage

During the syncing stage, the document is sent to Amazon Kendra ingestion service APIs like BatchPutDocument and BatchDeleteDocument. After a document is submitted to these APIs, Amazon Kendra runs validation checks on the submitted documents. If any document fails these checks, its SyncStatus is marked as FAILED. If there is an irrecoverable error for a particular document, it is marked as SKIPPED and other documents are sent forward.

Indexing stage

In this step, Amazon Kendra parses the document, processes it according to its content type, and persists it in the index. If the document fails to be persisted, its IndexStatus is marked as FAILED; otherwise, it is marked as SUCCESS.

After the statuses of all the stages have been captured, we emit these statuses as an Amazon CloudWatch event to the customer’s AWS account.

Key features and benefits of document-level reports

The following are the key features and benefits of the new document-level report in Amazon Kendra indexes:

  • Enhanced sync run history page – A new Actions column has been added to the sync run history page, providing access to the document-level report for each sync run.

  • Dedicated log stream – A new log stream named SYNC_RUN_HISTORY_REPORT has been created in the Amazon Kendra CloudWatch log group, containing the document-level report.

  • Comprehensive document information – The document-level report includes the following information for each document:
  • Document ID – This is the document ID that is inherited directly from the data source or mapped by the customer in the data source field mappings.
  • Document title – The title of the document is taken from the data source or mapped by the customer in the data source field mappings.
  • Consolidated document status (SUCCESS, FAILED, or SKIPPED) – This is the final consolidated status of the document. It can have a value of SUCCESS, FAILED, or SKIPPED. If the document was successfully processed in all stages, then the value is SUCCESS. If the document failed or was skipped in any of the stages, then the value of this field will be FAILED or SKIPPED, respectively.
  • Error message (if the document failed) – This field contains the error message with which a document failed. If a document was skipped due to throttling errors, or any internal errors, this will be shown in the error message field.
  • Crawl status – This field denotes whether the document was crawled successfully from the data source. This status correlates to the syncing-crawling state in the data source sync.
  • Sync status – This field denotes whether the document was sent for syncing successfully. This correlates to the syncing-indexing state in the data source sync.
  • Index status – This field denotes whether the document was successfully persisted in the index.
  • ACLs – This field contains a list of document-level permissions that were crawled from the data source. The details of each element in the list are:
    • Global name – This is the email or user name of the user. This field is mapped across multiple data sources. For example, if a user has three datasources Confluence, SharePoint, and Gmail, with the local user ID as confluence_user, sharepoint_user and gmail_user respectively, and their email address user@email.com is the globalName in the ACL for all of them, then Amazon Kendra understands that all of these local user IDs map to the same global name.
    • Name – This is the local unique ID of the user, which is assigned by the data source.
    • Type – This field indicates the principal type. This can be either USER or GROUP.
    • Is Federated – This is a boolean flag that indicates whether the group is of INDEX level (true) or DATASOURCE level (false).
    • Access – This field indicates whether the user has access allowed or denied explicitly. Values can be either ALLOWED or DENIED.
    • Data source ID – This is the data source ID. For federated groups (INDEX level), this field will be null.
  • Metadata – This field contains the metadata fields (other than ACL) that were pulled from the data source. This list also includes the metadata fields mapped by the customer in the data source field mappings as well as extra metadata fields added by the connector.
  • Hashed document ID (for troubleshooting assistance) – To safeguard your data privacy, we present a secure, one-way hash of the document identifier. This encrypted value enables the Amazon Kendra team to efficiently locate and analyze the specific document within our logs, should you encounter any issue that requires further investigation and resolution.
  • Timestamp – The timestamp indicates when the document status was logged in CloudWatch.

In the following sections, we explore different use cases for the logging feature.

Determine the optimal boosting duration for recent documents in using document-level reporting

When it comes to generating accurate answers, you may want to fine-tune the way Amazon Kendra prioritizes its content. For instance, you may prefer to boost recent documents over older ones to make sure the most up-to-date passages are used to generate an answer. To achieve this, you can use the relevance tuning feature in Amazon Kendra to boost documents based on the last update date attribute, with a specified boosting duration. However, determining the optimal boosting period can be challenging when dealing with a large number of frequently changing documents.

You can now use the per-document-level report to obtain the _last_updated_at metadata field information for your documents, which can help you determine the appropriate boosting period. For this, you use the following CloudWatch Logs Insights query to retrieve the _last_updated_at metadata attribute for machine learning documents from the SYNC_RUN_HISTORY_REPORT log stream.

filter @logStream like 'SYNC_RUN_HISTORY_REPORT/'
and Metadata like 'Machine Learning'
| parse Metadata '{"key":"_last_updated_at","value":{"dateValue":"*"}}' as @last_updated_at
| sort @last_updated_at desc, @timestamp desc
| dedup DocumentTitle

With the preceding query, you can gain insights into the last updated timestamps of your documents, enabling you to make informed decisions about the optimal boosting period. This approach makes sure your chat responses are generated using the most recent and relevant information, enhancing the overall accuracy and effectiveness of your Amazon Kendra implementation.

The following screenshot shows an example result.

Common document indexing observability and troubleshooting methods

In this section, we explore some common admin tasks for observing and troubleshooting document indexing using the new document-level reporting feature.

List all successfully indexed documents from a data source

To retrieve a list of all documents that have been successfully indexed from a specific data source, you can use the following CloudWatch Logs Insights query:

fields DocumentTitle, DocumentId, @timestamp
| filter @logStream like 'SYNC_RUN_HISTORY_REPORT/your-data-source-id/'
and ConnectorDocumentStatus.Status = "SUCCESS"
| sort @timestamp desc | dedup DocumentTitle, DocumentId

The following screenshot shows an example result.

List all successfully indexed documents from a data source sync job

To retrieve a list of all documents that have been successfully indexed during a specific sync job, you can use the following CloudWatch Logs Insights query:

fields DocumentTitle, DocumentId, ConnectorDocumentStatus.Status AS IndexStatus, @timestamp
| filter @logStream like 'SYNC_RUN_HISTORY_REPORT/your-data-source-id/run-id'
and ConnectorDocumentStatus.Status = "SUCCESS"
| sort DocumentTitle

The following screenshot shows an example result.

List all failed indexed documents from a data source sync job

To retrieve a list of all documents that failed to index during a specific sync job, along with the error messages, you can use the following CloudWatch Logs Insights query:

fields DocumentTitle, DocumentId, ConnectorDocumentStatus.Status AS IndexStatus, ErrorMsg, @timestamp
| filter @logStream like 'SYNC_RUN_HISTORY_REPORT/your-data-source-id/run-id'
and ConnectorDocumentStatus.Status = "FAILED"
| sort @timestamp desc

The following screenshot shows an example result.

List all documents that contain a user’s ACL permission from an Amazon Kendra index

To retrieve a list of documents that have a specific users ACL permission, you can use the following CloudWatch Logs Insights query:

filter @logStream like 'SYNC_RUN_HISTORY_REPORT/'
and Acl like 'aneesh@mydemoaws.onmicrosoft.com'
| display DocumentTitle, SourceUri

The following screenshot shows an example result.

List the ACL of an indexed document from a data source sync job

To retrieve the ACL information for a specific indexed document from a sync job, you can use the following CloudWatch Logs Insights query:

filter @logStream like 'SYNC_RUN_HISTORY_REPORT/data-source-id/run-id'
and DocumentTitle = "your-document-title"
| display DocumentTitle, Acl

The following screenshot shows an example result.

List metadata of an indexed document from a data source sync job

To retrieve the metadata information for a specific indexed document from a sync job, you can use the following CloudWatch Logs Insights query:

filter @logStream like 'SYNC_RUN_HISTORY_REPORT/data-source-id/run-id'
and DocumentTitle = "your-document-title"
| display DocumentTitle, Metadata

The following screenshot shows an example result.

Conclusion

The newly introduced document-level report in Amazon Kendra provides enhanced visibility and observability into the document processing lifecycle during data source sync jobs. This feature addresses a critical need expressed by customers for better troubleshooting capabilities and access to detailed information about the indexing status, metadata, and ACLs of individual documents.

The document-level report is stored in a log stream named SYNC_RUN_HISTORY_REPORT within the Amazon Kendra index CloudWatch log group. This report contains comprehensive information for each document, including the document ID, title, overall document sync status, error messages (if any), along with its ACLs and metadata information retrieved from the data sources. The data source sync run history page now includes an Actions column, providing access to the document-level report for each sync run. This feature significantly improves the ability to troubleshoot issues related to document ingestion and access control, and issues related to metadata relevance, and provides better visibility about the documents synced with an Amazon Kendra index.

To get started with Amazon Kendra, explore the Getting started guide. To learn more about data source connectors and best practices, see Creating a data source connector.


About the Authors

Aneesh Mohan is a Senior Solutions Architect at Amazon Web Services (AWS), with over 20 years of experience in architecting and delivering high-impact solutions for mission-critical workloads. His expertise spans across the financial services industry, AI/ML, security, and data technologies. Driven by a deep passion for technology, Aneesh is dedicated to partnering with customers to design and implement well-architected, innovative solutions that address their unique business needs.

Ashwin Shukla is a Software Development Engineer II on the Amazon Q for Business and Amazon Kendra engineering team, with 6 years of experience in developing enterprise software. In this role, he works on designing and developing foundational features for Amazon Q for Business.

Read More

Medical Centers Tap AI, Federated Learning for Better Cancer Detection

Medical Centers Tap AI, Federated Learning for Better Cancer Detection

A committee of experts from top U.S. medical centers and research institutes is harnessing NVIDIA-powered federated learning to evaluate the impact of federated learning and AI-assisted annotation to train AI models for tumor segmentation.

Federated learning is a technique for developing more accurate, generalizable AI models trained on data across diverse data sources without mitigating data security or privacy. It allows several organizations to collaborate on the development of an AI model without sensitive data ever leaving their servers.

“Due to privacy and data management constraints, it’s growing more and more complicated to share data from site to site and aggregate it in one place — and imaging AI is developing faster than research institutes can set up data-sharing contracts,” said John Garrett, associate professor of radiology at the University of Wisconsin–Madison. “Adopting federated learning to build and test models at multiple sites at once is the only way, practically speaking, to keep up. It’s an indispensable tool.”

Garrett is part of the Society for Imaging Informatics and Medicine (SIIM) Machine Learning Tools and Research Subcommittee, a group of clinicians, researchers and engineers that aims to advance the development and application of AI for medical imaging. NVIDIA is a member of SIIM, and has been collaborating with the committee on federated learning projects since 2019.

“Federated learning techniques allow enhanced data privacy and security in compliance with privacy regulations like GDPR, HIPAA and others,” said committee chair Khaled Younis. “In addition, we see improved model accuracy and generalization.”

To support their latest project, the team — including collaborators from Case Western, Georgetown University, the Mayo Clinic, the University of California, San Diego, the University of Florida and Vanderbilt University — tapped NVIDIA FLARE (NVFlare), an open-source framework that includes robust security features, advanced privacy protection techniques and a flexible system architecture.

Through the NVIDIA Academic Grant Program, the committee received four NVIDIA RTX A5000 GPUs, which were distributed across participating research institutes to set up their workstations for federated learning. Additional collaborators used NVIDIA GPUs in the cloud and in on-premises servers, highlighting the flexibility of NVFLare.

Cracking the Code for Federated Learning

Each of six participating medical centers provided data from around 50 medical imaging studies for the project, focused on renal cell carcinoma, a kind of kidney cancer.

“The idea with federated learning is that during training we exchange the model rather than exchange the data,” said Yuankai Huo, assistant professor of computer science and director of the Biomedical Data Representation and Learning Lab at Vanderbilt University.

In a federated learning framework, an initial global model broadcasts model parameters to client servers. Each server uses those parameters to set up a local version of the model that’s trained on the organization’s proprietary data. Then, updated parameters from each of the local models are sent back to the global model, where they’re aggregated to produce a new global model. The cycle repeats until the model’s predictions no longer improve with each training round.

The group experimented with model architectures and hyperparameters to optimize for training speed, accuracy and the number of imaging studies required to train the model to the desired level of precision.

AI-Assisted Annotation With NVIDIA MONAI 

In the first phase of the project, the training data used for the model was labeled manually. For the next phase, the team is using NVIDIA MONAI for AI-assisted annotation to evaluate how model performance differs with training data segmented with the help of AI compared to traditional annotation methods.

“The biggest struggle with federated learning activities is typically that the data at different sites is not tremendously uniform. People use different imaging equipment, have different protocols and just label their data differently,” said Garrett. “By training the federated learning model a second time with the addition of MONAI, we aim to find if that improves overall annotation accuracy.”

The team is using MONAI Label, an image-labeling tool that enables users to develop custom AI annotation apps, reducing the time and effort needed to create new datasets. Experts will validate and refine the AI-generated segmentations before they’re used for model training.

Data for both the manual and AI-assisted annotation phases is hosted on Flywheel, a leading medical imaging data and AI platform that has integrated NVIDIA MONAI into its offerings.

Once the project is complete, the team plans to publish their methodology, annotated datasets and pretrained model to support future work.

“We’re interested in not just exploring these tools,” Garrett said, “but also publishing our work so others can learn and use these tools throughout the medical field.”

Apply for an NVIDIA Academic Grant

The NVIDIA Academic Grant Program advances academic research by providing world-class computing access and resources to researchers. Applications are now open for full-time faculty members at accredited academic institutions who are using NVIDIA technology to accelerate projects in simulation and modeling, generative AI and large language models.

Future application cycles will focus on projects in data science, graphics and vision, and edge AI — including federated learning.

Read More

Linguistic Bias in ChatGPT: Language Models Reinforce Dialect Discrimination

Linguistic Bias in ChatGPT: Language Models Reinforce Dialect Discrimination



Sample language model responses to different varieties of English and native speaker reactions.

ChatGPT does amazingly well at communicating with people in English. But whose English?

Only 15% of ChatGPT users are from the US, where Standard American English is the default. But the model is also commonly used in countries and communities where people speak other varieties of English. Over 1 billion people around the world speak varieties such as Indian English, Nigerian English, Irish English, and African-American English.

Speakers of these non-“standard” varieties often face discrimination in the real world. They’ve been told that the way they speak is unprofessional or incorrect, discredited as witnesses, and denied housing–despite extensive research indicating that all language varieties are equally complex and legitimate. Discriminating against the way someone speaks is often a proxy for discriminating against their race, ethnicity, or nationality. What if ChatGPT exacerbates this discrimination?

To answer this question, our recent paper examines how ChatGPT’s behavior changes in response to text in different varieties of English. We found that ChatGPT responses exhibit consistent and pervasive biases against non-“standard” varieties, including increased stereotyping and demeaning content, poorer comprehension, and condescending responses.

‘We’ve Fused Signal Processing and AI’: NVIDIA CEO Outlines Future of Telecom at T-Mobile’s Capital Markets Day

‘We’ve Fused Signal Processing and AI’: NVIDIA CEO Outlines Future of Telecom at T-Mobile’s Capital Markets Day

In a surprise appearance at T-Mobile’s Capital Markets Day, NVIDIA founder and CEO Jensen Huang shared a bold vision for the future of telecommunications.

“We’ve fused signal processing and AI,” Huang declared during a fireside chat with T-Mobile CEO Mike Sievert, speaking to an audience of press, analysts and investors. “This is going to be a great new growth opportunity for the telecommunications industry.”

Huang’s remarks came alongside NVIDIA’s announcement of its groundbreaking AI Aerial platform, which promises to reshape wireless networks by integrating AI and radio access networks, AI-RAN.

The platform is designed to optimize network performance, efficiency and new revenue potential, such as AI-computing-as-a-service during periods when network infrastructure is underutilized, maximizing the return on assets.

During the conversation, Huang emphasized the importance of AI in shaping the future of telecommunications, particularly highlighting the role of AI-RAN in optimizing and scaling network performance.

Fusing radio computing and AI computing into one architecture allows companies to apply AI models to optimize signal quality across diverse environments, Huang explained.

He emphasized that this fusion would lead to improved network efficiency and new growth opportunities for the telecommunications industry,

“We could teach these AI models how to optimize signal quality in hundreds of thousands of virtual cities,” Huang said.

AI-RAN aligns with NVIDIA’s broader vision to make AI an integral part of network infrastructure, enabling telecommunications providers to unlock new revenue streams and deliver enhanced experiences through generative AI, robotics and autonomous technologies.

Huang underscored the synergies between NVIDIA and T-Mobile, particularly their collaboration on the newly announced AI-RAN Innovation Center, as co-authors of transformation. The AI-RAN Innovation Center, developed with T-Mobile, Ericsson and Nokia, is set to accelerate the commercialization of AI-RAN technologies.

Every radio operates in a unique and constantly changing world environment. This is where deep reinforcement learning algorithms embedded into radio signal processing make complex computations simpler with AI to help deliver a customer-centric network experience.

Sievert emphasized how virtualizing RAN into the cloud will create new business opportunities. He explained that AI workloads will increasingly require compute power located close to the customer, leveraging underutilized network resources.

Huang also highlighted the crucial role AI will play in making networks more energy-efficient, emphasizing the need for sustainable technology as the demand for data and connectivity grows.

“We have to use AI to reduce energy consumption,” Huang said. “Everything that we accelerate, everything that we teach an AI model to do [we] will do a lot more energy efficiently.”

As Huang explained, by simulating  AI models in virtual environments with accurate physics and then emulating them in the real world, NVIDIA maximizes energy efficiency. This approach underpins the NVIDIA AI Aerial suite of platforms for designing, training and deploying AI-driven cellular networks for AI.

With NVIDIA AI Aerial now supporting a growing ecosystem of partners this collaboration marks a milestone in the telecom industry’s journey toward a future powered by AI.

Read More

Fine-tune Meta Llama 3.1 models using torchtune on Amazon SageMaker

Fine-tune Meta Llama 3.1 models using torchtune on Amazon SageMaker

This post is co-written with Meta’s PyTorch team.

In today’s rapidly evolving AI landscape, businesses are constantly seeking ways to use advanced large language models (LLMs) for their specific needs. Although foundation models (FMs) offer impressive out-of-the-box capabilities, true competitive advantage often lies in deep model customization through fine-tuning. However, fine-tuning LLMs for complex tasks typically requires advanced AI expertise to align and optimize them effectively. Recognizing this challenge, Meta developed torchtune, a PyTorch-native library that simplifies authoring, fine-tuning, and experimenting with LLMs, making it more accessible to a broader range of users and applications.

In this post, AWS collaborates with Meta’s PyTorch team to showcase how you can use Meta’s torchtune library to fine-tune Meta Llama-like architectures while using a fully-managed environment provided by Amazon SageMaker Training. We demonstrate this through a step-by-step implementation of model fine-tuning, inference, quantization, and evaluation. We perform the steps on a Meta Llama 3.1 8B model utilizing the LoRA fine-tuning strategy on a single p4d.24xlarge worker node (providing 8 Nvidia A100 GPUs).

Before we dive into the step-by-step guide, we first explored the performance of our technical stack by fine-tuning a Meta Llama 3.1 8B model across various configurations and instance types.

As can be seen in the following chart, we found that a single p4d.24xlarge delivers 70% higher performance than two g5.48xlarge instances (each with 8 NVIDIA A10 GPUs) at almost 47% reduced price. We therefore have optimized the example in this post for a p4d.24xlarge configuration. However, you could use the same code to run single-node or multi-node training on different instance configurations by changing the parameters passed to the SageMaker estimator. You could further optimize the time for training in the following graph by using a SageMaker managed warm pool and accessing pre-downloaded models using Amazon Elastic File System (Amazon EFS).

Challenges with fine-tuning LLMs

Generative AI models offer many promising business use cases. However, to maintain factual accuracy and relevance of these LLMs to specific business domains, fine-tuning is required. Due to the growing number of model parameters and the increasing context length of modern LLMs, this process is memory intensive. To address these challenges, fine-tuning strategies like LoRA (Low-Rank Adaptation) and QLoRA (Quantized Low-Rank Adaptation) limit the number of trainable parameters by adding low-rank parallel structures to the transformer layers. This enables you to train LLMs even on systems with low memory availability like commodity GPUs. However, this leads to an increased complexity because new dependencies have to be handled and training recipes and hyperparameters need to be adapted to the new techniques.

What businesses need today is user-friendly training recipes for these popular fine-tuning techniques, which provide abstractions to the end-to-end tuning process, addressing the common pitfalls in the most opinionated way.

How does torchtune helps?

torchtune is a PyTorch-native library that aims to democratize and streamline the fine-tuning process for LLMs. By doing so, it makes it straightforward for researchers, developers, and organizations to adapt these powerful LLMs to their specific needs and constraints. It provides training recipes for a variety of fine-tuning techniques, which can be configured through YAML files. The recipes implement common fine-tuning methods (full-weight, LoRA, QLoRA) as well as other common tasks like inference and evaluation. They automatically apply a set of important features (FSDP, activation checkpointing, gradient accumulation, mixed precision) and are specific to a given model family (such as Meta Llama 3/3.1 or Mistral) as well as compute environment (single-node vs. multi-node).

Additionally, torchtune integrates with major libraries and frameworks like Hugging Face datasets, EleutherAI’s Eval Harness, and Weights & Biases. This helps address the requirements of the generative AI fine-tuning lifecycle, from data ingestion and multi-node fine-tuning to inference and evaluation. The following diagram shows a visualization of the steps we describe in this post.

Refer to the installation instructions and PyTorch documentation to learn more about torchtune and its concepts.

Solution overview

This post demonstrates the use of SageMaker Training for running torchtune recipes through task-specific training jobs on separate compute clusters. SageMaker Training is a comprehensive, fully managed ML service that enables scalable model training. It provides flexible compute resource selection, support for custom libraries, a pay-as-you-go pricing model, and self-healing capabilities. By managing workload orchestration, health checks, and infrastructure, SageMaker helps reduce training time and total cost of ownership.

The solution architecture incorporates the following key components to enhance security and efficiency in fine-tuning workflows:

  • Security enhancement – Training jobs are run within private subnets of your virtual private cloud (VPC), significantly improving the security posture of machine learning (ML) workflows.
  • Efficient storage solution – Amazon EFS is used to accelerate model storage and access across various phases of the ML workflow.
  • Customizable environment – We use custom containers in training jobs. The support in SageMaker for custom containers allows you to package all necessary dependencies, specialized frameworks, and libraries into a single artifact, providing full control over your ML environment.

The following diagram illustrates the solution architecture. Users initiate the process by calling the SageMaker control plane through APIs or command line interface (CLI) or using the SageMaker SDK for each individual step. In response, SageMaker spins up training jobs with the requested number and type of compute instances to run specific tasks. Each step defined in the diagram accesses torchtune recipes from an Amazon Simple Storage Service (Amazon S3) bucket and uses Amazon EFS to save and access model artifacts across different stages of the workflow.

By decoupling every torchtune step, we achieve a balance between flexibility and integration, allowing for both independent execution of steps and the potential for automating this process using seamless pipeline integration.

In this use case, we fine-tune a Meta Llama 3.1 8B model with LoRA. Subsequently, we run model inference, and optionally quantize and evaluate the model using torchtune and SageMaker Training.

Recipes, configs, datasets, and prompt templates are completely configurable and allow you to align torchtune to your requirements. To demonstrate this, we use a custom prompt template in this use case and combine it with the open source dataset Samsung/samsum from the Hugging Face hub.

We fine-tune the model using torchtune’s multi device LoRA recipe (lora_finetune_distributed) and use the SageMaker customized version of Meta Llama 3.1 8B default config (llama3_1/8B_lora).

Prerequisites

You need to complete the following prerequisites before you can run the SageMaker Jupyter notebooks:

  1. Create a Hugging Face access token to get access to the gated repo meta-llama/Meta-Llama-3.1-8B on Hugging Face.
  2. Create a Weights & Biases API key to access the Weights & Biases dashboard for logging and monitoring
  3. Request a SageMaker service quota for 1x ml.p4d.24xlarge and 1xml.g5.2xlarge.
  4. Create an AWS Identity and Access Management (IAM) role with managed policies AmazonSageMakerFullAccess, AmazonEC2FullAccess, AmazonElasticFileSystemFullAccess, and AWSCloudFormationFullAccess to give required access to SageMaker to run the examples. (This is for demonstration purposes. You should adjust this to your specific security requirements for production.)
  5. Create an Amazon SageMaker Studio domain (see Quick setup to Amazon SageMaker) to access Jupyter notebooks with the preceding role. Refer to the instructions to set permissions for Docker build.
  6. Log in to the notebook console and clone the GitHub repo:
$ git clone https://github.com/aws-samples/sagemaker-distributed-training-workshop.git
$ cd sagemaker-distributed-training-workshop/13-torchtune
  1. Run the notebook ipynb to set up VPC and Amazon EFS using an AWS CloudFormation stack.

Review torchtune configs

The following figure illustrates the steps in our workflow.

You can look up the torchtune configs for your use case by directly using the tune CLI.For this post, we provide modified config files aligned with SageMaker directory path’s structure:

sh-4.2$ cd config/
sh-4.2$ ls -ltr
-rw-rw-r-- 1 ec2-user ec2-user 1151 Aug 26 18:34 config_l3.1_8b_gen_orig.yaml
-rw-rw-r-- 1 ec2-user ec2-user 1172 Aug 26 18:34 config_l3.1_8b_gen_trained.yaml
-rw-rw-r-- 1 ec2-user ec2-user  644 Aug 26 18:49 config_l3.1_8b_quant.yaml
-rw-rw-r-- 1 ec2-user ec2-user 2223 Aug 28 14:53 config_l3.1_8b_lora.yaml
-rw-rw-r-- 1 ec2-user ec2-user 1223 Sep  4 14:28 config_l3.1_8b_eval_trained.yaml
-rw-rw-r-- 1 ec2-user ec2-user 1213 Sep  4 14:29 config_l3.1_8b_eval_original.yaml

torchtune uses these config files to select and configure the components (think models and tokenizers) during the execution of the recipes.

Build the container

As part of our example, we create a custom container to provide custom libraries like torch nightlies and torchtune. Complete the following steps:

sh-4.2$ cat Dockerfile
# Set the default value for the REGION build argument
ARG REGION=us-west-2
# SageMaker PyTorch image for TRAINING
FROM ${ACCOUNTID}.dkr.ecr.${REGION}.amazonaws.com/pytorch-training:2.3.0-gpu-py311-cu121-ubuntu20.04-sagemaker
# Uninstall existing PyTorch packages
RUN pip uninstall torch torchvision transformer-engine -y
# Install latest release of PyTorch and torchvision
RUN pip install --force-reinstall torch==2.4.1 torchao==0.4.0 torchvision==0.19.1

Run the 1_build_container.ipynb notebook until the following command to push this file to your ECR repository:

!sm-docker build . --repository accelerate:latest

sm-docker is a CLI tool designed for building Docker images in SageMaker Studio using AWS CodeBuild. We install the library as part of the notebook.

Next, we will run the 2_torchtune-llama3_1.ipynb notebook for all fine-tuning workflow tasks.

For every task, we review three artifacts:

  • torchtune configuration file
  • SageMaker task config with compute and torchtune recipe details
  • SageMaker task output

Run the fine-tuning task

In this section, we walk through the steps to run and monitor the fine-tuning task.

Run the fine-tuning job

The following code shows a shortened torchtune recipe configuration highlighting a few key components of the file for a fine-tuning job:

  • Model component including LoRA rank configuration
  • Meta Llama 3 tokenizer to tokenize the data
  • Checkpointer to read and write checkpoints
  • Dataset component to load the dataset
sh-4.2$ cat config_l3.1_8b_lora.yaml
# Model Arguments
model:
  _component_: torchtune.models.llama3_1.lora_llama3_1_8b
  lora_attn_modules: ['q_proj', 'v_proj']
  lora_rank: 8
  lora_alpha: 16

# Tokenizer
tokenizer:
  _component_: torchtune.models.llama3.llama3_tokenizer
  path: /opt/ml/input/data/model/hf-model/original/tokenizer.model

checkpointer:
  _component_: torchtune.utils.FullModelMetaCheckpointer
  checkpoint_files: [
    consolidated.00.pth
  ]
  …

# Dataset and Sampler
dataset:
  _component_: torchtune.datasets.samsum_dataset
  train_on_input: True
batch_size: 13

# Training
epochs: 1
gradient_accumulation_steps: 2

... and more ...

We use Weights & Biases for logging and monitoring our training jobs, which helps us track our model’s performance:

metric_logger:
_component_: torchtune.utils.metric_logging.WandBLogger
…

Next, we define a SageMaker task that will be passed to our utility function in the script create_pytorch_estimator. This script creates the PyTorch estimator with all the defined parameters.

In the task, we use the lora_finetune_distributed torchrun recipe with config config-l3.1-8b-lora.yaml on an ml.p4d.24xlarge instance. Make sure you download the base model from Hugging Face before it’s fine-tuned using the use_downloaded_model parameter. The image_uri parameter defines the URI of the custom container.

sagemaker_tasks={
    "fine-tune":{
        "hyperparameters":{
            "tune_config_name":"config-l3.1-8b-lora.yaml",
            "tune_action":"fine-tune",
            "use_downloaded_model":"false",
            "tune_recipe":"lora_finetune_distributed"
            },
        "instance_count":1,
        "instance_type":"ml.p4d.24xlarge",        
        "image_uri":"<accountid>.dkr.ecr.<region>.amazonaws.com/accelerate:latest"
    }
    ... and more ...
}

To create and run the task, run the following code:

Task="fine-tune"
estimator=create_pytorch_estimator(**sagemaker_tasks[Task])
execute_task(estimator)

The following code shows the task output and reported status:

# Refer-Output

2024-08-16 17:45:32 Starting - Starting the training job...
...
...

1|140|Loss: 1.4883038997650146:  99%|█████████▉| 141/142 [06:26<00:02,  2.47s/it]
1|141|Loss: 1.4621509313583374:  99%|█████████▉| 141/142 [06:26<00:02,  2.47s/it]

Training completed with code: 0
2024-08-26 14:19:09,760 sagemaker-training-toolkit INFO     Reporting training SUCCESS

The final model is saved to Amazon EFS, which makes it available without download time penalties.

Monitor the fine-tuning job

You can monitor various metrics such as loss and learning rate for your training run through the Weights & Biases dashboard. The following figures show the results of the training run where we tracked GPU utilization, GPU memory utilization, and loss curve.

For the following graph, to optimize memory usage, torchtune uses only rank 0 to initially load the model into CPU memory. rank 0 therefore will be responsible for loading the model weights from the checkpoint.

The example is optimized to use GPU memory to its maximum capacity. Increasing the batch size further will lead to CUDA out-of-memory (OOM) errors.

The run took about 13 minutes to complete for one epoch, resulting in the loss curve shown in the following graph.

Run the model generation task

In the next step, we use the previously fine-tuned model weights to generate the answer to a sample prompt and compare it to the base model.

The following code shows the configuration of the generate recipe config_l3.1_8b_gen_trained.yaml. The following are key parameters:

  • FullModelMetaCheckpointer – We use this to load the trained model checkpoint meta_model_0.pt from Amazon EFS
  • CustomTemplate.SummarizeTemplate – We use this to format the prompt for inference
# torchtune - trained model generation config - config_l3.1_8b_gen_trained.yaml
model:
  _component_: torchtune.models.llama3_1.llama3_1_8b
  
checkpointer:
  _component_: torchtune.utils.FullModelMetaCheckpointer
  checkpoint_dir: /opt/ml/input/data/model/
  checkpoint_files: [
    meta_model_0.pt
  ]
  …

# Generation arguments; defaults taken from gpt-fast
instruct_template: CustomTemplate.SummarizeTemplate

... and more ...

Next, we configure the SageMaker task to run on a single ml.g5.2xlarge instance:

prompt=r'{"dialogue":"Amanda: I baked  cookies. Do you want some?rnJerry: Sure rnAmanda: I will bring you tomorrow :-)"}'

sagemaker_tasks={
    "generate_inference_on_trained":{
        "hyperparameters":{
            "tune_config_name":"config_l3.1_8b_gen_trained.yaml ",
            "tune_action":"generate-trained",
            "use_downloaded_model":"true",
            "prompt":json.dumps(prompt)
            },
        "instance_count":1,
        "instance_type":"ml.g5.2xlarge",
 "image_uri":"<accountid>.dkr.ecr.<region>.amazonaws.com/accelerate:latest"
    }
}

In the output of the SageMaker task, we see the model summary output and some stats like tokens per second:

#Refer- Output
...
Amanda: I baked  cookies. Do you want some?rnJerry: Sure rnAmanda: I will bring you tomorrow :-)

Summary:
Amanda baked cookies. She will bring some to Jerry tomorrow.

INFO:torchtune.utils.logging:Time for inference: 1.71 sec total, 7.61 tokens/sec
INFO:torchtune.utils.logging:Memory used: 18.32 GB

... and more ...

We can generate inference from the original model using the original model artifact consolidated.00.pth:

# torchtune - trained original generation config - config_l3.1_8b_gen_orig.yaml
…  
checkpointer:
  _component_: torchtune.utils.FullModelMetaCheckpointer
  checkpoint_dir: /opt/ml/input/data/model/hf-model/original/
  checkpoint_files: [
    consolidated.00.pth
  ]
  
... and more ...

The following code shows the comparison output from the base model run with the SageMaker task (generate_inference_on_original). We can see that the fine-tuned model is performing subjectively better than the base model by also mentioning that Amanda baked the cookies.

# Refer-Output 
---
Summary:
Jerry tells Amanda he wants some cookies. Amanda says she will bring him some cookies tomorrow.

... and more ...

Run the model quantization task

To speed up the inference and decrease the model artifact size, we can apply post-training quantization. torchtune relies on torchao for post-training quantization.

We configure the recipe to use Int8DynActInt4WeightQuantizer, which refers to int8 dynamic per token activation quantization combined with int4 grouped per axis weight quantization. For more details, refer to the torchao implementation.

# torchtune model quantization config - config_l3.1_8b_quant.yaml
model:
  _component_: torchtune.models.llama3_1.llama3_1_8b

checkpointer:
  _component_: torchtune.utils.FullModelMetaCheckpointer
  …

quantizer:
  _component_: torchtune.utils.quantization.Int8DynActInt4WeightQuantizer
  groupsize: 256

We again use a single ml.g5.2xlarge instance and use SageMaker warm pool configuration to speed up the spin-up time for the compute nodes:

sagemaker_tasks={
"quantize_trained_model":{
        "hyperparameters":{
            "tune_config_name":"config_l3.1_8b_quant.yaml",
            "tune_action":"run-quant",
            "use_downloaded_model":"true"
            },
        "instance_count":1,
        "instance_type":"ml.g5.2xlarge",
        "image_uri":"<accountid>.dkr.ecr.<region>.amazonaws.com/accelerate:latest"
    }
}

In the output, we see the location of the quantized model and how much memory we saved due to the process:

#Refer-Output
...

linear: layers.31.mlp.w1, in=4096, out=14336
linear: layers.31.mlp.w2, in=14336, out=4096
linear: layers.31.mlp.w3, in=4096, out=14336
linear: output, in=4096, out=128256
INFO:torchtune.utils.logging:Time for quantization: 7.40 sec
INFO:torchtune.utils.logging:Memory used: 22.97 GB
INFO:torchtune.utils.logging:Model checkpoint of size 8.79 GB saved to /opt/ml/input/data/model/quantized/meta_model_0-8da4w.pt

... and more ...

You can run model inference on the quantized model meta_model_0-8da4w.pt by updating the inference-specific configurations.

Run the model evaluation task

Finally, let’s evaluate our fine-tuned model in an objective manner by running an evaluation on the validation portion of our dataset.

torchtune integrates with EleutherAI’s evaluation harness and provides the eleuther_eval recipe.

For our evaluation, we use a custom task for the evaluation harness to evaluate the dialogue summarizations using the rouge metrics.

The recipe configuration points the evaluation harness to our custom evaluation task:

# torchtune trained model evaluation config - config_l3.1_8b_eval_trained.yaml

model:
...

include_path: "/opt/ml/input/data/config/tasks"
tasks: ["samsum"]
...

The following code is the SageMaker task that we run on a single ml.p4d.24xlarge instance:

sagemaker_tasks={
"evaluate_trained_model":{
        "hyperparameters":{
            "tune_config_name":"config_l3.1_8b_eval_trained.yaml",
            "tune_action":"run-eval",
            "use_downloaded_model":"true",
            },
        "instance_count":1,
        "instance_type":"ml.p4d.24xlarge",
    }
}

Run the model evaluation on ml.p4d.24xlarge:

Task="evaluate_trained_model"
estimator=create_pytorch_estimator(**sagemaker_tasks[Task])
execute_task(estimator)

The following tables show the task output for the fine-tuned model as well as the base model.

The following output is for the fine-tuned model.

 

Tasks Version Filter n-shot Metric Direction Value ± Stderr
samsum 2 none None rouge1 45.8661 ± N/A
none None rouge2 23.6071 ± N/A
none None rougeL 37.1828 ± N/A

The following output is for the base model.

Tasks Version Filter n-shot Metric Direction Value ± Stderr
samsum 2 none None rouge1 33.6109 ± N/A
none None rouge2 13.0929 ± N/A
none None rougeL 26.2371 ± N/A

Our fine-tuned model achieves an improvement of approximately 46% on the summarization task, which is approximately 12 points better than the baseline.

Clean up

Complete the following steps to clean up your resources:

  1. Delete any unused SageMaker Studio resources.
  2. Optionally, delete the SageMaker Studio domain.
  3. Delete the CloudFormation stack to delete the VPC and Amazon EFS resources.

Conclusion

In this post, we discussed how you can fine-tune Meta Llama-like architectures using various fine-tuning strategies on your preferred compute and libraries, using custom dataset prompt templates with torchtune and SageMaker. This architecture gives you a flexible way of running fine-tuning jobs that are optimized for GPU memory and performance. We demonstrated this through fine-tuning a Meta Llama3.1 model using P4 and G5 instances on SageMaker and used observability tools like Weights & Biases to monitor loss curve, as well as CPU and GPU utilization.

We encourage you to use SageMaker training capabilities and Meta’s torchtune library to fine-tune Meta Llama-like architectures for your specific business use cases. To stay informed about upcoming releases and new features, refer to the torchtune GitHub repo and the official Amazon SageMaker training documentation .

Special thanks to Kartikay Khandelwal (Software Engineer at Meta), Eli Uriegas (Engineering Manager at Meta), Raj Devnath (Sr. Product Manager Technical at AWS) and Arun Kumar Lokanatha (Sr. ML Solution Architect at AWS) for their support to the launch of this post.


About the Authors

Kanwaljit Khurmi is a Principal Solutions Architect at Amazon Web Services. He works with AWS customers to provide guidance and technical assistance, helping them improve the value of their solutions when using AWS. Kanwaljit specializes in helping customers with containerized and machine learning applications.

Roy Allela is a Senior AI/ML Specialist Solutions Architect at AWS.He helps AWS customers—from small startups to large enterprises—train and deploy large language models efficiently on AWS.

Matthias Reso is a Partner Engineer at PyTorch working on open source, high-performance model optimization, distributed training (FSDP), and inference. He is a co-maintainer of llama-recipes and TorchServe.

Trevor Harvey is a Principal Specialist in Generative AI at Amazon Web Services (AWS) and an AWS Certified Solutions Architect – Professional. He serves as a voting member of the PyTorch Foundation Governing Board, where he contributes to the strategic advancement of open-source deep learning frameworks. At AWS, Trevor works with customers to design and implement machine learning solutions and leads go-to-market strategies for generative AI services.

Read More