Transforming home ownership with Amazon Transcribe Call Analytics, Amazon Comprehend, and Amazon Bedrock: Rocket Mortgage’s journey with AWS

Transforming home ownership with Amazon Transcribe Call Analytics, Amazon Comprehend, and Amazon Bedrock: Rocket Mortgage’s journey with AWS

This post is co-written with Josh Zook and Alex Hamilton from Rocket Mortgage.

Rocket Mortgage, America’s largest retail mortgage lender, revolutionizes homeownership with Rocket Logic – Synopsis, an AI tool built on AWS.  This innovation has transformed client interactions and operational efficiency through the use of Amazon Transcribe Call Analytics, Amazon Comprehend, and Amazon Bedrock. Through Rocket Logic – Synopsis, Rocket achieved remarkable results: automating post call interaction wrap-up resulting in a projected 40,000 team hours saved annually, and a 10% increase in first-call resolutions saved 20,000 hours annually. In addition to Rocket Logic – Synopsis, 70% of servicing clients choose to self-serve over Gen AI powered mediums such as IVR. Rocket’s “start small, launch and learn, scale fast” approach paired with AWS enablement proved effective, deploying 30,000 servicing calls in 10 days, then scaling four times greater for operations and six times greater for banking.

This post offers insights for businesses aiming to use artificial intelligence (AI) and cloud technologies to enhance customer service and streamline operations. We share how Rocket Mortgage’s use of AWS services set a new industry standard and demonstrate how to apply these principles to transform your client interactions and processes with speed and scalability.

Opportunities for innovation

Rocket services over 2.6 million clients, with 65 million voice interactions and 147 million voice minutes inclusive of banking, operations, and servicing, and generates and processes over 10 PB of data. By focusing on three key personas—clients, client advocates, and business leaders or senior leadership—Rocket aims to create a solution that enhances experiences across the board.

At the heart of this transformation is the recognition that clients value their time, but also benefit from hyper-personalized support in ultra complex moments. With call volumes on the rise, solving this problem at scale was essential. Rocket tapped into a crucial insight: 81% of consumers prefer self-service options. This preference opens exciting possibilities for swift, efficient problem-solving. Imagine a world where answers are available at your fingertips, 24/7, without the need to wait in a queue. By implementing enhanced self-service tools, Rocket is poised to offer faster resolution times, greater client autonomy, and a more satisfying overall experience.

Client advocates, the face of the company, stand to benefit significantly from this transformation. Currently, client advocates spend about 30% of their time on administrative tasks. By streamlining processes, client advocates can focus on what they do best: providing exceptional customer service and nurturing client relationships. This shift promises more engaging work, increased job satisfaction, and opportunities for skill development. Rocket envisions their client advocates evolving into trusted advisors, handling complex inquiries that truly take advantage of their expertise and interpersonal skills.

For business leaders, this wealth of data on trends, sentiment, and performance opens up a treasure trove of opportunities. Decision-makers can now drive significant improvements across the board, employing data-driven strategies to enhance customer satisfaction, optimize operations, and boost overall business performance. Business leaders can look forward to leading more efficient teams, and senior leadership can anticipate improved client loyalty and a stronger bottom line.

Strategic requirements

To further elevate their client interactions, Rocket identified key requirements for their solution. These requirements were essential to make sure the solution could handle the demands of their extensive client base and provide actionable insights to enhance client experiences:

  • Sentiment analysis – Tracking client sentiment and preferences was necessary to offer personalized experiences. The solution needed to accurately gauge client emotions and preferences to tailor responses and services effectively.
  • Automation – Automating routine tasks, such as call summaries, was essential to free up team members for more meaningful client interactions. This automation would help reduce the manual workload, allowing the team to focus on building stronger client relationships.
  • AI integration – Using generative AI to analyze calls was crucial for providing actionable insights and enhancing client interactions. The AI integration needed to be robust enough to process vast amounts of data and deliver precise, meaningful results.
  • Data security – Protecting sensitive client information throughout the process was a non-negotiable requirement. Rocket needed to uphold the highest standards of data security, maintaining regulatory compliance, data privacy, and the integrity of client information.
  • Compliance and data privacy – Rocket required a solution that met strict compliance and data privacy standards. Given the sensitive nature of the information handled, the solution needed to provide complete data protection and adhere to industry regulations.
  • Scalability – Rocket needed a solution capable of handling millions of calls annually and scaling efficiently with growing demand. This requirement was vital to make sure the system could support their expansive and continuously increasing volume of voice interactions.

Solution overview

To meet these requirements, Rocket partnered with the AWS team to deploy the AWS Contact Center Intelligence (CCI) solution Post-Call Analytics, branded internally as Rocket Logic – Synopsis. This solution seamlessly integrates into Rocket’s existing operations, using AI technologies to transcribe and analyze client calls. By utilizing services like Amazon Transcribe Call Analytics, Amazon Comprehend, and Amazon Bedrock, the solution extracts valuable insights such as sentiment, call drivers, and client preferences, enhancing client interactions and providing actionable data for continuous improvement.

At the heart of Rocket are their philosophies, known as their -ISMs, which guide their growth and innovation.  One of these guiding principles is “launch and learn.”

Embracing the mantra of “think big but start small,” Rocket adopted a rapid, iterative approach to achieve a remarkable time to market of just 10 days, compared to the months it would have traditionally taken. This agile methodology allowed them to create space for exploration and innovation. The team initially focused on a few key use cases, starting simple and rapidly iterating based on feedback and results.

To accelerate development and make sure data was quickly put into the hands of the business, they utilized mechanisms such as a hackathon with targeted goals. By using existing solutions and AWS technical teams, Rocket significantly reduced the time to market, allowing for swift deployment. Additionally, they looked to industry tactics to find solutions to common problems, so their approach was both innovative and practical.

During this “launch and learn” process, Rocket anticipated and managed challenges such as scaling issues and burst volume management using Drip Hopper and serverless technologies through AWS. They also fine-tuned the Anthropic’s Claude 3 Haiku large language model (LLM) on Amazon Bedrock for call classification and data extraction.

The following diagram illustrates the solution architecture.

Post-Call Analytics provides an entire architecture around ingesting audio files in a fully automated workflow with AWS Step Functions, which is initiated when an audio file is delivered to a configured Amazon Simple Storage Service (Amazon S3) bucket. After a few minutes, a transcript is produced with Amazon Transcribe Call Analytics and saved to another S3 bucket for processing by other business intelligence (BI) tools. These transcripts are saved for further processing by BI tools, with stringent security measures making sure personally identifiable information (PII) is redacted and data is encrypted.  The PII is redacted throughout, but client ID and interaction ID are used to correlate and trace across the data sets.  Downstream applications use those ids to pull from client data services in the UI presentation layer.

Enhancing the analysis, Amazon Comprehend is used for sentiment analysis and entity extraction, providing deeper insights into client interactions. Generative AI is integrated to generate concise call summaries and actionable insights, significantly reducing the manual workload and allowing team members to focus on building stronger client relationships. This generative AI capability, powered by Amazon Bedrock, Anthropic’s Claude Sonnet 3, and customizable prompts, enables Rocket to deliver real-time, contextually relevant information. Data is securely stored and managed within AWS, using Amazon S3 and Amazon DynamoDB, with robust encryption and access controls provided by AWS Key Management Service (AWS KMS) and AWS Identity and Access Management (IAM) policies. This comprehensive setup enables Rocket to efficiently manage, analyze, and act on client interaction data, thereby enhancing both client experience and operational efficiency.

Achieving excellence

The implementation of Rocket Logic – Synopsis has yielded remarkable results for Rocket:

  • Efficiency gains – Automating call transcription and sentiment analysis is projected to save the servicing team nearly 40,000 hours annually
  • Enhanced client experience – Approximately 70% of servicing clients fully self-serve over Gen AI powered mediums such as IVR; allowing clients to resolve inquiries without needing team member intervention
  • Increased first-call resolutions – There has been a nearly 10% increase in first-call resolutions, saving approximately 20,000 team member hours annually
  • Proactive client solutions – The tool’s predictive capabilities have improved, allowing Rocket to proactively address client needs before they even make a call
  • Start small, launch and learn, scale fast – Rocket started with 30,000 servicing calls with a 10-day time to market, and then scaled four times greater for operations, followed by six times greater for banking

Roadmap

Looking ahead, Rocket plans to continue enhancing Rocket Logic – Synopsis by using the vast amount of data gathered from call transcripts. Future developments will include:

  • Advanced predictive analytics – Further improving the tool’s ability to anticipate client needs and offer solutions proactively
  • Omnichannel integration – Expanding the AI capabilities to other communication channels such as emails and chats
  • Client preference tracking – Refining the technology to better understand and adapt to individual client preferences, providing more personalized interactions
  • Enhanced personalization – Utilizing data to create even more tailored client experiences, including understanding preferences for communication channels and timing

Conclusion

The collaboration between Rocket Mortgage and AWS has revolutionized the homeownership process by integrating advanced AI solutions into client interactions. Rocket Logic – Synopsis enhances operational efficiency significantly and improves the client experience. As Rocket continues to innovate and expand its AI capabilities, they remain committed to providing personalized, efficient, and seamless homeownership experiences for their clients. The success of Rocket Logic – Synopsis demonstrates the transformative power of technology in creating more efficient, responsive, and personalized client experiences. To learn more, visit Amazon Transcribe Call Analytics, Amazon Comprehend, and Amazon Bedrock.


About the authors

Josh Zook is the Chief Technology Officer of Rocket Mortgage, working alongside the teams that are shipping the products that clients and partners are using every day to make home ownership a reality. He started in Technology in 1984 by writing a program in BASIC to calculate his weight on the moon using an Apple IIe. Since then, he has been on a relentless pursuit in using technology to make life easier by solving slightly more complex problems. Josh believes the key to success is curiosity combined with the grit and grind to make ideas reality. This has led to a steady paycheck since he was 10 years old, with jobs in landscaping, sandwich artistry, sporting goods sales, satellite installation, firefighter, and bookstore aficionado… just to name a few.

Alex Hamilton is a Director of Engineering at Rocket Mortgage, spearheading the AI driven digital strategy to help everyone home. He’s been shaping the tech scene at Rocket for over 11 years, including launching one of the company’s first models to boost trading revenue and bring modern event streaming and containerization to Rocket. Alex is passionate about solving novel engineering problems and bringing magical client experiences to life. Outside of work Alex enjoys traveling, weekend brunch, and firing up the grill!

Ritesh Shah is a Senior Worldwide GenAI Specialist at AWS. He partners with customers like Rocket to drive AI adoption, resulting in millions of dollars in top and bottom line impact for these customers. Outside work, Ritesh tries to be a dad to his AWSome daughter.  Connect with him on LinkedIn.

Venkata Santosh Sajjan Alla is a Senior Solutions Architect at AWS Financial Services, where he partners with North American FinTech companies like Rocket to drive cloud strategy and accelerate AI adoption. His expertise in AI & ML, and cloud native architecture has helped organizations unlock new revenue streams, enhance operational efficiency, and achieve substantial business transformation. By modernizing financial institutions with secure, scalable infrastructures, Sajjan enables them to stay competitive in a rapidly evolving, data-driven landscape. Outside of work, he enjoys spending time with his family and is a proud father to his daughter.

Read More

Integrate dynamic web content in your generative AI application using a web search API and Amazon Bedrock Agents

Integrate dynamic web content in your generative AI application using a web search API and Amazon Bedrock Agents

Amazon Bedrock Agents offers developers the ability to build and configure autonomous agents in their applications. These agents help users complete actions based on organizational data and user input, orchestrating interactions between foundation models (FMs), data sources, software applications, and user conversations.

Amazon Bedrock agents use the power of large language models (LLMs) to perform complex reasoning and action generation. This approach is inspired by the ReAct (reasoning and acting) paradigm, which combines reasoning traces and task-specific actions in an interleaved manner.

Amazon Bedrock agents use LLMs to break down tasks, interact dynamically with users, run actions through API calls, and augment knowledge using Amazon Bedrock Knowledge Bases. The ReAct approach enables agents to generate reasoning traces and actions while seamlessly integrating with company systems through action groups. By offering accelerated development, simplified infrastructure, enhanced capabilities through chain-of-thought (CoT) prompting, and improved accuracy, Amazon Bedrock Agents allows developers to rapidly build sophisticated AI solutions that combine the power of LLMs with custom actions and knowledge bases, all without managing underlying complexity.

Web search APIs empower developers to seamlessly integrate powerful search capabilities into their applications, providing access to vast troves of internet data with just a few lines of code. These APIs act as gateways to sophisticated search engines, allowing applications to programmatically query the web and retrieve relevant results including webpages, images, news articles, and more.

By using web search APIs, developers can enhance their applications with up-to-date information from across the internet, enabling features like content discovery, trend analysis, and intelligent recommendations. With customizable parameters for refining searches and structured response formats for parsing, web search APIs offer a flexible and efficient solution for harnessing the wealth of information available on the web.

Amazon Bedrock Agents offers a powerful solution for enhancing chatbot capabilities, and when combined with web search APIs, they address a critical customer pain point. In this post, we demonstrate how to use Amazon Bedrock Agents with a web search API to integrate dynamic web content in your generative AI application.

Benefits of integrating a web search API with Amazon Bedrock Agents

Let’s explore how this integration can revolutionize your chatbot experience:

  • Seamless in-chat web search – By incorporating web search APIs into your Amazon Bedrock agents, you can empower your chatbot to perform real-time web searches without forcing users to leave the chat interface. This keeps users engaged within your application, improving overall user experience and retention.
  • Dynamic information retrieval – Amazon Bedrock agents can use web search APIs to fetch up-to-date information on a wide range of topics. This makes sure that your chatbot provides the most current and relevant responses, enhancing its utility and user trust.
  • Contextual responses – Amazon Bedrock agent uses CoT prompting, enabling FMs to plan and run actions dynamically. Through this approach, agents can analyze user queries and determine when a web search is necessary or—if enabled—gather more information from the user to complete the task. This allows your chatbot to blend information from APIs, knowledge bases, and up-to-date web-sourced content, creating a more natural and informative conversation flow. With these capabilities, agents can provide responses that are better tailored to the user’s needs and the current context of the interaction.
  • Enhanced problem solving – By integrating web search APIs, your Amazon Bedrock agent can tackle a broader range of user inquiries. Whether it’s troubleshooting a technical issue or providing industry insights, your chatbot becomes a more versatile and valuable resource for users.
  • Minimal setup, maximum impact – Amazon Bedrock agents simplify the process of adding web search functionality to your chatbot. With just a few configuration steps, you can dramatically expand your chatbot’s knowledge base and capabilities, all while maintaining a streamlined UI.
  • Infrastructure as code – You can use AWS CloudFormation or the AWS Cloud Development Kit (AWS CDK) to deploy and manage Amazon Bedrock agents.

By addressing the customer challenge of expanding chatbot functionality without complicating the user experience, the combination of web search APIs and Amazon Bedrock agents offers a compelling solution. This integration allows businesses to create more capable, informative, and user-friendly chatbots that keep users engaged and satisfied within a single interface.

Solution overview

This solution uses Amazon Bedrock Agents with a web search capability that integrates external search APIs (SerpAPI and Tavily AI) with the agent. The architecture consists of the following key components:

Visual representation of the system

  • An Amazon Bedrock agent orchestrates the interaction between the user and search APIs, handling the chat sessions and optionally long-term memory
  • An AWS Lambda function implements the logic for calling external search APIs and processing results
  • External search APIs (SerpAPI and Tavily AI) provide web search capabilities
  • Amazon Bedrock FMs generate natural language responses based on search results
  • AWS Secrets Manager securely stores API keys for external services

The solution flow is as follows:

  1. User input is received by the Amazon Bedrock agent, powered by Anthropic Claude 3 Sonnet on Amazon Bedrock.
  2. The agent determines if a web search is necessary, or comes back to the user with clarifying questions.
  3. If required, the agent invokes one of two Lambda functions to perform a web search: SerpAPI for up-to-date events or Tavily AI for web research-heavy questions.
  4. The Lambda function retrieves the API secrets securely from Secrets Manager, calls the appropriate search API, and processes the results.
  5. The agent generates the final response based on the search results.
  6. The response is returned to the user after final output guardrails are applied.

The following figure is a visual representation of the system we are going to implement.

We demonstrate two methods to build this solution. To set up the agent on the AWS Management Console, we use the new agent builder. The following GitHub repository contains the Python AWS CDK code to deploy the same example.

Prerequisites

Make sure you have the following prerequisites:

Amazon Bedrock agents support models like Amazon Titan Text and Anthropic Claude models. Each model has different capabilities and pricing. For the full list of supported models, see Supported regions and models for Amazon Bedrock Agents.

For this post, we use the Anthropic Claude 3 Sonnet model.

Configure the web search APIs

Both SERPER (SerpAPI) and Tavily AI provide web search APIs that can be integrated with Amazon Bedrock agents by calling their REST-based API endpoints from a Lambda function. However, they have some key differences that can influence when you would use each one:

  • SerpAPI provides access to multiple search engines, including Google, Bing, Yahoo, and others. It offers granular control over search parameters and result types (for example, organic results, featured snippets, images, and videos). SerpAPI might be better suited for tasks requiring specific search engine features or when you need results from multiple search engines.
  • Tavily AI is specifically designed for AI agents and LLMs, focusing on delivering relevant and factual results. It offers features like including answers, raw content, and images in search results. It provides customization options such as search depth (basic or advanced) and the ability to include or exclude specific domains. It’s optimized for speed and efficiency in delivering real-time results.

You would use SerpAPI if you need results from specific search engines or multiple engines, and Tavily AI when relevance and factual accuracy are crucial.

Ultimately, the choice between SerpAPI and Tavily AI depends on your specific research requirements, the level of control you need over search parameters, and whether you prioritize general search engine capabilities or AI-optimized results.

For the example in this post, we chose to use both and let the agent decide which API is the more appropriate one, depending on the question or prompt. The agent can also opt to call both if one doesn’t provide a good enough answer. Both SerpAPI and Tavily AI provide a free tier that can be used for the example in this post.

For both APIs, API keys are required and are available from Serper and Tavily.

We securely store the obtained API keys in Secrets Manager. The following examples create secrets for the API keys:

aws secretsmanager create-secret 
--name SERPER_API_KEY 
--description "The API secret key for Serper." 
--secret-string "$SERPER_API_KEY"

aws secretsmanager create-secret 
--name TAVILY_API_KEY 
--description "The API secret key for Tavily AI." 
--secret-string "$TAVILY_API_KEY"

When you enter commands in a shell, there is a risk of the command history being accessed or utilities having access to your command parameters. For more information, see Mitigate the risks of using the AWS CLI to store your AWS Secrets Manager secrets.

Now that the APIs are configured, you can start building the web search Amazon Bedrock agent.

In the following section, we present two methods to create your agent: through the console and using the AWS CDK. Although the console path offers a more visual approach, we strongly recommend using the AWS CDK for deploying the agent. This method not only provides a more robust deployment process, but also allows you to examine the underlying code. Let’s explore both options to help you choose the best approach for your needs.

Build a web search Amazon Bedrock agent using the console

In the first example, you build a web search agent using the Amazon Bedrock console to create and configure the agent, and then the Lambda console to configure and deploy a Lambda function.

Create a web search agent

To create a web search agent using the console, complete the following steps:

  1. On the Amazon Bedrock console, choose Agents in the navigation pane.
  2. Choose Create agent.
  3. Enter a name for the agent (such as websearch-agent) and an optional description, then choose Create.

Create Agent Dialogue

You are now in the new agent builder, where you can access and edit the configuration of an agent.

  1. For Agent resource role, leave the default Create and use a new service role

This option automatically creates the AWS Identity and Access Management (IAM) role assumed by the agent.

  1. For the model, choose Anthropic and Claude 3 Sonnet.

Instructions for the Agent

  1. For Instructions for the Agent, provide clear and specific instructions to tell the agent what it should do. For the web search agent, enter:
You are an agent that can handle various tasks as described below:
1/ Helping users do research and finding up-to-date information. For up-to-date information always uses web search. Web search has two flavors:
a/ Google Search - this is great for looking up up-to-date information and current events
b/ Tavily AI Search - this is used to do deep research on topics your user is interested in. Not good for being used on news because it does not order search results by date.

As you can see from the instruction, we decided to name the SerpAPI option Google Search. In our tests with the Anthropic Claude 3 Sonnet model, Google Search is synonymous with web search. Because the instruction is a natural language instruction to the model, we want to stay as close to the assumed usage of words in a language, therefore, we use Google Search instead of SerpAPI. However, this could vary from model to model. We encourage you to test new instructions when changing the model.
  1. Choose Add in the Action groups

Action groups are how agents can interact with external systems or APIs to get more information or perform actions.

  1. For Enter action group name, enter action-group-web-search for the action group.
  2. For Action group type, select Define with function details so you can specify functions and their parameters as JSON instead of providing an Open API schema.
  3. For Action group invocation, set up what the agent does after this action group is identified by the model. Because we want to call the web search APIs, select Quick create a new Lambda function.

With this option, Amazon Bedrock creates a basic Lambda function for your agent that you can later modify on the Lambda console for the use case of calling the web search APIs. The agent will predict the function and function parameters needed to fulfil its goal and pass the parameters to the Lambda function.

Create Action group

  1. Now, configure the two functions of the action group—one for the SerpAPI Google search, and one for the Tavily AI search.
  2. For each of the two functions, for Parameters, add search_query with a description.

This is a parameter of type String and is required by each of the functions.

  1. Choose Create to complete the creation of the action group.

Action group functions

We use the following parameter descriptions:

“The search query for the Google web search.”
“The search query for the Tavily web search.”

We encourage you to try to add a target website as an extra parameter to the action group functions. Take a look at the lambda function code and infer the settings.

You will be redirected to the agent builder console.

  1. Choose Save to save your agent configuration.

Configure and deploy a Lambda function

Complete the following steps to update the action group Lambda function:

  1. On the Lambda console, locate the new Lambda function with the name action-group-web-search-.
  2. Edit the provided starting code and implement the web search use case:
import http.client
import json
… 
def lambda_handler(event, _):
    action_group = event["actionGroup"]
    function = event["function"]
    parameters = event.get("parameters", [])
    search_query, target_website = extract_search_params(action_group, function, parameters)
    search_results: str = ""
    if function == "tavily-ai-search":
        search_results = tavily_ai_search(search_query, target_website)
    elif function == "google-search":
        search_results = google_search(search_query, target_website)
    # Prepare the response
    function_response_body = {"TEXT": {"body": f"Here are the top search results for the query '{search_query}': {search_results} "}}
    action_response = {
        "actionGroup": action_group,
        "function": function,
        "functionResponse": {"responseBody": function_response_body},
    }
    response = {"response": action_response, "messageVersion": event["messageVersion"]}
    return response

The code is truncated for brevity. The full code is available on GitHub.

  1. Choose Deploy.

The function is configured with a resource-based policy that allows Amazon Bedrock to invoke the function. For this reason, you don’t need to update the IAM role used by the agent.

As part of the Quick create a new Lambda function option selected earlier, the agent builder configured the function with a resource-based policy that allows the Amazon Bedrock service principal to invoke the function. There is no need to update the IAM role used by the agent. However, the function needs permission to access API keys saved in Secrets Manager.

  1. On the function details page, choose the Configuration tab, then choose Permissions.
  2. Choose the link for Role name to open the role on the IAM console.

Execution role

  1. Open the JSON view of the IAM policy under Policy name and choose Edit to edit the policy.

Permissions policies

  1. Add the following statement, which gives the Lambda function the required access to read the API keys from Secrets Manager. Adjust the Region code as needed, and provide your AWS account ID.
{
  "Action": "secretsmanager:GetSecretValue",
  "Resource": [
    "arn:aws:secretsmanager:us-west-2:<account_id>:secret:SERPER_API_KEY*",
    "arn:aws:secretsmanager:<region_name>:<account_id>:secret:TAVILY_API_KEY*"
  ],
  "Effect": "Allow",
  "Sid": "GetSecretsManagerSecret"
}

Test the agent

You’re now ready to test the agent.

  1. On the Amazon Bedrock console, on the websearch-agent details page, choose Test.
  2. Choose Prepare to prepare the agent and test it with the latest changes.
  3. As test input, you can ask a question such as “What are the latest news from AWS?”

Test the agent

  1. To see the details of each step of the agent orchestration, including the reasoning steps, choose Show trace (already opened in the preceding screenshot).

This helps you understand the agent decisions and debug the agent configuration if the result isn’t as expected. We encourage you to investigate how the instructions for the agent and the tool instructions are handed to the agent by inspecting the traces of the agent.

In the next section, we walk through deploying the web search agent with the AWS CDK.

Build a web search Amazon Bedrock agent with the AWS CDK

Both AWS CloudFormation and AWS CDK support have been released for Amazon Bedrock Agents, so you can develop and deploy the preceding agent completely in code.

The AWS CDK example in this post uses Python. The following are the required steps to deploy this solution:

  1. Install the AWS CDK version 2.174.3 or later and set up your AWS CDK Python environment with Python 3.11 or later.
  2. Clone the GitHub repository and install the dependencies.
  3. Run AWS CDK bootstrapping on your AWS account.

The structure of the sample AWS CDK application repository is:

  • /app.py file – Contains the top-level definition of the AWS CDK app
  • /cdk folder – Contains the stack definition for the web search agent stack
  • /lambda folder – Contains the Lambda function runtime code that handles the calls to the Serper and Tavily AI APIs
  • /test folder – Contains a Python script to test the deployed agent

To create an Amazon Bedrock agent, the key resources required are:

  • An action group that defines the functions available to the agent
  • A Lambda function that implements these functions
  • The agent itself, which orchestrates the interactions between the FMs, functions, and user conversations

AWS CDK code to define an action group

The following Python code defines an action group as a Level 1 (L1) construct. L1 constructs, also known as AWS CloudFormation resources, are the lowest-level constructs available in the AWS CDK and offer no abstraction. Currently, the available Amazon Bedrock AWS CDK constructs are L1. With the action_group_executor parameter of AgentActionGroupProperty, you define the Lambda function containing the business logic that is carried out when the action is invoked.

action_group = bedrock.CfnAgent.AgentActionGroupProperty(
    action_group_name=f"{ACTION_GROUP_NAME}",
    description="Action that will trigger the lambda",
    action_group_executor=bedrock.CfnAgent.ActionGroupExecutorProperty(lambda_=lambda_function.function_arn),
    function_schema=bedrock.CfnAgent.FunctionSchemaProperty(
        functions=[
            bedrock.CfnAgent.FunctionProperty(
                name="tavily-ai-search",
                description="""
                    To retrieve information via the internet
                    or for topics that the LLM does not know about and
                    intense research is needed.
                """,
                parameters={
                    "search_query": bedrock.CfnAgent.ParameterDetailProperty(
                        type="string",
                        description="The search query for the Tavily web search.",
                        required=True,
                    )
                },
            ),
            bedrock.CfnAgent.FunctionProperty(
                name="google-search",
                description="For targeted news, like 'what are the latest news in Austria' or similar.",
                parameters={
                    "search_query": bedrock.CfnAgent.ParameterDetailProperty(
                        type="string",
                        description="The search query for the Google web search.",
                        required=True,
                    )
                },
            ),
        ]
),

After the Amazon Bedrock agent determines the API operation that it needs to invoke in an action group, it sends information alongside relevant metadata as an input event to the Lambda function.

The following code shows the Lambda handler function that extracts the relevant metadata and populated fields from the request body parameters to determine which function (Serper or Tavily AI) to call. The extracted parameter is search_query, as defined in the preceding action group function. The complete Lambda Python code is available in the GitHub repository.

def lambda_handler(event, _):  # type: ignore
    action_group = event["actionGroup"]
    function = event["function"]
    parameters = event.get("parameters", [])
    search_query, target_website = extract_search_params(action_group, function, parameters)
    search_results: str = ""
    if function == "tavily-ai-search":
        search_results = tavily_ai_search(search_query, target_website)
    elif function == "google-search":
        search_results = google_search(search_query, target_website)

Lastly, with the CfnAgent AWS CDK construct, specify an agent as a resource. The auto_prepare=True parameter creates a DRAFT version of the agent that can be used for testing.

  agent_instruction = """
      You are an agent that can handle various tasks as described below:
      1/ Helping users do research and finding up to date information. For up to date information always
         uses web search. Web search has two flavours:
         1a/ Google Search - this is great for looking up up to date information and current events
         2b/ Tavily AI Search - this is used to do deep research on topics your user is interested in. Not good on being used on news as it does not order search results by date.
      2/ Retrieving knowledge from the vast knowledge bases that you are connected to.
  """

  agent = bedrock.CfnAgent(
      self,
      "WebSearchAgent",
      agent_name="websearch_agent",
      foundation_model="anthropic.claude-3-sonnet-20240229-v1:0",
      action_groups=[action_group],
      auto_prepare=True,
      instruction=agent_instruction,
      agent_resource_role_arn=agent_role.role_arn,
   )

Deploy the AWS CDK application

Complete the following steps to deploy the agent using the AWS CDK:

  1. Clone the example AWS CDK code:
git clone https://github.com/aws-samples/websearch_agent
  1. Create a Python virtual environment, activate it, and install Python dependencies (make sure that you’re using Python 3.11 or later):
python -m venv .venv && source .venv/bin/activate && pip install -r requirements.txt
  1. To deploy the agent AWS CDK example, run the cdk deploycommand:
cdk deploy

When the AWS CDK deployment is finished, it will output values for agent_id and agent_alias_id:

Outputs:
WebSearchAgentStack.agentaliasid = <agent_alias_id>
WebSearchAgentStack.agentid = <agent_id>
WebSearchAgentStack.agentversion = DRAFT

For example:

WebSearchAgentStack.agentaliasid = XP3JHPEDMK
WebSearchAgentStack.agentid = WFRPT9IMBO
WebSearchAgentStack.agentversion = DRAFT

Make a note of the outputs; you need them to test the agent in the next step.

Test the agent

To test the deployed agent, a Python script is available in the test/ folder. You must be authenticated using an AWS account and an AWS_REGION environment variable set. For details, see Configure the AWS CLI.

To run the script, you need the output values and to pass in a question using the -prompt parameter:

python invoke-agent.py --agent_id <agent_id> --agent_alias_id <agent_alias_id> --prompt "What are the latest AWS news?"

For example, with the outputs we received from the preceding cdk deploy command, you would run the following:

python invoke-agent.py --agent_id WFRPT9IMBO --agent_alias_id XP3JHPEDMK --prompt "What are the latest AWS news?"

You would receive the following response (output is truncated for brevity):

Here are some of the latest major AWS news and announcements:
At the recent AWS Summit in New York, AWS announced several new services and capabilities across areas like generative AI, machine learning, databases, and more.
Amazon Q, AWS's generative AI assistant, has been integrated with Smartsheet to provide AI-powered assistance to employees. Amazon Q Developer has also reached general availability with new features for developers.
AWS plans to launch a new Region in Mexico called the AWS Mexico (Central) Region, which will be the second AWS Region in Mexico ....

Clean up

To delete the resources deployed with the agent AWS CDK example, run the following command:

cdk destroy

Use the following commands to delete the API keys created in Secrets Manager:

aws secretsmanager delete-secret —secret-id SERPER_API_KEY
aws secretsmanager delete-secret —secret-id TAVILY_API_KEY

Key considerations

Let’s dive into some key considerations when integrating web search into your AI systems.

API usage and cost management

When working with external APIs, it’s crucial to make sure that your rate limits and quotas don’t become bottlenecks for your workload. Regularly check and identify limiting factors in your system and validate that it can handle the load as it scales. This might involve implementing a robust monitoring system to track API usage, setting up alerts for when you’re approaching limits, and developing strategies to gracefully handle rate-limiting scenarios.

Additionally, carefully consider the cost implications of external APIs. The amount of content returned by these services directly translates into token usage in your language models, which can significantly impact your overall costs. Analyze the trade-offs between comprehensive search results and the associated token consumption to optimize your system’s efficiency and cost-effectiveness. Consider implementing caching mechanisms for frequently requested information to reduce API calls and associated costs.

Privacy and security considerations

It’s essential to thoroughly review the pricing and privacy agreements of your chosen web search provider. The agentic systems you’re building can potentially leak sensitive information to these providers through the search queries sent. To mitigate this risk, consider implementing data sanitization techniques to remove or mask sensitive information before it reaches the search provider. This becomes especially crucial when building or enhancing secure chatbots and internally facing systems—educating your users about these privacy considerations is therefore of utmost importance.

To add an extra layer of security, you can implement guardrails, such as those provided by Amazon Bedrock Guardrails, in the Lambda functions that call the web search. This additional safeguard can help protect against inadvertent information leakage to web search providers. These guardrails could include pattern matching to detect potential personally identifiable information (PII), allow and deny lists for certain types of queries, or AI-powered content classifiers to flag potentially sensitive information.

Localization and contextual search

When designing your web search agent, it’s crucial to consider that end-users are accustomed to the search experience provided by standard web browsers, especially on mobile devices. These browsers often supply additional context as part of a web search, significantly enhancing the relevance of results. Key aspects of localization and contextual search include language considerations, geolocation, search history and personalization, and time and date context. For language considerations, you can implement language detection to automatically identify the user’s preferred language or provide it through the agent’s session context.

Refer to Control agent session context for details on how to provide session context in Amazon Bedrock Agents for more details.

It’s important to support multilingual queries and results, using a model that supports your specific language needs. Geolocation is another critical factor; utilizing the user’s approximate location (with permission) can provide geographically relevant results. Search history and personalization can greatly enhance the user experience. Consider implementing a system (with user consent) to remember recent searches and use this context for result ranking. You can customize an Amazon Bedrock agent with the session state feature. Adding a user’s location attributes to the session state is a potential implementation option.

Additionally, allow users to set persistent preferences for result types, such as preferring videos over text articles. Time and date context is also vital; use the user’s local time zone for time-sensitive queries like “latest news on quarterly numbers of company XYZ, now,” and consider seasonal context for queries that might have different meanings depending on the time of year.

For instance, without providing such extra information, a query like “What is the current weather in Zurich?” could yield results for any Zurich globally, be it in Switzerland or various locations in the US. By incorporating these contextual elements, your search agent can distinguish that a user in Europe is likely asking about Zurich, Switzerland, whereas a user in Illinois might be interested in the weather at Lake Zurich. To implement these features, consider creating a system that safely collects and utilizes relevant user context. However, always prioritize user privacy and provide clear opt-in mechanisms for data collection. Clearly communicate what data is being used and how it enhances the search experience. Offer users granular control over their data and the ability to opt out of personalized features. By carefully balancing these localization and contextual search elements, you can create a more intuitive and effective web search agent that provides highly relevant results while respecting user privacy.

Performance optimization and testing

Performance optimization and testing are critical aspects of building a robust web search agent. Implement comprehensive latency testing to measure response times for various query types and content lengths across different geographical regions. Conduct load testing to simulate concurrent users and identify system limits if applicable to your application. Optimize your Lambda functions for cold starts and runtime, and consider using Amazon CloudFront to reduce latency for global users. Implement error handling and resilience measures, including fallback mechanisms and retry logic. Set up Amazon CloudWatch alarms for key metrics such as API latency and error rates to enable proactive monitoring and quick response to performance issues.

To test the solution end to end, create a dataset of questions and correct answers to test if changes to your system improve or deteriorate the information retrieval capabilities of your app.

Migration strategies

For organizations considering a migration from open source frameworks like LangChain to Amazon Bedrock Agents, it’s important to approach the transition strategically. Begin by mapping your current ReAct agent’s logic to the Amazon Bedrock agents’ action groups and Lambda functions. Identify any gaps in functionality and plan for alternative solutions or custom development where necessary. Adapt your existing API calls to work with the Amazon Bedrock API and update authentication methods to use IAM roles and policies.

Develop comprehensive test suites to make sure functionalities are correctly replicated in the new environment. One significant advantage of Amazon Bedrock agents is the ability to implement a gradual rollout. By using the agent alias ID, you can quickly direct traffic between different versions of your agent, allowing for a smooth and controlled migration process. This approach enables you to test and validate your new implementation with a subset of users or queries before fully transitioning your entire system.

By carefully balancing these considerations—from API usage and costs to privacy concerns, localization, performance optimization, and migration strategies—you can create a more intelligent, efficient, and user-friendly search experience that respects individual preferences and data protection regulations. As you build and refine your web search agent with Amazon Bedrock, keep these factors in mind to provide a robust, scalable, and responsible AI system.

Expanding the solution

With this post, you’ve taken the first step towards revolutionizing your applications with Amazon Bedrock Agents and the power of agentic workflows with LLMs. You’ve not only learned how to integrate dynamic web content, but also gained insights into the intricate relationship between AI agents and external information sources.

Transitioning your existing systems to Amazon Bedrock agents is a seamless process, and with the AWS CDK, you can manage your agentic AI infrastructure as code, providing scalability, reliability, and maintainability. This approach not only streamlines your development process, but also paves the way for more sophisticated AI-driven applications that can adapt and grow with your business needs.

Expand your horizons and unlock even more capabilities:

  • Connect to an Amazon Bedrock knowledge base – Augment your agents’ knowledge by integrating them with a centralized knowledge repository, enabling your AI to draw upon a vast, curated pool of information tailored to your specific domain.
  • Embrace streaming – Use the power of streaming responses to provide an enhanced user experience and foster a more natural and interactive conversation flow, mimicking the real-time nature of human dialogue and keeping users engaged throughout the interaction.
  • Expose ReAct prompting and tool use – Parse the streaming output on your frontend to visualize the agent’s reasoning process and tool usage, providing invaluable transparency and interpretability for your users, building trust, and allowing users to understand and verify the AI’s decision-making process.
  • Utilize memory for Amazon Bedrock Agents – Amazon Bedrock agents can retain a summary of their conversations with each user and are able to provide a smooth, adaptive experience if enabled. This allows you to give extra context for tasks like web search and topics of interest, creating a more personalized and contextually aware interaction over time.
  • Give extra context – As outlined earlier, context matters. Try to implement additional user context through the session attributes that you can provide through the session state. Refer to Control agent session context for the technical implementations, and consider how this context can be used responsibly to enhance the relevance and accuracy of your agent’s responses.
  • Add agentic web research – Agents allow you to build very sophisticated workflows. Our system is not limited to a simple web search. The Lambda function can also serve as an environment to implement an agentic web research with multi-agent collaboration, enabling more comprehensive and nuanced information gathering and analysis.

What other tools would you use to complement your agent? Refer to the aws-samples GitHub repo for Amazon Bedrock Agents to see what others have built and consider how these tools might be integrated into your own unique AI solutions.

Conclusion

The future of generative AI is here, and Amazon Bedrock Agents is your gateway to unlocking its full potential. Embrace the power of agentic LLMs and experience the transformative impact they can have on your applications and user experiences. As you embark on this journey, remember that the true power of AI lies not just in its capabilities, but in how we thoughtfully and responsibly integrate it into our systems to solve real-world problems and enhance human experiences.

If you would like us to follow up with a second post tackling any points discussed here, feel free to leave a comment. Your engagement helps shape the direction of our content and makes sure we’re addressing the topics that matter most to you and the broader AI community.

In this post, you have seen the steps needed to integrate dynamic web content and harness the full potential of generative AI, but don’t stop here. Transitioning your existing systems to Amazon Bedrock agents is a seamless process, and with the AWS CDK, you can manage your agentic AI infrastructure as code, providing scalability, reliability, and maintainability.


About the Authors

Philipp Kaindl is a Senior Artificial Intelligence and Machine Learning Specialist Solutions Architect at AWS. With a background in data science and mechanical engineering, his focus is on empowering customers to create lasting business impact with the help of AI. Connect with Philipp on LinkedIn.

Markus Rollwagen is a Senior Solutions Architect at AWS, based in Switzerland. He enjoys deep dive technical discussions, while keeping an eye on the big picture and the customer goals. With a software engineering background, he embraces infrastructure as code and is passionate about all things security. Connect with Markus on LinkedIn.

Read More

Build a generative AI assistant to enhance employee experience using Amazon Q Business

Build a generative AI assistant to enhance employee experience using Amazon Q Business

In today’s fast-paced business environment, organizations are constantly seeking innovative ways to enhance employee experience and productivity. There are many challenges that can impact employee productivity, such as cumbersome search experiences or finding specific information across an organization’s vast knowledge bases. Additionally, with the rise of remote and hybrid work models, traditional support systems such as IT Helpdesks and HR might struggle to keep up with the increased demand for assistance. Productivity loss because of these challenges can lead to lengthy onboarding times for new employees, extended task completion times, and call volumes for undifferentiated IT and HR support, to name a few.

Amazon Q Business is a fully managed, generative artificial intelligence (AI) powered assistant that can address the challenges mentioned above by providing 24/7 support tailored to individual needs. It can handle a wide range of tasks such as answering questions, providing summaries, and generating content and completing tasks based on data in your organization. Additionally, Amazon Q Business offers enterprise-grade data security and privacy and has guardrails built-in that are configurable by an admin. Customers like Deriv were successfully able to reduce new employee onboarding time by up to 45% and overall recruiting efforts by as much as 50% by making generative AI available to all of their employees in a safe way.

In this blog post, we will talk about Amazon Q Business use cases, walk-through an example application, and discuss approaches for measuring productivity gains.

Use cases overview

Some key use cases for Amazon Q Business for organizations include:

  • Providing grounded responses to employees: An organization can deploy Amazon Q Business on their internal data, documents, products, and services. This allows Amazon Q Business to understand the business context and provide tailored assistance to employees on common questions, tasks, and issues.
  • Improving employee experience: By deploying Amazon Q Business across various environments like websites, apps, and chatbots, organizations can provide unified, engaging and personalized experiences. Employees will have a consistent experience wherever they choose to interact with the generative AI assistant.
  • Knowledge management: Amazon Q Business helps organizations use their institutional knowledge more effectively. It can be integrated with internal knowledge bases, manuals, best practices, and more, to provide a centralized source of information to employees.
  • Project management and issue tracking: With Amazon Q Business plugins, users can use natural language to open tickets without leaving the chat interface. Previously resolved tickets can also be used to help reduce overall ticket volumes and get employees the information they need faster to resolve an issue.

Amazon Q Business features

The Amazon Q Business-powered chatbot aims to provide comprehensive support to users with a multifaceted approach. It offers multiple data source connectors that can connect to your data sources and help you create your generative AI solution with minimal configuration. Amazon Q Business supports over 40 connectors at the time of writing. Additionally, Amazon Q Business also supports plugins to enable users to take action from within the conversation. There are four native plugins offered, and a custom plugin option to integrate with any third-party application.

Using the Business User Store feature, users see chat responses generated only from the documents that they have access to within an Amazon Q Business application. You can also customize your application environment to your organizational needs by using application environment guardrails or chat controls such as global controls and topic-level controls that you can configure to manage the user chat experience.

Features like document enrichment and relevance tuning together play a key role in further customizing and enhancing your applications. The document enrichment feature helps you control both what documents and document attributes are ingested into your index and also how they’re ingested. Using document enrichment, you can create, modify, or delete document attributes and document content when you ingest them into your Amazon Q Business index. You can then assign weights to document attributes after mapping them to index fields using the relevance tuning feature. You can use these assigned weights to fine-tune the underlying ranking of Retrieval-Augmented Generation (RAG)-retrieved passages within your application environment to optimize the relevance of chat responses.

Amazon Q Business offers robust security features to protect customer data and promote responsible use of the AI assistant. It uses pre-trained machine learning models and does not use customer data to train or improve the models. The service supports encryption at rest and in transit, and administrators can configure various security controls such as restricting responses to enterprise content only, specifying blocked words or phrases, and defining special topics with customized guardrails. Additionally, Amazon Q Business uses the security capabilities of Amazon Bedrock, the underlying AWS service, to enforce safety, security, and responsible use of AI.

Sample application architecture

The following figure shows a sample application architecture.

Sample Architecture Diagram

Application architecture walkthrough

Before you begin to create an Amazon Q Business application environment, make sure that you complete the setting up tasks and review the Before you begin section. This includes tasks like setting up required AWS Identity and Access Management (IAM) roles and enabling and pre-configuring an AWS IAM Identity Center instance.

As the next step towards creating a generative AI assistant, you can create the Amazon Q Business web experience. The web experience can be created using either the AWS Management Console or the Amazon Q Business APIs.

After creating your Amazon Q Business application environment, you create and select the retriever and provision the index that will power your generative AI web experience. The retriever pulls data from the index in real time during a conversation. After you select a retriever for your Amazon Q Business application environment, you connect data sources to it.

This sample application connects to repositories like Amazon Simple Storage Service (Amazon S3) and SharePoint, and to public facing websites or internal company websites using Amazon Q Web Crawler. The application also integrates with service and project management tools such as ServiceNow and Jira and enterprise communication tools such as Slack and Microsoft Teams. The application uses built-in plugins for Jira and ServiceNow to enable users to perform specific tasks related to supported third-party services from within their web experience chat, such as creating a Jira ticket or opening an incident in ServiceNow.

After the data sources are configured, data is integrated and synchronized into container indexes that are maintained by the Amazon Q Business service. Authorized users interact with the application environment through the web experience URL after successfully authenticating. You could also use Amazon Q Business APIs to build a custom UI to implement special features such as handling feedback, using company brand colors and templates, and using a custom sign-in. It also enables conversing with Amazon Q through an interface personalized to your use case.

Application demo

Here are a few screenshots demonstrating an AI assistant application using Amazon Q Business. These screenshots illustrate a scenario where an employee interacts with the Amazon Q Business chatbot to get summaries, address common queries related to IT support, and open tickets or incidents using IT service management (ITSM) tools such as ServiceNow.

  1. Employee A interacts with the application to get help when wireless access was down and receives suggested actions to take:
    Screenshot showing employee interacting with the application to get help when wireless access was down
  2. Employee B interacts with the application to report an incident of wireless access down and receives a form to fill out to create a ticket:
    Screenshot showing employee interacting with the form presented by the application to create an incident in ServiceNow
    Screenshot showing the created incident in the application
    An incident is created in ServiceNow based on Employee B’s interaction:
    Screenshot of the created incident in ServiceNow
  3. A new employee in the organization interacts with the application to ask several questions about company policies and receives reliable answers:
    Screenshot showing employee interacting with the application to ask several questions about company policies
  4. A new employee in the organization asks the application how to reach IT support and receives detailed IT support contact information:
    Screenshot showing employee interacting with the application on how to reach IT support

Approaches for measuring productivity gains:

There are several approaches to measure productivity gains achieved by using a generative AI assistant. Here are some common metrics and methods:

Average search time reduction: Measure the time employees spend searching for information or solutions before and after implementing the AI assistant. A reduction in average search time indicates faster access to information, which can lead to shorter task completion times and improved efficiency.

    • Units: Percentage reduction in search time or absolute time saved (for example, hours or minutes)
    • Example: 40% reduction in average search time or 1 hour saved per employee per day

Task completion time: Measure the time taken to complete specific tasks or processes with and without the AI assistant. Shorter completion times suggest productivity gains.

    • Units: Percentage reduction in task completion time or absolute time saved (for example, hours or minutes)
    • Example: 30% reduction in task completion time or 2 hours saved per task

Recurring issues: Monitor the number of tickets raised for recurring issues and issues related to tasks or processes that the AI assistant can handle. A decrease in these tickets indicates improved productivity and reduced workload for employees.

    • Units: Percentage reduction in recurring issue frequency or absolute reduction in occurrences
    • Example: 40% reduction in the frequency of recurring issue X or 50 fewer occurrences per quarter

Overall ticket volume: Track the total number of tickets or issues raised related to tasks or processes that the AI assistant can handle.

    • Units: Percentage reduction in ticket volume or absolute number of tickets reduced
    • Example: 30% reduction in relevant ticket volume or 200 fewer tickets per month

Employee onboarding duration: Evaluate the time required for new employees to become fully productive with and without the AI assistant. Shorter onboarding times can indicate that the AI assistant is providing effective support, which translates to cost savings and faster time-to-productivity.

    • Units: Percentage reduction in onboarding time or absolute time saved (for example, days or weeks)
    • Example: 20% reduction in onboarding duration or 2 weeks saved per new employee

Employee productivity metrics: Track metrics such as output per employee or output quality before and after implementing the AI assistant. Improvements in these metrics can indicate productivity gains.

    • Units: Percentage improvement in output quality or reduction in rework or corrections
    • Example: 15% improvement in output quality or 30% reduction in rework required

Cost savings: Calculate the cost savings achieved through reduced labor hours, improved efficiency, and faster turnaround times enabled by the AI assistant.

    • Units: Monetary value (for example, dollars or euros) saved
    • Example: $100,000 in cost savings due to increased productivity

Knowledge base utilization: Measure the increase in utilization or effectiveness of knowledge bases or self-service resources because of the AI assistant’s ability to surface relevant information.

    • Units: Percentage increase in knowledge base utilization
    • Example: 20% increase in knowledge base utilization

Employee satisfaction surveys: Gather feedback from employees on their perceived productivity gains, time savings, and overall satisfaction with the AI assistant. Positive feedback can lead to increased retention, better performance, and a more positive work environment.

    • Units: Employee satisfaction score or percentage of employees reporting positive impact
    • Example: 80% of employees report increased productivity and satisfaction with the AI assistant

It’s important to establish baseline measurements before introducing the AI assistant and then consistently track the relevant metrics over time. Additionally, conducting controlled experiments or pilot programs can help isolate the impact of the AI assistant from other factors affecting productivity.

Conclusion

In this blog post, we explored how you can use Amazon Q Business to build generative AI assistants that enhance employee experience and boost productivity. By seamlessly integrating with internal data sources, knowledge bases, and productivity tools, Amazon Q Business equips your workforce with instant access to information, automated tasks, and personalized support. Using its robust capabilities, including multi-source connectors, document enrichment, relevance tuning, and enterprise-grade security, you can create tailored AI solutions that streamline workflows, optimize processes, and drive tangible gains in areas like task completion times, issue resolution, onboarding efficiency, and cost savings.

Unlock the transformative potential of Amazon Q Business and future-proof your organization—contact your AWS account team today.

Read more about Amazon Q


About the Authors

Puneeth Ranjan Komaragiri is a Principal Technical Account Manager at Amazon Web Services (AWS). He is particularly passionate about Monitoring and Observability, Cloud Financial Management, and Generative Artificial Intelligence (Gen-AI) domains. In his current role, Puneeth enjoys collaborating closely with customers, leveraging his expertise to help them design and architect their cloud workloads for optimal scale and resilience.

Krishna Pramod is a Senior Solutions Architect at AWS. He works as a trusted advisor for customers, helping customers innovate and build well-architected applications in AWS cloud. Outside of work, Krishna enjoys reading, music and traveling.

Tim McLaughlin is a Senior Product Manager for Amazon Q Business at Amazon Web Services (AWS). He is passionate about helping customers adopt generative AI services to meet evolving business challenges. Outside of work, Tim enjoys spending time with his family, hiking, and watching sports.

Read More

Introducing document-level sync reports: Enhanced data sync visibility in Amazon Kendra

Introducing document-level sync reports: Enhanced data sync visibility in Amazon Kendra

Amazon Kendra is an intelligent search service powered by machine learning (ML). Amazon Kendra helps you aggregate content from a variety of content repositories into a centralized index that lets you quickly search all your enterprise data and find the most accurate answer.

Amazon Kendra securely connects to over 40 data sources. When using your data source, you might want better visibility into the document processing lifecycle during data source sync jobs. They could include knowing the status of each document you attempted to crawl and index, as well as being able to troubleshoot why certain documents were not returned with the expected answers. Additionally, you might need access to metadata, timestamps, and access control lists (ACLs) for the indexed documents.

We are pleased to announce a new feature now available in Amazon Kendra that significantly improves visibility into data source sync operations. The latest release introduces a comprehensive document-level report incorporated into the sync history, providing administrators with granular indexing status, metadata, and ACL details for every document processed during a data source sync job. This enhancement to sync job observability enables administrators to quickly investigate and resolve ingestion or access issues encountered while setting up Amazon Kendra indexes. The detailed document reports are persisted in the new SYNC_RUN_HISTORY_REPORT log stream under the Amazon Kendra index log group, so critical sync job details are available on-demand when troubleshooting.

In this post, we discuss the benefits of this new feature and how it offers enhanced data sync visibility in Amazon Kendra.

Lifecycle of a document in a data source sync run job

In this section, we examine the lifecycle of a document within a data source sync in Amazon Kendra. This provides valuable insight into the sync process. The data source sync comprises three key stages: crawling, syncing, and indexing. Crawling involves the connector connecting to the data source and extracting documents meeting the defined sync scope according to the data source configuration. These documents are then synced to the Amazon Kendra index during the syncing phase. Finally, indexing makes the synced documents searchable within the Amazon Kendra environment.

The following diagram shows a flowchart of a sync run job.

Crawling stage

The first stage is the crawling stage, where the connector crawls all documents and their metadata from the data source. During this stage, the connector also compares the checksum of the document against the Amazon Kendra index to determine if a particular document needs to be added, modified, or deleted from the index. This operation corresponds to the CrawlAction field in the sync run history report.

If the document is unmodified, it’s marked as UNMODIFIED and skipped in the rest of the stages. If any document fails in the crawling stage, for example due to throttling errors, broken content, or if the document size is too big, that document is marked in the sync run history report with the CrawlStatus as FAILED. If the document was skipped due to any validation errors, its CrawlStatus is marked as SKIPPED. These documents are not sent to the next stage. All successful documents are marked as SUCCESS and are sent forward.

We also capture the ACLs and metadata on each document in this stage to be able to add it to the sync run history report.

Syncing stage

During the syncing stage, the document is sent to Amazon Kendra ingestion service APIs like BatchPutDocument and BatchDeleteDocument. After a document is submitted to these APIs, Amazon Kendra runs validation checks on the submitted documents. If any document fails these checks, its SyncStatus is marked as FAILED. If there is an irrecoverable error for a particular document, it is marked as SKIPPED and other documents are sent forward.

Indexing stage

In this step, Amazon Kendra parses the document, processes it according to its content type, and persists it in the index. If the document fails to be persisted, its IndexStatus is marked as FAILED; otherwise, it is marked as SUCCESS.

After the statuses of all the stages have been captured, we emit these statuses as an Amazon CloudWatch event to the customer’s AWS account.

Key features and benefits of document-level reports

The following are the key features and benefits of the new document-level report in Amazon Kendra indexes:

  • Enhanced sync run history page – A new Actions column has been added to the sync run history page, providing access to the document-level report for each sync run.

  • Dedicated log stream – A new log stream named SYNC_RUN_HISTORY_REPORT has been created in the Amazon Kendra CloudWatch log group, containing the document-level report.

  • Comprehensive document information – The document-level report includes the following information for each document:
  • Document ID – This is the document ID that is inherited directly from the data source or mapped by the customer in the data source field mappings.
  • Document title – The title of the document is taken from the data source or mapped by the customer in the data source field mappings.
  • Consolidated document status (SUCCESS, FAILED, or SKIPPED) – This is the final consolidated status of the document. It can have a value of SUCCESS, FAILED, or SKIPPED. If the document was successfully processed in all stages, then the value is SUCCESS. If the document failed or was skipped in any of the stages, then the value of this field will be FAILED or SKIPPED, respectively.
  • Error message (if the document failed) – This field contains the error message with which a document failed. If a document was skipped due to throttling errors, or any internal errors, this will be shown in the error message field.
  • Crawl status – This field denotes whether the document was crawled successfully from the data source. This status correlates to the syncing-crawling state in the data source sync.
  • Sync status – This field denotes whether the document was sent for syncing successfully. This correlates to the syncing-indexing state in the data source sync.
  • Index status – This field denotes whether the document was successfully persisted in the index.
  • ACLs – This field contains a list of document-level permissions that were crawled from the data source. The details of each element in the list are:
    • Global name – This is the email or user name of the user. This field is mapped across multiple data sources. For example, if a user has three datasources Confluence, SharePoint, and Gmail, with the local user ID as confluence_user, sharepoint_user and gmail_user respectively, and their email address user@email.com is the globalName in the ACL for all of them, then Amazon Kendra understands that all of these local user IDs map to the same global name.
    • Name – This is the local unique ID of the user, which is assigned by the data source.
    • Type – This field indicates the principal type. This can be either USER or GROUP.
    • Is Federated – This is a boolean flag that indicates whether the group is of INDEX level (true) or DATASOURCE level (false).
    • Access – This field indicates whether the user has access allowed or denied explicitly. Values can be either ALLOWED or DENIED.
    • Data source ID – This is the data source ID. For federated groups (INDEX level), this field will be null.
  • Metadata – This field contains the metadata fields (other than ACL) that were pulled from the data source. This list also includes the metadata fields mapped by the customer in the data source field mappings as well as extra metadata fields added by the connector.
  • Hashed document ID (for troubleshooting assistance) – To safeguard your data privacy, we present a secure, one-way hash of the document identifier. This encrypted value enables the Amazon Kendra team to efficiently locate and analyze the specific document within our logs, should you encounter any issue that requires further investigation and resolution.
  • Timestamp – The timestamp indicates when the document status was logged in CloudWatch.

In the following sections, we explore different use cases for the logging feature.

Determine the optimal boosting duration for recent documents in using document-level reporting

When it comes to generating accurate answers, you may want to fine-tune the way Amazon Kendra prioritizes its content. For instance, you may prefer to boost recent documents over older ones to make sure the most up-to-date passages are used to generate an answer. To achieve this, you can use the relevance tuning feature in Amazon Kendra to boost documents based on the last update date attribute, with a specified boosting duration. However, determining the optimal boosting period can be challenging when dealing with a large number of frequently changing documents.

You can now use the per-document-level report to obtain the _last_updated_at metadata field information for your documents, which can help you determine the appropriate boosting period. For this, you use the following CloudWatch Logs Insights query to retrieve the _last_updated_at metadata attribute for machine learning documents from the SYNC_RUN_HISTORY_REPORT log stream.

filter @logStream like 'SYNC_RUN_HISTORY_REPORT/'
and Metadata like 'Machine Learning'
| parse Metadata '{"key":"_last_updated_at","value":{"dateValue":"*"}}' as @last_updated_at
| sort @last_updated_at desc, @timestamp desc
| dedup DocumentTitle

With the preceding query, you can gain insights into the last updated timestamps of your documents, enabling you to make informed decisions about the optimal boosting period. This approach makes sure your chat responses are generated using the most recent and relevant information, enhancing the overall accuracy and effectiveness of your Amazon Kendra implementation.

The following screenshot shows an example result.

Common document indexing observability and troubleshooting methods

In this section, we explore some common admin tasks for observing and troubleshooting document indexing using the new document-level reporting feature.

List all successfully indexed documents from a data source

To retrieve a list of all documents that have been successfully indexed from a specific data source, you can use the following CloudWatch Logs Insights query:

fields DocumentTitle, DocumentId, @timestamp
| filter @logStream like 'SYNC_RUN_HISTORY_REPORT/your-data-source-id/'
and ConnectorDocumentStatus.Status = "SUCCESS"
| sort @timestamp desc | dedup DocumentTitle, DocumentId

The following screenshot shows an example result.

List all successfully indexed documents from a data source sync job

To retrieve a list of all documents that have been successfully indexed during a specific sync job, you can use the following CloudWatch Logs Insights query:

fields DocumentTitle, DocumentId, ConnectorDocumentStatus.Status AS IndexStatus, @timestamp
| filter @logStream like 'SYNC_RUN_HISTORY_REPORT/your-data-source-id/run-id'
and ConnectorDocumentStatus.Status = "SUCCESS"
| sort DocumentTitle

The following screenshot shows an example result.

List all failed indexed documents from a data source sync job

To retrieve a list of all documents that failed to index during a specific sync job, along with the error messages, you can use the following CloudWatch Logs Insights query:

fields DocumentTitle, DocumentId, ConnectorDocumentStatus.Status AS IndexStatus, ErrorMsg, @timestamp
| filter @logStream like 'SYNC_RUN_HISTORY_REPORT/your-data-source-id/run-id'
and ConnectorDocumentStatus.Status = "FAILED"
| sort @timestamp desc

The following screenshot shows an example result.

List all documents that contain a user’s ACL permission from an Amazon Kendra index

To retrieve a list of documents that have a specific users ACL permission, you can use the following CloudWatch Logs Insights query:

filter @logStream like 'SYNC_RUN_HISTORY_REPORT/'
and Acl like 'aneesh@mydemoaws.onmicrosoft.com'
| display DocumentTitle, SourceUri

The following screenshot shows an example result.

List the ACL of an indexed document from a data source sync job

To retrieve the ACL information for a specific indexed document from a sync job, you can use the following CloudWatch Logs Insights query:

filter @logStream like 'SYNC_RUN_HISTORY_REPORT/data-source-id/run-id'
and DocumentTitle = "your-document-title"
| display DocumentTitle, Acl

The following screenshot shows an example result.

List metadata of an indexed document from a data source sync job

To retrieve the metadata information for a specific indexed document from a sync job, you can use the following CloudWatch Logs Insights query:

filter @logStream like 'SYNC_RUN_HISTORY_REPORT/data-source-id/run-id'
and DocumentTitle = "your-document-title"
| display DocumentTitle, Metadata

The following screenshot shows an example result.

Conclusion

The newly introduced document-level report in Amazon Kendra provides enhanced visibility and observability into the document processing lifecycle during data source sync jobs. This feature addresses a critical need expressed by customers for better troubleshooting capabilities and access to detailed information about the indexing status, metadata, and ACLs of individual documents.

The document-level report is stored in a log stream named SYNC_RUN_HISTORY_REPORT within the Amazon Kendra index CloudWatch log group. This report contains comprehensive information for each document, including the document ID, title, overall document sync status, error messages (if any), along with its ACLs and metadata information retrieved from the data sources. The data source sync run history page now includes an Actions column, providing access to the document-level report for each sync run. This feature significantly improves the ability to troubleshoot issues related to document ingestion and access control, and issues related to metadata relevance, and provides better visibility about the documents synced with an Amazon Kendra index.

To get started with Amazon Kendra, explore the Getting started guide. To learn more about data source connectors and best practices, see Creating a data source connector.


About the Authors

Aneesh Mohan is a Senior Solutions Architect at Amazon Web Services (AWS), with over 20 years of experience in architecting and delivering high-impact solutions for mission-critical workloads. His expertise spans across the financial services industry, AI/ML, security, and data technologies. Driven by a deep passion for technology, Aneesh is dedicated to partnering with customers to design and implement well-architected, innovative solutions that address their unique business needs.

Ashwin Shukla is a Software Development Engineer II on the Amazon Q for Business and Amazon Kendra engineering team, with 6 years of experience in developing enterprise software. In this role, he works on designing and developing foundational features for Amazon Q for Business.

Read More

Fine-tune Meta Llama 3.1 models using torchtune on Amazon SageMaker

Fine-tune Meta Llama 3.1 models using torchtune on Amazon SageMaker

This post is co-written with Meta’s PyTorch team.

In today’s rapidly evolving AI landscape, businesses are constantly seeking ways to use advanced large language models (LLMs) for their specific needs. Although foundation models (FMs) offer impressive out-of-the-box capabilities, true competitive advantage often lies in deep model customization through fine-tuning. However, fine-tuning LLMs for complex tasks typically requires advanced AI expertise to align and optimize them effectively. Recognizing this challenge, Meta developed torchtune, a PyTorch-native library that simplifies authoring, fine-tuning, and experimenting with LLMs, making it more accessible to a broader range of users and applications.

In this post, AWS collaborates with Meta’s PyTorch team to showcase how you can use Meta’s torchtune library to fine-tune Meta Llama-like architectures while using a fully-managed environment provided by Amazon SageMaker Training. We demonstrate this through a step-by-step implementation of model fine-tuning, inference, quantization, and evaluation. We perform the steps on a Meta Llama 3.1 8B model utilizing the LoRA fine-tuning strategy on a single p4d.24xlarge worker node (providing 8 Nvidia A100 GPUs).

Before we dive into the step-by-step guide, we first explored the performance of our technical stack by fine-tuning a Meta Llama 3.1 8B model across various configurations and instance types.

As can be seen in the following chart, we found that a single p4d.24xlarge delivers 70% higher performance than two g5.48xlarge instances (each with 8 NVIDIA A10 GPUs) at almost 47% reduced price. We therefore have optimized the example in this post for a p4d.24xlarge configuration. However, you could use the same code to run single-node or multi-node training on different instance configurations by changing the parameters passed to the SageMaker estimator. You could further optimize the time for training in the following graph by using a SageMaker managed warm pool and accessing pre-downloaded models using Amazon Elastic File System (Amazon EFS).

Challenges with fine-tuning LLMs

Generative AI models offer many promising business use cases. However, to maintain factual accuracy and relevance of these LLMs to specific business domains, fine-tuning is required. Due to the growing number of model parameters and the increasing context length of modern LLMs, this process is memory intensive. To address these challenges, fine-tuning strategies like LoRA (Low-Rank Adaptation) and QLoRA (Quantized Low-Rank Adaptation) limit the number of trainable parameters by adding low-rank parallel structures to the transformer layers. This enables you to train LLMs even on systems with low memory availability like commodity GPUs. However, this leads to an increased complexity because new dependencies have to be handled and training recipes and hyperparameters need to be adapted to the new techniques.

What businesses need today is user-friendly training recipes for these popular fine-tuning techniques, which provide abstractions to the end-to-end tuning process, addressing the common pitfalls in the most opinionated way.

How does torchtune helps?

torchtune is a PyTorch-native library that aims to democratize and streamline the fine-tuning process for LLMs. By doing so, it makes it straightforward for researchers, developers, and organizations to adapt these powerful LLMs to their specific needs and constraints. It provides training recipes for a variety of fine-tuning techniques, which can be configured through YAML files. The recipes implement common fine-tuning methods (full-weight, LoRA, QLoRA) as well as other common tasks like inference and evaluation. They automatically apply a set of important features (FSDP, activation checkpointing, gradient accumulation, mixed precision) and are specific to a given model family (such as Meta Llama 3/3.1 or Mistral) as well as compute environment (single-node vs. multi-node).

Additionally, torchtune integrates with major libraries and frameworks like Hugging Face datasets, EleutherAI’s Eval Harness, and Weights & Biases. This helps address the requirements of the generative AI fine-tuning lifecycle, from data ingestion and multi-node fine-tuning to inference and evaluation. The following diagram shows a visualization of the steps we describe in this post.

Refer to the installation instructions and PyTorch documentation to learn more about torchtune and its concepts.

Solution overview

This post demonstrates the use of SageMaker Training for running torchtune recipes through task-specific training jobs on separate compute clusters. SageMaker Training is a comprehensive, fully managed ML service that enables scalable model training. It provides flexible compute resource selection, support for custom libraries, a pay-as-you-go pricing model, and self-healing capabilities. By managing workload orchestration, health checks, and infrastructure, SageMaker helps reduce training time and total cost of ownership.

The solution architecture incorporates the following key components to enhance security and efficiency in fine-tuning workflows:

  • Security enhancement – Training jobs are run within private subnets of your virtual private cloud (VPC), significantly improving the security posture of machine learning (ML) workflows.
  • Efficient storage solution – Amazon EFS is used to accelerate model storage and access across various phases of the ML workflow.
  • Customizable environment – We use custom containers in training jobs. The support in SageMaker for custom containers allows you to package all necessary dependencies, specialized frameworks, and libraries into a single artifact, providing full control over your ML environment.

The following diagram illustrates the solution architecture. Users initiate the process by calling the SageMaker control plane through APIs or command line interface (CLI) or using the SageMaker SDK for each individual step. In response, SageMaker spins up training jobs with the requested number and type of compute instances to run specific tasks. Each step defined in the diagram accesses torchtune recipes from an Amazon Simple Storage Service (Amazon S3) bucket and uses Amazon EFS to save and access model artifacts across different stages of the workflow.

By decoupling every torchtune step, we achieve a balance between flexibility and integration, allowing for both independent execution of steps and the potential for automating this process using seamless pipeline integration.

In this use case, we fine-tune a Meta Llama 3.1 8B model with LoRA. Subsequently, we run model inference, and optionally quantize and evaluate the model using torchtune and SageMaker Training.

Recipes, configs, datasets, and prompt templates are completely configurable and allow you to align torchtune to your requirements. To demonstrate this, we use a custom prompt template in this use case and combine it with the open source dataset Samsung/samsum from the Hugging Face hub.

We fine-tune the model using torchtune’s multi device LoRA recipe (lora_finetune_distributed) and use the SageMaker customized version of Meta Llama 3.1 8B default config (llama3_1/8B_lora).

Prerequisites

You need to complete the following prerequisites before you can run the SageMaker Jupyter notebooks:

  1. Create a Hugging Face access token to get access to the gated repo meta-llama/Meta-Llama-3.1-8B on Hugging Face.
  2. Create a Weights & Biases API key to access the Weights & Biases dashboard for logging and monitoring
  3. Request a SageMaker service quota for 1x ml.p4d.24xlarge and 1xml.g5.2xlarge.
  4. Create an AWS Identity and Access Management (IAM) role with managed policies AmazonSageMakerFullAccess, AmazonEC2FullAccess, AmazonElasticFileSystemFullAccess, and AWSCloudFormationFullAccess to give required access to SageMaker to run the examples. (This is for demonstration purposes. You should adjust this to your specific security requirements for production.)
  5. Create an Amazon SageMaker Studio domain (see Quick setup to Amazon SageMaker) to access Jupyter notebooks with the preceding role. Refer to the instructions to set permissions for Docker build.
  6. Log in to the notebook console and clone the GitHub repo:
$ git clone https://github.com/aws-samples/sagemaker-distributed-training-workshop.git
$ cd sagemaker-distributed-training-workshop/13-torchtune
  1. Run the notebook ipynb to set up VPC and Amazon EFS using an AWS CloudFormation stack.

Review torchtune configs

The following figure illustrates the steps in our workflow.

You can look up the torchtune configs for your use case by directly using the tune CLI.For this post, we provide modified config files aligned with SageMaker directory path’s structure:

sh-4.2$ cd config/
sh-4.2$ ls -ltr
-rw-rw-r-- 1 ec2-user ec2-user 1151 Aug 26 18:34 config_l3.1_8b_gen_orig.yaml
-rw-rw-r-- 1 ec2-user ec2-user 1172 Aug 26 18:34 config_l3.1_8b_gen_trained.yaml
-rw-rw-r-- 1 ec2-user ec2-user  644 Aug 26 18:49 config_l3.1_8b_quant.yaml
-rw-rw-r-- 1 ec2-user ec2-user 2223 Aug 28 14:53 config_l3.1_8b_lora.yaml
-rw-rw-r-- 1 ec2-user ec2-user 1223 Sep  4 14:28 config_l3.1_8b_eval_trained.yaml
-rw-rw-r-- 1 ec2-user ec2-user 1213 Sep  4 14:29 config_l3.1_8b_eval_original.yaml

torchtune uses these config files to select and configure the components (think models and tokenizers) during the execution of the recipes.

Build the container

As part of our example, we create a custom container to provide custom libraries like torch nightlies and torchtune. Complete the following steps:

sh-4.2$ cat Dockerfile
# Set the default value for the REGION build argument
ARG REGION=us-west-2
# SageMaker PyTorch image for TRAINING
FROM ${ACCOUNTID}.dkr.ecr.${REGION}.amazonaws.com/pytorch-training:2.3.0-gpu-py311-cu121-ubuntu20.04-sagemaker
# Uninstall existing PyTorch packages
RUN pip uninstall torch torchvision transformer-engine -y
# Install latest release of PyTorch and torchvision
RUN pip install --force-reinstall torch==2.4.1 torchao==0.4.0 torchvision==0.19.1

Run the 1_build_container.ipynb notebook until the following command to push this file to your ECR repository:

!sm-docker build . --repository accelerate:latest

sm-docker is a CLI tool designed for building Docker images in SageMaker Studio using AWS CodeBuild. We install the library as part of the notebook.

Next, we will run the 2_torchtune-llama3_1.ipynb notebook for all fine-tuning workflow tasks.

For every task, we review three artifacts:

  • torchtune configuration file
  • SageMaker task config with compute and torchtune recipe details
  • SageMaker task output

Run the fine-tuning task

In this section, we walk through the steps to run and monitor the fine-tuning task.

Run the fine-tuning job

The following code shows a shortened torchtune recipe configuration highlighting a few key components of the file for a fine-tuning job:

  • Model component including LoRA rank configuration
  • Meta Llama 3 tokenizer to tokenize the data
  • Checkpointer to read and write checkpoints
  • Dataset component to load the dataset
sh-4.2$ cat config_l3.1_8b_lora.yaml
# Model Arguments
model:
  _component_: torchtune.models.llama3_1.lora_llama3_1_8b
  lora_attn_modules: ['q_proj', 'v_proj']
  lora_rank: 8
  lora_alpha: 16

# Tokenizer
tokenizer:
  _component_: torchtune.models.llama3.llama3_tokenizer
  path: /opt/ml/input/data/model/hf-model/original/tokenizer.model

checkpointer:
  _component_: torchtune.utils.FullModelMetaCheckpointer
  checkpoint_files: [
    consolidated.00.pth
  ]
  …

# Dataset and Sampler
dataset:
  _component_: torchtune.datasets.samsum_dataset
  train_on_input: True
batch_size: 13

# Training
epochs: 1
gradient_accumulation_steps: 2

... and more ...

We use Weights & Biases for logging and monitoring our training jobs, which helps us track our model’s performance:

metric_logger:
_component_: torchtune.utils.metric_logging.WandBLogger
…

Next, we define a SageMaker task that will be passed to our utility function in the script create_pytorch_estimator. This script creates the PyTorch estimator with all the defined parameters.

In the task, we use the lora_finetune_distributed torchrun recipe with config config-l3.1-8b-lora.yaml on an ml.p4d.24xlarge instance. Make sure you download the base model from Hugging Face before it’s fine-tuned using the use_downloaded_model parameter. The image_uri parameter defines the URI of the custom container.

sagemaker_tasks={
    "fine-tune":{
        "hyperparameters":{
            "tune_config_name":"config-l3.1-8b-lora.yaml",
            "tune_action":"fine-tune",
            "use_downloaded_model":"false",
            "tune_recipe":"lora_finetune_distributed"
            },
        "instance_count":1,
        "instance_type":"ml.p4d.24xlarge",        
        "image_uri":"<accountid>.dkr.ecr.<region>.amazonaws.com/accelerate:latest"
    }
    ... and more ...
}

To create and run the task, run the following code:

Task="fine-tune"
estimator=create_pytorch_estimator(**sagemaker_tasks[Task])
execute_task(estimator)

The following code shows the task output and reported status:

# Refer-Output

2024-08-16 17:45:32 Starting - Starting the training job...
...
...

1|140|Loss: 1.4883038997650146:  99%|█████████▉| 141/142 [06:26<00:02,  2.47s/it]
1|141|Loss: 1.4621509313583374:  99%|█████████▉| 141/142 [06:26<00:02,  2.47s/it]

Training completed with code: 0
2024-08-26 14:19:09,760 sagemaker-training-toolkit INFO     Reporting training SUCCESS

The final model is saved to Amazon EFS, which makes it available without download time penalties.

Monitor the fine-tuning job

You can monitor various metrics such as loss and learning rate for your training run through the Weights & Biases dashboard. The following figures show the results of the training run where we tracked GPU utilization, GPU memory utilization, and loss curve.

For the following graph, to optimize memory usage, torchtune uses only rank 0 to initially load the model into CPU memory. rank 0 therefore will be responsible for loading the model weights from the checkpoint.

The example is optimized to use GPU memory to its maximum capacity. Increasing the batch size further will lead to CUDA out-of-memory (OOM) errors.

The run took about 13 minutes to complete for one epoch, resulting in the loss curve shown in the following graph.

Run the model generation task

In the next step, we use the previously fine-tuned model weights to generate the answer to a sample prompt and compare it to the base model.

The following code shows the configuration of the generate recipe config_l3.1_8b_gen_trained.yaml. The following are key parameters:

  • FullModelMetaCheckpointer – We use this to load the trained model checkpoint meta_model_0.pt from Amazon EFS
  • CustomTemplate.SummarizeTemplate – We use this to format the prompt for inference
# torchtune - trained model generation config - config_l3.1_8b_gen_trained.yaml
model:
  _component_: torchtune.models.llama3_1.llama3_1_8b
  
checkpointer:
  _component_: torchtune.utils.FullModelMetaCheckpointer
  checkpoint_dir: /opt/ml/input/data/model/
  checkpoint_files: [
    meta_model_0.pt
  ]
  …

# Generation arguments; defaults taken from gpt-fast
instruct_template: CustomTemplate.SummarizeTemplate

... and more ...

Next, we configure the SageMaker task to run on a single ml.g5.2xlarge instance:

prompt=r'{"dialogue":"Amanda: I baked  cookies. Do you want some?rnJerry: Sure rnAmanda: I will bring you tomorrow :-)"}'

sagemaker_tasks={
    "generate_inference_on_trained":{
        "hyperparameters":{
            "tune_config_name":"config_l3.1_8b_gen_trained.yaml ",
            "tune_action":"generate-trained",
            "use_downloaded_model":"true",
            "prompt":json.dumps(prompt)
            },
        "instance_count":1,
        "instance_type":"ml.g5.2xlarge",
 "image_uri":"<accountid>.dkr.ecr.<region>.amazonaws.com/accelerate:latest"
    }
}

In the output of the SageMaker task, we see the model summary output and some stats like tokens per second:

#Refer- Output
...
Amanda: I baked  cookies. Do you want some?rnJerry: Sure rnAmanda: I will bring you tomorrow :-)

Summary:
Amanda baked cookies. She will bring some to Jerry tomorrow.

INFO:torchtune.utils.logging:Time for inference: 1.71 sec total, 7.61 tokens/sec
INFO:torchtune.utils.logging:Memory used: 18.32 GB

... and more ...

We can generate inference from the original model using the original model artifact consolidated.00.pth:

# torchtune - trained original generation config - config_l3.1_8b_gen_orig.yaml
…  
checkpointer:
  _component_: torchtune.utils.FullModelMetaCheckpointer
  checkpoint_dir: /opt/ml/input/data/model/hf-model/original/
  checkpoint_files: [
    consolidated.00.pth
  ]
  
... and more ...

The following code shows the comparison output from the base model run with the SageMaker task (generate_inference_on_original). We can see that the fine-tuned model is performing subjectively better than the base model by also mentioning that Amanda baked the cookies.

# Refer-Output 
---
Summary:
Jerry tells Amanda he wants some cookies. Amanda says she will bring him some cookies tomorrow.

... and more ...

Run the model quantization task

To speed up the inference and decrease the model artifact size, we can apply post-training quantization. torchtune relies on torchao for post-training quantization.

We configure the recipe to use Int8DynActInt4WeightQuantizer, which refers to int8 dynamic per token activation quantization combined with int4 grouped per axis weight quantization. For more details, refer to the torchao implementation.

# torchtune model quantization config - config_l3.1_8b_quant.yaml
model:
  _component_: torchtune.models.llama3_1.llama3_1_8b

checkpointer:
  _component_: torchtune.utils.FullModelMetaCheckpointer
  …

quantizer:
  _component_: torchtune.utils.quantization.Int8DynActInt4WeightQuantizer
  groupsize: 256

We again use a single ml.g5.2xlarge instance and use SageMaker warm pool configuration to speed up the spin-up time for the compute nodes:

sagemaker_tasks={
"quantize_trained_model":{
        "hyperparameters":{
            "tune_config_name":"config_l3.1_8b_quant.yaml",
            "tune_action":"run-quant",
            "use_downloaded_model":"true"
            },
        "instance_count":1,
        "instance_type":"ml.g5.2xlarge",
        "image_uri":"<accountid>.dkr.ecr.<region>.amazonaws.com/accelerate:latest"
    }
}

In the output, we see the location of the quantized model and how much memory we saved due to the process:

#Refer-Output
...

linear: layers.31.mlp.w1, in=4096, out=14336
linear: layers.31.mlp.w2, in=14336, out=4096
linear: layers.31.mlp.w3, in=4096, out=14336
linear: output, in=4096, out=128256
INFO:torchtune.utils.logging:Time for quantization: 7.40 sec
INFO:torchtune.utils.logging:Memory used: 22.97 GB
INFO:torchtune.utils.logging:Model checkpoint of size 8.79 GB saved to /opt/ml/input/data/model/quantized/meta_model_0-8da4w.pt

... and more ...

You can run model inference on the quantized model meta_model_0-8da4w.pt by updating the inference-specific configurations.

Run the model evaluation task

Finally, let’s evaluate our fine-tuned model in an objective manner by running an evaluation on the validation portion of our dataset.

torchtune integrates with EleutherAI’s evaluation harness and provides the eleuther_eval recipe.

For our evaluation, we use a custom task for the evaluation harness to evaluate the dialogue summarizations using the rouge metrics.

The recipe configuration points the evaluation harness to our custom evaluation task:

# torchtune trained model evaluation config - config_l3.1_8b_eval_trained.yaml

model:
...

include_path: "/opt/ml/input/data/config/tasks"
tasks: ["samsum"]
...

The following code is the SageMaker task that we run on a single ml.p4d.24xlarge instance:

sagemaker_tasks={
"evaluate_trained_model":{
        "hyperparameters":{
            "tune_config_name":"config_l3.1_8b_eval_trained.yaml",
            "tune_action":"run-eval",
            "use_downloaded_model":"true",
            },
        "instance_count":1,
        "instance_type":"ml.p4d.24xlarge",
    }
}

Run the model evaluation on ml.p4d.24xlarge:

Task="evaluate_trained_model"
estimator=create_pytorch_estimator(**sagemaker_tasks[Task])
execute_task(estimator)

The following tables show the task output for the fine-tuned model as well as the base model.

The following output is for the fine-tuned model.

 

Tasks Version Filter n-shot Metric Direction Value ± Stderr
samsum 2 none None rouge1 45.8661 ± N/A
none None rouge2 23.6071 ± N/A
none None rougeL 37.1828 ± N/A

The following output is for the base model.

Tasks Version Filter n-shot Metric Direction Value ± Stderr
samsum 2 none None rouge1 33.6109 ± N/A
none None rouge2 13.0929 ± N/A
none None rougeL 26.2371 ± N/A

Our fine-tuned model achieves an improvement of approximately 46% on the summarization task, which is approximately 12 points better than the baseline.

Clean up

Complete the following steps to clean up your resources:

  1. Delete any unused SageMaker Studio resources.
  2. Optionally, delete the SageMaker Studio domain.
  3. Delete the CloudFormation stack to delete the VPC and Amazon EFS resources.

Conclusion

In this post, we discussed how you can fine-tune Meta Llama-like architectures using various fine-tuning strategies on your preferred compute and libraries, using custom dataset prompt templates with torchtune and SageMaker. This architecture gives you a flexible way of running fine-tuning jobs that are optimized for GPU memory and performance. We demonstrated this through fine-tuning a Meta Llama3.1 model using P4 and G5 instances on SageMaker and used observability tools like Weights & Biases to monitor loss curve, as well as CPU and GPU utilization.

We encourage you to use SageMaker training capabilities and Meta’s torchtune library to fine-tune Meta Llama-like architectures for your specific business use cases. To stay informed about upcoming releases and new features, refer to the torchtune GitHub repo and the official Amazon SageMaker training documentation .

Special thanks to Kartikay Khandelwal (Software Engineer at Meta), Eli Uriegas (Engineering Manager at Meta), Raj Devnath (Sr. Product Manager Technical at AWS) and Arun Kumar Lokanatha (Sr. ML Solution Architect at AWS) for their support to the launch of this post.


About the Authors

Kanwaljit Khurmi is a Principal Solutions Architect at Amazon Web Services. He works with AWS customers to provide guidance and technical assistance, helping them improve the value of their solutions when using AWS. Kanwaljit specializes in helping customers with containerized and machine learning applications.

Roy Allela is a Senior AI/ML Specialist Solutions Architect at AWS.He helps AWS customers—from small startups to large enterprises—train and deploy large language models efficiently on AWS.

Matthias Reso is a Partner Engineer at PyTorch working on open source, high-performance model optimization, distributed training (FSDP), and inference. He is a co-maintainer of llama-recipes and TorchServe.

Trevor Harvey is a Principal Specialist in Generative AI at Amazon Web Services (AWS) and an AWS Certified Solutions Architect – Professional. He serves as a voting member of the PyTorch Foundation Governing Board, where he contributes to the strategic advancement of open-source deep learning frameworks. At AWS, Trevor works with customers to design and implement machine learning solutions and leads go-to-market strategies for generative AI services.

Read More

Integrate Amazon Bedrock Knowledge Bases with Microsoft SharePoint as a data source

Integrate Amazon Bedrock Knowledge Bases with Microsoft SharePoint as a data source

Amazon Bedrock Knowledge Bases provides foundation models (FMs) and agents in Amazon Bedrock contextual information from your company’s private data sources for Retrieval Augmented Generation (RAG) to deliver more relevant, accurate, and customized responses. Amazon Bedrock Knowledge Bases offers a fully managed RAG experience.

The data sources that can be connected to as knowledge bases are continuously expanding. This post showcases how to use one of the data source connectors; Microsoft SharePoint, an integrated content management and collaboration tool that many organizations use for storing, organizing, and sharing their internal data. See Data source connectors for the full list of supported data source connectors.

Solution overview

The following are some pertinent features of the SharePoint data source within Amazon Bedrock Knowledge Bases:

  • It provides access to the information stored in SharePoint. The RAG architecture queries and retrieves relevant information from the SharePoint source to provide contextual responses based on the user’s input.
  • It provides the ability to extract structured data, metadata, and other information from documents ingested from SharePoint to provide relevant search results based on the user query.
  • It provides the ability to sync incremental SharePoint content updates on an ongoing basis.
  • It provides source attribution to the response generated by the FM.

In the following sections, we walk through the steps to create a knowledge base, configure your data source, and test the solution.

Prerequisites

The following are the prerequisites necessary to implement Amazon Bedrock Knowledge Bases with SharePoint as a connector:

Create a knowledge base and connect to the data source

Complete the following steps to set up a knowledge base on Amazon Bedrock and connect to a SharePoint data source:

  1. On the Amazon Bedrock console, choose Knowledge bases in the navigation pane.
  2. Choose Create knowledge base.

kb-landing-view

  1. In the Knowledge base details section, optionally change the default name and enter a description for your knowledge base.
  2. In the IAM permissions section, select an IAM role that provides Amazon Bedrock permission to access other AWS services. You can let Amazon Bedrock create the service role or choose a custom role that you have created.
  3. In the Choose data source section, select SharePoint.
  4. Optionally, add tags to your knowledge base. For more information, see Tag resources.
  5. Choose Next.

kb-details-1

  1. In the Name and Description section, optionally change the default data source name and enter a description of the data source.
  2. In the Source section, provide the following information:
    1. For Site URLs, enter the site URLs to use for crawling and indexing the content for RAG.
    2. For Domain, enter the domain name associated with the data source. For example, if the site URL is https://deloittedasits.sharepoint.com/xyz.aspx, the domain value would be deloittedasits.
    3. Under Advanced settings, keep the default selections.

kb-details-name-desc

While converting your data into embeddings, Amazon Bedrock encrypts your data with a key that AWS owns and manages by default. To use your own AWS Key Management Service (AWS KMS) key, choose Customize encryption settings (Advanced) and choose a key. For more information, see Encryption of transient data storage during data ingestion.

You can also choose from the following options for the data deletion policy for your data source:

  • Delete – Deletes all underlying data belonging to the data source from the vector store upon deletion of a knowledge base or data source resource. Note that the vector store itself is not deleted, only the underlying data. This flag is ignored if an AWS account is deleted.
  • Retain – Retains all underlying data in your vector store upon deletion of a knowledge base or data source resource.

For more information on managing your knowledge base, see Manage a data source.

ML-17173-kb-details-advanced-settings

  1. In the Authentication section, the supported authentication method is set to OAuth 2.0.
    1. For Tenant ID, enter your tenant ID. Refer to section Register a new application in the Microsoft Azure Portal of this post to get the Tenant ID.
    2. For AWS Secrets Manager secret, enter an AWS Secrets Manager Refer to the section Create a Secrets Manager secret for the SharePoint data source of this post to get the secret.

The SharePoint data source will need credentials to connect to the SharePoint Online site using the Microsoft Graph API. To facilitate this, create a new Secrets Manager secret. These credentials will not be used in any access logs for the SharePoint Online Site.

kb-details-authentication

  1. In the Metadata Settings section, optionally select any content types that you want to include or exclude.

kb-details-metadata

  1. In the Content chunking and parsing section, select Default.

kb-details-content-chunking

  1. Choose Next.
  2. In the Embeddings model section, select Titan Embeddings G1 – Text or another embeddings model as appropriate.
  3. In the Vector database section, select Quick create a new vector store to create a vector store for the embeddings.
  4. Choose Next.

kb-details-embeddings

  1. On the Review and create page, verify the selections you made and choose Create.

The knowledge base creation should be complete.

kn-created-success

The knowledge base with SharePoint as the data source is now created. However, the data source needs to be synced in order to crawl the site URLs and index the associated content.

  1. To initiate this process, on the knowledge base details page, select your data source and choose Sync.

kb-sync

Register a new application in the Microsoft Azure Portal

In this section, we register a new application in the Microsoft Azure Portal. We capture the Tenant ID from this step to use when configuring the data source for Knowledge Base for Amazon Bedrock. Complete the following steps:

  1. Open the Azure Portal and log in with your Microsoft account. If you don’t have an account, you can create one or contact your organization’s administration team.
  2. Choose New registration.
  3. Provide the following information:
    1. For Name, provide the name for your application. Let’s refer to this application as TargetApp. Amazon Bedrock Knowledge Bases uses TargetApp to connect to the SharePoint site to crawl and index the data.
    2. For Who can use this application or access this API, choose Accounts in this organizational directory only (<Tenant name> only – Single tenant).
    3. Choose Register.
    4. Note down the application (client) ID and the directory (tenant) ID on the Overview You’ll need them later when asked for TargetApp-ClientId and TenantId.
  4. Choose API permissions in the navigation pane.
  5. Configure the permissions as follows:
    1. Choose Add a permission.
    2. Choose Microsoft Graph.
    3. Choose Delegated permissions.
    4. Choose Read.All in the User section.
    5. Choose Read.All in the GroupMember section.
    6. Choose FullControl.All in the Sites section.
    7. Choose Add permissions. This permission allows the app to read data in your organization’s directory about the signed-in user.
    8. On the options menu (three dots), choose Remove permission.
    9. Remove the original Read – Delegated permission.
    10. Choose Grant admin consent for the default directory.

SPO-register-app

  1. Choose Certificates & secrets in the navigation pane.
    1. Choose New client secret.
    2. For Description, enter a description, such as description of my client secret.
    3. Choose a value for Expires. In production, you’ll need to manually rotate your secret before it expires.
    4. Choose Add.
    5. Note down the value for your new secret. You’ll need it later when asked for your client secret (TargetApp-ClientSecret).
  2. Optionally, choose Owners to add any additional owners for the application. Owners will be able to manage permissions of the Azure AD app (TargetApp).

Create a Secrets Manager secret for the SharePoint data source

Complete the following steps to create a Secrets Manager secret to connect to the SharePoint online sites listed as site URLs within the data source:

  1. On the Secrets Manager console, choose Store a new secret.
  2. For Secret type, select Other type of secret.
  3. For Key/value pairs, enter the following:
    1. username
    2. password
    3. clientId
    4. clientSecret
  4. For Encryption key, choose aws/secretsmanager.
  5. Choose Next.
  6. In the Secret name and description section, enter the name of the secret and an optional description.
  7. Add any associated tags in the Tags
  8. Leave Resource permissions and Replication secret as default.
  9. Choose Next.
  10. In the Configure rotation section, leave as default or modify according to your organizational policies.
  11. Choose Next.
  12. Review the options you selected and choose Store.
  13. On the secrets detail page, note your secret ARN value to be used as the secret when creating the Knowledge Base for Amazon Bedrock.

kb-secret

Test the solution

Complete the following steps to test the knowledge base you created:

  1. On the Amazon Bedrock console, choose Knowledge bases in the navigation pane.
  2. Select the knowledge base you created and choose Test.

kb-test-1

  1. Choose an appropriate model for testing and choose Apply.

kb-test-2

  1. Enter your question for the content housed in the SharePoint site.

kb-test-3

Clean up

If you created a new knowledge base to experiment using this post and don’t plan to use it further, delete the knowledge base so that your AWS account doesn’t accumulate costs. For instructions, see Manage a knowledge base.

Conclusion

In this post, we showed you how to configure Amazon Bedrock Knowledge Bases with SharePoint Online as a data source. By connecting SharePoint Online as a data source, employees can interact with the organization’s knowledge and data stored in SharePoint using natural language, making it straightforward to find relevant information, extract key points, and derive valuable insights. This can significantly improve productivity, decision-making, and knowledge sharing within the organization.

Try this feature on the Amazon Bedrock console today! See Amazon Bedrock Knowledge Bases to learn more.


About the Authors

SurendarSurendar Gajavelli is a Sr. Solutions Architect based out of Nashville, Tennessee. He is a passionate technology enthusiast who enjoys working with customers and helping them build innovative solutions.

AbhiAbhi Patlolla is a Sr. Solutions Architect based out of the New York City region, helping customers in their cloud transformation, AI/ML, and data initiatives. He is a strategic and technical leader, advising executives and engineers on cloud strategies to foster innovation and positive impact.

Read More

Revolutionize logo design creation with Amazon Bedrock: Embracing generative art, dynamic logos, and AI collaboration

Revolutionize logo design creation with Amazon Bedrock: Embracing generative art, dynamic logos, and AI collaboration

In the field of technology and creative design, logo design and creation has adapted and evolved at a rapid pace. From the hieroglyphs of ancient Egypt to the sleek minimalism of today’s tech giants, the visual identities that define our favorite brands have undergone a remarkable transformation.

Today, the world of creative design is once again being transformed by the emergence of generative AI. Designers and brands now have opportunities to push the boundaries of creativity, crafting logos that are not only visually stunning but also responsive to their environments and tailored to the preferences of their target audiences.

Amazon Bedrock enables access to powerful generative AI models like Stable Diffusion through a user-friendly API. These models can be integrated into the logo design workflow, allowing designers to rapidly ideate, experiment, generate, and edit a wide range of unique visual images. Integrating it with the range of AWS serverless computing, networking, and content delivery services like AWS Lambda, Amazon API Gateway, and AWS Amplify facilitates the creation of an interactive tool to generate dynamic, responsive, and adaptive logos.

In this post, we walk through how AWS can help accelerate a brand’s creative efforts with access to a powerful image-to-image model from Stable Diffusion available on Amazon Bedrock to interactively create and edit art and logo images.

Image-to-image model

The Stability AI’s image-to-image model, SDXL, is a deep learning model that generates images based on text descriptions, images, or other inputs. It first converts the text into numerical values that summarize the prompt, then uses those values to generate an image representation. Finally, it upscales the image representation into a high-resolution image. Stable Diffusion can also generate new images based on an initial image and a text prompt. For example, it can fill in a line drawing with colors, lighting, and a background that makes sense for the subject. Stable Diffusion can also be used for inpainting (adding features to an existing image) and outpainting (removing features from an existing image).

One of its primary applications lies in advertising and marketing, where it can be used to create personalized ad campaigns and an unlimited number of marketing assets. Businesses can generate visually appealing and tailored images based on specific prompts, enabling them to stand out in a crowded marketplace and effectively communicate their brand message. In the media and entertainment sector, filmmakers, artists, and content creators can use this as a tool for developing creative assets and ideating with images.

Solution overview

The following diagram illustrates the solution architecture.

This architecture workflow involves the following steps:

  1. In the frontend UI, a user chooses from one of two options to get started:
    1. Generate an initial image.
    2. Provide an initial image link.
  2. The user provides a text prompt to edit the given image.
  3. The user chooses Call API to invoke API Gateway to begin processing on the backend.
  4. The API invokes a Lambda function, which uses the Amazon Bedrock API to invoke the Stability AI SDXL 1.0 model.
  5. The invoked model generates an image, and the output image is stored in an Amazon Simple Storage Service (Amazon S3) bucket.
  6. The backend services return the output image to the frontend UI.
  7. The user can use this generated image as a reference image and edit it, generate a new image, or provide a different initial image. They can continue this process until the model produces a satisfactory output.

Prerequisites

To set up this solution, complete the following prerequisites:

  1. Pick an AWS Region where you want to deploy the solution. We recommend using the us-east-1
  2. Obtain access to the Stability SDXL 1.0 model in Amazon Bedrock if you don’t have it already. For instructions, see Access Amazon Bedrock foundation models.
  3. If you prefer to use a separate S3 bucket for this solution, create a new S3 bucket.
  4. If you prefer to use localhost for testing the application instead of Amplify, make sure python3 is installed in your local machine.

Deploy the solution

To deploy the backend resources for the solution, we create a stack using an AWS CloudFormation template. You can upload the template directly, or upload it to an S3 bucket and link to it during the stack creation process. During the creation process, provide the appropriate variable names for apiGatewayName, apiGatewayStageName, s3BucketName, and lambdaFunctionName. If you created a new S3 bucket earlier, input that name in s3BucketName – this bucket is where output images are stored. When the stack creation is complete, all the backend resources are ready to be connected to the frontend UI.

The frontend resources play an integral part in creating an interactive environment for your end-users. Complete the following steps to integrate the frontend and backend:

  1. When the CloudFormation stack deployment is complete, open the created API from the API Gateway console.

Step 1

  1. Choose Stages in the navigation pane, and on the Stage actions menu, choose Generate SDK.

Step 2

  1. For Platform, choose JavaScript.

  1. Download and unzip the JavaScript SDK .zip file, which contains a folder called apiGateway-js-sdk.
  2. Download the frontend UI index.html file and place it in the unzipped folder.

This file is configured to integrate with the JavaScript SDK by simply placing it in the folder.

  1. After the index.html is placed in the folder, select the content of the folder and compress it into a .zip file (don’t compress the apiGateway-js-sdk folder itself.)

  1. On the Amplify console, choose Create new app.
  2. Select Deploy without Git, then choose Next.

  1. Upload the compressed .zip file, and change the application name and branch name if preferred.
  2. Choose Save and deploy.

The deployment will take a few seconds. When deployment is complete, there will be a domain URL that you can use to access the application. The application is ready to be tested at the domain URL.

CloudFormation template overview

Before we move on to testing the solution, let’s explore the CloudFormation template. This template sets up an API Gateway API with appropriate rules and paths, a Lambda function, and necessary permissions in AWS Identity and Access Management (IAM). Let’s dive deep into the content of the CloudFormation template to understand the resources created:

  • PromptProcessingAPI – This is the main API Gateway REST API. This API will be used to invoke the Lambda function. Other API Gateway resources, methods, and schemas created in the CloudFormation template are attached to this API.
  • ActionResource, ActionInputResource, PromptResource, PromptInputResource, and ProxyResource – These are API Gateway resources that define the URL path structure for the API. The path structure is /action/{actionInput}/prompt/{promptInput}/{proxy+}. The {promptInput} value is a placeholder variable for the prompt that users input in the frontend. Similarly, {actionInput} is the choice the user selected for how they want to generate the image. These are used in the backend Lambda function to process and generate images.
  • ActionInputMethod, PromptInputMethod, and ProxyMethod – These are API Gateway methods that define the integration with the Lambda function for the POST HTTP method.
  • ActionMethodCORS, ActionInputMethodCORS, PromptMethodCORS, PromptInputMethodCORS, and ProxyMethodCORS – These are API Gateway methods that handle the cross-origin resource sharing (CORs) support. These resources are crucial in integrating the frontend UI with backend resources. For more information on CORS, see What is CORS?
  • ResponseSchema and RequestSchema – These are API Gateway models that define the expected JSON schema for the response and request payloads, respectively.
  • Default4xxResponse and Default5xxResponse – These are the gateway responses that define the default response behavior for 4xx and 5xx HTTP status codes, respectively.
  • ApiDeployment – This resource deploys the API Gateway API after all of the preceding configurations have been set. After the deployment, the API is ready to use.
  • LambdaFunction – This creates a Lambda function and specifies the type of runtime, the service role for Lambda, and the limit for the reserved concurrent runs.
  • LambdaPermission1, LambdaPermission2, and LambdaPermission3 – These are permissions that allow the API Gateway API to invoke the Lambda function.
  • LambdaExecutionRole and lambdaLogGroup – The first resource is the IAM role attached to the Lambda function allowing it to run on other AWS services such as Amazon S3 and Amazon Bedrock. The second resource configures the Lambda function log group in Amazon CloudWatch.

Lambda function explanation

Let’s dive into the details of the Python code that generates and manipulate images using the Stability AI model. There are three ways of using the Lambda function: provide a text prompt to generate an initial image, upload an image and include a text prompt to adjust the image, or reupload a generated image and include a prompt to adjust the image.

The code contains the following constants:

  • negative_prompts – A list of negative prompts used to guide the image generation.
  • style_preset – The style preset to use for image generation (for example, photographic, digital-art, or cinematic). We used digital-art for this post.
  • clip_guidance_preset – The Contrastive Language-Image Pretraining (CLIP) guidance preset to use (for example, FAST_BLUE, FAST_GREEN, NONE, SIMPLE, SLOW, SLOWER, SLOWEST).
  • sampler – The sampling algorithm to use for image generation (for example, DDIM, DDPM, K_DPMPP_SDE, K_DPMPP_2M, K_DPMPP_2S_ANCESTRAL, K_DPM_2, K_DPM_2_ANCESTRAL, K_EULER, K_EULER_ANCESTRAL, K_HEUN, K_LMS).
  • width – The width of the generated image.

handler(event, context) is the main entry point for the Lambda function. It processes the input event, which contains the promptInput and actionInput parameters. Based on the actionInput, it performs one of the following actions:

  • For GenerateInit, it generates a new image using the generate_image_with_bedrock function, uploads it to Amazon S3, and returns the file name and a pre-signed URL.
  • When you upload an existing image, it performs one of the following actions:
    • s3URL – It retrieves an image from a pre-signed S3 URL, generates a new image using the generate_image_with_bedrock function, uploads the new image to Amazon S3, and returns the file name and a pre-signed URL.
    • UseGenerated – It retrieves an image from a pre-signed S3 URL, generates a new image using the generate_image_with_bedrock function, uploads the new image to Amazon S3, and returns the file name and a pre-signed URL.

The function generate_image_with_bedrock(prompt, init_image_b64=None) generates an image using the Amazon Bedrock runtime service, which includes the following actions:

  • If an initial image is provided (base64-encoded), it uses that as the starting point for the image generation.
  • If no initial image is provided, it generates a new image based on the provided prompt.
  • The function sets various parameters for the image generation, such as the text prompts, configuration, and sampling method.
  • It then invokes the Amazon Bedrock model, retrieves the generated image as a base64-encoded string, and returns it.

To obtain a more personalized outputs, the hyperparameter values in the function can be adjusted:

  • text_prompts – This is a list of dictionaries, where each dictionary contains a text prompt and an associated weight. For a positive text prompt, one that you would like to associate to the output image, weight is set as 1.0. For all of the negative text prompts, weight is set as -1.0.
  • cfg_scale – This parameter controls the potential for randomness in the image. The default is 7, and 10 seems to work well from our observations. A higher value means the image will be more influenced by the text, but a value that’s too high or too low will result in visually poor-quality outputs.
  • init_image – This parameter is a base64-encoded string representing an initial image. The model uses this image as a starting point and modifies it based on the text prompts. For generating the first image, this parameter is not used.
  • start_schedule – This parameter controls the strength of the noise added to the initial image at the start of the generation process. A value of 0.6 means that the initial noise will be relatively low.
  • steps – This parameter specifies the number of steps (iterations) the model should take during the image generation process. In this case, it’s set to 50 steps.
  • style_preset – This parameter specifies a predefined style or aesthetic to apply to the generated image. Because we’re generating logo images, we use digital-art.
  • clip_guidance_preset – This parameter specifies a predefined guidance setting for the CLIP model, which is used to guide the image generation process based on the text prompts.
  • sampler – This parameter specifies the sampling algorithm used during the image generation process to repeatedly denoise the image to produce a high-quality output.

Test and evaluate the application

The following screenshot shows a simple UI. You can choose to either generate a new image or edit an image using text prompts.

The following screenshots show iterations of sample logos we created using the UI. The text prompts are included under each image.

Clean up

To clean up, delete the CloudFormation stack and the S3 bucket you created.

Conclusion

In this post, we explored how you can use Stability AI and Amazon Bedrock to generate and edit images. By following the instructions and using the provided CloudFormation template and the frontend code, you can generate unique and personalized images and logos for your business. Try generating and editing your own logos, and let us know what you think in the comments. To explore more AI use cases, refer to AI Use Case Explorer.


About the authors

Pyone Thant Win is a Partner Solutions Architect focused on AI/ML and computer vision. Pyone is passionate about enabling AWS Partners through technical best practices and using the latest technologies to showcase the art of possible.

Nneoma Okoroafor is a Partner Solutions Architect focused on helping partners follow best practices by conducting technical validations. She specializes in assisting AI/ML and generative AI partners, providing guidance to make sure they’re using the latest technologies and techniques to deliver innovative solutions to customers.

Read More

Reinvent personalization with generative AI on Amazon Bedrock using task decomposition for agentic workflows

Reinvent personalization with generative AI on Amazon Bedrock using task decomposition for agentic workflows

Personalization has become a cornerstone of delivering tangible benefits to businesses and their customers. Generative AI and large language models (LLMs) offer new possibilities, although some businesses might hesitate due to concerns about consistency and adherence to company guidelines. This post presents an automated personalization solution that balances the innovative capabilities of LLMs with adherence to human directives and human-curated assets for a consistent and responsible personalization experience for your customers.

Our solution uses Amazon Bedrock, a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral, Stability AI, and Amazon through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI. For this post, we use Anthropic’s Claude models on Amazon Bedrock.

We present our solution through a fictional consulting company, OneCompany Consulting, using automatically generated personalized website content for accelerating business client onboarding for their consultancy service. The personalized content is built using generative AI by following human guidance and provided sources of truth. We employ task decomposition, using domain / task adopted LLMs for content personalization (UX designer/personalizer), image generation (artist), and building (builder/front end developer) for the final delivery of HTML, CSS, and JavaScript files. The approach broadly mimics a human organization pursuing the same objective. This allows us to create cost-effective, more controlled, more accurate and responsible personalized customer experiences, utilizing existing guidelines and assets originally designed for human-driven processes.

We provide our code base on GitHub for you to follow along, suggest possible enhancements and modifications, and help you innovate with generative AI in personalization. Generative AI on AWS can transform user experiences for customers while maintaining brand consistency and your desired customization.

Use case overview

Our fictional company, OneCompany Consulting, plans to use generative AI to automatically create personalized landing pages as their business clients sign in. Their clients have provided some basic public information during sign-up, such as state of location, industry vertical, company size, and their mission statement. In parallel, OneCompany maintains a market research repository gathered by their researchers, offers industry-specific services outlined in documents, and has compiled approved customer testimonials. UX/UI designers have established best practices and design systems applicable to all of their websites. These resources should be used as single sources of truth. Because we don’t have such expertise, we synthetically generate these assets to demonstrate the process that would otherwise be created by expert humans or other methods in real life.

The following diagram illustrates the process of generating a personalized landing page for business visitors after they sign up.

Processing showing customers signing up and solution creating personalized websites

Fig 1. The process of customers signing up and the solution creating personalized websites using human-curated assets and guidelines.

We employed other LLMs available on Amazon Bedrock to synthetically generate fictitious reference materials to avoid potential biases that could arise from Amazon Claude’s pre-training data. In practical scenarios, these resources would be created by humans and organizations, containing more comprehensive and exhaustive details. Nonetheless, our solution can still be utilized.

  • Client profiles – We have three business clients in the construction, manufacturing, and mining industries, which are mid-to-enterprise companies. The process assumes the information in the company profiles is public and that the companies who signed up opted in to OneCompany Consulting to use for personalization. The following example is for the construction industry:
Profiles = {
'Construction-Example': {
'Name': 'Example Corp Construction',
'Industry': 'Construction',
'CompanySize': 1500,
'CompanyType': 'Enterprise',
'Location': 'New York City, NY',
'Mission': 'Building a sustainable future for New York' }
}
  • Offerings – Offerings are documents that consolidate all offerings provided by OneCompany Consulting. The following is an example of a synthetically generated offering for the construction industry:
OneCompany Consulting Construction Consulting Services Offerings
Introduction
OneCompany Consulting is a premier construction consulting firm dedicated to... <Redacted for printing>
Our core values are:
1. Client-Centric Approach: We put our clients at the heart of everything we do... <Redacted for printing>
Our offerings include:
1. Pre-Construction Services - Feasibility Studies - Site Selection and Evaluation... <Redacted for printing>
5. Construction Technology Solutions - Construction Data Analytics and Reporting... <Redacted for printing>
  • Industry insights – Your LLMs can use industry pain points, news, and other resources to enrich personalized content. Because the industry news and resources are wide, we use a Retrieval Augmented Generation (RAG) framework to retrieve related information. The following is an example for the manufacturing industry:
The Electric Vehicle Manufacturing Industry: Overcoming Challenges... <Redacted for printing>
Introduction
The electric vehicle (EV) industry has experienced unprecedented growth in recent years, driven by... <Redacted for printing>
This article will explore the key challenges facing the EV... <Redacted for printing>
Supply Chain Disruptions
The COVID-19 pandemic has highlighted the vulnerability of global supply chains, and the EV industry has not been... <Redacted for printing>
V. Regulatory Frameworks and Incentives
Regulatory frameworks and government incentives play a critical role in promoting EV... <Redacted for printing>

  • Testimonials – Synthetically generated customer testimonials are displayed for the visitors. In this solution, the LLM is asked to use the sentence without changes because it’s a testimonial. The following is an example:
"AnyCompany Consulting's expert consulting services have been invaluable in streamlining our operations and optimizing our construction processes, resulting in increased efficiency and cost savings"
- John Smith, CTO from Example Corp Solutions.
  • Design guidelines and systems – This part is for the instructions and rules to be followed for building the website. Our examples were manually created only for high-level guidance for simplicity.
    • Guidelines – The following are some examples from the design guidelines:
- Use a color palette that aligns with the customer's industry, ensuring sufficient color contrast for readability.
- Use visible font sizes and responsive units for better visibility and readability across screen sizes... <Redacted for printing> - ... <Redacted for printing>
  • Instructions – The following are some examples from the design instructions:
Header Design:
- Choose an attention-grabbing background color and font that aligns with the client's industry.
- Consider incorporating subtle animations or transitions
Hero Section:
- Choose an attention-grabbing background color and font that aligns with the client's industry.
- ... <Redacted for printing>

Solution overview

To create personalized websites efficiently, we employ task decomposition—breaking down the complex process into simpler, decoupled sub-tasks. This approach allows using smaller, cost-effective language models, creating targeted prompts and contexts for increased accuracy and faithfulness, isolating responses for straightforward troubleshooting, and achieving cost savings.

In our example, we decomposed the overall personalized website creation process into three steps, each handled by specialized agents: the personalizer for tailoring content, the artist for generating images, and the frontend engineer/builder for coding. For the personalizer, we used Claude Sonnet due to the relative complexity of the task compared to code generation handled by Haiku. However, Claude Haiku can also be used for the personalization task, potentially leading to further cost savings. Yet, Haiku may require more prescriptive prompts and examples to achieve similar results. We recommend that customers test both Sonnet and Haiku to determine the optimal balance between performance and cost for their specific use case. In our demonstration, we chose to use Sonnet with a relatively simple prompt to showcase its efficiency, but the flexibility of this approach allows for various LLMs to be integrated into the agentic workflow.

The following diagram illustrates our agentic workflow.

Illustration of agentic workflow in solution.

Fig 2. Workflow diagram of agentic workflow made of specialized (task / domain adopted) LLMs.

The following diagram illustrates our solution architecture.

Solutions architecture

Fig 3. Solutions architecture

The workflow includes the following steps:

  1. The client profile is stored as key-value pairs in JSON format. However, the JSON needs to be converted into natural language to simplify the task for the downstream LLMs, so they don’t have to figure out the JSON schema and associated meaning.
  2. After the profile is converted into text that explains the profile, a RAG framework is launched using Amazon Bedrock Knowledge Bases to retrieve related industry insights (articles, pain points, and so on). Amazon Bedrock Knowledge Bases is a fully managed capability that helps you implement the RAG workflow, from ingestion to retrieval and prompt augmentation, without having to build custom integrations to data sources and manage data flows. All the information retrieved from Amazon Bedrock Knowledge Bases is provided with citations to improve transparency and increase accuracy. In our example, if the client is a manufacturing customer building electronic vehicles (EVs), then the related context will be retrieved from the human-curated or human-created research documents.
  3. Now we’re ready to send our prompt to the personalizer LLM with all the relevant information. In addition to the customer profile and industry insights, we include offerings, design guidance, and testimonials, and ask the personalizer LLM to create a detailed website description and description of visuals.
  4. The response from the personalizer LLM is divided into two paths by a regex method. The first part moves to the frontend developer LLM.
  5. The second part is sent to the image generator or artist LLM.
  6. For the frontend developer LLM, we also use system design-related materials (in our case, design guidelines) so the frontend developer builds the website described by the personalizer LLM while applying the rules in the design guidelines. Here, we also prompted the LLM to use the company logo (which is the unicorn of AWS GameDay) to demonstrate incorporating existing design elements into the design. In addition, our prompt asks the frontend developer LLM to write the JavaScript to make the testimonials display and call to action dynamic.
  7. At the end of the process, we create a consolidated HTML file, which includes CSS and JavaScript, and store it in an Amazon Simple Storage Service (Amazon S3) bucket so that the assets are ready to be deployed.

Prerequisites

For this post, you need the following prerequisites:

After you complete the prerequisites, you can use the following Jupyter notebook , which have all the necessary steps to follow this post.

Test the solution

Let’s start with our example manufacturing client, who is building the next-generation EVs ('Manufacturing-Example'). First, the profile of this client in JSON format is converted into natural language as follows:

customer = build_profile(UserProfile)
print(customer)

Output:
"Your customer is Example Corp Manufacturing. Their industry is manufacturing. They have 2,500 employees." 
"They are an enterprise company, located in San Jose, CA."  
"Their mission statement is “Building the next generation Electric Vehicles."

Based on the customer profile giving the location and industry, the related background information is retrieved using RAG based on kbId = <Knowledge_base_id>:

context_painpoints = Bedrock.doc_retrieve(query,kbId,numberOfResults = 10)
contexts_painpoints = get_contexts(context_painpoints)

At this stage, we have the context and industry pain points. We now need to gather human-curated sources of truth such as testimonials, design guidelines, requirements, and offerings. These references aren’t numerous and require complete information in the system prompt. However, you can apply the RAG method used for industry insights to retrieve one or all of these references.

offerings = open("./references/offerings.txt", "r").read()
testimonials = open("./references/Testimonials.txt", "r").read()
design_steps= open("./references/RecommendedStepsToDesignNew.txt", "r").read()
design_guideline= open("./references/DesignGuidelineNew.txt", "r").read()

We’re ready to input our prompt to the personalizer LLM, which is Anthropic’s Claude 3 Sonnet (or Haiku, if you used it) on Amazon Bedrock. We used the following which incorporates the parameters customer, testimonials, design_steps, and offerings. We prompted the LLM to generate a response in two parts:

  • Detailed Website Description – Describing the website in text format
  • Visual Elements – Describing the visual elements used

These two responses will be used in two different workflows in the following steps. The following is an example response (response_personalized_website) for the manufacturing client:

Section 1: Detailed Website Description

Header:
The header section features a clean and modern design with a navy-blue background color (#003366) and white text. The AnyCompany Consulting logo is prominently displayed on the left side, while the main navigation menu is positioned on the right. The navigation menu includes the following links: "Home," "Services," "About," and "Contact". The header is fixed to the top of the page, ensuring easy access to the navigation throughout the user's experience.

Hero Section:
The hero section showcases a large, high-quality image (hero-example.jpg) that depicts a modern manufacturing facility with various machinery and workers in action. The image is overlaid with a semi-transparent black gradient to improve the readability of the text. The hero section's main heading reads "Transforming Manufacturing Excellence" in a large, bold font (font-size: 48px), and the subheading states "Unlock your full potential with AnyCompany Consulting's tailored solutions" in a slightly smaller font (font-size: 24px). A prominent "Get Started" call-to-action button is positioned below the subheading, encouraging the user to take the next step.

Offerings Section:
The offerings section is divided into three main categories, each with a corresponding icon and brief description:

1. Operational Excellence
- Icon: operations-icon.png
- Description: "Optimize your manufacturing processes and drive continuous improvement with our Lean and Six Sigma expertise."
2. Digital Transformation
- Icon: digital-icon.png
- Description: "Leverage the power of Industry 4.0 technologies to enhance productivity, efficiency, and data-driven decision-making."
- ... <Redacted for printing>

Dynamic Content for JavaScript:
The testimonials section features a dynamic horizontal slider that automatically cycles through the testimonials every 5 seconds. This functionality can be implemented using JavaScript, with the following elements:

1. Testimonial Data:
- An array of testimonial objects, each containing the following properties:
- name: "John Smith"
- title: "CTO, Example Corp Solutions"
- quote: "AnyCompany Consulting's expert consulting services have been invaluable in streamlining our operations and optimizing our construction processes, resulting in increased efficiency and cost savings."

2. Testimonial Slider:
- A container element to hold the testimonials
- A function to display the current testimonial
- A timer to automatically cycle through the testimonials every 5 seconds
- Smooth transition animations between testimonials

3. User Interaction:
- Ability to manually navigate through the testimonials (for example, previous and next buttons)
- Pause the automatic cycling when the user interacts with the slider

Section 2: Visual Elements

1. <VISUAL_LABEL>anycompany_logo.jpg</VISUAL_LABEL>
<VISUAL_DESCRIPTION>This is the AnyCompany Consulting logo, featuring the company name in a bold, modern font. The logo is designed in a clean, minimalist style with a navy blue color scheme.</VISUAL_DESCRIPTION>

2. <VISUAL_LABEL>hero-example.jpg</VISUAL_LABEL>
<VISUAL_DESCRIPTION>This image depicts a modern manufacturing facility with various machinery and workers in action. The scene showcases the dynamic and technologically advanced nature of the manufacturing industry, aligning with the customer's profile and the AnyCompany Consulting's expertise.</VISUAL_DESCRIPTION>

3. <VISUAL_LABEL>operations-icon.png</VISUAL_LABEL>
<VISUAL_DESCRIPTION>This icon represents the "Operational Excellence" offering. It features a gear or cog symbol, symbolizing the optimization and streamlining of manufacturing processes.</VISUAL_DESCRIPTION>

4. ... <Redacted for printing>

We use Stable Diffusion to generate visual assets based on the descriptions provided by the personalizer LLM. We extract the image descriptions enclosed within <VISUAL_LABEL> and <VISUAL_DESCRIPTION> tags using a regex pattern. These descriptions are then sent in batches to our artist LLM to create images.

pattern = r'&lt;VISUAL_LABEL&gt;(.+?)&lt;/VISUAL_LABEL&gt;s*&lt;VISUAL_DESCRIPTION&gt;(.+?)&lt;/VISUAL_DESCRIPTION&gt;'

The images are created and put into the S3 bucket to store your website assets.

Now we’re ready to create the website HTML, CSS, and JavaScript assets. We used the following prompt template, which uses response_personalized_website from the personalizer, actual testimonials, and the UI design guidelines:

prompt =
f"""
You are an experienced frontend web developer specializing in creating accessible, responsive, and visually appealing websites. Your task is to generate the complete HTML, CSS, and JavaScript code that accurately implements the provided 'Website Description' while adhering to the specified guidelines.
 <website description>
Know that this your Design Guideline (Requirements):
{design_guideline}
You use the testimonials as follows:
{testimonials}
Website Description:
{response_personalized_website} </website description> 
Carefully read the 'Website Description' line by line, and then generate the HTML, CSS, and JavaScript code required to build the described website while following the specified design guidelines and requirements.
Provide the HTML, CSS, and JavaScript code directly, starting with the <!DOCTYPE html> declaration, without any preamble or introduction.
"""

After this step, you have all the necessary assets to preview your website. You can put the HTML and created files into a folder and use a web browser to see your created website.

The following screenshots show examples of generated personalized pages for EV manufacturing, mining, and construction clients, respectively. Each image is displayed for 3 seconds in GIF format. To experience the full quality and dynamic features of these pages, we recommend you visit the examples folder, download the folders, and open the corresponding main.html files with your internet browser.

Three examples looping use-cases

Fig 4. Generated example personalized pages for three industry clients.

Highlights from the test

The solution automatically generated personalized web pages, including personalized images within the provided guardrails, prompts, and reference materials. The workflow considered appropriate color contrasts, such as dark backgrounds with white fonts for accessibility. The solution generated representative icons with consistent coloring and themes across the pages. The workflow also created industry-specific, engaging labels, descriptions, offerings, and pain points based on the source of truth references. The offering selection and pain point sections are especially noteworthy because they were tailored to the visitor. For example, the hero page showcased an EV on a production line, whereas a mining company with a “sustainability” motto received green icons and a focus on that topic. The construction company from New York had themes mentioning their specific points. The workflow is capable of creating dynamic assets as prompted, such as testimonials or call-to-action buttons. Additionally, the solution created consistent assets that can scale well and are compatible with multiple devices, as requested.

In this example, we did not fully exhaust the capabilities of personalization. However, we hope these examples can provide a simple starting point for your personalization use cases.

Clean up

To clean up, start by deleting the S3 bucket you created for your knowledge base. Then delete your knowledge base. Because we used Amazon Bedrock on demand, unless you invoke the endpoint, it will not incur any cost. However, we recommend deleting the artifacts in SageMaker Studio or the SageMaker Studio domain if you used SageMaker Studio to follow along with this demo.

Suggested enhancements

You can extend this solution with some further enhancements, such as the following:

  • Use batch processing for cost-effective asset creation based on visitor profiles. You can use batch inference with Amazon Bedrock or batch transform with SageMaker.
  • Cluster similar client profiles to reduce design element variations for frugality and consistency.
  • Provide website templates and chain-of-thought descriptions to follow design patterns more prescriptively.
  • Use Haiku instead of Sonnet for further cost reduction. You may need more prescriptive and multi-shot prompts as you switch to Haiku for the personalization stage.
  • Retrieve existing company images and icons using semantic search instead of generating visuals. For example, you can build semantic image search using Amazon Titan.

Conclusions

In this post, we presented an automated solution to provide a consistent and responsible personalization experience for your customers. This approach uses smaller LLMs for website personalization tailored to businesses and industries. It decomposes the complex task into subtasks handled by task / domain adopted LLMs, adhering to company guidelines and human expertise. Using a fictional business consulting company scenario, we demonstrated the solution by generating personalized marketing content like text, images, HTML, CSS, and JavaScript code. The process employs techniques like RAG, prompt engineering with personas, and human-curated references to maintain output control.

By combining generative AI, curated data, and task decomposition, businesses can cost-effectively create accurate, personalized website experiences aligned with their branding and design systems.

Amazon Bedrock, which you can use to build generative AI applications, is at the center of this solution. To get started with Amazon Bedrock, we recommend following the quick start and familiarizing yourself with building generative AI applications.


About the Authors

BurakBurak Gozluklu is a Principal AI/ML Specialist Solutions Architect and lead GenAI Scientist Architect for Amazon on AWS, based in Boston, MA. He helps strategic customers adopt AWS technologies and specifically Generative AI solutions to achieve their business objectives. Burak has a PhD in Aerospace Engineering from METU, an MS in Systems Engineering, and a post-doc in system dynamics from MIT in Cambridge, MA. He maintains his connection to academia as a research affiliate at MIT. Outside of work, Burak is an enthusiast of yoga.

Chidi Prince John is a Data Scientist at Amazon. He designs, builds and deploys models for large-scale personalization in Amazon Payments. Chidi has a Master’s degree in Quantitative Management from Duke University and a Bachelor’s degree in Economics from the University of Nigeria. Outside of work, he is passionate about soccer and TV shows.

Dieter D’Haenens is a Senior Product Manager for Amazon, responsible for customer growth, delivering personalized experiences and driving the Amazon flywheel. Leveraging his expertise in retail and strategy, he is passionate about solving customer problems through scalable, innovative AI and ML solutions. Dieter holds a Bachelor of Science in Economics from Ghent University, a Master in General Management from Vlerick Business School, and a Master of Science in Business Analytics from Southern Methodist University. In his spare time, he enjoys traveling and sports.

Read More

Accelerate pre-training of Mistral’s Mathstral model with highly resilient clusters on Amazon SageMaker HyperPod

Accelerate pre-training of Mistral’s Mathstral model with highly resilient clusters on Amazon SageMaker HyperPod

In recent years, FM sizes have been increasing. It is important to consider the massive amount of compute often required to train these models. The compute clusters used in these scenarios are composed of more than thousands of AI accelerators such as GPUs or AWS Trainium and AWS Inferentia, custom machine learning (ML) chips designed by Amazon Web Services (AWS) to accelerate deep learning workloads in the cloud.

When using compute clusters of massive size, a single failure can often throw a training job off course and may require multiple hours of discovery and remediation from customers. According to a report from OPT-175B training, about 178,000 GPU hours were wasted due to various training failures, amounting to 16 percent of the total training time. Similarly, a study by Meta AI and Carnegie Melon university found that, in the worst cases, 43 percent of compute time was wasted because of overheads due to hardware failures. This can adversely impact a customer’s ability to keep up with the pace of innovation in generative AI and can also increase the time-to-market for their models.

Amazon SageMaker HyperPod is a service that is purpose-built to accelerate FM training, removing the undifferentiated heavy-lifting involved in managing and optimizing a large training compute cluster. With SageMaker HyperPod, you can train FMs for weeks to months without disruption. To make FM training more resilient to hardware failures, SageMaker HyperPod continually monitors cluster health, repairs and replaces faulty nodes without disrupting training, and uses customer-defined checkpoints to automatically resume training from the last point of failure.

Why SageMaker HyperPod?

SageMaker HyperPod offers several benefits that make it a good choice for FM training:

  • Standby pool of nodes at no additional cost – SageMaker HyperPod provisions and manages a pool of spare nodes on the customer’s behalf. These nodes are on standby and can be automatically used to replace faulty nodes during training. This makes it so that failures don’t interrupt or delay large-scale training jobs, and these spare nodes come at no additional cost to the user. With the SageMaker HyperPod auto-resume functionality, the service can dynamically swap out unhealthy nodes for spare ones to ensure the seamless continuation of the workload.
  • Cluster placement groups for optimized training – Each instance group is launched in a cluster placement group within the same network spine, in order to get the best inter-node latency and maximize bandwidth between nodes. This is ideal for tightly coupled workloads like distributed training where low-latency communication is essential for synchronizing gradient updates and ensuring that model training scales effectively across multiple GPUs.
  • Preconfigured deep learning AMI with essential libraries – The SageMaker HyperPod agent runs a SageMaker HyperPod DLAMI, which is built on top of AWS Deep Learning Base GPU AMI (Ubuntu 20.04).The SageMaker HyperPod DLAMI is bundled with additional packages to support open source tools such as Slurm and dependencies. Also included are SageMaker HyperPod cluster software packages, which support features such as cluster health check and auto-resume.
  • Reusable scaling scripts for rapid experimentation – HyperPod offers a set of scalable and reusable scripts that simplify the process of launching multiple training runs. These scripts streamline infrastructure setup and deployment and can be easily adapted for different training scenarios or to run many jobs in parallel, making large-scale training more manageable. By reducing repetitive tasks and providing reusable automation, these scripts empower users to quickly scale up or down, test different model variations, and iterate faster, improving productivity and reducing operational overhead.
  • Auto-resume functionality – This is one of the most valuable features of SageMaker HyperPod. When a node fails, SageMaker HyperPod automatically replaces it with a healthy node from the spare pool and resumes the job from the last saved checkpoint with minimal disruption to training. This is particularly crucial for long-running training jobs, where even minor interruptions can lead to significant delays.
  • Real-time performance dashboards with few-click setup – SageMaker HyperPod integrates seamlessly with real-time dashboards to monitor node health, GPU utilization, network traffic, and other key metrics. This can be done with just a few clicks, providing full visibility into training jobs and allowing teams to optimize performance in real-time.

In this post, we present to you an in-depth guide to starting a continual pre-training job using PyTorch Fully Sharded Data Parallel (FSDP) for Mistral AI’s Mathstral model with SageMaker HyperPod. We review components of the Slurm orchestrated SageMaker HyperPod cluster setup, primarily focusing on the resiliency and feature set of SageMaker HyperPod, including automatic fault detection and integration with open source tools such as Amazon Managed Service for Prometheus and Amazon Managed Grafana.

Overview of SageMaker HyperPod resiliency

Some of the health check metrics used by SageMaker HyperPod include:

  • Accelerator issues Checks for GPU issues including DCGM policies like XID errors, GPU health through nvidia-smi, and Trainium issues by reading from Neuron sysfs
  • Networking issues Elastic Fabric Adapter (EFA)
  • Health checks – Performed to run processes on accelerators and multiple threads on CPUs to achieve 100 percent utilization. This process determines the health of the CPU or accelerator. Specifically, DCGM Diagnostics Level 2 tests are run for GPUs, and CPU health is determined using the Linux stress tool.

SageMaker HyperPod continuously performs health checks on crucial components, including GPUs, AWS Trainium cores, and EFA networking devices. This proactive approach allows for the HyperPod health check agent to identify various hardware failures or potential performance degradation. When hardware failures are detected, SageMaker HyperPod identifies faulty instances and is also able to use its auto-resume functionality to initiate a replacement process without manual intervention. This feature automatically detects hardware failures, seamlessly replaces faulty instances, and resumes jobs from the last saved checkpoint. In addition, SageMaker HyperPod offers you the ability to manually replace a node in the case that you have a node stuck with an issue but is not being fixed by the SageMaker HyperPod auto-resume functionality. You can manually change the state of the node to fail, and SageMaker HyperPod will replace it with a healthy instance. For a more in-depth dive into resiliency with SageMaker HyperPod, refer to the Resiliency section of this post.

Overview of SageMaker HyperPod observability

To achieve comprehensive observability into your SageMaker HyperPod cluster resources and software components, you can integrate your cluster with Amazon Managed Service for Prometheus and Amazon Managed Grafana. The integration with Amazon Managed Service for Prometheus enables the export of metrics related to your SageMaker HyperPod cluster resources, providing insights into their performance, utilization, and health. The integration with Amazon Managed Grafana enables the visualization of these metrics through various Grafana dashboards that offer intuitive interface for monitoring and analyzing the cluster’s behavior. By using these services, you gain a centralized and unified view of your SageMaker HyperPod cluster, facilitating proactive monitoring, troubleshooting, and optimization of your distributed training workloads. The Observability section of this post goes into more detail on which metrics are exported and what the dashboards look like in Amazon Managaed Grafana.

This post is primarily focused on Amazon Managed Service for Prometheus and Amazon Managed Grafana for observability. To explore more observability integrations with SageMaker HyperPod like Nvidia Nsight, refer to the validation and observability folder of the awsome-distributed-training GitHub repo.

These resiliency and observability features collectively contribute to a more reliable and efficient training environment, minimize downtime, and optimize resource usage. By directly integrating with Amazon Managed Service for Prometheus and Amazon Managed Grafana and abstracting the management of hardware failures and job resumption, SageMaker HyperPod allows data scientists and ML engineers to focus on model development rather than infrastructure management.

Mathstral model from Mistral AI

Mathstral is a model designed for math reasoning and scientific discovery, is based on the original Mistral 7B model, and features a 32k context window. The release of Mathstral aligns with Mistral AI’s broader effort to support academic and scientific research, particularly through their collaboration with Project Numina. As a 7B model, Mathstral sets a new standard on the performance and latency space for math and reasoning generation compared to similar models used for math and reasoning. Mathstral can achieve significantly better results with more inference-time computation.

Overview of PyTorch FSDP

In distributed data parallel (DDP) training, each process or worker owns a replica of the model and processes a batch of data. Then, it uses all-reduce to sum up gradients over different workers. In DDP, the model weights and optimizer states are replicated across all workers. DDP maintains a full copy of the model on each GPU and requires enough memory on each GPU to store the entire model. For training larger FMs, using an approach like FSDP is recommended, since these FMs require more than a single GPU. It is a type of data parallelism that shards model parameters, optimizer states, and gradients across DDP ranks. This approach reduces the memory requirements on individual GPUs and distributes the memory load across GPUs. With FSDP enhanced efficiency, researchers and developers use fewer GPUs, thereby minimizing operational costs and achieving faster model convergence.

When training with FSDP, the GPU memory footprint is smaller than when training with DDP across all workers. This makes the training of some very large models feasible by allowing them to be loaded into memory with a lower memory footprint. However, this comes at the cost of increased communication volume. For more information on FSDP, refer to PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel.

Solution overview

The following image shows the architecture diagram for the resources deployed as part of Sagemaker HyperPod for our use case of training the Mathstral model. In your account, you will have a VPC provisioned with a public and private subnet, and an S3 bucket synced to your FSxL file system via a data repository link. In the service team account, your cluster of P4de instances is provisioned, along with the head node, and the login node, for you to submit the training job to your cluster.

Prerequisites

In the context of this post, we use four p4de.24xlarge instances. You can find more information on the p4de.24xlarge instance type at Amazon EC2 P4 Instances. To get the best inter-node latency, we launch these instances together in a cluster and only run jobs on a single instance group. You can also use a variety of other instance types to follow along with this post.

For more information on getting access to instances in a partition group, refer to the Getting Started section in this post. Note that Mathstral 7B at full precision (FP32) is approximately 26 GB in size so you need to make sure that your cluster configuration has sufficient GPU memory to load the model into memory along with the gradients, activations, and moments. This should account for a total of 107 GB in addition to the training assets required to kick off a job successfully. For demonstration purposes, we use FSDP for this continued pre-training job.

The following sections describe setting up your infrastructure and environment with SageMaker HyperPod. For detailed instructions and code, we recommend that you follow along with the Amazon SageMaker HyperPod workshop. The prerequisites and cluster setup parts of this workshop go over all the required components needed in order to set up your cluster. The workshop also provides resources to troubleshoot commonly faced issues during setup.

Set up your infrastructure

Deploy HyperPod VPC stack

To set up your cluster, you first need to create some resources. The following resources can be created by deploying this SageMaker HyperPod VPC CloudFormation stack. By default usw2-az4 is specified as the Availability Zone. Change this to reflect the Availability Zone where you have your cluster. This VPC stack creates the following resources:

  • Subnet – This is a private subnet in the Availability Zone id that you choose to use
  • Security group – This allows SageMaker HyperPod to mount your Amazon FSx for Lustre file system
  • FSx for Lustre file system – This serves as the shared file system that all the nodes can access. It’s a 1.2 TB PERSISTENT_2 file system in the private subnet you create. It gets mounted at /fsx.
  • Linux environment – This provides a standardized development environment to work in
  • Amazon Simple Storage Service (Amazon S3) bucket – To push and store your lifecycle scripts
  • AWS Identity and Access Management (IAM) role – Role required for creating the SageMaker HyperPod cluster

Deploy the observability stack

In order to use the observability integration with SageMaker HyperPod, you need to deploy the SageMaker HyperPod Observability CloudFormation stack, which can then be used to monitor your cluster metrics in real time.

Set up your environment

Let’s move on to environment setup. In order to deploy this solution, you need to use a Linux-based development environment. This section briefly describes the steps required to set up your cluster. For detailed instructions and code, we recommend that you follow along with the Amazon SageMaker HyperPod workshop.

Set up your cluster

This section guides you through the process of deploying a cluster to train with. You need to set up the following:

  • Head node and compute nodes – The head node is composed of an m5.12xlarge instance, and the worker group consists of p4de.24xlarge instances. Refer to the following table for details on these instance types.
  • Shared volume – The cluster has an FSx for Lustre file system mounted at /fsx on both the head and compute nodes
  • Placement groups enabled – A placement group will launch instances close together inside one physical data center in a single Availability Zone to maximize the bandwidth and reduce the latency between instances
  • Local storage – Each node will have an 8 TB local NVME volume attached for local storage
  • Scheduler – SLURM will be used as a job scheduler
  • Accounting – As part of cluster configuration, a local MariaDB is deployed to keep track of job runtime information
A B C D E F
1 Instance size GPU devices Total GPU memory VCPUs CPU memory EFA bandwidth
2 p4de.24xlarge 8 640 gb 96 1152 gb 400 Gbps

Set up the AWS CLI

Before creating the cluster and its associated resources, you need to set up the AWS Command Line Interface (AWS CLI) using the latest version (or version 2.17.1 at a minimum).

To check the AWS CLI version, use the following command.

aws --version

To update the AWS CLI to the latest version, use the following command.

sudo ./aws/install –update

The AWS CLI plugin for Session Manager, a capability of AWS Systems Manager, must be installed to access your cluster. To use Amazon Linux 2 to install Session Manager, use the following command:

sudo yum install -y https://s3.amazonaws.com/session-manager-downloads/plugin/latest/linux_64bit/session-manager-plugin.rpm

For detailed steps on installing and setting up the AWS CLI, follow the steps provided in the Install AWS CLI section of the Amazon SageMaker HyperPod workshop.

Source environment variables

An important part of the setup is to source in all the environment variables, using the output from the VPC CloudFormation stack deployed in a previous step. Use the following command.

curl 'https://static.us-east-1.prod.workshops.aws/public/e3e1b2f1-8140-43eb-a316-e76f569119dd/static/scripts/create_config.sh' --output create_config.sh
bash create_config.sh
source env_vars

Once you have sourced them in, confirm that they were correctly set using the following command.

cat env_vars

Set up lifecycle scripts

SageMaker HyperPod uses a collection of lifecycle scripts  to bootstrap the cluster. These scripts are responsible for several actions, including setting up Slurm and mounting the FSx for Lustre file system. You need to customize these scripts in order to mount your FSx for Lustre file system. For detailed steps on setting up these lifecycle scripts, refer to the Set Up Lifecycle Scripts section of the workshop.

Make sure to complete the Enable Optional Lifecycle scripts section after step 4 of the Set Up Lifecycle scripts section because this is needed in order to enable installation of exporter services on the cluster. This is required because you need the exporter services on the cluster to emit metrics to Amazon Managed Service for Prometheus.

Additionally, the observability stack requires the following two AWS managed IAM policies to be added to your AmazonSagemakerClusterExecutionRole prior to creating your cluster.

aws iam attach-role-policy --role-name $ROLENAME --policy-arn arn:aws:iam::aws:policy/AmazonPrometheusRemoteWriteAccess

aws iam attach-role-policy --role-name $ROLENAME --policy-arn arn:aws:iam::aws:policy/AWSCloudFormationReadOnlyAccess

Once you have uploaded the lifecycle scripts to Amazon S3, you can then create your cluster.

Create your cluster

To create your cluster, you need your cluster configuration. Because you use p4de.24xlarge for this example, copy the following cluster configuration.

source env_vars
cat > cluster-config.json << EOL
{
    "ClusterName": "ml-cluster",
    "InstanceGroups": [
      {
        "InstanceGroupName": "controller-machine",
        "InstanceType": "ml.m5.12xlarge",
        "InstanceStorageConfigs": [
          {
            "EbsVolumeConfig": {
              "VolumeSizeInGB": 500
            }
          }
        ],
        "InstanceCount": 1,
        "LifeCycleConfig": {
          "SourceS3Uri": "s3://${BUCKET}/src",
          "OnCreate": "on_create.sh"
        },
        "ExecutionRole": "${ROLE}",
        "ThreadsPerCore": 1
      },
      {
        "InstanceGroupName": "worker-group-1",
        "InstanceType": "ml.p4de.24xlarge",
        "InstanceCount": 4,
        "LifeCycleConfig": {
          "SourceS3Uri": "s3://${BUCKET}/src",
          "OnCreate": "on_create.sh"
        },
        "ExecutionRole": "${ROLE}",
        "ThreadsPerCore": 1
      }
    ],
    "VpcConfig": {
      "SecurityGroupIds": ["$SECURITY_GROUP"],
      "Subnets":["$SUBNET_ID"]
    }
}
EOL

If you use a different instance type for your cluster, refer to the Create Cluster section of the workshop to create your cluster-config.json file.

SageMaker HyperPod also gives you the ability to update your clusters to increase the size of an existing worker group or create a new worker group to add additional instance-types to your cluster. For steps on updating the cluster to create additional worker groups that use other instance types, refer to the section in the workshop to create Heterogenous Clusters.

Once you’ve created the cluster-config.json file, follow the Create Cluster steps in the workshop to create the FSX for Lustre configuration (provisioning_parameters.json) file and upload it to Amazon S3. Then, you can validate the configuration using the validate-config.py file in the awsome-distributed-training GitHub repo.

Once this validation is completed, you can create your cluster. Use the following command.

aws sagemaker create-cluster 
    --cli-input-json file://cluster-config.json 
    --region $AWS_REGION

To check the state of your cluster, run the following command.

aws sagemaker list-clusters --output table

You should then be able to observe the cluster creating.

-----------------------------------------------------------------------------------------------------------------------------------------------------
|                                                                          ListClusters                                                             |
+---------------------------------------------------------------------------------------------------------------------------------------------------+
||                                                                       ClusterSummaries                                                          ||
|+----------------------------------------------------------------+----------------------+---------------+-----------------------------------------+|
||                        ClusterArn                              |    ClusterName       | ClusterStatus |               CreationTime              ||
|+----------------------------------------------------------------+----------------------+---------------+-----------------------------------------+|
|| arn:aws:sagemaker:us-west-2:{cluster arn}                      |  ml-cluster          | Creating      | time taken to create                    ||
|+----------------------------------------------------------------+----------------------+---------------+-----------------------------------------+|

Now that you’ve created a cluster, you can monitor the status in the SageMaker console. This will show you cluster status, running instances, and node groups and allow you to modify the cluster. In the SageMaker HyperPod console, find your cluster and select it, as shown in the following screenshot.

Once the Cluster status changes to InService, you can connect using Secure Shell (SSH). Make sure that you completed the step in Set up the AWS CLI to install the SSM plugin. You can then take the easy-ssh.sh file from the repo to simplify the SSM command connect to the controller-machine using SSH.

curl -O https://raw.githubusercontent.com/aws-samples/awsome-distributed-training/main/1.architectures/5.sagemaker-hyperpod/easy-ssh.sh
chmod +x easy-ssh.sh
./easy-ssh.sh -c controller-machine ml-cluster

Use the following command to switch to the ubuntu user.

sudo su - ubuntu

Refer to the Get to know your Cluster  section in the SageMaker HyperPod workshop to familiarize yourself with the commands you need to use in the later sections.

Finally, set up SSH access to the compute nodes. To do this, add a key-value pair to the /fsx/ubuntu directory. Because all the compute nodes mount this directory, you only have to do this once for ubuntu to access all the compute nodes. For instructions, refer to the SSH Access to compute section of the workshop.

Congrats on setting up your environment! Now that you’ve completed the necessary steps, you can move on to your training job.

Run your pre-training job

Follow these steps on your cluster head node:

  1. Navigate to your shared FSx for Lustre file system. If you followed the tutorial linked previously, it will be location at /fsx.
  2. Use the following command to clone the awsome-distributed-training repo.
cd /fsx
git clone https://github.com/aws-samples/awsome-distributed-training/
cd awsome-distributed-training/3.test_cases/10.FSDP
  1. Run the create_conda_env.sh script.

This script will first download and install Miniconda, then create a Conda environment called pt_fsdp. The Conda environment installs PyTorch on AWS, which is a package that is built to run PyTorch workloads on AWS. Specifically, it lets you use EFA out of the box, since OFI-NCCL is pre-built in the Conda package. PyTorch on AWS also provides the latest versions of CUDA, cuDNN, and NCCL for the best performance on GPU-based instances. Dependencies required to run your FSDP training job will be installed in this Conda environment, and since this Conda environment is created on the /fsx file system, it’ll be shared across all your training nodes.

bash 0.create_conda_env.sh

For this training job, you use the C4 dataset, which is several hundred gigabytes. Instead of downloading the whole thing, the create_streaming_dataloaders function will stream the dataset from HuggingFace, so there’s no data prep required for running this training.

If want to use your own dataset instead, you can format it as a HuggingFace dataset and pass its location to the --dataset_path argument.

Launch training

The script to launch the Mathstral training job can be found in 3.distributed-training-mistral-mathstral.sbatch. Depending on the number of nodes in your cluster, you are can adjust them by modifying #SBATCH --nodes=4. Because you are using four p4de.24xlarge instances, it has been set to 4.

For the purpose of this post, you need to make sure that the FI_EFA variables for EFA are exported in the 3.distributed-training-mistral-mathstral.sbatch file. If you use instances not enabled for remote direct memory access (RDMA), such as the g5.12xlarge, comment out lines 21–22 of this file. These instances have EFA between nodes, but do not have the GPU direct RDMA access of p4d/e and p5 instances. In this walkthrough, we are using p4de instances, so we leave these lines uncommented.

## Plenty of EFA level variables
## Comment out for non-efa instances (G5, G4d, P3)
export FI_EFA_USE_DEVICE_RDMA=1 # use for p4de
export FI_LOG_LEVEL=1
export FI_PROVIDER=efa
export NCCL_DEBUG=INFO

Under User Variables, make sure to adjust GPUS_PER_NODE to match the number of GPUs on your instance type (8 for p4de).

You can also adjust the training parameters in TRAINING_ARGS. Additional parameters can be found in model/arguments.py.

We use the same directory for both --checkpoint_dir and --resume_from_checkpoint. If there are multiple checkpoints, --resume_from_checkpoint will automatically select the most recent one. This way, if the training is interrupted for any reason, it will automatically pick up the most recent checkpoint.

Note: You may change these hyperparameters in the 3.distributed-training-mistral-mathstral.sbatch file. We are using arbitrary hyperparameters here for the sake of demonstration.

declare -a TRAINING_ARGS=(
    --train_batch_size=1 
    --val_batch_size=1 
    --max_steps=5000 
    --seed=42 
    --grad_clip=1.0 
    --weight_decay=0.2 
    --beta1=0.9 
    --beta2=0.95 
    --activation_checkpointing=1 
    --intermediate_size=14336 
    --num_key_value_heads=8 
    --logging_freq=1 
    --max_context_width=32768 
    --vocab_size=32768 
    --hidden_width=4096 
    --num_layers=32 
    --num_heads=32 
    --resid_pdrop=0.1 
    --embd_pdrop=0.1 
    --attn_pdrop=0.1 
    --summary_first_pdrop=0.1 
    --initializer_range=0.02 
    --model_type="mistral" 
    --rotary_pct=0.25 
    --rotary_emb_base=10000 
    --lr=0.0001 
    --lr_decay_style="cosine" 
    --min_lr=1e-5 
    --warmup=0.0032 
    --plateau=0.0 
    --dataset="c4" 
    --tokenizer="mistralai/mathstral-7B-v0.1" 
    --epochs=3 
    --checkpoint_dir="./checkpoints/mathstral-7B" 
    --resume_from_checkpoint="./checkpoints/mathstral-7B" 
    --checkpoint_freq=50 
    --validation_freq=500 
    --dataset_config_name="en" 
    --limit_all_gathers=1 
    --sharding_strategy="full"  # https://pytorch.org/docs/stable/fsdp.html
    --offload_activations=1
)

To launch your training, run the following command.

sbatch 3.distributed-training-mistral-mathstral.sbatch

You’ll find a new file in the FSDP directory of the form slurm-[job-number].out. This will be continuously updated with your training logs. Don’t be worried if you notice a long stream of NCCL logs (we prefer to use NCCL_DEBUG=INFO for verbose logging). After about a minute, you should observe your Mathstral model training, with an output similar to the following.

...
+ TORCHRUN=./pt_fsdp/bin/torchrun
+ export TRAIN_SCRIPT=./train.py
+ TRAIN_SCRIPT=./train.py
+ TRAINING_ARGS=(--train_batch_size=1 --val_batch_size=1 --max_steps=5000 --seed=42 --grad_clip=1.0 --weight_decay=0.2 --beta1=0.9 --beta2=0.95 --activation_checkpointing=1 --intermediate_size=14336 --num_key_value_heads=8 --logging_freq=1 --max_context_width=32768 --vocab_size=32768 --hidden_width=4096 --num_layers=32 --num_heads=32 --resid_pdrop=0.1 --embd_pdrop=0.1 --attn_pdrop=0.1 --summary_first_pdrop=0.1 --initializer_range=0.02 --model_type="mistral" --rotary_pct=0.25 --rotary_emb_base=10000 --lr=0.0001 --lr_decay_style="cosine" --min_lr=1e-5 --warmup=0.0032 --plateau=0.0 --dataset="c4" --tokenizer="mistralai/mathstral-7B-v0.1" --epochs=3 --checkpoint_dir="./checkpoints/mathstral-7B" --resume_from_checkpoint="./checkpoints/mathstral-7B" --checkpoint_freq=50 --validation_freq=500 --dataset_config_name="en" --limit_all_gathers=1 --sharding_strategy="full"  # https://pytorch.org/docs/stable/fsdp.html --offload_activations=1)
+ declare -a TRAINING_ARGS
+ AUTO_RESUME=
+ '[' -d /opt/sagemaker_cluster ']'
+ echo 'Detected Hyperpod cluster.. enabling --auto-resume=1'
Detected Hyperpod cluster.. enabling --auto-resume=1
+ AUTO_RESUME=--auto-resume=1
+ srun --auto-resume=1 -l ./pt_fsdp/bin/torchrun --nproc_per_node=8 --nnodes=4 --rdzv_id=35 --rdzv_backend=c10d --rdzv_endpoint=ip-10-2-39-253 ./train.py --train_batch_size=1 --val_batch_size=1 --max_steps=5000 --seed=42 --grad_clip=1.0 --weight_decay=0.2 --beta1=0.9 --beta2=0.95 --activation_checkpointing=1 --intermediate_size=14336 --num_key_value_heads=8 --logging_freq=1 --max_context_width=32768 --vocab_size=32768 --hidden_width=4096 --num_layers=32 --num_heads=32 --resid_pdrop=0.1 --embd_pdrop=0.1 --attn_pdrop=0.1 --summary_first_pdrop=0.1 --initializer_range=0.02 --model_type=mistral --rotary_pct=0.25 --rotary_emb_base=10000 --lr=0.0001 --lr_decay_style=cosine --min_lr=1e-5 --warmup=0.0032 --plateau=0.0 --dataset=c4 --tokenizer=mistralai/mathstral-7B-v0.1 --epochs=3 --checkpoint_dir=./checkpoints/mathstral-7B --resume_from_checkpoint=./checkpoints/mathstral-7B --checkpoint_freq=50 --validation_freq=500 --dataset_config_name=en --limit_all_gathers=1 --sharding_strategy=full ' #' https://pytorch.org/docs/stable/fsdp.html --offload_activations=1
...
3: 2024-07-19 03:31:38 I [train.py:155] Creating Model
3: 2024-07-19 03:33:08 I [train.py:171] Created model with total parameters: 7248023552 (7.25 B)
3:...
3: 2024-07-19 03:33:23 I [train.py:209] Wrapped model with FSDP
3: 2024-07-19 03:33:23 I [train.py:226] Created optimizer
3: 2024-07-19 03:33:23 I [checkpoint.py:70] No Checkpoints Found
...
3: 2024-07-19 03:33:35 I [train.py:102] Batch 0 Loss: 11.19900, Speed: 5.10 samples/sec, lr: 0.000006
3: 2024-07-19 03:33:38 I [train.py:102] Batch 1 Loss: 11.18291, Speed: 10.96 samples/sec, lr: 0.000013
3: 2024-07-19 03:33:40 I [train.py:102] Batch 2 Loss: 11.09163, Speed: 11.22 samples/sec, lr: 0.000019
3: 2024-07-19 03:33:43 I [train.py:102] Batch 3 Loss: 10.86621, Speed: 11.19 samples/sec, lr: 0.000025
3: 2024-07-19 03:33:46 I [train.py:102] Batch 4 Loss: 10.58236, Speed: 11.12 samples/sec, lr: 0.000031
3: 2024-07-19 03:33:49 I [train.py:102] Batch 5 Loss: 10.08024, Speed: 11.18 samples/sec, lr: 0.000038
3: 2024-07-19 03:33:52 I [train.py:102] Batch 6 Loss: 10.15507, Speed: 11.23 samples/sec, lr: 0.000044
3: 2024-07-19 03:33:55 I [train.py:102] Batch 7 Loss: 9.97296, Speed: 10.42 samples/sec, lr: 0.000050
3: 2024-07-19 03:33:58 I [train.py:102] Batch 8 Loss: 10.13596, Speed: 11.21 samples/sec, lr: 0.000056
3: 2024-07-19 03:34:01 I [train.py:102] Batch 9 Loss: 9.93156, Speed: 11.10 samples/sec, lr: 0.000063

Observability

SageMaker HyperPod can optionally be integrated with Amazon Managed Service for Prometheus and Amazon Managed Grafana to export metrics about your cluster and cluster-nodes to an Amazon Managed Grafana dashboard.

For more details about configuring Amazon Managed Service for Prometheus and Amazon Managed Grafana, refer to the Prometheus Configuration and Amazon Managed Grafana sections in the SageMaker HyperPod workshop.

Slurm Exporter dashboard

The Amazon Managed Grafana Slurm dashboard (ID: 4323) provides visualization options for monitoring Slurm clusters. Prometheus Slurm exporter is installed on the controller node of the cluster. Some of the metrics exported include:

  • Cluster overview – Displays the total number of nodes, jobs, and their states
  • Job metrics – Visualizes job counts and states over time
  • Node metrics – Shows node states, allocation, and available resources
  • Partition metrics – Monitors partition-specific metrics such as CPU, memory, and GPU utilization
  • Job efficiency – Calculates job efficiency based on resources used

The following screenshot of the exporter dashboard shows the continued pre-training job for Mathstral being completed successfully.

Node Exporter dashboard

The Amazon Managed Grafana Node Exporter Full dashboard (ID: 1860) offers visualization options for monitoring system metrics collected by the Prometheus Node Exporter installed on the cluster nodes. Some of the key metrics you can visualize include:

  • System overview – Displays CPU load averages and memory usage
  • Memory metrics – Visualizes memory utilization including total memory, free memory, and swap space
  • Disk usage – Monitors disk space utilization and availability
  • Network traffic – Shows network bytes received and transmitted over time
  • File system metrics – Analyzes file system usage and availability
  • Disk I/O metrics – Visualizes disk read and write activity

DCGM Exporter dashboard

The Amazon Managed Grafana NVIDIA DCGM Exporter dashboard (ID: 12239) offers visualization options for monitoring NVIDIA GPU metrics collected by the DCGM Exporter. Some of the key metrics you can visualize include:

  • GPU overview – Displays GPU utilization, temperatures, power usage, and memory usage
  • Temperature metrics – Visualizes GPU temperatures over time
  • Power usage – Monitors GPU power draw and power usage trends
  • Memory utilization – Analyzes GPU memory usage, including used, free, and total memory
  • Fan speed – Shows GPU fan speeds and variations
  • ECC errors – Tracks GPU memory ECC errors and pending errors

EFA Metrics dashboard

The Amazon Managed Grafana EFA Metrics dashboard (ID: 20579) offers visualization options for monitoring EFA related metrics collected by the EFA Node Exporter. Some of the key visualizations include:

  • EFA error metrics – Visualizes errors such as allocation errors, command errors, and memory map errors
  • EFA network traffic – Monitors received and transmitted bytes, packets, and work requests
  • EFA RDMA performance – Analyzes RDMA read and write operations, including bytes transferred and error rates
  • EFA port lifespan – Displays the lifespan of EFA ports over time
  • EFA keep-alive packets – Tracks the number of keep-alive packets received

FSx Metrics dashboard

The Amazon Managed Grafana FSx for Lustre dashboard (ID: 20906) offers visualization options for monitoring Amazon FSx for Lustre file system related metrics collected by Amazon CloudWatch. Some of the key visualizations include:

  • DataReadBytes – The number of bytes for file system read operations
  • DataWriteBytes – The number of bytes for file system write operations
  • DataReadOperations – The number of read operations
  • DataWriteOperations – The number of write operations
  • MetadataOperations – The number of metadata operations
  • FreeDataStorageCapacity – The amount of available storage capacity

These metrics provide insights into various aspects of your FSx for Lustre file systems.

Resiliency

As mentioned previously, one of the value propositions of SageMaker HyperPod is that it provides a variety of cluster resiliency features such as cluster health checks, auto-resume, and the option to manually replace faulty nodes.

Based on the status of these health checks, SageMaker HyperPod detects whether nodes in the cluster are healthy or not. If a node is deemed unhealthy by any of the health checks, SageMaker HyperPod uses its auto-resume feature to automatically replace the faulty node, without any manual intervention.

Additionally, users have the option to implement checkpointing in their training procedure. Checkpointing, combined with auto-resume, means that once a faulty node is replaced, the training job can resume from the last saved checkpoint. This way, despite a hardware failure, a user’s training job can run with minimal loss in progress.

In this section, we demonstrate the resiliency and auto-resume feature of SageMaker HyperPod by simulating a hardware failure scenario and pointing you towards some logs that indicate the success of a replacement job. We use the same submitted FSDP training job, which has the following two important components enabled:

  1. Checkpointing is enabled and implemented
  2. The --auto-resume=1 flag is set. You can verify this in the SLURM .out

This section in the provided sbatch file sets the --auto-resume=1 flag.

AUTO_RESUME=""
if [ -d "/opt/sagemaker_cluster" ]; then
    echo "Detected Hyperpod cluster.. enabling --auto-resume=1"
    AUTO_RESUME="--auto-resume=1"
fi
srun ${AUTO_RESUME} -l ${TORCHRUN} "${TORCHRUN_ARGS[@]}" $TRAIN_SCRIPT "${TRAINING_ARGS[@]}"

The sbatch file has the checkpointing flags checkpoint_freq, checkpoint_dir, resume_from_checkpoint, which tell the job how often to write checkpoints, where to write the checkpoints to, and what directory to read checkpoints from in case of failure, respectively.

Assuming that you already have your training job submitted, wait until a few checkpoints are written to the ./checkpoints directory (or the directory name you specified for checkpoint_freq. You can check whether any checkpoints were written by running ls -lt checkpoints/. This should return an output that resembles the following.

total 74
-rw-rw-r--  1 ubuntu ubuntu     1 Dec  9 00:21 latest_checkpointed_iteration.txt
drwxrwxr-x 10 ubuntu ubuntu 33280 Dec  9 00:20 iter_0000002
drwxrwxr-x 10 ubuntu ubuntu 33280 Dec  9 00:11 iter_0000001

You may also check the progress of your training job by running tail -f slurm-<job-id>.log, where <job-id> can be derived by running squeue. You should observe an output that resembles the following.

1:  iteration        1/  508626 | consumed samples:          288 | elapsed time per iteration (ms): 440352.6 | learning rate: 0.000E+00 | global batch size:   288 | loss scale: 4294967296.0 | number of skipped iterations:   1 | number of nan iterations:   0 |0: saving checkpoint at iteration       1 to /fsx/checkpoints
0:   successfully saved checkpoint at iteration       1 to /fsx/checkpoints
1: (min, max) time across ranks (ms):
1:     save-checkpoint ................................: (81611.24, 81611.82)

Once you’ve confirmed that your training job is running and that you have checkpoints written, you are ready to simulate a hardware failure.

As part of the output of running squeue, you have received an output that resembles the following.

JOBID PARTITION     NAME   USER ST  TIME  NODES NODELIST(REASON)
32          dev interact ubuntu  R  0:02      4  ip-10-2-9-98,...

This tells you what jobs are running and on what nodes. Locate your training job and choose any of the nodes except the first node on the list of nodes allocated to your job (this is the node that you will be injecting an error into). This is very important because PyTorch uses node 0 (that is, the first node) as the coordination node for your training job.

Once you’ve identified the node to inject the error onto, connect to it using SSH with following command.

ssh <NODE ip>

You can inject an ECC error by running the following command.

dcgmi test --inject --gpuid 0 -f 319 -v 4

This simulates a double-bit error (DBE) in the GPU of your chosen node. Additionally, to kill the training job to simulate a job failure, you can take the process id (PID) of any of the python processes running. The Python processes are the training job processes running your FSDP training job. The -9 flag here is the signal number for the SIGKILL signal, which forces a process to stop, without giving it a chance to clean up, or perform any other actions.

ps -aux | grep python
kill -9 <PID>

Once the ECC error is injected and the Python process has stopped, you can exit out of your compute node. In the meantime, you can get the output of the slurmctld.log file using the following command.

tail -f /var/log/slurm/slurmctld.log

In there, you can observe the following lines, which show a failed job or node.

 [2024-07-19T04:13:03.313] sched: Allocate JobId=35 NodeList-ip-10-2-39-253, ip-10-2-40-102, ip-10-2-76-26, ip-10-2-108-162 #CPUs=192 Partition=dev
 [2024-07-19T04:50:31.682] _slurm_rpc_submit_batch_job: JobId=35 InitPrio=1 usec=727
 [2024-07-19T04:50:31.803] update_node: node ip-10-2-39-253 reason set to: Action: Replace 
 [2024-07-19T04:50:31.803] update_node: node ip-10-2-39-253 state set to FAILING

Pay attention to the line that says update_node: node ip-10-2-39-253 reason set to: Action:Replace, which is the log that says that the node has failed and requires replacement.

If you look at your <slurm-job>.out file, you should observe logs like the following.

[Auto Resume] Info: JobID: 35 StepID: 0 Initiating communication with cluster agent to diagnose health of nodes
[Auto Resume] Info: JobID: 35 StepID: 0 Response from cluster agent: JobID=35, ResumeAction=RETRYSTEP
[Auto Resume] Info: JobID: 35 StepID: 0 Job failed - replacing nodes
[Auto Resume] Info: JobID: 35 StepID: 0 Job failed - Dropping unhealthy nodes
[Auto Resume] Info: JobID: 35 StepID: 0 Succesfully shrink job to retain healthy nodes ...
srun: job 35 queued and waiting for resources

This shows that job 35 (your training job) is paused and a new job (job 35) has initiated the replacement process. You can verify this by running squeue, where you will observe an auto-res. This is the auto-resume job that is initiated by SageMaker HyperPod to replace your faulty node.

JOBID PARTITION  NAME      USER ST    TIME NODES NODELIST(REASON)
35    dev    auto-res    ubuntu PD    0:00     4 (Resources)
...

You can also monitor your SageMaker HyperPod cluster using the AWS console. Under Instances, you should observe one of the nodes in worker-group-1 in Pending state, as shown in the following screenshot. This shows that the node is about to get replaced.

Once your node is replaced, you can observe the slurmctld.log file. Be on the alert for the following line:

update_node: node <YOUR-NODE-IP-ADDRESS> reason set to: AWS:Replaced

You can also verify that your node was successfully replaced using the HyperPod cluster tab in the Amazon SageMaker console.

Once your node is replaced, squeue should no longer display the auto-res job and should only display your original training job. The node is successfully replaced, without any manual intervention.

Because you enabled checkpointing, you can verify that the training job resumes from the latest checkpoint. In your <slurm-job>.out file, find the following lines, which show that a checkpoint was detected in the checkpoint directory (./checkpoints) and that the latest checkpoint was loaded, respectively.

...
Loading checkpoint from checkpoints/mathstral-10steps ...
...
Checkpoint loaded from checkpoints/mathstral-10steps ...
...

If you continue to monitor your <slurm-job>.out file, you should observe that your training job has resumed from the latest checkpoint.

Clean up

  1. To delete your cluster, enter the following command.
aws sagemaker delete-cluster --cluster-name ml-cluster

Once you are done deleting the cluster, make sure it is deleted in the SageMaker HyperPod clusters section under SageMaker.

  1. To use the console to delete your SageMaker HyperPod VPC and Observability CloudFormation stacks, follow the directions at Delete a stack from the CloudFormation console. Alternatively, use the AWS CLI by entering the following command. Replace my-stack with the name of your stacks.
aws cloudformation delete-stack 
    --stack-name my-stack

Conclusion

In this post, we provided a comprehensive guide on using Amazon SageMaker HyperPod for training large-scale models such as Mistral AI’s Mathstral using PyTorch Fully Sharded Data Parallel (FSDP). The process highlighted the efficiency of distributed training on SageMaker HyperPod, showcasing the critical role of resiliency and observability features in maintaining uninterrupted, scalable training environments.

Because of the integration with tools such as Amazon Managed Service for Prometheus and Amazon Managed Grafana for real-time monitoring, along with the robust cluster management capabilities of SageMaker HyperPod, ML practitioners can focus on model development rather than infrastructure management. The detailed steps for setting up the infrastructure, deploying the observability stack, and running a training job demonstrate how SageMaker HyperPod helps tackle the complexities of distributed training.

Moreover, the automatic health checks and the auto-resume feature significantly reduce downtime and minimize the impact of hardware failures so that large-scale training jobs can proceed with minimal interruptions. This level of resilience is crucial for maintaining the pace of innovation in AI research, especially when dealing with massive FMs.

By following the outlined procedures and using the powerful tools provided by AWS, data scientists and engineers can optimize their training workflows, reduce operational overhead, and accelerate the development of state-of-the-art models.

Getting Started

Interested in getting started with SageMaker HyperPod? Reach out to your AWS Account Team or email aws-frameworks-gtm@amazon.com. To begin experimenting with other examples on SageMaker HyperPod, refer to the awsome-distributed-training GitHub repo and the Amazon SageMaker HyperPod workshop.


About the Authors

Niithiyn Vijeaswaran is a Solutions Architect at AWS. His area of focus is generative AI and AWS AI Accelerators. He holds a Bachelor’s degree in Computer Science and Bioinformatics. Niithiyn works closely with the Generative AI GTM team to enable AWS customers on multiple fronts and accelerate their adoption of generative AI. He’s an avid fan of the Dallas Mavericks and enjoys collecting sneakers.

Aman Shanbhag is an Associate Specialist Solutions Architect on the ML Frameworks team at Amazon Web Services, where he helps customers and partners with deploying ML Training and Inference solutions at scale. Before joining AWS, Aman graduated from Rice University with degrees in Computer Science, Mathematics, and Entrepreneurship.

Armando Diaz is a Solutions Architect at AWS. He focuses on generative AI, AI/ML, and data analytics. At AWS, Armando helps customers integrate cutting-edge generative AI capabilities into their systems, fostering innovation and competitive advantage. When he’s not at work, he enjoys spending time with his wife and family, hiking, and traveling the world.

Rohit Talluri is a Generative AI GTM Specialist (Tech BD) at Amazon Web Services (AWS). He is partnering with key GenAI foundation model providers, AWS service teams, strategic customers, founders, universities, venture ecosystems, and Amazon to develop technology strategy that enables the next generation of artificial intelligence, machine learning, and accelerated computing on AWS.

Anoop Saha is a Sr GTM Specialist at Amazon Web Services (AWS) focusing on Gen AI model training and inference. He is partnering with top foundation model builders, strategic customers, and AWS service teams to enable distributed training and inference at scale on AWS and lead joint GTM motions. Before AWS, Anoop has held several leadership roles at startups and large corporations, primarily focusing on silicon and system architecture of AI infrastructure.

Read More