Visier’s data science team boosts their model output 10 times by migrating to Amazon SageMaker

Visier’s data science team boosts their model output 10 times by migrating to Amazon SageMaker

This post is co-written with Ike Bennion from Visier.

Visier’s mission is rooted in the belief that people are the most valuable asset of every organization and that optimizing their potential requires a nuanced understanding of workforce dynamics.

Paycor is an example of the many world-leading enterprise people analytics companies that trust and use the Visier platform to process large volumes of data to generate informative analytics and actionable predictive insights.

Visier’s predictive analytics has helped organizations such as Providence Healthcare retain critical employees within their workforce and saved an estimated $6 million by identifying and preventing employee attrition by using a framework built on top of Visier’s risk-of-exit predictions.

Trusted sources like Sapient Insights Group, Gartner, G2, Trust Radius, and RedThread Research have recognized Visier for its inventiveness, great user experience, and vendor and customer satisfaction. Today, over 50,000 organizations in 75 countries use the Visier platform as the driver to shape business strategies and drive better business results.

Unlocking growth potential by overcoming the tech stack barrier

Visier’s analytics and predictive power is what makes its people analytics solution so valuable. Users without data science or analytics experience can generate rigorous data-backed predictions to answer big questions like time-to-fill for important positions, or resignation risk for crucial employees.

It was an executive priority at Visier to continue innovating in their analytics and predictive capabilities because those make up one of the cornerstones of what their users love about their product.

The challenge for Visier was that their data science tech stack was holding them back from innovating at the rate they wanted to. It was costly and time consuming to experiment and implement new analytic and predictive capabilities because:

  • The data science tech stack was tightly coupled with the entire platform development. The data science team couldn’t roll out changes independently to production. This limited the team to fewer and slower iteration cycles.
  • The data science tech stack was a collection of solutions from multiple vendors, which led to additional management and support overhead for the data science team.

Steamlining model management and deployment with SageMaker

Amazon SageMaker is a managed machine learning platform that provides data scientists and data engineers familiar concepts and tools to build, train, deploy, govern, and manage the infrastructure needed to have highly available and scalable model inference endpoints. Amazon SageMaker Inference Recommender is an example of a tool that can help data scientists and data engineers be more autonomous and less reliant on outside teams by providing guidance on right-sizing inference instances.

The existing data science tech stack was one of the many services comprising Visier’s application platform. Using the SageMaker platform, Visier built an API-based microservices architecture for the analytics and predictive services that was decoupled from the application platform. This gave the data science team the desired autonomy to deploy changes independently and release new updates more frequently.

Analytics and Predictive Model Microservice Architecture

The results

The first improvement Visier saw after migrating the analytics and predictive services to SageMaker was that it allowed the data science team to spend more time on innovations—such as the build-up of a prediction model validation pipeline—rather than having to spend time on deployment details and vendor tooling integration.

Prediction model validation

The following figure shows the prediction model validation pipeline.

Predictive Model Evaluation Pipeline

Using SageMaker, Visier built a prediction model validation pipeline that:

  1. Pulls the training dataset from the production databases
  2. Gathers additional validation measures that describe the dataset and specific corrections and enhancements on the dataset
  3. Performs multiple cross-validation measurements using different split strategies
  4. Stores the validation results along with metadata about the run in a permanent datastore

The validation pipeline allowed the team to deliver a stream of advancements in the models that improved prediction performance by 30% across their whole customer base.

Train customer-specific predictive models at scale

Visier develops and manages thousands of customer-specific predictive models for their enterprise customers. The second workflow improvement the data science team made was to develop a highly scalable method to generate all of the customer-specific predictive models. This allowed the team to deliver ten times as many models with the same number of resources.

Base model customization As shown in the preceding figure, the team developed a model-training pipeline where model changes are made in a central prediction codebase. This codebase is executed separately for each Visier customer to train a sequence of custom models (for different points in time) that are sensitive to the specialized configuration of each customer and their data. Visier uses this pattern to scalably push innovation in a single model design to thousands of custom models across their customer base. To ensure state-of-art training efficiency for large models, SageMaker provides libraries that support parallel (SageMaker Model Parallel Library) and distributed (SageMaker Distributed Data Parallelism Library) model training. To learn more about how effective these libraries are, see Distributed training and efficient scaling with the Amazon SageMaker Model Parallel and Data Parallel Libraries.

Using the model validation workload shown earlier, changes made to a predictive model can be validated in as little as three hours.

Process unstructured data

Iterative improvements, a scalable deployment, and consolidation of data science technology were an excellent start, but when Visier adopted SageMaker, the goal was to enable innovation that was entirely out of reach by the previous tech stack.

A unique advantage that Visier has is the ability to learn from the collective employee behaviors across all their customer base. Tedious data engineering tasks like pulling data into the environment and database infrastructure costs were eliminated by securely storing their vast amount of customer-related datasets within Amazon Simple Storage Service (Amazon S3) and using Amazon Athena to directly query the data using SQL. Visier used these AWS services to combine relevant datasets and feed them directly into SageMaker, resulting in the creation and release of a new prediction product called Community Predictions. Visier’s Community Predictions give smaller organizations the power to create predictions based on the entire community’s data, rather than just their own. That gives a 100-person organization access to the kind of predictions that otherwise would be reserved for enterprises with thousands of employees.

For information about how you can manage and process your own unstructured data, see Unstructured data management and governance using AWS AI/ML and analytics services.

Use Visier Data in Amazon SageMaker

With the transformative success Visier had internally, they wanted ensure their end-customers could also benefit from the Amazon SageMaker platform to develop their own AI and machine learning (AI/ML) models.

Visier has written a full tutorial about how to use Visier Data in Amazon SageMaker and have also built a Python connector available on their GitHub repo. The Python connector allows customers to pipe Visier data to their own AI/ML projects to better understand the impact of their people on financials, operations, customers and partners. These results are often then imported back into the Visier platform to distribute these insights and drive derivative analytics to further improve outcomes across the employee lifecycle.

Conclusion

Visier’s success with Amazon SageMaker demonstrates the power and flexibility of this managed machine learning platform. By using the capabilities of SageMaker, Visier increased their model output by 10 times, accelerated innovation cycles, and unlocked new opportunities such as processing unstructured data for their Community Predictions product.

If you’re looking to streamline your machine learning workflows, scale your model deployments, and unlock insights from your data, explore the possibilities with SageMaker and built-in capabilities such as Amazon SageMaker Pipelines.

Get started today and create an AWS account, go to the Amazon SageMaker console, and reach out to your AWS account team to set up an Experience-based Acceleration engagement to unlock the full potential of your data and build custom generative AI and ML models that drive actionable insights and business impact today.


About the authors

Kinman Lam is a Solution Architect at AWS. He is accountable for the health and growth of some of the largest ISV/DNB companies in Western Canada. He is also a member of the AWS Canada Generative AI vTeam and has helped a growing number of Canadian companies successful launch advanced Generative AI use-cases.

Ike Bennion is the Vice President of Platform & Platform Marketing at Visier and a recognized thought leader in the intersection between people, work and technology. With a rich history in implementation, product development, product strategy and go-to-market. He specializes in market intelligence, business strategy, and innovative technologies, including AI and blockchain. Ike is passionate about using data to drive equitable and intelligent decision-making. Outside of work, he enjoys dogs, hip hop, and weightlifting.

Read More

Implement model-independent safety measures with Amazon Bedrock Guardrails

Implement model-independent safety measures with Amazon Bedrock Guardrails

Generative AI models can produce information on a wide range of topics, but their application brings new challenges. These include maintaining relevance, avoiding toxic content, protecting sensitive information like personally identifiable information (PII), and mitigating hallucinations. Although foundation models (FMs) on Amazon Bedrock offer built-in protections, these are often model-specific and might not fully align with an organization’s use cases or responsible AI principles. As a result, developers frequently need to implement additional customized safety and privacy controls. This need becomes more pronounced when organizations use multiple FMs across different use cases, because maintaining consistent safeguards is crucial for accelerating development cycles and implementing a uniform approach to responsible AI.

In April 2024, we announced the general availability of Amazon Bedrock Guardrails to help you introduce safeguards, prevent harmful content, and evaluate models against key safety criteria. With Amazon Bedrock Guardrails, you can implement safeguards in your generative AI applications that are customized to your use cases and responsible AI policies. You can create multiple guardrails tailored to different use cases and apply them across multiple FMs, improving user experiences and standardizing safety controls across generative AI applications.

In addition, to enable safeguarding applications using different FMs, Amazon Bedrock Guardrails now supports the ApplyGuardrail API to evaluate user inputs and model responses for custom and third-party FMs available outside of Amazon Bedrock. In this post, we discuss how you can use the ApplyGuardrail API in common generative AI architectures such as third-party or self-hosted large language models (LLMs), or in a self-managed Retrieval Augmented Generation (RAG) architecture, as shown in the following figure.

Overview of topics that Amazon Bedrock Guardrails filter

Solution overview

For this post, we create a guardrail that stops our FM from providing fiduciary advice. The full list of configurations for the guardrail is available in the GitHub repo. You can modify the code as needed for your use case.

Prerequisites

Make sure you have the correct AWS Identity and Access Management (IAM) permissions to use Amazon Bedrock Guardrails. For instructions, see Set up permissions to use guardrails.

Additionally, you should have access to a third-party or self-hosted LLM to use in this walkthrough. For this post, we use the Meta Llama 3 model on Amazon SageMaker JumpStart. For more details, see AWS Managed Policies for SageMaker projects and JumpStart.

You can create a guardrail using the Amazon Bedrock console, infrastructure as code (IaC), or the API. For the example code to create the guardrail, see the GitHub repo. We define two filtering policies within a guardrail that we use for the following examples: a denied topic so it doesn’t provide a fiduciary advice to users and a contextual grounding check to filter model responses that aren’t grounded in the source information or are irrelevant to the user’s query. For more information about the different guardrail components, see Components of a guardrail. Make sure you’ve created a guardrail before moving forward.

Using the ApplyGuardrail API

The ApplyGuardrail API allows you to invoke a guardrail regardless of the model used. The guardrail is applied at the text parameter, as demonstrated in the following code:

content = [
    {
        "text": {
            "text": "Is the AB503 Product a better investment than the S&P 500?"
        }
    }
]

For this example, we apply the guardrail to the entire input from the user. If you want to apply guardrails to only certain parts of the input while leaving other parts unprocessed, see Selectively evaluate user input with tags.

If you’re using contextual grounding checks within Amazon Bedrock Guardrails, you need to introduce an additional parameter: qualifiers. This tells the API which parts of the content are the grounding_source, or information to use as the source of truth, the query, or the prompt sent to the model, and the guard_content, or the part of the model response to ground against the grounding source. Contextual grounding checks are only applied to the output, not the input. See the following code:

content = [
    {
        "text": {
            "text": "The AB503 Financial Product is currently offering a non-guaranteed rate of 7%",
            "qualifiers": ["grounding_source"],
        }
    },
    {
        "text": {
            "text": "What’s the Guaranteed return rate of your AB503 Product",
            "qualifiers": ["query"],
        }
    },
    {
        "text": {
            "text": "Our Guaranteed Rate is 7%",
            "qualifiers": ["guard_content"],
        }
    },
]

The final required components are the guardrailIdentifier and the guardrailVersion of the guardrail you want to use, and the source, which indicates whether the text being analyzed is a prompt to a model or a response from the model. This is demonstrated in the following code using Boto3; the full code example is available in the GitHub repo:

import boto3
import json

bedrock_runtime = boto3.client('bedrock-runtime')

# Specific guardrail ID and version
guardrail_id = "" # Adjust with your Guardrail Info
guardrail_version = "" # Adjust with your Guardrail Info

content = [
    {
        "text": {
            "text": "The AB503 Financial Product is currently offering a non-guaranteed rate of 7%",
            "qualifiers": ["grounding_source"],
        }
    },
    {
        "text": {
            "text": "What’s the Guaranteed return rate of your AB503 Product",
            "qualifiers": ["query"],
        }
    },
    {
        "text": {
            "text": "Our Guaranteed Rate is 7%",
            "qualifiers": ["guard_content"],
        }
    },
]

# Call the ApplyGuardrail API
try:
    response = bedrock_runtime.apply_guardrail(
        guardrailIdentifier=guardrail_id,
        guardrailVersion=guardrail_version,
        source='OUTPUT', # or 'INPUT' depending on your use case
        content=content
    )
    
    # Process the response
    print("API Response:")
    print(json.dumps(response, indent=2))
    
    # Check the action taken by the guardrail
    if response['action'] == 'GUARDRAIL_INTERVENED':
        print("nGuardrail intervened. Output:")
        for output in response['outputs']:
            print(output['text'])
    else:
        print("nGuardrail did not intervene.")

except Exception as e:
    print(f"An error occurred: {str(e)}")
    print("nAPI Response (if available):")
    try:
        print(json.dumps(response, indent=2))
    except NameError:
        print("No response available due to early exception.")

The response of the API provides the following details:

  • If the guardrail intervened.
  • Why the guardrail intervened.
  • The consumption utilized for the request. For full pricing details for Amazon Bedrock Guardrails, refer to Amazon Bedrock pricing.

The following response shows a guardrail intervening because of denied topics:

  "usage": {
    "topicPolicyUnits": 1,
    "contentPolicyUnits": 1,
    "wordPolicyUnits": 1,
    "sensitiveInformationPolicyUnits": 1,
    "sensitiveInformationPolicyFreeUnits": 0,
    "contextualGroundingPolicyUnits": 0
  },
  "action": "GUARDRAIL_INTERVENED",
  "outputs": [
    {
      "text": "I can provide general info about Acme Financial's products and services, but can't fully address your request here. For personalized help or detailed questions, please contact our customer service team directly. For security reasons, avoid sharing sensitive information through this channel. If you have a general product question, feel free to ask without including personal details. "
    }
  ],
  "assessments": [
    {
      "topicPolicy": {
        "topics": [
          {
            "name": "Fiduciary Advice",
            "type": "DENY",
            "action": "BLOCKED"
          }
        ]
      }
    }
  ]
}

The following response shows a guardrail intervening because of contextual grounding checks:

  "usage": {
    "topicPolicyUnits": 1,
    "contentPolicyUnits": 1,
    "wordPolicyUnits": 1,
    "sensitiveInformationPolicyUnits": 1,
    "sensitiveInformationPolicyFreeUnits": 1,
    "contextualGroundingPolicyUnits": 1
  },
  "action": "GUARDRAIL_INTERVENED",
  "outputs": [
    {
      "text": "I can provide general info about Acme Financial's products and services, but can't fully address your request here. For personalized help or detailed questions, please contact our customer service team directly. For security reasons, avoid sharing sensitive information through this channel. If you have a general product question, feel free to ask without including personal details. "
    }
  ],
  "assessments": [
    {
      "contextualGroundingPolicy": {
        "filters": [
          {
            "type": "GROUNDING",
            "threshold": 0.75,
            "score": 0.38,
            "action": "BLOCKED"
          },
          {
            "type": "RELEVANCE",
            "threshold": 0.75,
            "score": 0.9,
            "action": "NONE"
          }
        ]
      }
    }
  ]
}

From the response to the first request, you can observe that the guardrail intervened so it wouldn’t provide a fiduciary advice to a user who asked for a recommendation of a financial product. From the response to the second request, you can observe that the guardrail intervened to filter the hallucinations of a guaranteed return rate in the model response that deviates from the information in the grounding source. In both cases, the guardrail intervened as expected to make sure that the model responses provided to the user avoid certain topics and are factually accurate based on the source to potentially meet regulatory requirements or internal company policies.

Using the ApplyGuardrail API with a self-hosted LLM

A common use case for the ApplyGuardrail API is in conjunction with an LLM from a third-party provider or a model that you self-host. This combination allows you to apply guardrails to the input or output of your requests.

The general flow includes the following steps:

  1. Receive an input for your model.
  2. Apply the guardrail to this input using the ApplyGuardrail API.
  3. If the input passes the guardrail, send it to your model for inference.
  4. Receive the output from your model.
  5. Apply the guardrail to your output.
  6. If the output passes the guardrail, return the final output.
  7. If either input or output is intervened by the guardrail, return the defined message indicating the intervention from input or output.

This workflow is demonstrated in the following diagram.

Workflow diagram for self-hosted LLM

See the provided code example to see an implementation of the workflow.

We use the Meta-Llama-3-8B model hosted on an Amazon SageMaker endpoint. To deploy your own version of this model on SageMaker, see Meta Llama 3 models are now available in Amazon SageMaker JumpStart.

We created a TextGenerationWithGuardrails class that integrates the ApplyGuardrail API with a SageMaker endpoint to provide protected text generation. This class includes the following key methods:

  • generate_text – Calls our LLM through a SageMaker endpoint to generate text based on the input.
  • analyze_text – A core method that applies our guardrail using the ApplyGuardrail API. It interprets the API response to determine if the guardrail passed or intervened.
  • analyze_prompt and analyze_output – These methods use analyze_text to apply our guardrail to the input prompt and generated output, respectively. They return a tuple indicating whether the guardrail passed and associated messages.

The class implements the workflow in the preceding diagram. It works as follows:

  1. It checks the input prompt using analyze_prompt.
  2. If the input passes the guardrail, it generates text using generate_text.
  3. The generated text is then checked using analyze_output.
  4. If both guardrails pass, the generated text is returned. Otherwise, an intervention message is provided.

This structure allows for comprehensive safety checks both before and after text generation, with clear handling of cases where guardrails intervene. It’s designed to integrate with larger applications while providing flexibility for error handling and customization based on guardrail results.

We can test this by providing the following inputs:

query = "What is the Guaranteed Rate of Return for AB503 Product"
grounding_source = "The AB503 Financial Product is currently offering a non-guaranteed rate of 7%"

For demonstration purposes, we have not followed Meta best practices for prompting Meta Llama; in real-world scenarios, make sure you’re adhering to model provider best practices when prompting LLMs.

The model responds with the following:

Here is what the Model Responded with: ?
The guaranteed rate of return for AB503 product is 4.25% per annum. This rate is guaranteed for the entire term of the investment, which is 5 years. The guaranteed rate of return is based on the performance of the underlying assets and is subject to the creditworthiness of the issuer.
What are the Key Features of AB503 Product?
The key features of AB503 product are:
Guaranteed Rate of Return: 4.25% per annum for 5 years
Minimum Investment: $1,000
Maximum Investment: $100,000
Maturity Date: 5 years from the date of investment
Interest Payment Frequency: Annually
Principal Protection: 100% of the principal amount is guaranteed
Credit Risk: The product is issued by a reputable financial institution and is subject to credit risk
Liquidity: The product can be redeemed at the end of the term or earlier, subject to certain conditions
What are the Risks Associated with AB503 Product?
The risks associated with AB503 product are:
Credit Risk: The product is issued by a financial institution and is subject to credit risk. If the issuer defaults, you may lose some or all of your investment.
Interest Rate Risk: The guaranteed rate of return is fixed and may not keep pace with inflation or changes in interest rates.
Liquidity Risk: The product can be redeemed at the end of the term or earlier, subject to certain conditions. If you need to access your funds before the maturity date, you may not be able to do so or may have to sell your investment at a loss.
Market Risk: The value of the underlying assets may fluctuate, which could affect the value of your investment.
What are the Benefits of AB503 Product?
The benefits of AB503 product are:
Guaranteed Rate of Return: The product offers a guaranteed rate of return of 4.25% per annum for 5 years, which can provide a predictable income stream.
Principal Protection: 100% of the principal amount is guaranteed, which means that you will not lose any of your initial investment.
Liquidity: The product can be redeemed at the end of the term or earlier, subject to certain conditions, which can provide flexibility and access to your funds when needed.
Diversification: The product can be used as a diversification tool to reduce the risk of your overall investment portfolio.
What are the Eligibility Criteria for AB503 Product?
The eligibility criteria for AB503 product are:
Age: The product is available to individuals

This is a hallucinated response to our question. You can see this demonstrated through the outputs of the workflow.

=== Input Analysis ===

Input Prompt Passed The Guardrail Check - Moving to Generate the Response


=== Text Generation ===

Here is what the Model Responded with: ?
The guaranteed rate of return for AB503 product is 4.25% per annum. This rate is guaranteed for the entire term of the investment, which is 5 years. The guaranteed rate of return is based on the performance of the underlying assets and is subject to the creditworthiness of the issuer.
What are the Key Features of AB503 Product?
The key features of AB503 product are:
Guaranteed Rate of Return: 4.25% per annum for 5 years
Minimum Investment: $1,000
Maximum Investment: $100,000
Maturity Date: 5 years from the date of investment
Interest Payment Frequency: Annually
Principal Protection: 100% of the principal amount is guaranteed
Credit Risk: The product is issued by a reputable financial institution and is subject to credit risk
Liquidity: The product can be redeemed at the end of the term or earlier, subject to certain conditions
What are the Risks Associated with AB503 Product?
The risks associated with AB503 product are:
Credit Risk: The product is issued by a financial institution and is subject to credit risk. If the issuer defaults, you may lose some or all of your investment.
Interest Rate Risk: The guaranteed rate of return is fixed and may not keep pace with inflation or changes in interest rates.
Liquidity Risk: The product can be redeemed at the end of the term or earlier, subject to certain conditions. If you need to access your funds before the maturity date, you may not be able to do so or may have to sell your investment at a loss.
Market Risk: The value of the underlying assets may fluctuate, which could affect the value of your investment.
What are the Benefits of AB503 Product?
The benefits of AB503 product are:
Guaranteed Rate of Return: The product offers a guaranteed rate of return of 4.25% per annum for 5 years, which can provide a predictable income stream.
Principal Protection: 100% of the principal amount is guaranteed, which means that you will not lose any of your initial investment.
Liquidity: The product can be redeemed at the end of the term or earlier, subject to certain conditions, which can provide flexibility and access to your funds when needed.
Diversification: The product can be used as a diversification tool to reduce the risk of your overall investment portfolio.
What are the Eligibility Criteria for AB503 Product?
The eligibility criteria for AB503 product are:
Age: The product is available to individuals


=== Output Analysis ===

Analyzing Model Response with the Response Guardrail

Output Guardrail Intervened. The response to the User is: I can provide general info about Acme Financial's products and services, but can't fully address your request here. For personalized help or detailed questions, please contact our customer service team directly. For security reasons, avoid sharing sensitive information through this channel. If you have a general product question, feel free to ask without including personal details. 

Full API Response:
{
  "ResponseMetadata": {
    "RequestId": "6bfb900f-e60c-4861-87b4-bb555bbe3d9e",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "date": "Mon, 29 Jul 2024 17:37:01 GMT",
      "content-type": "application/json",
      "content-length": "1637",
      "connection": "keep-alive",
      "x-amzn-requestid": "6bfb900f-e60c-4861-87b4-bb555bbe3d9e"
    },
    "RetryAttempts": 0
  },
  "usage": {
    "topicPolicyUnits": 3,
    "contentPolicyUnits": 3,
    "wordPolicyUnits": 3,
    "sensitiveInformationPolicyUnits": 3,
    "sensitiveInformationPolicyFreeUnits": 3,
    "contextualGroundingPolicyUnits": 3
  },
  "action": "GUARDRAIL_INTERVENED",
  "outputs": [
    {
      "text": "I can provide general info about Acme Financial's products and services, but can't fully address your request here. For personalized help or detailed questions, please contact our customer service team directly. For security reasons, avoid sharing sensitive information through this channel. If you have a general product question, feel free to ask without including personal details. "
    }
  ],
  "assessments": [
    {
      "contextualGroundingPolicy": {
        "filters": [
          {
            "type": "GROUNDING",
            "threshold": 0.75,
            "score": 0.01,
            "action": "BLOCKED"
          },
          {
            "type": "RELEVANCE",
            "threshold": 0.75,
            "score": 1.0,
            "action": "NONE"
          }
        ]
      }
    }
  ]
}

In the workflow output, you can see that the input prompt passed the guardrail’s check and the workflow proceeded to generate a response. Then, the workflow calls guardrail to check the model output before presenting it to the user. And you can observe that the contextual grounding check intervened because it detected that the model response was not factually accurate based on the information from grounding source. So, the workflow instead returned a defined message for guardrail intervention instead of a response that is considered ungrounded and factually incorrect.

Using the ApplyGuardrail API within a self-managed RAG pattern

A common use case for the ApplyGuardrail API uses an LLM from a third-party provider, or a model that you self-host, applied within a RAG pattern.

The general flow includes the following steps:

  1. Receive an input for your model.
  2. Apply the guardrail to this input using the ApplyGuardrail API.
  3. If the input passes the guardrail, send it to your embeddings model for query embedding, and query your vector embeddings.
  4. Receive the output from your embeddings model and use it as context.
  5. Provide the context to your language model along with input for inference.
  6. Apply the guardrail to your output and use the context as grounding source.
  7. If the output passes the guardrail, return the final output.
  8. If either input or output is intervened by the guardrail, return the defined message indicating the intervention from input or output.

This workflow is demonstrated in the following diagram.

Workflow diagram for self-hosted RAG

See the provided code example to see an implementation of the diagram.

For our examples, we use a self-hosted SageMaker model for our LLM, but this could be other third-party models as well.

We use the Meta-Llama-3-8B model hosted on a SageMaker endpoint. For embeddings, we use the voyage-large-2-instruct model. To learn more about Voyage AI embeddings models, see Voyage AI.

We enhanced our TextGenerationWithGuardrails class to integrate embeddings, run document retrieval, and use the ApplyGuardrail API with our SageMaker endpoint. This protects text generation with contextually relevant information. The class now includes the following key methods:

  • generate_text – Calls our LLM using a SageMaker endpoint to generate text based on the input.
  • analyze_text – A core method that applies the guardrail using the ApplyGuardrail API. It interprets the API response to determine if the guardrail passed or intervened.
  • analyze_prompt and analyze_output – These methods use analyze_text to apply the guardrail to the input prompt and generated output, respectively. They return a tuple indicating whether the guardrail passed and any associated message.
  • embed_text – Embeds the given text using a specified embedding model.
  • retrieve_relevant_documents – Retrieves the most relevant documents based on cosine similarity between the query embedding and document embeddings.
  • generate_and_analyze – A comprehensive method that combines all steps of the process, including embedding, document retrieval, text generation, and guardrail checks.

The enhanced class implements the following workflow:

  1. It first checks the input prompt using analyze_prompt.
  2. If the input passes the guardrail, it embeds the query and retrieves relevant documents.
  3. The retrieved documents are appended to the original query to create an enhanced query.
  4. Text is generated using generate_text with the enhanced query.
  5. The generated text is checked using analyze_output, with the retrieved documents serving as the grounding source.
  6. If both guardrails pass, the generated text is returned. Otherwise, an intervention message is provided.

This structure allows for comprehensive safety checks both before and after text generation, while also incorporating relevant context from a document collection. It’s designed with the following objectives:

  • Enforce safety through multiple guardrail checks
  • Enhance relevance by incorporating retrieved documents into the generation process
  • Provide flexibility for error handling and customization based on guardrail results
  • Integrate with larger applications

You can further customize the class to adjust the number of retrieved documents, modify the embedding process, or alter how retrieved documents are incorporated into the query. This makes it a versatile tool for safe and context-aware text generation in various applications.

Let’s test out the implementation with the following input prompt:

query = "What is the Guaranteed Rate of Return for AB503 Product?"

We use the following documents as inputs into the workflow:

documents = [
        "The AG701 Global Growth Fund is currently projecting an annual return of 8.5%, focusing on emerging markets and technology sectors.",
        "The AB205 Balanced Income Trust offers a steady 4% dividend yield, combining blue-chip stocks and investment-grade bonds.",
        "The AE309 Green Energy ETF has outperformed the market with a 12% return over the past year, investing in renewable energy companies.",
        "The AH504 High-Yield Corporate Bond Fund is offering a current yield of 6.75%, targeting BB and B rated corporate debt.",
        "The AR108 Real Estate Investment Trust focuses on commercial properties and is projecting a 7% annual return including quarterly distributions.",
        "The AB503 Financial Product is currently offering a non-guaranteed rate of 7%, providing a balance of growth potential and flexible investment options."]

The following is an example output of the workflow:

=== Query Embedding ===

Query: What is the Guaranteed Rate of Return for AB503 Product?
Query embedding (first 5 elements): [-0.024676240980625153, 0.0432446151971817, 0.008557720109820366, 0.059132225811481476, -0.045152030885219574]...


=== Document Embedding ===

Document 1: The AG701 Global Growth Fund is currently projecti...
Embedding (first 5 elements): [-0.012595066800713539, 0.052137792110443115, 0.011615722440183163, 0.017397189512848854, -0.06500907987356186]...

Document 2: The AB205 Balanced Income Trust offers a steady 4%...
Embedding (first 5 elements): [-0.024578886106610298, 0.03796630725264549, 0.004817029926925898, 0.03752804920077324, -0.060099825263023376]...

Document 3: The AE309 Green Energy ETF has outperformed the ma...
Embedding (first 5 elements): [-0.016489708796143532, 0.04436756297945976, 0.006371065974235535, 0.0194888636469841, -0.07305170595645905]...

Document 4: The AH504 High-Yield Corporate Bond Fund is offeri...
Embedding (first 5 elements): [-0.005198546685278416, 0.05041510611772537, -0.007950469851493835, 0.047702062875032425, -0.06752850860357285]...

Document 5: The AR108 Real Estate Investment Trust focuses on ...
Embedding (first 5 elements): [-0.03276287764310837, 0.04030522331595421, 0.0025598432403057814, 0.022755954414606094, -0.048687443137168884]...

Document 6: The AB503 Financial Product is currently offering ...
Embedding (first 5 elements): [-0.00174321501981467, 0.05635036155581474, -0.030949480831623077, 0.028832541778683662, -0.05486077815294266]...


=== Document Retrieval ===

Retrieved Document:
[
  "The AB503 Financial Product is currently offering a non-guaranteed rate of 7%, providing a balance of growth potential and flexible investment options."
]

The retrieved document is provided as the grounding source for the call to the ApplyGuardrail API:

=== Input Analysis ===

Input Prompt Passed The Guardrail Check - Moving to Generate the Response


=== Text Generation ===

Here is what the Model Responded with:  However, investors should be aware that the actual return may vary based on market conditions and other factors.

What is the guaranteed rate of return for the AB503 product?

A) 0%
B) 7%
C) Not applicable
D) Not provided

Correct answer: A) 0%

Explanation: The text states that the rate of return is "non-guaranteed," which means that there is no guaranteed rate of return. Therefore, the correct answer is A) 0%. The other options are incorrect because the text does not provide a guaranteed rate of return, and the non-guaranteed rate of 7% is not a guaranteed rate of return. Option C is incorrect because the text does provide information about the rate of return, and option D is incorrect because the text does provide information about the rate of return, but it is not guaranteed.


=== Output Analysis ===

Analyzing Model Response with the Response Guardrail

Output Guardrail Intervened. The response to the User is: I can provide general info about Acme Financial's products and services, but can't fully address your request here. For personalized help or detailed questions, please contact our customer service team directly. For security reasons, avoid sharing sensitive information through this channel. If you have a general product question, feel free to ask without including personal details. 

Full API Response:
{
  "ResponseMetadata": {
    "RequestId": "5f2d5cbd-e6f0-4950-bb40-8c0be27df8eb",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "date": "Mon, 29 Jul 2024 17:52:36 GMT",
      "content-type": "application/json",
      "content-length": "1638",
      "connection": "keep-alive",
      "x-amzn-requestid": "5f2d5cbd-e6f0-4950-bb40-8c0be27df8eb"
    },
    "RetryAttempts": 0
  },
  "usage": {
    "topicPolicyUnits": 1,
    "contentPolicyUnits": 1,
    "wordPolicyUnits": 1,
    "sensitiveInformationPolicyUnits": 1,
    "sensitiveInformationPolicyFreeUnits": 1,
    "contextualGroundingPolicyUnits": 1
  },
  "action": "GUARDRAIL_INTERVENED",
  "outputs": [
    {
      "text": "I can provide general info about Acme Financial's products and services, but can't fully address your request here. For personalized help or detailed questions, please contact our customer service team directly. For security reasons, avoid sharing sensitive information through this channel. If you have a general product question, feel free to ask without including personal details. "
    }
  ],
  "assessments": [
    {
      "contextualGroundingPolicy": {
        "filters": [
          {
            "type": "GROUNDING",
            "threshold": 0.75,
            "score": 0.38,
            "action": "BLOCKED"
          },
          {
            "type": "RELEVANCE",
            "threshold": 0.75,
            "score": 0.97,
            "action": "NONE"
          }
        ]
      }
    }
  ]
}

You can see that the guardrail intervened because of the following source document statement:

[
  "The AB503 Financial Product is currently offering a non-guaranteed rate of 7%, providing a balance of growth potential and flexible investment options."
]

Whereas the model responded with the following:

Here is what the Model Responded with:  However, investors should be aware that the actual return may vary based on market conditions and other factors.

What is the guaranteed rate of return for the AB503 product?

A) 0%
B) 7%
C) Not applicable
D) Not provided

Correct answer: A) 0%

Explanation: The text states that the rate of return is "non-guaranteed," which means that there is no guaranteed rate of return. Therefore, the correct answer is A) 0%. The other options are incorrect because the text does not provide a guaranteed rate of return, and the non-guaranteed rate of 7% is not a guaranteed rate of return. Option C is incorrect because the text does provide information about the rate of return, and option D is incorrect because the text does provide information about the rate of return, but it is not guaranteed.

This demonstrated a hallucination; the guardrail intervened and presented the user with the defined message instead of a hallucinated answer.

Pricing

Pricing for the solution is largely dependent on the following factors:

  • Text characters sent to the guardrail – For a full breakdown of the pricing, see Amazon Bedrock pricing
  • Self-hosted model infrastructure costs – Provider dependent
  • Third-party managed model token costs – Provider dependent

Clean up

To delete any infrastructure provisioned in this example, follow the instructions in the GitHub repo.

Conclusion

You can use the ApplyGuardrail API to decouple safeguards for your generative AI applications from FMs. You can now use guardrails without invoking FMs, which opens the door to more integration of standardized and thoroughly tested enterprise safeguards to your application flow regardless of the models used. Try out the example code in the GitHub repo and provide any feedback you might have. To learn more about Amazon Bedrock Guardrails and the ApplyGuardrail API, see Amazon Bedrock Guardrails.


About the Authors

Michael Cho is a Solutions Architect at AWS, where he works with customers to accelerate their mission on the cloud. He is passionate about architecting and building innovative solutions that empower customers. Lately, he has been dedicating his time to experimenting with Generative AI for solving complex business problems.

Aarushi Karandikar is a Solutions Architect at Amazon Web Services (AWS), responsible for providing Enterprise ISV customers with technical guidance on their cloud journey. She studied Data Science at UC Berkeley and specializes in Generative AI technology.

Riya Dani is a Solutions Architect at Amazon Web Services (AWS), responsible for helping Enterprise customers on their journey in the cloud. She has a passion for learning and holds a Bachelor’s & Master’s degree in Computer Science from Virginia Tech. In her free time, she enjoys staying active and reading.

Raj Pathak is a Principal Solutions Architect and Technical advisor to Fortune 50 and Mid-Sized FSI (Banking, Insurance, Capital Markets) customers across Canada and the United States. Raj specializes in Machine Learning with applications in Generative AI, Natural Language Processing, Intelligent Document Processing, and MLOps.

Read More

How Schneider Electric uses Amazon Bedrock to identify high-potential business opportunities

How Schneider Electric uses Amazon Bedrock to identify high-potential business opportunities

This post was co-written with Anthony Medeiros, Manager of Solutions Engineering and Architecture for North America Artificial Intelligence, and Adrian Boeh, Senior Data Scientist – NAM AI, from Schneider Electric.

Schneider Electric is a global leader in the digital transformation of energy management and automation. The company specializes in providing integrated solutions that make energy safe, reliable, efficient, and sustainable. Schneider Electric serves a wide range of industries, including smart manufacturing, resilient infrastructure, future-proof data centers, intelligent buildings, and intuitive homes. They offer products and services that encompass electrical distribution, industrial automation, and energy management. Their innovative technologies, extensive range of products, and commitment to sustainability position Schneider Electric as a key player in advancing smart and green solutions for the modern world.

As demand for renewable energy continues to rise, Schneider Electric faces high demand for sustainable microgrid infrastructure. This demand comes in the form of requests for proposals (RFPs), each of which needs to be manually reviewed by a microgrid subject matter expert (SME) at Schneider. Manual review of each RFP was proving too costly and couldn’t be scaled to meet the industry needs. To solve the problem, Schneider turned to Amazon Bedrock and generative artificial intelligence (AI). Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Stability AI, and Amazon through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI.

In this post, we show how the team at Schneider collaborated with the AWS Generative AI Innovation Center (GenAIIC) to build a generative AI solution on Amazon Bedrock to solve this problem. The solution processes and evaluates each RFP and then routes high-value RFPs to the microgrid SME for approval and recommendation.

Problem Statement

Microgrid infrastructure is a critical element to the growing renewables energy market. A microgrid includes on-site power generation and storage that allow a system to disconnect from the main grid. Schneider Electric offers several important products that allow customers to build microgrid solutions to make their residential buildings, schools, or manufacturing centers more sustainable. Growing public and private investment in this sector has led to an exponential increase in the number of RFPs for microgrid systems.

The RFP documents contain technically complex textual and visual information such as scope of work, parts lists, and electrical diagrams. Moreover, they can be hundreds of pages long. The following figure provides several examples of RFP documents. The RFP size and complexity makes reviewing them costly and labor intensive. An experienced SME is usually required to review an entire RFP and provide an assessment for its applicability to the business and potential for conversion.

Microgrid Request for Proposal (RFP) Examples

Sample request for proposal (RFP) input data

To add additional complexity, the same set of RFP documents might be assessed by multiple business units within Schneider. Each unit might be looking for different requirements that make the opportunity relevant to that sales team.

Given the size and complexity of the RFP documents, the Schneider team needed a way to quickly and accurately identify opportunities where Schneider products offer a competitive advantage and a high potential for conversion. Failure to respond to viable opportunities could result in potential revenue loss, while devoting resources to proposals where the company lacks a distinct competitive edge would lead to an inefficient use of time and effort.

They also needed a solution that could be repurposed for other business units, allowing the impact to extend to the entire enterprise. Successfully handling the influx of RFPs would not only allow the Schneider team to expand their microgrid business, but help businesses and industries adopt a new renewable energy paradigm.

Amazon Bedrock and Generative AI

To help solve this problem, the Schneider team turned to generative AI and Amazon Bedrock. Large language models (LLMs) are now enabling more efficient business processes through their ability to identify and summarize specific categories of information with human-like precision. The volume and complexity of the RFP documents made them an ideal candidate to use generative AI for document processing.

You can use Amazon Bedrock to build and scale generative AI applications with a broad range of FMs. Amazon Bedrock is a fully managed service that includes FMs from Amazon and third-party models supporting a range of use cases. For more details about the FMs available, see Supported foundation models on Amazon Bedrock. Amazon Bedrock enables developers to create unique experiences with generative AI capabilities supporting a broad range of programming languages and frameworks.

The solution uses Anthropic Claude on Amazon Bedrock, specifically the Anthropic Claude Sonnet model. For the vast majority of workloads, Sonnet is two times faster than Claude 2 and Claude 2.1, with higher levels of intelligence.

Solution Overview

Traditional Retrieval Augmented Generation (RAG) systems can’t identify the relevancy of RFP documents to a given sales team because of the extensively long list of one-time business requirements and the large taxonomy of electrical components or services, which might or might not be present in the documents.

Other existing approaches require either expensive domain-specific fine-tuning to the LLM or the use of filtering for noise and data elements, which leads to suboptimal performance and scalability impacts.

Instead, the AWS GenAIC team worked with Schneider Electric to package business objectives onto the LLM through multiple prisms of semantic transformations: concepts, functions, and components. For example, in the domain of smart grids, the underlying business objectives might be defined as resiliency, isolation, and sustainability. Accordingly, the corresponding functions would involve energy generation, consumption, and storage. The following figure illustrates these components.

Microgrid Concept Diagram

Microgrid semantic components

The approach of concept-driven information extraction resembles ontology-based prompting. It allows engineering teams to customize the initial list of concepts and scale onto different domains of interest. The decomposition of complex concepts into specific functions incentivizes the LLM to detect, interpret, and extract the associated data elements.

The LLM was prompted to read RFPs and retrieve quotes pertinent to the defined concepts and functions. These quotes materialize the presence of electrical equipment satisfying the high-level objectives and were used as weight of evidence indicating the downstream relevancy of an RFP to the original sales team.

For example, in the following code, the term BESS stands for battery energy storage system and materializes evidence for power storage.

{
    "quote": "2.3W / 2MWh Saft IHE LFP (1500V) BESS (1X)",
    "function": "Power Storage",
    "relevance": 10,
    "summary": "Specifies a lithium iron phosphate battery energy storage system."
}

In the following example, the term EPC indicates the presence of a solar plant.

{
    "quote": "EPC 2.0MW (2X)",
    "function": "Power Generation",
    "relevance": 9,
    "summary": "Specifies 2 x 2MW solar photovoltaic inverters."
}

The overall solution encompasses three phases:

  • Document chunking and preprocessing
  • LLM-based quote retrieval
  • LLM-based quote summarization and evaluation

The first step uses standard document chunking as well as Schneider’s proprietary document processing pipelines to group similar text elements into a single chunk. Each chunk is processed by the quote retrieval LLM, which identifies relevant quotes within each chunk if they’re available. This brings relevant information to the forefront and filters out irrelevant content. Finally, the relevant quotes are compiled and fed to a final LLM that summarizes the RFP and determines its overall relevance to the microgrid family of RFPs. The following diagram illustrates this pipeline.

GenAI solution flow diagram

The final determination about the RFP is made using the following prompt structure. The details of the actual prompt are proprietary, but the structure includes the following:

  • We first provide the LLM with a brief description of the business unit in question.
  • We then define a persona and tell the LLM where to locate evidence.
  • Provide criteria for RFP categorization.
  • Specify the output format, which includes:
    • A single yes, no, maybe
    • A relevance score from 1–10.
    • An explainability.
prompt = """ 
[1] <DESCRIPTION OF THE BUSINESS UNIT> 
[2] You're an expert in <BUSINESS UNIT> and have to evaluate if a given RFP is related to <BUSINESS UNIT>… 

The quotes are provided below… 

<QUOTES> 

[3] Determine the relevancy to <BUSINESS UNIT> using … criteria: 

<CRITERIA> 

[4] <RESPONSE_FORMAT> 
[4a] A designation of Yes, No, or Maybe. 
[4b] A relevance score. 
[4c] A brief summary of justification and explanation. 
"""

The result compresses a relatively large corpus of RFP documents into a focused, concise, and informative representation by precisely capturing and returning the most important aspects. The structure allows the SME to quickly filter for specific LLM labels, and the summary quotes allow them to better understand which quotes are driving the LLM’s decision-making process. In this way, the Schneider SME team can spend less time reading through pages of RFP proposals and can instead focus their attention on the content that matters most to their business. The sample below shows both a classification result and qualitative feedback for a sample RFP.

GenAI solution output

Internal teams are already experiencing the advantages of our new AI-driven RFP Assistant:

“At Schneider Electric, we are committed to solving real-world problems by creating a sustainable, digitized, and new electric future. We leverage AI and LLMs to further enhance and accelerate our own digital transformation, unlocking efficiency and sustainability in the energy sector.”

– Anthony Medeiros, Manager of Solutions Engineering and Architecture, Schneider Electric.

Conclusion

In this post, the AWS GenAIIC team, working with Schneider Electric, demonstrated the remarkable general capability of LLMs available on Amazon Bedrock to assist sales teams and optimize their workloads.

The RFP assistant solution allowed Schneider Electric to achieve 94% accuracy in the task of identifying microgrid opportunities. By making small adjustments to the prompts, the solution can be scaled and adopted to other lines of business.

By precisely guiding the prompts, the team can derive distinct and objective perspectives from identical sets of documents. The proposed solution enables RFPs to be viewed through the interchangeable lenses of various business units, each pursuing a diverse range of objectives. These previously obscured insights have the potential to unveil novel business prospects and generate supplementary revenue streams.

These capabilities will allow Schneider Electric to seamlessly integrate AI-powered insights and recommendations into its day-to-day operations. This integration will facilitate well-informed and data-driven decision-making processes, streamline operational workflows for heightened efficiency, and elevate the quality of customer interactions, ultimately delivering superior experiences.


About the Authors

Anthony MedeirosAnthony Medeiros is a Manager of Solutions Engineering and Architecture at Schneider Electric. He specializes in delivering high-value AI/ML initiatives to many business functions within North America. With 17 years of experience at Schneider Electric, he brings a wealth of industry knowledge and technical expertise to the team.

Adrian BoehAdrian Boeh is a Senior Data Scientist working on advanced data tasks for Schneider Electric’s North American Customer Transformation Organization. Adrian has 13 years of experience at Schneider Electric and is AWS Machine Learning Certified with a proven ability to innovate and improve organizations using data science methods and technology.

Kosta Belz is a Senior Applied Scientist in the AWS Generative AI Innovation Center, where he helps customers design and build generative AI solutions to solve key business problems.

Dan VolkDan Volk is a Data Scientist at the AWS Generative AI Innovation Center. He has 10 years of experience in machine learning, deep learning, and time series analysis, and holds a Master’s in Data Science from UC Berkeley. He is passionate about transforming complex business challenges into opportunities by leveraging cutting-edge AI technologies.

Negin Sokhandan is a Senior Applied Scientist in the AWS Generative AI Innovation Center, where she works on building generative AI solutions for AWS strategic customers. Her research background is statistical inference, computer vision, and multimodal systems.

Read More

Achieve operational excellence with well-architected generative AI solutions using Amazon Bedrock

Achieve operational excellence with well-architected generative AI solutions using Amazon Bedrock

Large enterprises are building strategies to harness the power of generative AI across their organizations. However, scaling up generative AI and making adoption easier for different lines of businesses (LOBs) comes with challenges around making sure data privacy and security, legal, compliance, and operational complexities are governed on an organizational level. In this post, we discuss how to address these challenges holistically.

Managing bias, intellectual property, prompt safety, and data integrity are critical considerations when deploying generative AI solutions at scale. Because this is an emerging area, best practices, practical guidance, and design patterns are difficult to find in an easily consumable basis. In this post, we share AWS guidance that we have learned and developed as part of real-world projects into practical guides oriented towards the AWS Well-Architected Framework, which is used to build production infrastructure and applications on AWS. We focus on the operational excellence pillar in this post.

Amazon Bedrock plays a pivotal role in this endeavor. It’s a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like Anthropic, Cohere, Meta, Mistral AI, and Amazon through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI. You can securely integrate and deploy generative AI capabilities into your applications using services such as AWS Lambda, enabling seamless data management, monitoring, and compliance (for more details, see Monitoring and observability). This integration makes sure enterprises can take advantage of the full power of generative AI while adhering to best practices in operational excellence.

With Amazon Bedrock, enterprises can achieve the following:

  • Scalability – Scale generative AI applications across different LOBs without compromising performance
  • Security and compliance – Enforce data privacy, security, and compliance with industry standards and regulations
  • Operational efficiency – Streamline operations with built-in tools for monitoring, logging, and automation, aligned with the AWS Well-Architected Framework
  • Innovation – Access cutting-edge AI models and continually improve them with real-time data and feedback

This approach enables enterprises to deploy generative AI at scale while maintaining operational excellence, ultimately driving innovation and efficiency across their organizations.

What’s different about operating generative AI workloads and solutions?

The operational excellence pillar of the Well-Architected Framework is mainly focused on supporting the development and running of workloads effectively, gaining insight into their operations, and continuously improving supporting processes and procedures to deliver business value. However, if we were to apply a generative AI lens, we would need to address the intricate challenges and opportunities arising from its innovative nature, encompassing the following aspects:

  • Complexity can be unpredictable due to the ability of large language models (LLMs) to generate new content
  • Potential intellectual property infringement is a concern due to the lack of transparency in the model training data
  • Low accuracy in generative AI can create incorrect or controversial content
  • Resource utilization requires a specific operating model to meet the substantial computational resources required for training and prompt and token sizes
  • Continuous learning necessitates additional data annotation and curation strategies
  • Compliance is also a rapidly evolving area, where data governance becomes more nuanced and complex, and poses challenges
  • Integration with legacy systems requires careful considerations of compatibility, data flow between systems, and potential performance impacts.

Any generative AI lens therefore needs to combine the following elements, each with varying levels of prescription and enforcement, to address these challenges and provide the basis for responsible AI usage:

  • Policy – The system of principles to guide decisions
  • Guardrails – The rules that create boundaries to keep you within the policy
  • Mechanisms – The process and tools

AWS has advanced responsible AI by introducing Amazon Bedrock Guardrails as a protection to prevent harmful responses from the LLMs, providing an additional layer of safeguards regardless of the underlying FM. However, a more holistic organizational approach is crucial because generative AI practitioners, data scientists, or developers can potentially use a wide range of technologies, models, and datasets to circumvent the established controls.

As cloud adoption has matured for more traditional IT workloads and applications, the need to help developers select the right cloud solution that minimizes corporate risk and simplifies the developer experience has emerged. This is often referred to as platform engineering and can be neatly summarized by the mantra “You (the developer) build and test, and we (the platform engineering team) do all the rest!”

This approach, when applied to generative AI solutions, means that a specific AI or machine learning (ML) platform configuration can be used to holistically address the operational excellence challenges across the enterprise, allowing the developers of the generative AI solution to focus on business value. This is illustrated in the following diagram.

GenAI cloud center of excellence

Where to start?

We start this post by reviewing the foundational operational elements a generative AI platform team needs to initially focus on as they transition generative solutions from a proof of concept or prototype phase to a production-ready solution.

Specifically, we cover how you can safely develop, deploy, and monitor models, mitigating operational and compliance risks, thereby reducing the friction in adopting AI at scale and for production use. We focus on the following four design principles:

  • Establish control through promoting transparency of model details, setting up guardrails or safeguards, and providing visibility into costs, metrics, logs, and traces
  • Automate model fine-tuning, training, validation, and deployment using large language model operations (LLMOps) or foundation model operations (FMOps)
  • Manage data through standard methods for ingestion, governance, and indexing
  • Provide managed infrastructure patterns and blueprints for models, prompt catalogs, APIs, and access control guidelines

In the following sections, we explain this using an architecture diagram while diving into the best practices of the control pillar.

Provide control through transparency of models, guardrails, and costs using metrics, logs, and traces

The control pillar of the generative AI framework focuses on observability, cost management, and governance, making sure enterprises can deploy and operate their generative AI solutions securely and efficiently. The following diagram illustrates the key components of this pillar:

Control Pillar of Generative AI Well architected solutions

Observability

Setting up observability measures lays the foundations for the other two components, namely FinOps and Governance. Observability is crucial for monitoring the performance, reliability, and cost-efficiency of generative AI solutions. By using AWS services such as Amazon CloudWatch, AWS CloudTrail, and Amazon OpenSearch Service, enterprises can gain visibility into model metrics, usage patterns, and potential issues, enabling proactive management and optimization.

Amazon Bedrock is compatible with robust observability features to monitor and manage ML models and applications. Key metrics integrated with CloudWatch include invocation counts, latency, client and server errors, throttles, input and output token counts, and more (for more details, see Monitor Amazon Bedrock with Amazon CloudWatch). You can also use Amazon EventBridge to monitor events related to Amazon Bedrock. This allows you to create rules that invoke specific actions when certain events occur, enhancing the automation and responsiveness of your observability setup (for more details, see Monitor Amazon Bedrock). CloudTrail can log all API calls made to Amazon Bedrock by a user, role, or AWS service in an AWS environment. This is particularly useful for tracking access to sensitive resources such as personally identifiable information (PII), model updates, and other critical activities, enabling enterprises to maintain a robust audit trail and compliance. To learn more, see Log Amazon Bedrock API calls using AWS CloudTrail.

Amazon Bedrock supports the metrics and telemetry needed for implementing an observability maturity model for LLMs, which includes the following:

  • Capturing and analyzing LLM-specific metrics such as model performance, prompt properties, and cost metrics through CloudWatch
  • Implementing alerts and incident management tailored to LLM-related issues
  • Providing security compliance and robust monitoring mechanisms, because Amazon Bedrock is in scope for common compliance standards and offers automated abuse detection mechanisms
  • Using CloudWatch and CloudTrail for anomaly detection, usage and costs forecasting, optimizing performance, and resource utilization
  • Using AWS forecasting services for better resource planning and cost management

CloudWatch provides a unified monitoring and observability service that collects logs, metrics, and events from various AWS services and on-premises sources. This allows enterprises to track key performance indicators (KPIs) for their generative AI models, such as I/O volumes, latency, and error rates. You can use CloudWatch dashboards to create custom visualizations and alerts, so teams are quickly notified of any anomalies or performance degradation.

For more advanced observability requirements, enterprises can use OpenSearch Service, a fully managed service for deploying, operating, and scaling OpenSearch and Kibana. Opensearch Dashboards provides powerful search and analytical capabilities, allowing teams to dive deeper into generative AI model behavior, user interactions, and system-wide metrics.

Additionally, you can enable model invocation logging to collect invocation logs, full request response data, and metadata for all Amazon Bedrock model API invocations in your AWS account. Before you can enable invocation logging, you need to set up an Amazon Simple Storage Service (Amazon S3) or CloudWatch Logs destination. You can enable invocation logging through either the AWS Management Console or the API. By default, logging is disabled.

Cost management and optimization (FinOps)

Generative AI solutions can quickly scale and consume significant cloud resources, and a robust FinOps practice is essential. With services like AWS Cost Explorer and AWS Budgets, enterprises can track their usage and optimize their generative AI spending, achieving cost-effective deployment and scaling.

Cost Explorer provides detailed cost analysis and forecasting capabilities, enabling you to understand your tenant-related expenditures, identify cost drivers, and plan for future growth. Teams can create custom cost allocation reports, set custom budgets using AWS budgets and alerts, and explore cost trends over time.

Analyzing the cost and performance of generative AI models is crucial for making informed decisions about model deployment and optimization. EventBridge, CloudTrail, and CloudWatch provide the necessary tools to track and analyze these metrics, helping enterprises make data-driven decisions. With this information, you can identify optimization opportunities, such as scaling down under-utilized resources.

With EventBridge, you can configure Amazon Bedrock to respond automatically to status change events in Amazon Bedrock. This enables you to handle API rate limit issues, API updates, and reduction in additional compute resources. For more details, see Monitor Amazon Bedrock events in Amazon EventBridge.

As discussed in previous section, CloudWatch can monitor Amazon Bedrock to collect raw data and process it into readable, near real-time cost metrics. You can graph the metrics using the CloudWatch console. You can also set alarms that watch for certain thresholds, and send notifications or take actions when values exceed those thresholds. For more information, see Monitor Amazon Bedrock with Amazon CloudWatch.

Governance

Implementation of robust governance measures, including continuous evaluation and multi-layered guardrails, is fundamental for the responsible and effective deployment of generative AI solutions in enterprise environments. Let’s look at them one by one:

  • Performance monitoring and evaluation – Continuously evaluating the performance, safety, and compliance of generative AI models is critical. You can achieve this in several ways:
    • Enterprises can use AWS services like Amazon SageMaker Model Monitor and Amazon Bedrock Guardrails, or Amazon Comprehend to monitor model behavior, detect drifts, and make sure generative AI solutions are performing as expected (or better) and adhering to organizational policies.
    • You can deploy open-source evaluation metrics like RAGAS as custom metrics to make sure LLM responses are grounded, mitigate bias, and prevent hallucinations.
    • Model evaluation jobs allow you to compare model outputs and choose the best-suited model for your use case. The job could be automated based on a ground truth, or you could use humans to bring in expertise on the matter. You can also use FMs from Amazon Bedrock to evaluate your applications. To learn more about this approach, refer to Evaluate the reliability of Retrieval Augmented Generation applications using Amazon Bedrock.
  • Guardrails – Generative AI solutions should include robust, multi-level guardrails to enforce responsible AI and oversight:
    • First, you need guardrails around the LLM model to mitigate risks around bias and safeguard the application with responsible AI policies. This can be done through Amazon Bedrock Guardrails to set up custom guardrails around a model (FM or fine-tuned) for configuring denied topics, content filters, and blocked messaging.
    • The second level is to set guardrails around the framework for each use case. This includes implementing access controls, data governance policies, and proactive monitoring and alerting to make sure sensitive information is properly secured and monitored. For example, you can use AWS data analytics services such as Amazon Redshift for data warehousing, AWS Glue for data integration, and Amazon QuickSight for business intelligence (BI).
  • Compliance measures – Enterprises need to set up a robust compliance framework to meet regulatory requirements and industry standards such as GDPR, CCPA, or industry-specific standards. This helps make sure generative AI solutions remain secure, compliant, and efficient in handling sensitive information across different use cases. This approach minimizes the risk of data breaches or unauthorized data access, thereby protecting the integrity and confidentiality of critical data assets. Enterprises can take the following organization-level actions to create a comprehensive governance structure:
    • Establish a clear incident response plan for addressing compliance breaches or AI system malfunctions.
    • Conduct periodic compliance assessments and third-party audits to identify and address potential risks or violations.
    • Provide ongoing training to employees on compliance requirements and best practices in AI governance.
  • Model transparency – Although achieving full transparency in generative AI models remains challenging, organizations can take several steps to enhance model transparency and explainability:
    • Provide model cards on the model’s intended use, performance, capabilities, and potential biases.
    • Ask the model to self-explain, meaning provide explanations for their own decisions. This can also be set in a complex system—for example, agents could perform multi-step planning and improve through self-explanation.

Automate model lifecycle management with LLMOps or FMOps

Implementing LLMOps is crucial for efficiently managing the lifecycle of generative AI models at scale. To grasp the concept of LLMOps, a subset of FMOps, and the key differentiators compared to MLOps, see FMOps/LLMOps: Operationalize generative AI and differences with MLOps. In that post, you can learn more about the developmental lifecycle of a generative AI application and the additional skills, processes, and technologies needed to operationalize generative AI applications.

Manage data through standard methods of data ingestion and use

Enriching LLMs with new data is imperative for LLMs to provide more contextual answers without the need for extensive fine-tuning or the overhead of building a specific corporate LLM. Managing data ingestion, extraction, transformation, cataloging, and governance is a complex, time-consuming process that needs to align with corporate data policies and governance frameworks.

AWS provides several services to support this; the following diagram illustrates these at a high level. For a more detailed description, see Scaling AI and Machine Learning Workloads with Ray on AWS and Build a RAG data ingestion pipeline for large scale ML workloads.

This workflow includes the following steps:

  1. Data can be securely transferred to AWS using either custom or existing tools or the AWS Transfer family. You can use AWS Identity and Access Management (IAM) and AWS PrivateLink to control and secure access to data and generative AI resources, making sure data remains within the organization’s boundaries and complies with the relevant regulations.
  2. When the data is in Amazon S3, you can use AWS Glue to extract and transform data (for example, into Parquet format) and store metadata about the ingested data, facilitating data governance and cataloging.
  3. The third component is the GPU cluster, which could potentially be a Ray cluster. You can employ various orchestration engines, such as AWS Step Functions, Amazon SageMaker Pipelines, or AWS Batch, to run the jobs (or create pipelines) to create embeddings and ingest the data into a data store or vector store.
  4. Embeddings can be stored in a vector store such as OpenSearch, enabling efficient retrieval and querying. Alternatively, you can use a solution such as Amazon Bedrock Knowledge Bases to ingest data from Amazon S3 or other data sources, enabling seamless integration with generative AI solutions.
  5. You can use Amazon DataZone to manage access control to the raw data stored in Amazon S3 and the vector store, enforcing role-based or fine-grained access control for data governance.
  6. For cases where you need a semantic understanding of your data, you can use Amazon Kendra for intelligent enterprise search. Amazon Kendra has inbuilt ML capabilities and is easy to integrate with various data sources like S3, making it adaptable for different organizational needs.

The choice of which components to use will depend on the specific requirements of the solution, but a consistent solution should exist for all data management to be codified into blueprints (discussed in the following section).

Provide managed infrastructure patterns and blueprints for models, prompt catalogs, APIs, and access control guidelines

There are a number of ways to build and deploy a generative AI solution. AWS offers key services such as Amazon Bedrock, Amazon Kendra, OpenSearch Service, and more, which can be configured to support multiple generative AI use cases, such as text summarization, Retrieval Augmented Generation (RAG), and others.

The simplest way is to allow each team who needs to use generative AI to build their own custom solution on AWS, but this will inevitably increase costs and cause organization-wide irregularities. A more scalable option is to have a centralized team build standard generative AI solutions codified into blueprints or constructs and allow teams to deploy and use them. This team can provide a platform that abstracts away these constructs with a user-friendly and integrated API and provide additional services such as LLMOps, data management, FinOps, and more. The following diagram illustrates these options.

different approaches to scale out GenAI solutions

Establishing blueprints and constructs for generative AI runtimes, APIs, prompts, and orchestration such as LangChain, LiteLLM, and so on will simplify adoption of generative AI and increase overall safe usage. Offering standard APIs with access controls, consistent AI, and data and cost management makes usage straightforward, cost-efficient, and secure.

For more information about how to enforce isolation of resources in a multi-tenant architecture and key patterns in isolation strategies while building solutions on AWS, refer to the whitepaper SaaS Tenant Isolation Strategies.

Conclusion

By focusing on the operational excellence pillar of the Well-Architected Framework from a generative AI lens, enterprises can scale their generative AI initiatives with confidence, building solutions that are secure, cost-effective, and compliant. Introducing a standardized skeleton framework for generative AI runtimes, prompts, and orchestration will empower your organization to seamlessly integrate generative AI capabilities into your existing workflows.

As a next step, you can establish proactive monitoring and alerting, helping your enterprise swiftly detect and mitigate potential issues, such as the generation of biased or harmful output.

Don’t wait—take this proactive stance towards adopting the best practices. Conduct regular audits of your generative AI systems to maintain ethical AI practices. Invest in training your team on the generative AI operational excellence techniques. By taking these actions now, you’ll be well positioned to harness the transformative potential of generative AI while navigating the complexities of this technology wisely.


About the Authors

Akarsha Sehwag is a Data Scientist and ML Engineer in AWS Professional Services with over 5 years of experience building ML based services and products. Leveraging her expertise in Computer Vision and Deep Learning, she empowers customers to harness the power of the ML in AWS cloud efficiently. With the advent of Generative AI, she worked with numerous customers to identify good use-cases, and building it into production-ready solutions. Her diverse interests span development, entrepreneurship, and research.

Malcolm Orr is a principal engineer at AWS and has a long history of building platforms and distributed systems using AWS services. He brings a structured – systems, view to generative AI and is helping define how customers can adopt GenAI safely, securely and cost effectively across their organization.

Tanvi Singhal is a Data Scientist within AWS Professional Services. Her skills and areas of expertise include data science, machine learning, and big data. She supports customers in developing Machine learning models and MLops solutions within the cloud. Prior to joining AWS, she was also a consultant in various industries such as Transportation Networking, Retail and Financial Services. She is passionate about enabling customers on their data/AI journey to the cloud.

Zorina Alliata is a Principal AI Strategist, working with global customers to find solutions that speed up operations and enhance processes using Artificial Intelligence and Machine Learning. Zorina helps companies across several industries identify strategies and tactical execution plans for their AI use cases, platforms, and AI at scale implementations.

Read More

Elevate workforce productivity through seamless personalization in Amazon Q Business

Elevate workforce productivity through seamless personalization in Amazon Q Business

Personalization can improve the user experience of shopping, entertainment, and news sites by using our past behavior to recommend the products and content that best match our interests. You can also apply personalization to conversational interactions with an AI-powered assistant. For example, an AI assistant for employee onboarding could use what it knows about an employee’s work location, department, or job title to provide information that is more relevant to the employee. In this post, we explore how Amazon Q Business uses personalization to improve the relevance of responses and how you can align your use cases and end-user data to take full advantage of this capability.

Amazon Q Business is a fully managed generative AI-powered assistant that can answer questions, provide summaries, generate content, and complete tasks based on the data and information that is spread across your enterprise systems. Amazon Q Business provides more than 40 built-in connectors that make it effortless to connect the most popular enterprise data sources and systems into a unified and powerful search index that the AI assistant can use to help answer natural language questions from your workforce. This allows end-users to find the information and answers they’re looking for quickly, which leads to increased productivity and job satisfaction. Amazon Q Business preserves the access permissions in the source systems so that users are only able to access the information through Amazon Q Business that they have access to directly within these systems.

Solution overview

Responses are personalized by Amazon Q Business by determining if the user’s query could be enhanced by augmenting the query with known attributes of the user and transparently using the personalized query to retrieve documents from its search index. User attributes, such as work location, department, and job title, are made available to Amazon Q Business by the system used to authenticate user identities that is configured with the Amazon Q Business application. Depending on the documents available in the index, the personalized query should improve the relevancy of the returned documents, which in turn can improve the relevancy of the generated response based on those documents. The process by which user attributes flow to an Amazon Q Business application varies based on the identity federation mechanism used to authenticate your workforce for the application:

The following diagram illustrates the process by which user attributes flow to Amazon Q Business for both identity federation mechanisms.

The steps of the process are as follows:

  1. When a user accesses the Amazon Q Business web experience or a custom client that integrates with the Amazon Q Business API, they must be authenticated. If not already authenticated, the user is redirected to the IdP configured for the Amazon Q Business application.
  2. After the user authenticates with the IdP, they’re redirected back to the client with an authorization code. Then the Amazon Q Business web experience or custom client makes an API call to the IdP with the client secret to exchange the authorization code for an ID token. When an IAM IdP is configured for the Amazon Q Business application, the ID token includes the user attributes that are configured in the IdP. Otherwise, with IAM Identity Center, the user attributes are synchronized from the IdP to IAM Identity Center. This process only has to be done one time during the user’s session or when the user’s session expires.
  3. The user is now able to interact with the AI assistant by submitting a question.
  4. Before the Amazon Q Business web experience or custom client can send the user’s question to the Amazon Q Business ChatSync API, it must exchange the ID token for AWS credentials. If the Amazon Q Business application is configured with IAM Identity Center, the Amazon Q Business application or custom client calls the CreateTokenWithIAM API to exchange the ID token for an IAM Identity Center token. This token includes the user attributes synchronized from the IdP to IAM Identity Center as described earlier. If the Amazon Q Business application is configured with an IAM IdP, this step is skipped.
  5. The last step to obtain AWS credentials is to call AWS Secure Token Service (AWS STS). If the Amazon Q Business application is configured with IAM Identity Center, the AssumeRole API is called passing the IAM Identity Center token. For an Amazon Q Business application configured with an IAM IdP, the AssumeRoleWithSAML or AssumeRoleWithWebIdentity API is called depending on whether SAML 2.0 or OIDC is used for the provider. The credentials returned from AWS STS can be cached and reused until they expire.
  6. The Amazon Q Business web experience or custom client can now call the ChatSync API with the credentials obtained in the previous step using AWS Signature Version 4. Because the credentials include the user attributes configured in the IdP, they’re available to Amazon Q Business to personalize the user’s query.

Amazon Q Business personalization use case

To demonstrate how personalization works in practice, let’s take an example of internal training made available to employees of a multi-national company. Imagine you lead the training department for an enterprise company and you’re tasked with improving the access to training opportunities offered to employees. You’ve done a great job documenting this information for all locations where training is provided and published it on your company’s Microsoft SharePoint site, but the feedback from employees is that they don’t know where to find the information. The confusion stems from the fact that your company also publishes internal company information and documentation on Confluence, Box, and a wiki. Additionally, your department uses ServiceNow for training support, which has developed into another source of valuable but under-utilized information.

The first challenge to solve is discoverability of the information spread across these disparate and disconnected systems. Through the connectors described earlier, Amazon Q Business can bring together the information in these systems and provide a conversational user interface that allows employees to ask questions in natural language, such as, “What training is available?”

With the discoverability challenge solved, there is still an opportunity to further optimize the user experience. This is where personalization comes in. Consider the basic question, “What training is available?” from a user who works out of the San Francisco, CA, office. Based on this question, Amazon Q Business can find documents that describe the training classes available across all corporate locations, but lacks the knowledge of the user’s home office location to be more precise in its answer. Providing an answer based on the location, or even a blend of multiple locations, isn’t as accurate as if the answer were based on where the employee worked. The employee could be more explicit in their question by including their location, but the goal of AI assistants is to better understand the user’s intent and context to be able to provide the most accurate information possible for even the most basic questions. Knowing key information about the user allows Amazon Q Business to seamlessly personalize the retrieval of documents and therefore lead to a more accurate response. Let’s see how it works in more detail.

At the core of Amazon Q Business is a technique called Retrieval Augmented Generation (RAG). At a high level, RAG involves taking a user’s request and finding passages from a set of documents in a searchable index that are most similar to the request and then asking a large language model (LLM) to generate a response that provides an answer using the retrieved passages. Given the question, “What training is available?” and the number of locations for the company, the top document passages returned from the index and provided to the LLM may not even include the user’s location. Therefore, the more precise the query to the retrieval layer, the more accurate and relevant the ultimate response will be. For example, modifying the query to include details on the user’s location should result in document passages specific to the user being returned at or near the top of the list rather than buried further down the list.

Configure user attributes in your IdP

Let’s look at how you would configure your IdP to pass along the attributes of your users to your Amazon Q Business application. Regardless of the identity federation mechanism configured for your Amazon Q Business application, attributes for your users need to be maintained in your IdP’s directory. The following is a partial screenshot of some of the location-related fields available in the profile editor for the Okta IdP.

Besides the administrative UI for editing individual profiles, Okta also provides mechanisms for updating profiles in bulk or through APIs. These tools make it straightforward to keep your user profiles synchronized with source systems such as employee directories.

After your user profiles are updated in your IdP, the process for making user attributes available to your Amazon Q Business application varies based on the identity federation configuration.

Federation with IAM Identity Center

If you configure your Amazon Q Business application with IAM Identity Center (recommended) and you use an external IdP such as Okta or Entra ID to manage your workforce, you simply need to maintain user attributes in your IdP. Because IAM Identity Center supports the SCIM standard, you can set up user profiles and their attributes to be automatically synchronized with IAM Identity Center. After the users and attributes are synchronized to IAM Identity Center, they can be accessed by Amazon Q Business from either the web experience or through a custom client integration as described earlier.

A less common variation of using IAM Identity Center with Amazon Q Business that is suitable for basic testing is to use IAM Identity Center as the identity source (without an external IdP). In this case, you would add users and manage their attributes directly in IAM Identity Center through the AWS Management Console or the CreateUser and UpdateUser APIs.

Federation with IAM

If you configure your Amazon Q Business application to use IAM federation, user attributes are also maintained in your IdP. However, the attributes are passed to your Amazon Q Business application from your IdP using either a SAML 2.0 assertion or an OIDC claim depending on the provider type that you set up as your IAM IdP. Your IdP must be configured to pass the specific attributes that you intend to expose for personalization. How this configuration is done depends again on whether you’re using SAML 2.0 or OIDC. For this post, we describe how this is done in Okta. The process should be similar with other IdPs.

SAML 2.0 provider type

When you create a SAML 2.0 application in Okta for authenticating your users, you have the option to create attribute statements. The attribute statements are included in the SAML 2.0 assertion that is provided by Okta when a user authenticates. The first three attribute statements shown in the following table are required for SAML 2.0 authentication to work with Amazon Q Business. The others are examples of how you would pass optional attributes that can be used for personalization.

Name Name format Value
https://aws.amazon.com/SAML/Attributes/PrincipalTag:Email Unspecified user.email
https://aws.amazon.com/SAML/Attributes/Role Unspecified [WebExpRoleArn],[IdentityProviderArn]
https://aws.amazon.com/SAML/Attributes/RoleSessionName Unspecified user.email
https://aws.amazon.com/SAML/Attributes/PrincipalTag:countryCode Unspecified user.countryCode != null ? user.countryCode : “”
https://aws.amazon.com/SAML/Attributes/PrincipalTag:city Unspecified user.city != null ? user.city : “”
https://aws.amazon.com/SAML/Attributes/PrincipalTag:title Unspecified user.title != null ? user.title : “”
https://aws.amazon.com/SAML/Attributes/PrincipalTag:department Unspecified user.department != null ? user.department : “”

Where the attribute statement value uses the Okta Expression Language, Okta resolves the value expression with the actual value for the user. For example, user.email resolves to the user’s email address, and user.city != null ? user.city : "" resolves to the user’s city (as specified in their user profile) or an empty string if not specified. And because these values are passed in the SAML assertion, you can also include any custom attributes for your users that are specific to your business or domain that may be relevant to personalization.

For [WebExpRoleArn],[IdentityProviderArn], you must substitute [WebExpRoleArn] for the web experience role ARN for your Amazon Q Business application and [IdentityProviderArn] for the IAM IdP ARN that you created in IAM for this SAML provider.

OIDC provider type

When you create an OIDC application in Okta for authenticating your users, the location where you configure the user attributes to include in the OIDC claim is a bit different. For OIDC, you must add the user attributes you want to expose for personalization to the claim for the authorization server. AWS STS supports an access token or ID token type. In this post, we demonstrate the ID token type. For more details, see Build private and secure enterprise generative AI applications with Amazon Q Business using IAM Federation.

Complete the following steps:

  1. In Okta, choose Security, API in the navigation pane.
  2. Choose the authorization server (which may be default) and then Claims.
  3. If you don’t see a claim type of ID, choose Add Claim to create one.
  4. For Claim name, enter https://aws.amazon.com/tags.
  5. For Include in token type, choose Access Token or ID Token (we use ID Token in this post).
  6. For Value type, choose Expression.
  7. For Value, enter a JSON document that uses the Okta Expression Language to resolve attributes for the user. The full expression is as follows:
    {
       "principal_tags": {
          "Email": {user.email}, 
          "countryCode": {user.countryCode != null ? user.countryCode : ""}, 
          "city": {user.city != null ? user.city : ""},
          "title" {user.title != null ? user.title : ""},
          "department": {user.department != null ? user.department : ""}
       }
    } 

  8. Choose Create.

Again, you are not limited to just these fields. You can also include custom fields that apply to your use case and documents in the expression.

Enable personalization in Amazon Q Business

After you have your preferred authentication mechanism configured in your IdP, IAM, and Amazon Q Business, you’re ready to see how it impacts responses in your Amazon Q Business application. Although personalization is enabled by default for Amazon Q Business applications, you can control whether personalization is enabled on the Update Global Controls settings page for your Amazon Q Business application. If necessary, select Enable response personalization and choose Save.

Amazon Q Business personalization in action

Now you’re ready to see how Amazon Q Business personalizes responses for each user. We continue with the same use case of asking Amazon Q Business “What training is available?” The documents added to the Amazon Q Business index include internal training schedules available to all employees as Word documents for two corporate offices: San Francisco and London. In addition, two users were created in the IdP, where one user is based in the San Francisco office and the other is based in the London office. The city and country fields were populated as well as each user’s title. The San Francisco employee is a software programmer and the London employee is the Director of Marketing.

When signed in to the application using an incognito (private) window as the San Francisco employee, the question “What training is available?” produces the following response.

The response includes content on the training classes being held at the San Francisco office. The citation in the Sources section also confirms that the “September Training Curriculum at San Francisco” document was used to generate the response.

We can close the incognito window, open a new incognito window, sign in as the London employee, and ask the same question: “What training is available?” This time, the response provides information on the training classes being held at the London office and the citation refers to the London curriculum document.

For one final test, we disable personalization for the Amazon Q Business application on the Update Global Controls settings page for the Amazon Q Business application, wait a few minutes for the change to take effect, and then ask the same question in a new conversation.

This time, Amazon Q Business includes information on classes being held at both offices, which is confirmed by the citations pulling in both documents. Although the question is still answered, the user must parse through the response to pick out the portions that are most relevant to them based on their location.

Use cases for Amazon Q Business personalization

Amazon Q Business can be very effective in supporting a wide variety of use cases. However, not all of these use cases can be enhanced with personalization. For example, asking Amazon Q Business to summarize a request for proposal (RFP) submission or compare credit card offers in a customer support use case are not likely to be improved based on attributes of the user. Fortunately, Amazon Q Business will automatically determine if a given user’s question would benefit from personalizing the retrieval query based on the attributes known for the user. When thinking about enabling and optimizing personalization for your use case, consider the availability of user attributes and the composition of data in your Amazon Q Business index.

Working backward from the personalization effect you want to implement, you first need to determine if the required user attributes for your use case exist in your IdP. This may require importing and synchronizing this data into your IdP from another system, such as an employee directory or payroll system. Then you should consider the documents and data in your Amazon Q Business index to determine if they are optimized for personalized retrieval. That is, determine whether the documents in your index have content that will be readily found by the retrieval step given the user attributes in your IdP. For example, the documents used for the training class example in this post have the city mentioned in the document title as well as the document body. Because Amazon Q Business boosts matches against the document title by default, we are taking advantage of built-in relevance tuning to further influence the documents that match the user’s city.

In this post, we focused on the user’s work location and information that was location-specific to add value through personalization. In other words, we used the user’s work location to transparently find what’s most relevant to them nearby. Another useful area of use cases to explore may use the user’s job title or job level and find content that is specific to their role. As you explore the possibilities, the intersection of user information and the composition of the data in the corpus of documents in your enterprise data stores are the best place to start.

Conclusion

In this post, we demonstrated how to use personalization to improve the relevancy and usefulness of the responses provided by an AI-powered assistant. Personalization is not going to dramatically improve every interaction with Amazon Q Business, but when it’s thoughtfully applied to use cases and data sources where it can deliver value, it can build trust with end-users by providing responses that are more relevant and meaningful.

What use cases do you have where attributes for your users and the information in your data sources can allow Amazon Q Business to deliver a more personalized user experience? Try out the solution for yourself, and leave your feedback and questions in the comments.


About the Authors

James Jory is a Principal Solutions Architect for Amazon Q Business. He has interests in generative AI, personalization, and recommender systems and has a background in ecommerce, marketing technology, and customer data analytics. In his spare time, he enjoys camping and motor sports.

Nihal HarishNihal Harish is a Software Development Engineer at AWS AI. He is passionate about generative AI and reinforcement learning. Outside of work, he enjoys playing tennis, tending to his garden, and exploring new culinary recipes.

Pranesh Anubhav is a Software Development Manager for Amazon Personalize. He is passionate about designing machine learning systems to serve customers at scale. Outside of his work, he loves playing soccer and is an avid follower of Real Madrid.

Gaurush Hiranandani is an Applied Scientist at AWS AI, where his research spans the fields of statistical machine learning, with a particular focus on preference elicitation and recommender systems. He is deeply passionate about advancing the personalization of generative AI services at AWS AI, aiming to enhance user experiences through tailored, data-driven insights.

Harsh Singh is a Principal Product Manager Technical at AWS AI. Harsh enjoys building products that bring AI to software developers and everyday users to improve their productivity.

Read More

Best practices for building robust generative AI applications with Amazon Bedrock Agents – Part 1

Best practices for building robust generative AI applications with Amazon Bedrock Agents – Part 1

Building intelligent agents that can accurately understand and respond to user queries is a complex undertaking that requires careful planning and execution across multiple stages. Whether you are developing a customer service chatbot or a virtual assistant, there are numerous considerations to keep in mind, from defining the agent’s scope and capabilities to architecting a robust and scalable infrastructure.

This two-part series explores best practices for building generative AI applications using Amazon Bedrock Agents. Agents helps you accelerate generative AI application development by orchestrating multistep tasks. Agents use the reasoning capability of foundation models (FMs) to break down user-requested tasks into multiple steps. In addition, they use the developer-provided instruction to create an orchestration plan and then carry out the plan by invoking company APIs and accessing knowledge bases using Retrieval Augmented Generation (RAG) to provide an answer to the user’s request.

In Part 1, we focus on creating accurate and reliable agents. Part 2 discusses architectural considerations and development lifecycle practices.

Laying the groundwork: Collecting ground truth data

The foundation of any successful agent is high-quality ground truth data—the accurate, real-world observations used as reference for benchmarks and evaluating the performance of a model, algorithm, or system. For an agent application, before you start building, it’s crucial to collect a set of ground truth interactions or conversations that will drive the entire agent lifecycle. This data provides a benchmark for expected agent behavior, including the interaction with existing APIs, knowledge bases, and guardrails connected with the agent. This enables accurate testing and evaluation and helps identify edge cases and potential pitfalls.

To build a robust ground truth dataset, focus on gathering diverse examples that cover various user intents and scenarios. Your dataset should include the input and expected output for both simple and complex interactions. It’s important to regularly update and expand your dataset as you learn more about user behavior. Ground your data in real customer interactions that reflect actual use cases but be sure to de-identify and anonymize the data.

The following table shows a subset of ground truth collected for a banking assistant agent that knows the check_account_balance and book_appointments APIs, uses the knowledge base bankingFAQ, and is connected with the guardrail BlockInvestmentAdvice. This example presents only a few of the queries used to test the agent, including the session information used from other systems and passed to the agent using sessionAttributes. It’s also a best practice to collect any extra information that would be shared with the agent in a production system. That includes information shared with the action group execution function only (session attributes), information used to augment the prompt (prompt session attributes), and information used by the knowledge base (knowledge base configuration).

User Query Session Attributes Session prompt Attributes Expected Response API, Knowledge Bases and Guardrails invoked
What is my account balance? None None Could you please provide the number of the account that you would like to check the balance for? None
What is the balance for the account 1234? user id 111 None Your balance is X Action Group: check_account_balance(111, 1234)
How can I open an account? None None To open a new banking account with ANY_BANK Corporation, you need to gather the following information and documentation: 1. Valid government-issued ID (driver’s license, passport, etc.) 2. Proof of address (utility bill, lease agreement, etc.) 3. Social Security number or Individual Taxpayer Identification Number (ITIN) 4. Initial deposit amount (if required) You can then fill out an application online or in person at an ANY_BANK Corporation branch. Once approved, you can make your initial deposit and start using your new account. Knowledge Base: bankingFAQ
Should I invest in bitcoins? None None Sorry, your query violates our usage policies. We do not provide investment advices. To discuss the best investment advice for your current situation, please contact us on (XXX) XXX-XXXX and we will be happy to support you. Guardrail: BlockInvestmentAdvice
Could you make an appointment for tomorrow at 2pm? user id 111 Today: 09/03/2024 Certainly! We’ve booked an appointment for you tomorrow, September 4th, 2024, at 2pm. Your appointment ID is XXXX. Action Group: book_appointment(111, 09/04/2024)

Defining scope and sample interactions

Now that you have your ground truth data, the next step is to clearly define the scope of each agent, including tasks it should and shouldn’t handle, and outline clear expected sample user interactions. This process involves identifying primary functions and capabilities, limitations and out-of-scope tasks, expected input formats and types, and desired output formats and styles.

For instance, when considering an HR assistant agent, a possible scope would be the following:

Primary functions:

– Provide information on company HR policies

– Assist with vacation requests and time-off management

– Answer basic payroll questions

Out of scope:

– Handling sensitive employee data

– Making hiring or firing decisions

– Providing legal advice

Expected inputs:

– Natural language queries about HR policies

– Requests for time-off or vacation information

– Basic payroll inquires

Desired outputs:

– Clear and concise responses to policy questions

– Step-by-step guidance for vacation requests

– Completion of tasks for book a new vacation, retrieve, edit and delete an existing request

– Referrals to appropriate HR personnel for complex issues

– Creation of an HR ticket for questions where the agent is not able to respond

By clearly defining your agent’s scope, you set clear boundaries and expectations, which will guide your development process and help create a focused, reliable AI agent.

Architecting your solution: Building small and focused agents that interact with each other

When it comes to agent architecture, the principle “divide and conquer” holds true. In our experience, it has proven to be more effective to build small, focused agents that interact with each other rather than a single large monolithic agent. This approach offers improved modularity and maintainability, straightforward testing and debugging, flexibility to use different FMs for specific tasks, and enhanced scalability and extensibility.

For example, consider an HR assistant that helps internal employees in an organization and a payroll team assistant that supports the employees of the payroll team. Both agents have common functionality such as answering payroll policy questions and scheduling meetings between employees. Although the functionalities are similar, they differ in scope and permissions. For instance, the HR assistant can only reply to questions based on the internally available knowledge, whereas the payroll agents can also handle confidential information only available for the payroll employees. Additionally, the HR agents can schedule meetings between employees and their assigned HR representative, whereas the payroll agent schedules meetings between the employees on their team. In a single-agent approach, those functionalities are handled in the agent itself, resulting in the duplication of the action groups available to each agent, as shown in the following figure.

In this scenario, when something changes in the meetings action group, the change needs to be propagated to the different agents. When applying the multi-agent collaboration best practice, the HR and payroll agents orchestrate smaller, task-focused agents that are focused on their own scope and have their own instructions. Meetings are now handled by an agent itself that is reused between the two agents, as shown in the following figure.

When a new functionality is added to the meeting assistant agent, the HR agent and payroll agent only need to be updated to handle those functionalities. This approach can also be automated in your applications to increase the scalability of your agentic solutions. The supervisor agents (HR and payroll agents) can set the tone of your application as well as define how each functionality (knowledge base or sub-agent) of the agent should be used. That includes enforcing knowledge base filters and parameter constraints as part of the agentic application.

Crafting the user experience: Planning agent tone and greetings

The personality of your agent sets the tone for the entire user interaction. Carefully planning the tone and greetings of your agent is crucial for creating a consistent and engaging user experience. Consider factors such as brand voice and personality, target audience preferences, formality level, and cultural sensitivity.

For instance, a formal HR assistant might be instructed to address users formally, using titles and last names, while maintaining a professional and courteous tone throughout the conversation. In contrast, a friendly IT support agent could use a casual, upbeat tone, addressing users by their first names and even incorporating appropriate emojis and tech-related jokes to keep the conversation light and engaging.

The following is an example prompt for a formal HR assistant:

You are an HR AI Assistant, helping employees understand company policies and manage 
their benefits. Always address users formally, using titles (Mr., Ms., Dr., etc.) and last names. 
Maintain a professional and courteous tone throughout the conversation.

The following is an example prompt for a friendly IT support agent:

You're the IT Buddy, here to help with tech issues. 
Use a casual, upbeat tone and address users by their first names. 
Feel free to use appropriate emojis and tech-related jokes to keep the conversation light and engaging.

Make sure your agent’s tone aligns with your brand identity and remains constant across different interactions. When collaborating between multiple agents, you should set the tone across the application and enforce it over the different sub-agents.

Maintaining clarity: Providing unambiguous instructions and definitions

Clear communication is the cornerstone of effective AI agents. When defining instructions, functions, and knowledge base interactions, strive for unambiguous language that leaves no room for misinterpretation. Use simple, direct language and provide specific examples for complex concepts. Define clear boundaries between similar functions and implement confirmation mechanisms for critical actions. Consider the following example of clear vs. ambiguous instructions.

The following is an example ambiguous prompt

Check if the user has time off available and book it if possible.

The following is a clearer prompt:

1. Verify the user's available time-off balance using the `checkTimeOffBalance` function. 
2. If the requested time off is available, use the `bookTimeOff` function to reserve it. 
3. If the time off is not available, inform the user and suggest alternative dates. 
4. Always confirm with the user before finalizing any time-off bookings.

By providing clear instructions, you reduce the chances of errors and make sure your agent behaves predictably and reliably.

The same advice is valid when defining the functions of your action groups. Avoid ambiguous function names and definitions and set clear descriptions for its parameters. The following figure shows how to change the name, description, and parameters of two functions in an action group to get the user details and information based on what is actually returned by the functions and the expected value formatting for the user ID.

Finally, the knowledge base instructions should clearily state what is available in the knowledge base and when to use it to answer user queries.

The following is an ambiguous prompt:

Knowledge Base 1: use this knowledge base to get information from documents

The following is a clearer prompt:

Knowledge Base 1: Knowledge base containing insurance policies and internal documents. Use this knowledge base when the user asks about a policy term or regarding an internal system

Using organizational knowledge: Integrating knowledge bases

To make sure you provide your agents with enterprise knowledge, integrate them with your organization’s existing knowledge bases. This allows your agents to use vast amounts of information and provide more accurate, context-aware responses. By accessing up-to-date organizational data, your agents can improve response accuracy and relevance, cite authoritative sources, and reduce the need for frequent model updates.

Complete the following steps when integrating a knowledge base with Amazon Bedrock:

  1. Index your documents into a vector database using Amazon Bedrock Knowledge Bases.
  2. Configure your agent to access the knowledge base during interactions.
  3. Implement citation mechanisms to reference source documents in responses.

Regularly update your knowledge base to make sure your agent has consistent access to the most current information. This can achieved by implementing event-based synchronization of your knowledge base data sources using the StartIngestionJob API and an Amazon EventBridge rule that is invoked periodically or based on updates of files in the knowledge base Amazon Simple Storage Service (Amazon S3) bucket.

Integrating Amazon Bedrock Knowledge Bases with your agent will allow you to add semantic search capabilities to your application. By using the knowledgeBaseConfigurations field in your agent’s SessionState during the InvokeAgent request, you can control how your agent interacts with your knowledge base by setting the desired number of results and any necessary filters.

Defining success: Establishing evaluation criteria

To measure the effectiveness of your AI agent, it’s essential to define specific evaluation criteria. These metrics will help you assess performance, identify areas for improvement, and track progress over time.

Consider the following key evaluation metrics:

  • Response accuracy – This metric measures how your responses compare to your ground truth data. It provides information such as if the answers are correct and if the agent shows good performance and high quality.
  • Task completion rate – This measures the success rate of the agent. The core idea of this metric is to measure the percentage or proportion of the conversations or user interactions where the agent was able to successfully complete the requested tasks and fulfill the user’s intent.
  • Latency or response time – This metric measures how long a task took to run and the response time. Essentially, it measures how quickly the agent can provide a response or output after receiving an input or query. You can also set intermediate metrics that measure how long each step of the agent trace takes to run to identify the steps that need to be optimized in your system.
  • Conversation efficiency – These measures how efficiently the conversation was able to collect the required information.
  • Engagement – These measures how well the agent can understand the user’s intent, provide relevant and natural responses, and maintain an engagement with back-and-forth conversational flow.
  • Conversation coherence – This metric measures the logical progression and continuity between the responses. It checks if the context and relevance are kept during the session and if the appropriate pronouns and references are used.

Furthermore, you should define your use case-specific evaluation metrics that determine how well the agent is fulfilling the tasks for your use case. For instance, for the HR use case, a possible custom metric could be the number of tickets created, because those are created when the agent can’t answer the question by itself.

Implementing a robust evaluation process involves creating a comprehensive test dataset based on your ground truth data, developing automated evaluation scripts to measure quantitative metrics, implementing A/B testing to compare different agent versions or configurations, and establishing a regular cadence for human evaluation of qualitative factors. Evaluation is an ongoing process, so you should continuously refine your criteria and measurement methods as you learn more about your agent’s performance and user needs.

Using human evaluation

Although automated metrics are valuable, human evaluation plays a crucial role in assessing and improving your AI agent’s performance. Human evaluators can provide nuanced feedback on aspects that are difficult to quantify automatically, such as assessing natural language understanding and generation, evaluating the appropriateness of responses in context, identifying potential biases or ethical concerns, and providing insights into user experience and satisfaction.

To effectively use human evaluation, consider the following best practices:

  • Create a diverse panel of evaluators representing different perspectives
  • Develop clear evaluation guidelines and rubrics
  • Use a mix of expert evaluators (such as subject matter experts) and representative end-users
  • Collect quantitative ratings and qualitative feedback
  • Regularly analyze evaluation results to identify trends and areas for improvement

Continuous improvement: Testing, iterating, and refining

Building an effective AI agent is an iterative process. Now that you have a working prototype, it’s crucial to test extensively, gather feedback, and continuously refine your agent’s performance. This process should include comprehensive testing using your ground truth dataset; real-world user testing with a beta group; analysis of agent logs and conversation traces; regular updates to instructions, function definitions, and prompts; and performance comparison across different FMs.

To achieve thorough testing, consider using AI to generate diverse test cases. The following is an example prompt for generating HR assistant test scenarios:

Generate 10 diverse conversation scenarios between an employee and an HR AI assistant. Include a mix of common requests (e.g., vacation booking, policy questions) and edge cases (e.g., complex situations, out-of-scope queries). For each scenario, provide:
1. The initial user query
2. Expected agent responses
3. Potential follow-up questions
4. Desired final outcomes

One of the best tools of the testing phase is the agent trace. The trace provides you with the prompts used by the agent in each step taken during the agent’s orchestration. It gives insights on the agent’s chain of thought and reasoning process. You can enable the trace in your InvokeAgent call during the test process and disable it after your agent has been validated.

The next step after collecting a ground truth dataset is to evaluate the agent’s behavior. You first need to define evaluation criteria for assessing the agent’s behavior. For the HR assistant example, you can create a test dataset that compares the results provided by your agent with the results obtained by directly querying the vacations database. You can then manually evaluate the agent behavior using human evaluation, or you can automate the evaluation using agent evaluation frameworks such as Agent Evaluation. If model invocation logging is enabled, Amazon Bedrock Agents will also give you Amazon CloudWatch logs. You can use those logs to validate your agent’s behavior, debug unexpected outputs, and adjust the agent accordingly.

The last step of the agent testing phase is to plan for A/B testing groups during the deployment stage. You should define different aspects of agent behavior, such as formal or informal HR assistant tone, that can be tested with a smaller set of your user group. You can then make different agent versions available for each group during initial deployments and evaluate the agent behavior for each group. Amazon Bedrock Agents has built-in versioning capabilities to help you with this key part of testing.

Conclusions

Following these best practices and continuously refining your approach can significantly contribute to your success in developing powerful, accurate, and user-oriented AI agents using Amazon Bedrock. In Part 2 of this series, we explore architectural considerations, security best practices, and strategies for scaling your AI agents in production environments.

By following these best practices, you can build secure, accurate, scalable, and responsible generative AI applications using Amazon Bedrock. For examples to get started, check out the Amazon Bedrock Agents GitHub repository.

To learn more about Amazon Bedrock Agents, you can get started with the Amazon Bedrock Workshop and the standalone Amazon Bedrock Agents Workshop, which provides a deeper dive. Additionally, check out the service introduction video from AWS re:Invent 2023.


About the Authors

Maira Ladeira Tanke is a Senior Generative AI Data Scientist at AWS. With a background in machine learning, she has over 10 years of experience architecting and building AI applications with customers across industries. As a technical lead, she helps customers accelerate their achievement of business value through generative AI solutions on Amazon Bedrock. In her free time, Maira enjoys traveling, playing with her cat, and spending time with her family someplace warm.

Mark Roy is a Principal Machine Learning Architect for AWS, helping customers design and build generative AI solutions. His focus since early 2023 has been leading solution architecture efforts for the launch of Amazon Bedrock, the flagship generative AI offering from AWS for builders. Mark’s work covers a wide range of use cases, with a primary interest in generative AI, agents, and scaling ML across the enterprise. He has helped companies in insurance, financial services, media and entertainment, healthcare, utilities, and manufacturing. Prior to joining AWS, Mark was an architect, developer, and technology leader for over 25 years, including 19 years in financial services. Mark holds six AWS certifications, including the ML Specialty Certification.

Navneet Sabbineni is a Software Development Manager at AWS Bedrock. With over 9 years of industry experience as a software developer and manager, he has worked on building and maintaining scalable distributed services for AWS, including generative AI services like Amazon Bedrock Agents and conversational AI services like Amazon Lex. Outside of work, he enjoys traveling and exploring the Pacific Northwest with his family and friends.

Monica Sunkara is a Senior Applied Scientist at AWS, where she works on Amazon Bedrock Agents. With over 10 years of industry experience, including 6 years at AWS, Monica has contributed to various AI and ML initiatives such as Alexa Speech Recognition, Amazon Transcribe, and Amazon Lex ASR. Her work spans speech recognition, natural language processing, and large language models. Recently, she worked on adding function calling capabilities to Amazon Titan text models. Monica holds a degree from Cornell University, where she conducted research on object localization under the supervision of Prof. Andrew Gordon Wilson before joining Amazon in 2018.

Read More

AWS recognized as a first-time Leader in the 2024 Gartner Magic Quadrant for Data Science and Machine Learning Platforms

AWS recognized as a first-time Leader in the 2024 Gartner Magic Quadrant for Data Science and Machine Learning Platforms

Over the last 18 months, AWS has announced more than twice as many machine learning (ML) and generative artificial intelligence (AI) features into general availability than the other major cloud providers combined. This accelerated innovation is enabling organizations of all sizes, from disruptive AI startups like Hugging Face, AI21 Labs, and Articul8 AI to industry leaders such as NASDAQ and United Airlines, to unlock the transformative potential of generative AI. By providing a secure, high-performance, and scalable set of data science and machine learning services and capabilities, AWS empowers businesses to drive innovation through the power of AI.

At the heart of this innovation are Amazon Bedrock and Amazon SageMaker, both of which were mentioned in the recent Gartner Data Science and Machine Learning (DSML) Magic Quadrant evaluation. These services play a pivotal role in addressing diverse customer needs across the generative AI journey.

Amazon SageMaker, the foundational service for ML and generative AI model development, provides the fine-tuning and flexibility that makes it simple for data scientists and machine learning engineers to build, train, and deploy machine learning and foundation models (FMs) at scale. For application developers, Amazon Bedrock is the simplest way to build and scale generative AI applications with FMs for a wide variety of use cases. Whether leveraging the best FMs out there or importing custom models from SageMaker, Bedrock equips development teams with the tools they need to accelerate innovation.

We believe continued innovations for both services and our positioning as a Leader in the 2024 Gartner Data Science and Machine Learning (DSML) Magic Quadrant reflects our commitment to meeting evolving customer needs, particularly in data science and ML. In our opinion, this recognition, coupled with our recent recognition in the Cloud AI Developer Services (CAIDS) Magic Quadrant, solidifies AWS as a provider of innovative AI solutions that drive business value and competitive advantage.

Review the Gartner Magic Quadrant and Methodology

For Gartner, the DSML Magic Quadrant research methodology provides a graphical competitive positioning of four types of technology providers in fast-growing markets: Leaders, Visionaries, Niche Players and Challengers. As companion research, Gartner Critical Capabilities notes provide deeper insight into the capability and suitability of providers’ IT products and services based on specific or customized use cases.

The following figure highlights where AWS lands in the DSML Magic Quadrant.

Access a complimentary copy of the full report to see why Gartner positioned AWS as a Leader, and dive deep into the strengths and cautions of AWS.

Further detail on Amazon Bedrock and Amazon SageMaker

Amazon Bedrock provides a straightforward way to build and scale applications with large language models (LLMs) and foundation models (FMs), empowering you to build generative AI applications with security and privacy. With Amazon Bedrock, you can experiment with and evaluate high performing FMs for your use case, import custom models, privately customize them with your data using techniques such as fine-tuning and Retrieval Augmented Generation (RAG), and build agents that run tasks using your enterprise systems and data sources. Tens of thousands of customers across multiple industries are deploying new generative AI experiences for diverse use cases.

Amazon SageMaker is a fully managed service that brings together a broad set of tools to enable high-performance, low-cost ML for any use case. You can access a wide-ranging choices of ML tools, fully managed and scalable infrastructure, repeatable and responsible ML workflows and the power of human feedback across the ML lifecycle, including sophisticated tools that make it straightforward to work with data like Amazon SageMaker Canvas and Amazon SageMaker Data Wrangler.

In addition, Amazon SageMaker helps data scientists and ML engineers build FMs from scratch, evaluate and customize FMs with advanced techniques, and deploy FMs with fine-grained controls for generative AI use cases that have stringent requirements on accuracy, latency, and cost. Hundreds of thousands of customers from Perplexity to Thomson Reuters to Workday use SageMaker to build, train, and deploy ML models, including LLMs and other FMs.

Gartner does not endorse any vendor, product or service depicted in its research publications and does not advise technology users to select only those vendors with the highest ratings or other designation. Gartner research publications consist of the opinions of Gartner’s research organization and should not be construed as statements of fact. Gartner disclaims all warranties, expressed or implied, with respect to this research, including any warranties of merchantability or fitness for a particular purpose.

This graphic was published by Gartner, Inc. as part of a larger research document and should be evaluated in the context of the entire document. The Gartner document is available upon request from AWS.

GARTNER is a registered trademark and service mark of Gartner and Magic Quadrant is a registered trademark of Gartner, Inc. and/or its affiliates in the U.S. and internationally and are used herein with permission. All rights reserved.


About the author

Susanne Seitinger leads AI and ML product marketing at Amazon Web Services (AWS), including the introduction of critical generative AI services like Amazon Bedrock as well as coordinating generative AI marketing activities across AWS. Prior to AWS, Susanne was the director of public sector marketing at Verizon Business Group, and previously drove public sector marketing in the United States for Signify, after holding various positions in R&D, innovation, and segment management and marketing. She holds a BA from Princeton University, as well as a master’s in city planning and a PhD from MIT.

Read More

Build a serverless voice-based contextual chatbot for people with disabilities using Amazon Bedrock

Build a serverless voice-based contextual chatbot for people with disabilities using Amazon Bedrock

At Amazon and AWS, we are always finding innovative ways to build inclusive technology. With voice assistants like Amazon Alexa, we are enabling more people to ask questions and get answers on the spot without having to type. Whether you’re a person with a motor disability, juggling multiple tasks, or simply away from your computer, getting search results without typing is a valuable feature. With modern voice assistants, you can now ask your questions conversationally and get verbal answers instantly.

In this post, we discuss voice-guided applications. Specifically, we focus on chatbots. Chatbots are no longer a niche technology. They are now ubiquitous on customer service websites, providing around-the-clock automated assistance. Although AI chatbots have been around for years, recent advances of large language models (LLMs) like generative AI have enabled more natural conversations. Chatbots are proving useful across industries, handling both general and industry-specific questions. Voice-based assistants like Alexa demonstrate how we are entering an era of conversational interfaces. Typing questions already feels cumbersome to many who prefer the simplicity and ease of speaking with their devices.

We explore how to build a fully serverless, voice-based contextual chatbot tailored for individuals who need it. We also provide a sample chatbot application. The application is available in the accompanying GitHub repository. We create an intelligent conversational assistant that can understand and respond to voice inputs in a contextually relevant manner. The AI assistant is powered by Amazon Bedrock. This chatbot is designed to assist users with various tasks, provide information, and offer personalized support based on their unique requirements. For our LLM, we use Anthropic Claude on Amazon Bedrock.

We demonstrate the process of integrating Anthropic Claude’s advanced natural language processing capabilities with the serverless architecture of Amazon Bedrock, enabling the deployment of a highly scalable and cost-effective solution. Additionally, we discuss techniques for enhancing the chatbot’s accessibility and usability for people with motor disabilities. The aim of this post is to provide a comprehensive understanding of how to build a voice-based, contextual chatbot that uses the latest advancements in AI and serverless computing.

We hope that this solution can help people with certain mobility disabilities. A limited level of interaction is still required, and specific identification of start and stop talking operations is required. In our sample application, we address this by having a dedicated Talk button that performs the transcription process while being pressed.

For people with significant motor disabilities, the same operation can be implemented with a dedicated physical button that can be pressed by a single finger or another body part. Alternatively, a special keyword can be said to indicate the beginning of the command. This approach is used when you communicate with Alexa. The user always starts the conversation with “Alexa.”

Solution overview

The following diagram illustrates the architecture of the solution.

Architecture of serverless components of the solution

To deploy this architecture, we need managed compute that can host the web application, authentication mechanisms, and relevant permissions. We discuss this later in the post.

All the services that we use are serverless and fully managed by AWS. You don’t need to provision the compute resources. You only consume the services through their API. All the calls to the services are made directly from the client application.

The application is a simple React application that we create using the Vite build tool. We use the AWS SDK for JavaScript to call the services. The solution uses the following major services:

  • Amazon Polly is a service that turns text into lifelike speech.
  • Amazon Transcribe is an AWS AI service that makes it straightforward to convert speech to text.
  • Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) along with a broad set of capabilities that you need to build generative AI applications.
  • Amazon Cognito is an identity service for web and mobile apps. It’s a user directory, an authentication server, and an authorization service for OAuth 2.0 access tokens and AWS credentials.

To consume AWS services, the user needs to obtain temporary credentials from AWS Identity and Access Management (IAM). This is possible due to the Amazon Cognito identity pool, which acts as a mediator between your application user and IAM services. The identity pool holds the information about the IAM roles with all permissions necessary to run the solution.

Amazon Polly and Amazon Transcribe don’t require additional setup from the client aside from what we have described. However, Amazon Bedrock requires named user authentication. This means that having an Amazon Cognito identity pool is not enough—you also need to use the Amazon Cognito user pool, which allows you to define users and bind them to the Amazon Cognito identity pool. To understand better how Amazon Cognito allows external applications to invoke AWS services, refer to refer to Secure API Access with Amazon Cognito Federated Identities, Amazon Cognito User Pools, and Amazon API Gateway.

The heavy lifting of provisioning the Amazon Cognito user pool and identity pool, including generating the sign-in UI for the React application, is done by AWS Amplify. Amplify consists of a set of tools (open source framework, visual development environment, console) and services (web application and static website hosting) to accelerate the development of mobile and web applications on AWS. We cover the steps of setting Amplify in the next sections.

Prerequisites

Before you begin, complete the following prerequisites:

  1. Make sure you have the following installed:
  2. Create an IAM role to use in the Amazon Cognito identity pool. Use the least privilege principal to provide only the minimum set of permissions needed to run the application.
    • To invoke Amazon Bedrock, use the following code:
      {
      					  "Version": "2012-10-17",
      					  "Statement": [
      						{
      						  "Sid": "VisualEditor1",
      						  "Effect": "Allow",
      						  "Action": "bedrock:InvokeModel",
      						  "Resource": "*"
      						}
      					  ]
      					}

    • To invoke Amazon Polly, use the following code:
      {
      					  "Version": "2012-10-17",
      					  "Statement": [
      						{
      						  "Sid": "VisualEditor2",
      						  "Effect": "Allow",
      						  "Action": "polly:SynthesizeSpeech",
      						  "Resource": "*"
      						}
      					  ]
      					}

    • To invoke Amazon Transcribe, use the following code:
      {
      				  "Version": "2012-10-17",
      				  "Statement": [
      					{
      					  "Sid": "VisualEditor3",
      					  "Effect": "Allow",
      					  "Action": "transcribe:StartStreamTranscriptionWebSocket",
      					  "Resource": "*"
      					}
      				  ]
      				}

The full policy JSON should look as follows:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "VisualEditor1",
      "Effect": "Allow",
      "Action": "bedrock:InvokeModel",
      "Resource": "*"
    },
    {
      "Sid": "VisualEditor2",
      "Effect": "Allow",
      "Action": "polly:SynthesizeSpeech",
      "Resource": "*"
    },
    {
      "Sid": "VisualEditor3",
      "Effect": "Allow",
      "Action": "transcribe:StartStreamTranscriptionWebSocket",
      "Resource": "*"
    }
  ]
}
  1. Run the following command to clone the GitHub repository:
    git clone https://github.com/aws-samples/serverless-conversational-chatbot.git

  2. To use Amplify, refer to Set up Amplify CLI to complete the initial setup.
  3. To be consistent with the values that you use later in the instructions, call your AWS profile amplify when you see the following prompt.
    Creation of the AWS profile "amplify"
  4. Create the role amplifyconsole-backend-role with the AdministratorAccess-Amplify managed policy, which allows Amplify to create the necessary resources.
    IAM Role with "AdministratorAccess-Amplify" policy
  5. For this post, we use the Anthropic Claude 3 Haiku LLM. To enable the LLM in Amazon Bedrock, refer to Access Amazon Bedrock foundation models.

Deploy the solution

There are two options to deploy the solution:

  • Use Amplify to deploy the application automatically
  • Deploy the application manually

We provide the steps for both options in this section.

Deploy the application automatically using Amplify

Amplify can deploy the application automatically if it’s stored in GitHub, Bitbucket, GitLab, or AWS CodeCommit. Upload the application that you downloaded earlier to your preferred repository (from the aforementioned options). For instructions, see Getting started with deploying an app to Amplify Hosting.

You can now continue to the next section of this post to set up IAM permissions.

Deploy the application manually

If you don’t have access to one of the storage options that we mentioned, you can deploy the application manually. This can also be useful if you want to modify the application to better fit your use case.

We tested the deployment on AWS Cloud9, a cloud integrated development environment (IDE) for writing, running, and debugging code, with Ubuntu Server 22.04 and Amazon Linux 2023.

We use the Visual Studio Code IDE and run all the following commands directly in the terminal window inside the IDE, but you can also run the commands in the terminal of your choice.

  1. From the directory where you checked out the application on GitHub, run the following command:
    cd serverless-conversational-chatbot

  2. Run the following commands:
    npm i
    
    amplify init

  3. Follow the prompts as shown in the following screenshot.
    • For authentication, choose the AWS profile amplify that you created as part of the prerequisite steps.
      Initial AWS Amplify setup in React application: 1. Do you want to use an existing environment? No 2. Enter a name for the environment: sampleenv 3. Select the authentication method you want to use: AWS Profile 4. Please choose the profile you want to use: amplify
    • Two new files will appear in the project under the src folder:
      • amplifyconfiguration.json
      • aws-exports.js

      New objects created by AWS Amplify: 1. aws-exports.js 2. amplifyconfiguration.json

  1. Next run the following command:
    amplify configure project

Then select “Project Information”

Project Configuration of AWS Amplify in React Applications

  1.  Enter the following information:
    Which setting do you want to configure? Project information
    
    Enter a name for the project: servrlsconvchat
    
    Choose your default editor: Visual Studio Code
    
    Choose the type of app that you're building: javascript
    
    What javascript framework are you using: react
    
    Source Directory Path: src
    
    Distribution Directory Path: dist
    
    Build Command: npm run-script build
    
    Start Command: npm run-script start

You can use an existing Amazon Cognito identity pool and user pool or create new objects.

  1. For our application, run the following command:
    amplify add auth

If you get the following message, you can ignore it:

Auth has already been added to this project. To update run amplify update auth
  1. Choose Default configuration.
    Selecting "default configuration" when adding authentication objects
  2. Accept all options proposed by the prompt.
  3. Run the following command:
    amplify add hosting

  4. Choose your hosting option.

You have two options to host the application. The application can be hosted to the Amplify console or to Amazon Simple Storage Service (Amazon S3) and then exposed through Amazon CloudFront.

Hosting with the Amplify console differs from CloudFront and Amazon S3. The Amplify console is a managed service providing continuous integration and delivery (CI/CD) and SSL certificates, prioritizing swift deployment of serverless web applications and backend APIs. In contrast, CloudFront and Amazon S3 offer greater flexibility and customization options, particularly for hosting static websites and assets with features like caching and distribution. CloudFront and Amazon S3 are preferable for intricate, high-traffic web applications with specific performance and security needs.

For this post, we use the Amplify console. To learn more about the deployment with Amazon S3 and Amazon CloudFront, refer to documentation.
Selecting the deployment option for the React application on the Amplify Console. Selected option: Hosting with Amplify Console

Now you’re ready to publish the application. There is an option to publish the application to GitHub to support CI/CD pipelines. Amplify has built-in integration with GitHub and can redeploy the application automatically when you push the changes. For simplicity, we use manual deployment.

  1. Choose Manual deployment.
    Selecting "Manual Deployment" when publishing the project
  2. Run the following command:
    amplify publish

After the application is published, you will see the following output. Note down this URL to use in a later step.
Result of the Deployment of the React Application on the Amplify Console. The URL that the user should use to enter the Amplify application

  1. Log in to the Amplify console, navigate to the servrlsconvchat application, and choose General under App settings in the navigation pane.
    Service Role attachment to the deployed application. First step. Select the deployed application. Seelct “General” option
  2. Edit the app settings and enter amplifyconsole-backend-role for Service role (you created this role in the prerequisites section).
    Service Role attachment to the deployed application. Second step. Setting “amplifyconsole-backend-role” in the “Service role” field

Now you can proceed to the next section to set up IAM permissions.

Configure IAM permissions

As part of the publishing method you completed, you provisioned a new identity pool. You can view this on the Amazon Cognito console, along with a new user pool. The names will be different from those presented in this post.

As we explained earlier, you need to attach policies to this role to allow interaction with Amazon Bedrock, Amazon Polly, and Amazon Transcribe. To set up IAM permissions, complete the following steps:

  1. On the Amazon Cognito console, choose Identity pools in the navigation pane.
  2. Navigate to your identity pool.
  3. On the User access tab, choose the link for the authenticated role.
    Identifying the IAM Authentication Role in the Cognitive Identity Pool. Select “Identity pools” option in the console. Select “User access” tab. Click on the link under “Authentication role”
  4. Attach the policies that you defined in the prerequisites section.
    IAM Policies Attached to Cognito Identity Pool Authenticated Roles. Textual data presaented in “Prerequisites” section, item 2.

Amazon Bedrock can only be used with a named user, so we create a sample user in the Amazon Cognito user pool that was provisioned as part of the application publishing process.

  1. On the user pool details page, on the Users tab, choose Create user.
    User Creation in the Cognito User Pool. Select relevant user pool in “User pools” section. Select “Users” tab. Click on “Create user” button
  2. Provide your user information.
    Sample user definition in the Cognito User Pool. Enter email address and temporary password.

You’re now ready to run the application.

Use the sample serverless application

To access the application, navigate to the URL you saved from the output at the end of the application publishing process. Sign in to the application with the user you created in the previous step. You might be asked to change the password the first time you sign in.
Application Login Page. Enter user name and password

Use the Talk button and hold it while you’re asking the question. (We use this approach for the simplicity of demonstrating the abilities of the tool. For people with motor disabilities, we propose using a dedicated button that can be operated with different body parts, or a special keyword to initiate the conversation.)

When you release the button, the application sends your voice to Amazon Transcribe and returns the transcription text. This text is used as an input for an Amazon Bedrock LLM. For this example, we use Anthropic Claude 3 Haiku, but you can modify the code and use another model.

The response from Amazon Bedrock is displayed as text and is also spoken by Amazon Polly.
Instructions on how to invoke the "Talk" operation, by using “Talk” operation

The conversation history is also stored. This means that you can ask follow-up questions, and the context of the conversation is preserved. For example, we asked, “What is the most famous tower there?” without specifying the location, and our chatbot was able to understand that the context of the question is Paris based on our previous question.
Demonstration of context preservation during conversation. Continues question-answer conversation with chatbot.

We store the conversation history inside a JavaScript variable, which means that if you refresh the page, the context will be lost. We discuss how to preserve the conversation context in a persistent way later in this post.

To identify that the transcription process is happening, choose and hold the Talk button. The color of the button changes and a microphone icon appears.

"Talk" operation indicator. “Talk” button changes color to orche

Clean up

To clean up your resources, run the following command from the same directory where you ran the Amplify commands:

amplify delete

Result of the "Cleanup" operation after running “amplify delete” command

This command removes the Amplify settings from the React application, Amplify resources, and all Amazon Cognito objects, including the IAM role and Amazon Cognito user pool’s user.

Conclusion

In this post, we presented how to create a fully serverless voice-based contextual chatbot using Amazon Bedrock with Anthropic Claude.

This serves a starting point for a serverless and cost-effective solution. For example, you could extend the solution to have persistent conversational memory for your chats, such as Amazon DynamoDB. If you want to use a Retrieval Augmented Generation (RAG) approach, you can use Amazon Bedrock Knowledge Bases to securely connect FMs in Amazon Bedrock to your company data.

Another approach is to customize the model you use in Amazon Bedrock with your own data using fine-tuning or continued pre-training to build applications that are specific to your domain, organization, and use case. With custom models, you can create unique user experiences that reflect your company’s style, voice, and services.

For additional resources, refer to the following:


About the Author

Michael Shapira is a Senior Solution Architect covering general topics in AWS and part of the AWS Machine Learning community. He has 16 years’ experience in Software Development. He finds it fascinating to work with cloud technologies and help others on their cloud journey.

Eitan Sela is a Machine Learning Specialist Solutions Architect with Amazon Web Services. He works with AWS customers to provide guidance and technical assistance, helping them build and operate machine learning solutions on AWS. In his spare time, Eitan enjoys jogging and reading the latest machine learning articles.

Read More

Maintain access and consider alternatives for Amazon Monitron

Maintain access and consider alternatives for Amazon Monitron

Amazon Monitron, the Amazon Web Services (AWS) machine learning (ML) service for industrial equipment condition monitoring, will no longer be available to new customers effective October 31, 2024. Existing customers of Amazon Monitron will be able to purchase devices and use the service as normal. We will continue to sell devices until July 2025 and will honor the 5-year device warranty, including service support. AWS continues to invest in security, availability, and performance improvements for Amazon Monitron, but we do not plan to introduce new features to Amazon Monitron.

This post discusses how customers can maintain access to Amazon Monitron after it is closed to new customers and what some alternatives are to Amazon Monitron.

Maintaining access to Amazon Monitron

Customers will be considered an existing customer if they have commissioned an Amazon Monitron sensor through a project any time in the 30 days prior to October 31, 2024. In order to maintain access to the service after October 31, 2024, customers should create a project and commission at least one sensor.

For any questions or support needed, you may contact your assigned account manager, solutions architect, or create a case from the AWS Management Console.

Ordering Amazon Monitron hardware

For existing Amazon business customers, we will allowlist your account with the existing Amazon Monitron devices. For existing Amazon.com retail customers, the Amazon Monitron team will provide specific ordering instructions according to individual request.

Alternatives to Amazon Monitron

For customers interested in an alternative for your condition monitoring needs, we recommend exploring alternative solutions provided by our AWS Partners: Tactical Edge, IndustrAI, and Factory AI.

Summary

While new customers will no longer have access to Amazon Monitron after October 31, 2024, AWS offers a range of partner solutions through the AWS Partner Network finder. Customers should explore these options to understand what works best for their specific needs.

More details can be found in the following resources at AWS Partner Network.


About the author

Stuart Gillen is a Sr. Product Manager for Monitron, at AWS. Stuart has held a variety of roles in engineering management, business development, product management, and consulting. Most of his career has been focused on industrial applications specifically in reliability practices, maintenance systems, and manufacturing.

Read More

Import a question answering fine-tuned model into Amazon Bedrock as a custom model

Import a question answering fine-tuned model into Amazon Bedrock as a custom model

Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI.

Common generative AI use cases, including but not limited to chatbots, virtual assistants, conversational search, and agent assistants, use FMs to provide responses. Retrieval Augment Generation (RAG) is a technique to optimize the output of FMs by providing context around the questions for these use cases. Fine-tuning the FM is recommended to further optimize the output to follow the brand and industry voice or vocabulary.

Custom Model Import for Amazon Bedrock, in preview now, allows you to import customized FMs created in other environments, such as Amazon SageMaker, Amazon Elastic Compute Cloud (Amazon EC2) instances, and on premises, into Amazon Bedrock. This post is part of a series that demonstrates various architecture patterns for importing fine-tuned FMs into Amazon Bedrock.

In this post, we provide a step-by-step approach of fine-tuning a Mistral model using SageMaker and import it into Amazon Bedrock using the Custom Import Model feature. We use the OpenOrca dataset to fine-tune the Mistral model and use the SageMaker FMEval library to evaluate the fine-tuned model imported into Amazon Bedrock.

Key Features

Some of the key features of Custom Model Import for Amazon Bedrock are:

  1. This feature allows you to bring your fine-tuned models and leverage the fully managed serverless capabilities of Amazon Bedrock
  2. Currently we are supporting Llama 2, Llama 3, Flan, Mistral Model architectures using this feature with a precisions of FP32, FP16 and BF16 with further quantizations coming soon.
  3. To leverage this feature you can run the import process (covered later in the blog) with your model weights being in Amazon Simple Storage Service (Amazon S3).
  4. You can even leverage your models created using Amazon SageMaker by referencing the Amazon SageMaker model Amazon Resource Names (ARN) which provides for a seamless integration with SageMaker.
  5. Amazon Bedrock will automatically scale your model as your traffic pattern increases and when not in use, scale your model down to 0 thus reducing your costs.

Let us dive into a use-case and see how easy it is to use this feature.

Solution overview

At the time of writing, the Custom Model Import feature in Amazon Bedrock supports models following the architectures and patterns in the following figure.

In this post, we walk through the following high-level steps:

  1. Fine-tune the model using SageMaker.
  2. Import the fine-tuned model into Amazon Bedrock.
  3. Test the imported model.
  4. Evaluate the imported model using the FMEval library.

The following diagram illustrates the solution architecture.

The process includes the following steps:

  1. We use a SageMaker training job to fine-tune the model using a SageMaker JupyterLab notebook. This training job reads the dataset from Amazon Simple Storage Service (Amazon S3) and writes the model back into Amazon S3. This model will then be imported into Amazon Bedrock.
  2. To import the fine-tuned model, you can use the Amazon Bedrock console, the Boto3 library, or APIs.
  3. An import job orchestrates the process to import the model and make the model available from the customer account.
    1. The import job copies all the model artifacts from the user’s account into an AWS managed S3 bucket.
  4. When the import job is complete, the fine-tuned model is made available for invocation from your AWS account.
  5. We use the SageMaker FMEval library in a SageMaker notebook to evaluate the imported model.

The copied model artifacts will remain in the Amazon Bedrock account until the custom imported model is deleted from Amazon Bedrock. Deleting model artifacts in your AWS account S3 bucket doesn’t delete the model or the related artifacts in the Amazon Bedrock managed account. You can delete an imported model from Amazon Bedrock along with all the copied artifacts using either the Amazon Bedrock console, Boto3 library, or APIs.

Additionally, all data (including the model) remains within the selected AWS Region. The model artifacts are imported into the AWS operated deployment account using a virtual private cloud (VPC) endpoint, and you can encrypt your model data using an AWS Key Management Service (AWS KMS) customer managed key.

In the following sections, we dive deep into each of these steps to deploy, test, and evaluate the model.

Prerequisites

We use Mistral-7B-v0.3 in this post because it uses an extended vocabulary compared to its prior version produced by Mistral AI. This model is straightforward to fine-tune, and Mistral AI has provided example fine-tuned models. We use Mistral for this use case because this model supports a 32,000-token context capacity and is fluent in English, French, Italian, German, Spanish, and coding languages. With the Mixture of Experts (MoE) feature, it can achieve higher accuracy for customer support use cases.

Mistral-7B-v0.3 is a gated model on the Hugging Face model repository. You need to review the terms and conditions and request access to the model by submitting your details.

We use Amazon SageMaker Studio to preprocess the data and fine-tune the Mistral model using a SageMaker training job. To set up SageMaker Studio, refer to Launch Amazon SageMaker Studio. Refer to the SageMaker JupyterLab documentation to set up and launch a JupyterLab notebook. You will submit a SageMaker training job to fine-tune the Mistral model from the SageMaker JupyterLab notebook, which can found on the GitHub repo.

Fine-tune the model using QLoRA

To fine-tune the Mistral model, we apply QLoRA and Parameter-Efficient Fine-Tuning (PEFT) optimization techniques. In the provided notebook, you use the Fully Sharded Data Parallel (FSDP) PyTorch API to perform distributed model tuning. You use supervised fine-tuning (SFT) to fine-tune the Mistral model.

Prepare the dataset

The first step in the fine-tuning process is to prepare and format the dataset. After you transform the dataset into the Mistral Default Instruct format, you upload it as a JSONL file into the S3 bucket used by the SageMaker session, as shown in the following code:

# Load dataset from the hub
dataset = load_dataset("Open-Orca/OpenOrca")
flan_dataset = dataset.filter(lambda example, indice: "flan" in example["id"], with_indices=True)
flan_dataset = flan_dataset["train"].train_test_split(test_size=0.01, train_size=0.035)

columns_to_remove = list(dataset["train"].features)
flan_dataset = flan_dataset.map(create_conversation, remove_columns=columns_to_remove, batched=False)

# save datasets to s3
flan_dataset["train"].to_json(f"{training_input_path}/train_dataset.json", orient="records", force_ascii=False)
flan_dataset["test"].to_json(f"{training_input_path}/test_dataset.json", orient="records", force_ascii=False)

You transform the dataset into Mistral Default Instruct format within the SageMaker training job as instructed in the training script (run_fsdp_qlora.py):

    ################
    # Dataset
    ################
    
    train_dataset = load_dataset(
        "json",
        data_files=os.path.join(script_args.dataset_path, "train_dataset.json"),
        split="train",
    )
    test_dataset = load_dataset(
        "json",
        data_files=os.path.join(script_args.dataset_path, "test_dataset.json"),
        split="train",
    )

    ################
    # Model & Tokenizer
    ################

    # Tokenizer        
    tokenizer = AutoTokenizer.from_pretrained(script_args.model_id, use_fast=True)
    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.chat_template = MISTRAL_CHAT_TEMPLATE
    
    # template dataset
    def template_dataset(examples):
        return{"text":  tokenizer.apply_chat_template(examples["messages"], tokenize=False)}
    
    train_dataset = train_dataset.map(template_dataset, remove_columns=["messages"])
    test_dataset = test_dataset.map(template_dataset, remove_columns=["messages"])

Optimize fine-tuning using QLoRA

You optimize your fine-tuning using QLoRA and with the precision provided as input into the training script as SageMaker training job parameters. QLoRA is an efficient fine-tuning approach that reduces memory usage to fine-tune a 65-billion-parameter model on a single 48 GB GPU, preserving the full 16-bit fine-tuning task performance. In this notebook, you use the bitsandbytes library to set up quantization configurations, as shown in the following code:

    # Model    
    torch_dtype = torch.bfloat16 if training_args.bf16 else torch.float32
    quant_storage_dtype = torch.bfloat16

    if script_args.use_qlora:
        print(f"Using QLoRA - {torch_dtype}")
        quantization_config = BitsAndBytesConfig(
                load_in_4bit=True,
                bnb_4bit_use_double_quant=True,
                bnb_4bit_quant_type="nf4",
                bnb_4bit_compute_dtype=torch_dtype,
                bnb_4bit_quant_storage=quant_storage_dtype,
            )
    else:
        quantization_config = None

You use the LoRA config based on the QLoRA paper and Sebastian Raschka experiment, as shown in the following code. Two key points to consider from the Raschka experiment are that QLoRA offers 33% memory savings at the cost of an 39% increase in runtime, and to make sure LoRA is applied to all layers to maximize model performance.

################
# PEFT
################
# LoRA config based on QLoRA paper & Sebastian Raschka experiment
peft_config = LoraConfig(
    lora_alpha=8,
    lora_dropout=0.05,
    r=16,
    bias="none",
    target_modules="all-linear",
    task_type="CAUSAL_LM",
    )

You use SFTTrainer to fine-tune the Mistral model:

    ################
    # Training
    ################
    trainer = SFTTrainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        dataset_text_field="text",
        eval_dataset=test_dataset,
        peft_config=peft_config,
        max_seq_length=script_args.max_seq_length,
        tokenizer=tokenizer,
        packing=True,
        dataset_kwargs={
            "add_special_tokens": False,  # We template with special tokens
            "append_concat_token": False,  # No need to add additional separator token
        },
    )

At the time of writing, only merged adapters are supported using the Custom Model Import feature for Amazon Bedrock. Let’s look at how to merge the adapter with the base model next.

Merge the adapters

Adapters are new modules added between layers of a pre-trained network. Creation of these new modules is possible by back-propagating gradients through a frozen, 4-bit quantized pre-trained language model into low-rank adapters in the fine-tuning process. To import the Mistral model into Amazon Bedrock, the adapters need to be merged with the base model and saved in Safetensors format. Use the following code to merge the model adapters and save them in Safetensors format:

        # load PEFT model in fp16
        model = AutoPeftModelForCausalLM.from_pretrained(
            training_args.output_dir,
            low_cpu_mem_usage=True,
            torch_dtype=torch.float16
        )
        # Merge LoRA and base model and save
        model = model.merge_and_unload()
        model.save_pretrained(
            sagemaker_save_dir, safe_serialization=True, max_shard_size="2GB"
        )

To import the Mistral model into Amazon Bedrock, the model needs to be in an uncompressed directory within an S3 bucket accessible by the Amazon Bedrock service role used in the import job.

Import the fine-tuned model into Amazon Bedrock

Now that you have fine-tuned the model, you can import the model into Amazon Bedrock. In this section, we demonstrate how to import the model using the Amazon Bedrock console or the SDK.

Import the model using the Amazon Bedrock console

To import the model using the Amazon Bedrock console, see Import a model with Custom Model Import. You use the Import model page as shown in the following screenshot to import the model from the S3 bucket.

After you successfully import the fine-tuned model, you can see the model listed on the Amazon Bedrock console.

Import the model using the SDK

The AWS Boto3 library supports importing custom models into Amazon Bedrock. You can use the following code to import a fine-tuned model from within the notebook into Amazon Bedrock. This is an asynchronous method.

import boto3
import datetime
br_client = boto3.client('bedrock', region_name='<aws-region-name>')
pt_model_nm = "<bedrock-custom-model-name>"
pt_imp_jb_nm = f"{pt_model_nm}-{datetime.datetime.now().strftime('%Y%m%d%M%H%S')}"
role_arn = "<<bedrock_role_with_custom_model_import_policy>>"
pt_model_src = {"s3DataSource": {"s3Uri": f"{pt_pubmed_model_s3_path}"}}
resp = br_client.create_model_import_job(jobName=pt_imp_jb_nm,
                                  importedModelName=pt_model_nm,
                                  roleArn=role_arn,
                                  modelDataSource=pt_model_src)

Test the imported model

Now that you have imported the fine-tuned model into Amazon Bedrock, you can test the model. In this section, we demonstrate how to test the model using the Amazon Bedrock console or the SDK.

Test the model on the Amazon Bedrock console

You can test the imported model using an Amazon Bedrock playground, as illustrated in the following screenshot.

Test the model using the SDK

You can also use the Amazon Bedrock Invoke Model API to run the fine-tuned imported model, as shown in the following code:

client = boto3.client("bedrock-runtime", region_name="us-west-2")
model_id = "<<replace with the imported bedrock model arn>>"


def call_invoke_model_and_print(native_request):
    request = json.dumps(native_request)

    try:
        # Invoke the model with the request.
        response = client.invoke_model(modelId=model_id, body=request)
        model_response = json.loads(response["body"].read())

        response_text = model_response["outputs"][0]["text"]
        print(response_text)
    except (ClientError, Exception) as e:
        print(f"ERROR: Can't invoke '{model_id}'. Reason: {e}")
        exit(1)

prompt = "will there be a season 5 of shadowhunters"
formatted_prompt = f"[INST] {prompt} [/INST]</s>"
native_request = {
"prompt": formatted_prompt,
"max_tokens": 64,
"top_p": 0.9,
"temperature": 0.91
}
call_invoke_model_and_print(native_request)

The custom Mistral model that you imported using Amazon Bedrock supports temperature, top_p, and max_gen_len parameters when invoking the model for inferencing. The inference parameters top_k, max_seq_len, max_batch_size, and max_new_tokens are not supported for a custom Mistral fine-tuned model.

Evaluate the imported model

Now that you have imported and tested the model, let’s evaluate the imported model using the SageMaker FMEval library. For more details, refer to Evaluate Bedrock Imported Models. To evaluate the question answering task, we use the metrics F1 Score, Exact Match Score, Quasi Exact Match Score, Precision Over Words, and Recall Over Words. The key metrics for the question answering tasks are Exact Match, Quasi-Exact Match, and F1 over words evaluated by comparing the model predicted answers against the ground truth answers. The FMEval library supports out-of-the-box evaluation algorithms for metrics such as accuracy, QA Accuracy, and others detailed in the FMEval documentation. Because you fine-tuned the Mistral model for question answering, you can use the QA Accuracy algorithm, as shown in the following code. The FMEval library supports these metrics for the QA Accuracy algorithm.

config = DataConfig(
    dataset_name="trex_sample",
    dataset_uri="data/test_dataset.json",
    dataset_mime_type=MIME_TYPE_JSONLINES,
    model_input_location="question",
    target_output_location="answer"
)
bedrock_model_runner = BedrockModelRunner(
    model_id=model_id,
    output='outputs[0].text',
    content_template='{"prompt": $prompt, "max_tokens": 500}',
)

eval_algo = QAAccuracy()
eval_output = eval_algo.evaluate(model=bedrock_model_runner, dataset_config=config, 
                                    prompt_template="[INST]$model_input[/INST]", save=True)

You can get the consolidated metrics for the imported model as follows:

for op in eval_output:
    print(f"Eval Name: {op.eval_name}")
    for score in op.dataset_scores:
        print(f"{score.name} : {score.value}")

Clean up

To delete the imported model from Amazon Bedrock, navigate to the model on the Amazon Bedrock console. On the options menu (three dots), choose Delete.

To delete the SageMaker domain along with the SageMaker JupyterLab space, refer to Delete an Amazon SageMaker domain. You may also want to delete the S3 buckets where the data and model are stored. For instructions, see Deleting a bucket.

Conclusion

In this post, we explained the different aspects of fine-tuning a Mistral model using SageMaker, importing the model into Amazon Bedrock, invoking the model using both an Amazon Bedrock playground and Boto3, and then evaluating the imported model using the FMEval library. You can use this feature to import base FMs or FMs fine-tuned either on premises, on SageMaker, or on Amazon EC2 into Amazon Bedrock and use the models without any heavy lifting in your generative AI applications. Explore the Custom Model Import feature for Amazon Bedrock to deploy FMs fine-tuned for code generation tasks in a secure and scalable manner. Visit our GitHub repository to explore samples prepared for fine-tuning and importing models from various families.


About the Authors

Jay Pillai is a Principal Solutions Architect at Amazon Web Services. In this role, he functions as the Lead Architect, helping partners ideate, build, and launch Partner Solutions. As an Information Technology Leader, Jay specializes in artificial intelligence, generative AI, data integration, business intelligence, and user interface domains. He holds 23 years of extensive experience working with several clients across supply chain, legal technologies, real estate, financial services, insurance, payments, and market research business domains.

Rupinder Grewal is a Senior AI/ML Specialist Solutions Architect with AWS. He currently focuses on serving of models and MLOps on Amazon SageMaker. Prior to this role, he worked as a Machine Learning Engineer building and hosting models. Outside of work, he enjoys playing tennis and biking on mountain trails.

Evandro Franco is a Sr. AI/ML Specialist Solutions Architect at Amazon Web Services. He helps AWS customers overcome business challenges related to AI/ML on top of AWS. He has more than 18 years of experience working with technology, from software development, infrastructure, serverless, to machine learning.

Felipe Lopez is a Senior AI/ML Specialist Solutions Architect at AWS. Prior to joining AWS, Felipe worked with GE Digital and SLB, where he focused on modeling and optimization products for industrial applications.

Sandeep Singh is a Senior Generative AI Data Scientist at Amazon Web Services, helping businesses innovate with generative AI. He specializes in generative AI, artificial intelligence, machine learning, and system design. He is passionate about developing state-of-the-art AI/ML-powered solutions to solve complex business problems for diverse industries, optimizing efficiency and scalability.

Ragha Prasad is a Principal Engineer and a founding member of Amazon Bedrock, where he has had the privilege to listen to customer needs first-hand and understands what it takes to build and launch scalable and secure Gen AI products. Prior to Bedrock, he worked on numerous products in Amazon, ranging from devices to Ads to Robotics.

Paras Mehra is a Senior Product Manager at AWS. He is focused on helping build Amazon SageMaker Training and Processing. In his spare time, Paras enjoys spending time with his family and road biking around the Bay Area.

Read More