Email your conversations from Amazon Q

Email your conversations from Amazon Q

As organizations navigate the complexities of the digital realm, generative AI has emerged as a transformative force, empowering enterprises to enhance productivity, streamline workflows, and drive innovation. To maximize the value of insights generated by generative AI, it is crucial to provide simple ways for users to preserve and share these insights using commonly used tools such as email.

Amazon Q Business is a generative AI-powered assistant that can answer questions, provide summaries, generate content, and securely complete tasks based on data and information in your enterprise systems. It is redefining the way businesses approach data-driven decision-making, content generation, and secure task management. By using the custom plugin capability of Amazon Q Business, you can extend its functionality to support sending emails directly from Amazon Q applications, allowing you to store and share the valuable insights gleaned from your conversations with this powerful AI assistant.

Amazon Simple Email Service (Amazon SES) is an email service provider that provides a simple, cost-effective way for you to send and receive email using your own email addresses and domains. Amazon SES offers many email tools, including email sender configuration options, email deliverability tools, flexible email deployment options, sender and identity management, email security, email sending statistics, email reputation dashboard, and inbound email services.

This post explores how you can integrate Amazon Q Business with Amazon SES to email conversations to specified email addresses.

Solution overview

The following diagram illustrates the solution architecture.

architecture diagram

The workflow includes the following steps:

  1. Create an Amazon Q Business application with an Amazon Simple Storage Service (Amazon S3) data source. Amazon Q uses Retrieval Augmented Generation (RAG) to answer user questions.
  2. Configure an AWS IAM Identity Center instance for your Amazon Q Business application environment with users and groups added. Amazon Q Business supports both organization- and account-level IAM Identity Center instances.
  3. Create a custom plugin that invokes an OpenAPI schema of the Amazon API Gateway This API sends emails to the users.
  4. Store OAuth information in AWS Secrets Manager and provide the secret information to the plugin.
  5. Provide AWS Identity Manager and Access Management (IAM) roles to access the secrets in Secrets Manager.
  6. The custom plugin takes the user to an Amazon Cognito sign-in page. The user provides credentials to log in. After authentication, the user session is stored in the Amazon Q Business application for subsequent API calls.
  7. Post-authentication, the custom plugin will pass the token to API Gateway to invoke the API.
  8. You can help secure your API Gateway REST API from common web exploits, such as SQL injection and cross-site scripting (XSS) attacks, using AWS WAF.
  9. AWS Lambda hosted in Amazon Virtual Private Cloud (Amazon VPC) internally calls the Amazon SES SDK.
  10. Lambda uses AWS Identity and Access Management (IAM) permissions to make an SDK call to Amazon SES.
  11. Amazon SES sends an email using SMTP to verified emails provided by the user.

In the following sections, we walk through the steps to deploy and test the solution. This solution is supported only in the us-east-1 AWS Region.

Prerequisites

Complete the following prerequisites:

  1. Have a valid AWS account.
  2. Enable an IAM Identity Center instance and capture the Amazon Resource Name (ARN) of the IAM Identity Center instance from the settings page.
  3. Add users and groups to IAM Identity Center.
  4. Have an IAM role in the account that has sufficient permissions to create the necessary resources. If you have administrator access to the account, no action is necessary.
  5. Enable Amazon CloudWatch Logs for API Gateway. For more information, see How do I turn on CloudWatch Logs to troubleshoot my API Gateway REST API or WebSocket API?
  6. Have two email addresses to send and receive emails that you can verify using the link sent to you. Do not use existing verified identities in Amazon SES for these email addresses. Otherwise, the AWS CloudFormation template will fail.
  7. Have an Amazon Q Business Pro subscription to create Amazon Q apps.
  8. Have the service-linked IAM role AWSServiceRoleForQBusiness. If you don’t have one, create it with the amazonaws.com service name.
  9. Enable AWS CloudTrail logging for operational and risk auditing. For instructions, see Creating a trail for your AWS account.
  10. Enable budget policy notifications to help protect from unwanted billing.

Deploy the solution resources

In this step, we use a CloudFormation template to deploy a Lambda function, configure the REST API, and create identities. Complete the following steps:

  1. Open the AWS CloudFormation console in the us-east-1
  2. Choose Create stack.
  3. Download the CloudFormation template and upload it in the Specify template
  4. Choose Next.

cloud formation upload screen

  1. For Stack name, enter a name (for example, QIntegrationWithSES).
  2. In the Parameters section, provide the following:
    1. For IDCInstanceArn, enter your IAM Identity Center instance ARN.
    2. For LambdaName, enter the name of your Lambda function.
    3. For Fromemailaddress, enter the address to send email.
    4. For Toemailaddress, enter the address to receive email.
  3. Choose Next.

cloud formation parameter capture screen

  1. Keep the other values as default and select I acknowledge that AWS CloudFormation might create IAM resources in the Capabilities
  2. Choose Submit to create the CloudFormation stack.
  3. After the successful deployment of the stack, on the Outputs tab, make a note of the value for apiGatewayInvokeURL. You will need this later to create a custom plugin.

Verification emails will be sent to the Toemailaddress and Fromemailaddress values provided as input to the CloudFormation template.

  1. Verify the newly created email identities using the link in the email.

This post doesn’t cover auto scaling of Lambda functions. For more information about how to integrate Lambda with Application Auto Scaling, see AWS Lambda and Application Auto Scaling.

To configure AWS WAF on API Gateway, refer to Use AWS WAF to protect your REST APIs in API Gateway.

This is sample code, for non-production usage. You should work with your security and legal teams to meet your organizational security, regulatory, and compliance requirements before deployment.

Create Amazon Cognito users

This solution uses Amazon Cognito to authorize users to make a call to API Gateway. The CloudFormation template creates a new Amazon Cognito user pool.

Complete the following steps to create a user in the newly created user pool and capture information about the user pool:

  1. On the AWS CloudFormation console, navigate to the stack you created.
  2. On the Resources tab, choose the link next to the physical ID for CognitoUserPool.

cloudformation resource tab

  1. On the Amazon Cognito console, choose User management and users in the navigation pane.
  2. Choose Create user.
  3. Enter an email address and password of your choice, then choose Create user.

adding user to IDC screen

  1. In the navigation pane, choose Applications and app clients.
  2. Capture the client ID and client secret. You will need these later during custom plugin development.
  3. On the Login pages tab, copy the values for Allowed callback URLs. You will need these later during custom plugin development.
  4. In the navigation pane, choose Branding.
  5. Capture the Amazon Cognito domain. You will need this information to update OpenAPI specifications.

Upload documents to Amazon S3

This solution uses the fully managed Amazon S3 data source to seamlessly power a RAG workflow, eliminating the need for custom integration and data flow management.

For this post, we use sample articles to upload to Amazon S3. Complete the following steps:

  1. On the AWS CloudFormation console, navigate to the stack you created.
  2. On the Resources tab, choose the link for the physical ID of AmazonQDataSourceBucket.

cloud formation resource tab filtered by Qdatasource bucket

  1. Upload the sample articles file to the S3 bucket. For instructions, see Uploading objects.

Add users to the Amazon Q Business application

Complete the following steps to add users to the newly created Amazon Q business application:

  1. On the Amazon Q Business console, choose Applications in the navigation pane.
  2. Choose the application you created using the CloudFormation template.
  3. Under User access, choose Manage user access.

Amazon Q manage users screen

  1. On the Manage access and subscriptions page, choose Add groups and users.

add users and groups screen

  1. Select Assign existing users and groups, then choose Next.
  2. Search for your IAM Identity Center user group.

  1. Choose the group and choose Assign to add the group and its users.
  2. Make sure that the current subscription is Q Business Pro.
  3. Choose Confirm.

confirm subcscription screen

Sync Amazon Q data sources

To sync the data source, complete the following steps:

  1. On the Amazon Q Business console, navigate to your application.
  2. Choose Data Sources under Enhancements in the navigation pane.
  3. From the Data sources list, select the data source you created through the CloudFormation template.
  4. Choose Sync now to sync the data source.

sync data source

It takes some time to sync with the data source. Wait until the sync status is Completed.

sync completed

Create an Amazon Q custom plugin

In this section, you create the Amazon Q custom plugin for sending emails. Complete the following steps:

  1. On the Amazon Q Business console, navigate to your application.
  2. Under Enhancements in the navigation pane, choose Plugins.
  3. Choose Add plugin.

add custom plugin screen

  1. Choose Create custom plugin.
  2. For Plugin name, enter a name (for example, email-plugin).
  3. For Description, enter a description.
  4. Select Define with in-line OpenAPI schema editor.

You can also upload API schemas to Amazon S3 by choosing Select from S3. That would be the best way to upload for production use cases.

Your API schema must have an API description, structure, and parameters for your custom plugin.

  1. Select JSON for the schema format.
  2. Enter the following schema, providing your API Gateway invoke URL and Amazon Cognito domain URL:
{
    "openapi": "3.0.0",
    "info": {
        "title": "Send Email API",
        "description": "API to send email from SES",
        "version": "1.0.0"
    },
    "servers": [
        {
            "url": "< API Gateway Invoke URL >"
        }
    ],
    "paths": {
        "/": {
            "post": {
                "summary": "send email to the user and returns the success message",
                "description": "send email to the user and returns the success message",
                "security": [
                    {
                        "OAuth2": [
                            "email/read"
                        ]
                    }
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/sendEmailRequest"
                            }
                        }
                    }
                },
                "responses": {
                    "200": {
                        "description": "Successful response",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/sendEmailResponse"
                                }
                            }
                        }
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "sendEmailRequest": {
                "type": "object",
                "required": [
                                "emailContent",
                                "toEmailAddress",
                                "fromEmailAddress"

                ],
                "properties": {
                    "emailContent": {
                        "type": "string",
                        "description": "Body of the email."
                    },
                    "toEmailAddress": {
                      "type": "string",
                      "description": "To email address."
                    },
                    "fromEmailAddress": {
                          "type": "string",
                          "description": "To email address."
                    }
                }
            },
            "sendEmailResponse": {
                "type": "object",
                "properties": {
                    "message": {
                        "type": "string",
                        "description": "Success or failure message."
                    }
                }
            }
        },
        "securitySchemes": {
            "OAuth2": {
                "type": "oauth2",
                "description": "OAuth2 client credentials flow.",
                "flows": {
                    "authorizationCode": {
                        "authorizationUrl": "<Cognito Domain>/oauth2/authorize",
                        "tokenUrl": "<Cognito Domain>/oauth2/token",
                        "scopes": {
                            "email/read": "read the email"    
                        }
                    }
                }      
            }
        }
    }
}    

custom plugin screen

  1. Under Authentication, select Authentication required.
  2. For AWS Secrets Manager secret, choose Create and add new secret.

adding authorization

  1. In the Create an AWS Secrets Manager secret pop-up, enter the following values captured earlier from Amazon Cognito:
    1. Client ID
    2. Client secret
    3. OAuth callback URL

  1. For Choose a method to authorize Amazon Q Business, leave the default selection as Create and use a new service role.
  2. Choose Add plugin to add your plugin.

Wait for the plugin to be created and the build status to show as Ready.

The maximum size of an OpenAPI schema in JSON or YAML is 1 MB.

To maximize accuracy with the Amazon Q Business custom plugin, follow the best practices for configuring OpenAPI schema definitions for custom plugins.

Test the solution

To test the solution, complete the following steps:

  1. On the Amazon Q Business console, navigate to your application.
  2. In the Web experience settings section, find the deployed URL.
  3. Open the web experience deployed URL.
  4. Use the credentials of the user created earlier in IAM Identity Center to log in to the web experience.

amazon q web experience login page

  1. Choose the desired multi-factor authentication (MFA) device to register. For more information, see Register an MFA device for users.
  2. After you log in to the web portal, choose the appropriate application to open the chat interface.

Amazon Q portal

  1. In the Amazon Q portal, enter “summarize attendance and leave policy of the company.”

Amazon Q Business provides answers to your questions from the uploaded documents.

Summarize question

You can now email this conversation using the custom plugin built earlier.

  1. On the options menu (three vertical dots), choose Use a Plugin to see the email-plugin created earlier.

  1. Choose email-plugin and enter “Email the summary of this conversation.”
  2. Amazon Q will ask you to provide the email address to send the conversation. Provide the verified identity configured as part of the CloudFormation template.

email parameter capture

  1. After you enter your email address, the authorization page appears. Enter your Amazon Cognito user email ID and password to authenticate and choose Sign in.

This step verifies that you’re an authorized user.

The email will be sent to the specified inbox.

You can further personalize the emails by using email templates.

Securing the solution

Security is a shared responsibility model between you and AWS and is described as security of the cloud vs. security in the cloud. Keep in mind the following best practices:

  • To build a secure email application, we recommend you follow best practices for Security, Identity & Compliance to help protect sensitive information and maintain user trust.
  • For access control, we recommend that you protect AWS account credentials and set up individual users with IAM Identity Center or IAM.
  • You can store customer data securely and encrypt sensitive information at rest using AWS managed keys or customer managed keys.
  • You can implement logging and monitoring systems to detect and respond to suspicious activities promptly.
  • Amazon Q Business can be configured to help meet your security and compliance objectives.
  • You can maintain compliance with relevant data protection regulations, such as GDPR or CCPA, by implementing proper data handling and retention policies.
  • You can implement guardrails to define global controls and topic-level controls for your application environment.
  • You can enable AWS Shield on your network to help prevent DDOS attacks.
  • You should follow best practices of Amazon Q access control list (ACL) crawling to help protect your business data. For more details, see Enable or disable ACL crawling safely in Amazon Q Business.
  • We recommend using the aws:SourceArn and aws:SourceAccount global condition context keys in resource policies to limit the permissions that Amazon Q Business gives another service to the resource. For more information, refer to Cross-service confused deputy prevention.

By combining these security measures, you can create a robust and trustworthy application that protects both your business and your customers’ information.

Clean up

To avoid incurring future charges, delete the resources that you created and clean up your account. Complete the following steps:

  1. Empty the contents of the S3 bucket that was created as part of the CloudFormation stack.
  2. Delete the Lambda function UpdateKMSKeyPolicyFunction that was created as a part of the CloudFormation stack.
  3. Delete the CloudFormation stack.
  4. Delete the identities in Amazon SES.
  5. Delete the Amazon Q Business application.

Conclusion

The integration of Amazon Q Business, a state-of-the-art generative AI-powered assistant, with Amazon SES, a robust email service provider, unlocks new possibilities for businesses to harness the power of generative AI. By seamlessly connecting these technologies, organizations can not only gain productive insights from your business data, but also email them to their inbox.

Ready to supercharge your team’s productivity? Empower your employees with Amazon Q Business today! Unlock the potential of custom plugins and seamless email integration. Don’t let valuable conversations slip away—you can capture and share insights effortlessly. Additionally, explore our library of built-in plugins.

Stay up to date with the latest advancements in generative AI and start building on AWS. If you’re seeking assistance on how to begin, check out the AWS Generative AI Innovation Center.


About the Authors

Sujatha Dantuluri is a seasoned Senior Solutions Architect in the US federal civilian team at AWS, with over two decades of experience supporting commercial and federal government clients. Her expertise lies in architecting mission-critical solutions and working closely with customers to ensure their success. Sujatha is an accomplished public speaker, frequently sharing her insights and knowledge at industry events and conferences. She has contributed to IEEE standards and is passionate about empowering others through her engaging presentations and thought-provoking ideas.

NagaBharathi Challa is a solutions architect supporting Department of Defense team at AWS. She works closely with customers to effectively use AWS services for their mission use cases, providing architectural best practices and guidance on a wide range of services. Outside of work, she enjoys spending time with family and spreading the power of meditation.

Pranit Raje is a Solutions Architect in the AWS India team. He works with ISVs in India to help them innovate on AWS. He specializes in DevOps, operational excellence, infrastructure as code, and automation using DevSecOps practices. Outside of work, he enjoys going on long drives with his beloved family, spending time with them, and watching movies.

Dr Anil Giri is a Solutions Architect at Amazon Web Services. He works with enterprise software and SaaS customers to help them build generative AI applications and implement serverless architectures on AWS. His focus is on guiding clients to create innovative, scalable solutions using cutting-edge cloud technologies.

Read More

Unlock cost-effective AI inference using Amazon Bedrock serverless capabilities with an Amazon SageMaker trained model

Unlock cost-effective AI inference using Amazon Bedrock serverless capabilities with an Amazon SageMaker trained model

In this post, I’ll show you how to use Amazon Bedrock—with its fully managed, on-demand API—with your Amazon SageMaker trained or fine-tuned model.

Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies such as AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI.

Previously, if you wanted to use your own custom fine-tuned models in Amazon Bedrock, you either had to self-manage your inference infrastructure in SageMaker or train the models directly within Amazon Bedrock, which requires costly provisioned throughput.

With Amazon Bedrock Custom Model Import, you can use new or existing models that have been trained or fine-tuned within SageMaker using Amazon SageMaker JumpStart. You can import the supported architectures into Amazon Bedrock, allowing you to access them on demand through the Amazon Bedrock fully managed invoke model API.

Solution overview

At the time of writing, Amazon Bedrock supports importing custom models from the following architectures:

  • Mistral
  • Flan
  • Meta Llama 2 and Llama 3

For this post, we use a Hugging Face Flan-T5 Base model.

In the following sections, I walk you through the steps to train a model in SageMaker JumpStart and import it into Amazon Bedrock. Then you can interact with your custom model through the Amazon Bedrock playgrounds.

Prerequisites

Before you begin, verify that you have an AWS account with Amazon SageMaker Studio and Amazon Bedrock access.

If you don’t already have an instance of SageMaker Studio, see Launch Amazon SageMaker Studio for instructions to create one.

Train a model in SageMaker JumpStart

Complete the following steps to train a Flan model in SageMaker JumpStart:

  1. Open the AWS Management Console and go to SageMaker Studio.

Amazon SageMaker Console

  1. In SageMaker Studio, choose JumpStart in the navigation pane.

With SageMaker JumpStart, machine learning (ML) practitioners can choose from a broad selection of publicly available FMs using pre-built machine learning solutions that can be deployed in a few clicks.

  1. Search for and choose the Hugging Face Flan-T5 Base

Amazon SageMaker JumpStart Page

On the model details page, you can review a short description of the model, how to deploy it, how to fine-tune it, and what format your training data needs to be in to customize the model.

  1. Choose Train to begin fine-tuning the model on your training data.

Flan-T5 Base Model Card

Create the training job using the default settings. The defaults populate the training job with recommended settings.

  1. The example in this post uses a prepopulated example dataset. When using your own data, enter its location in the Data section, making sure it meets the format requirements.

Fine-tune model page

  1. Configure the security settings such as AWS Identity and Access Management (IAM) role, virtual private cloud (VPC), and encryption.
  2. Note the value for Output artifact location (S3 URI) to use later.
  3. Submit the job to start training.

You can monitor your job by selecting Training on the Jobs dropdown menu. When the training job status shows as Completed, the job has finished. With default settings, training takes about 10 minutes.

Training Jobs

Import the model into Amazon Bedrock

After the model has completed training, you can import it into Amazon Bedrock. Complete the following steps:

  1. On the Amazon Bedrock console, choose Imported models under Foundation models in the navigation pane.
  2. Choose Import model.

Amazon Bedrock - Custom Model Import

  1. For Model name, enter a recognizable name for your model.
  2. Under Model import settings, select Amazon SageMaker model and select the radio button next to your model.

Importing a model from Amazon SageMaker

  1. Under Service access, select Create and use a new service role and enter a name for the role.
  2. Choose Import model.

Creating a new service role

  1. The model import will complete in about 15 minutes.

Successful model import

  1. Under Playgrounds in the navigation pane, choose Text.
  2. Choose Select model.

Using the model in Amazon Bedrock text playground

  1. For Category, choose Imported models.
  2. For Model, choose flan-t5-fine-tuned.
  3. For Throughput, choose On-demand.
  4. Choose Apply.

Selecting the fine-tuned model for use

You can now interact with your custom model. In the following screenshot, we use our example custom model to summarize a description about Amazon Bedrock.

Using the fine-tuned model

Clean up

Complete the following steps to clean up your resources:

  1. If you’re not going to continue using SageMaker, delete your SageMaker domain.
  2. If you no longer want to maintain your model artifacts, delete the Amazon Simple Storage Service (Amazon S3) bucket where your model artifacts are stored.
  3. To delete your imported model from Amazon Bedrock, on the Imported models page on the Amazon Bedrock console, select your model, and then choose the options menu (three dots) and select Delete.

Clean-Up

Conclusion

In this post, we explored how the Custom Model Import feature in Amazon Bedrock enables you to use your own custom trained or fine-tuned models for on-demand, cost-efficient inference. By integrating SageMaker model training capabilities with the fully managed, scalable infrastructure of Amazon Bedrock, you now have a seamless way to deploy your specialized models and make them accessible through a simple API.

Whether you prefer the user-friendly SageMaker Studio console or the flexibility of SageMaker notebooks, you can train and import your models into Amazon Bedrock. This allows you to focus on developing innovative applications and solutions, without the burden of managing complex ML infrastructure.

As the capabilities of large language models continue to evolve, the ability to integrate custom models into your applications becomes increasingly valuable. With the Amazon Bedrock Custom Model Import feature, you can now unlock the full potential of your specialized models and deliver tailored experiences to your customers, all while benefiting from the scalability and cost-efficiency of a fully managed service.

To dive deeper into fine-tuning on SageMaker, see Instruction fine-tuning for FLAN T5 XL with Amazon SageMaker Jumpstart. To get more hands-on experience with Amazon Bedrock, check out our Building with Amazon Bedrock workshop.


About the Author

Joseph Sadler is a Senior Solutions Architect on the Worldwide Public Sector team at AWS, specializing in cybersecurity and machine learning. With public and private sector experience, he has expertise in cloud security, artificial intelligence, threat detection, and incident response. His diverse background helps him architect robust, secure solutions that use cutting-edge technologies to safeguard mission-critical systems

Read More

Align and monitor your Amazon Bedrock powered insurance assistance chatbot to responsible AI principles with AWS Audit Manager

Align and monitor your Amazon Bedrock powered insurance assistance chatbot to responsible AI principles with AWS Audit Manager

Generative AI applications are gaining widespread adoption across various industries, including regulated industries such as financial services and healthcare. As these advanced systems accelerate in playing a critical role in decision-making processes and customer interactions, customers should work towards ensuring the reliability, fairness, and compliance of generative AI applications with industry regulations. To address this need, AWS generative AI best practices framework was launched within AWS Audit Manager, enabling auditing and monitoring of generative AI applications. This framework provides step-by-step guidance on approaching generative AI risk assessment, collecting and monitoring evidence from Amazon Bedrock and Amazon SageMaker environments to assess your risk posture, and preparing to meet future compliance requirements.

Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities you need to build generative AI applications with security, privacy, and responsible AI. Amazon Bedrock Agents can be used to configure specialized agents that run actions seamlessly based on user input and your organization’s data. These managed agents play conductor, orchestrating interactions between FMs, API integrations, user conversations, and knowledge bases loaded with your data.

Insurance claim lifecycle processes typically involve several manual tasks that are painstakingly managed by human agents. An Amazon Bedrock-powered insurance agent can assist human agents and improve existing workflows by automating repetitive actions as demonstrated in the example in this post, which can create new claims, send pending document reminders for open claims, gather claims evidence, and search for information across existing claims and customer knowledge repositories.

Generative AI applications should be developed with adequate controls for steering the behavior of FMs. Responsible AI considerations such as privacy, security, safety, controllability, fairness, explainability, transparency and governance help ensure that AI systems are trustworthy. In this post, we demonstrate how to use the AWS generative AI best practices framework on AWS Audit Manager to evaluate this insurance claim agent from a responsible AI lens.

Use case

In this example of an insurance assistance chatbot, the customer’s generative AI application is designed with Amazon Bedrock Agents to automate tasks related to the processing of insurance claims and Amazon Bedrock Knowledge Bases to provide relevant documents. This allows users to directly interact with the chatbot when creating new claims and receiving assistance in an automated and scalable manner.

User interacts with Amazon Bedrock Agents, which in turn retrieves context from the Amazon Bedrock Knowledge Base or can make various API calls fro defined functions

The user can interact with the chatbot using natural language queries to create a new claim, retrieve an open claim using a specific claim ID, receive a reminder for documents that are pending, and gather evidence about specific claims.

The agent then interprets the user’s request and determines if actions need to be invoked or information needs to be retrieved from a knowledge base. If the user request invokes an action, action groups configured for the agent will invoke different API calls, which produce results that are summarized as the response to the user. Figure 1 depicts the system’s functionalities and AWS services. The code sample for this use case is available in GitHub and can be expanded to add new functionality to the insurance claims chatbot.

How to create your own assessment of the AWS generative AI best practices framework

  1. To create an assessment using the generative AI best practices framework on Audit Manager, go to the AWS Management Console and navigate to AWS Audit Manager.
  2. Choose Create assessment.

Choose Create Assessment on the AWS Audit Manager dashboard

  1. Specify the assessment details, such as the name and an Amazon Simple Storage Service (Amazon S3) bucket to save assessment reports to. Select AWS Generative AI Best Practices Framework for assessment.

Specify assessment details and choose the AWS Generative AI Best Practices Framework v2

  1. Select the AWS accounts in scope for assessment. If you’re using AWS Organizations and you have enabled it in Audit Manager, you will be able to select multiple accounts at once in this step. One of the key features of AWS Organizations is the ability to perform various operations across multiple AWS accounts simultaneously.

Add the AWS accounts in scope for the assessment

  1. Next, select the audit owners to manage the preparation for your organization. When it comes to auditing activities within AWS accounts, it’s considered a best practice to create a dedicated role specifically for auditors or auditing purposes. This role should be assigned only the permissions required to perform auditing tasks, such as reading logs, accessing relevant resources, or running compliance checks.

Specify audit owners

  1. Finally, review the details and choose Create assessment.

Review and create assessment

Principles of AWS generative AI best practices framework

Generative AI implementations can be evaluated based on eight principles in the AWS generative AI best practices framework. For each, we will define the principle and explain how Audit Manager conducts an evaluation.

Accuracy

A core principle of trustworthy AI systems is accuracy of the application and/or model. Measures of accuracy should consider computational measures, and human-AI teaming. It is also important that AI systems are well tested in production and should demonstrate adequate performance in the production setting. Accuracy measurements should always be paired with clearly defined and realistic test sets that are representative of conditions of expected use.

For the use case of an insurance claims chatbot built with Amazon Bedrock Agents, you will use the large language model (LLM) Claude Instant from Anthropic, which you won’t need to further pre-train or fine-tune. Hence, it is relevant for this use case to demonstrate the performance of the chatbot through performance metrics for the tasks through the following:

  • A prompt benchmark
  • Source verification of documents ingested in knowledge bases or databases that the agent has access to
  • Integrity checks of the connected datasets as well as the agent
  • Error analysis to detect the edge cases where the application is erroneous
  • Schema compatibility of the APIs
  • Human-in-the-loop validation.

To measure the efficacy of the assistance chatbot, you will use promptfoo—a command line interface (CLI) and library for evaluating LLM apps. This involves three steps:

  1. Create a test dataset containing prompts with which you test the different features.
  2. Invoke the insurance claims assistant on these prompts and collect the responses. Additionally, the traces of these responses are also helpful in debugging unexpected behavior.
  3. Set up evaluation metrics that can be derived in an automated manner or using human evaluation to measure the quality of the assistant.

In the example of an insurance assistance chatbot, designed with Amazon Bedrock Agents and Amazon Bedrock Knowledge Bases, there are four tasks:

  • getAllOpenClaims: Gets the list of all open insurance claims. Returns all claim IDs that are open.
  • getOutstandingPaperwork: Gets the list of pending documents that need to be uploaded by the policy holder before the claim can be processed. The API takes in only one claim ID and returns the list of documents that are pending to be uploaded. This API should be called for each claim ID.
  • getClaimDetail: Gets all details about a specific claim given a claim ID.
  • sendReminder: Send a reminder to the policy holder about pending documents for the open claim. The API takes in only one claim ID and its pending documents at a time, sends the reminder, and returns the tracking details for the reminder. This API should be called for each claim ID you want to send reminders for.

For each of these tasks, you will create sample prompts to create a synthetic test dataset. The idea is to generate sample prompts with expected outcomes for each task. For the purposes of demonstrating the ideas in this post, you will create only a few samples in the synthetic test dataset. In practice, the test dataset should reflect the complexity of the task and possible failure modes for which you would want to test the application. Here are the sample prompts that you will use for each task:

  • getAllOpenClaims
    • What are the open claims?
    • List open claims.
  • getOutstandingPaperwork
    • What are the missing documents from {{claim}}?
    • What is missing from {{claim}}?
  • getClaimDetail
    • Explain the details to {{claim}}
    • What are the details of {{claim}}
  • sendReminder
    • Send reminder to {{claim}}
    • Send reminder to {{claim}}. Include the missing documents and their requirements.
  • Also include sample prompts for a set of unwanted results to make sure that the agent only performs the tasks that are predefined and doesn’t provide out of context or restricted information.
    • List all claims, including closed claims
    • What is 2+2?

Set up

You can start with the example of an insurance claims agent by cloning the use case of Amazon Bedrock-powered insurance agent. After you create the agent, set up promptfoo. Now, you will need to create a custom script that can be used for testing. This script should be able to invoke your application for a prompt from the synthetic test dataset. We created a Python script, invoke_bedrock_agent.py, with which we invoke the agent for a given prompt.

python invoke_bedrock_agent.py "What are the open claims?"

Step 1: Save your prompts

Create a text file of the sample prompts to be tested. As seen in the following, a claim can be a parameter that is inserted into the prompt during testing.

%%writefile prompts_getClaimDetail.txt
Explain the details to {{claim}}.
---
What are the details of {{claim}}.

Step 2: Create your prompt configuration with tests

For prompt testing, we defined test prompts per task. The YAML configuration file uses a format that defines test cases and assertions for validating prompts. Each prompt is processed through a series of sample inputs defined in the test cases. Assertions check whether the prompt responses meet the specified requirements. In this example, you use the prompts for task getClaimDetail and define the rules. There are different types of tests that can be used in promptfoo. This example uses keywords and similarity to assess the contents of the output. Keywords are checked using a list of values that are present in the output. Similarity is checked through the embedding of the FM’s output to determine if it’s semantically similar to the expected value.

%%writefile promptfooconfig.yaml
prompts: [prompts_getClaimDetail.txt] # text file that has the prompts
providers: ['bedrock_agent_as_provider.js'] # custom provider setting
defaultTest:
  options:
    provider:
      embedding:
        id: huggingface:sentence-similarity:sentence-transformers/all-MiniLM-L6-v2
tests:
  - description: 'Test via keywords'
    vars:
      claim: claim-008 # a claim that is open
    assert:
      - type: contains-any
        value:
          - 'claim'
          - 'open'
  - description: 'Test via similarity score'
    vars: 
      claim: claim-008 # a claim that is open
    assert:
      - type: similar
        value: 'Providing the details for claim with id xxx: it is created on xx-xx-xxxx, last activity date on xx-xx-xxxx, status is x, the policy type is x.'
        threshold: 0.6

Step 3: Run the tests

Run the following commands to test the prompts against the set rules.

npx promptfoo@latest eval -c promptfooconfig.yaml
npx promptfoo@latest share

The promptfoo library generates a user interface where you can view the exact set of rules and the outcomes. The user interface for the tests that were run using the test prompts is shown in the following figure.

Prompfoo user interface for the tests that were run using the test prompts

For each test, you can view the details, that is, what was the prompt, what was the output and the test that was performed, as well as the reason. You see the prompt test result for getClaimDetail in the following figure, using the similarity score against the expected result, given as a sentence.

promptfoo user interface showing prompt test result for getClaimDetail

Similarly, using the similarity score against the expected result, you get the test result for getOpenClaims as shown in the following figure.

Promptfoo user interface showing test result for getOpenClaims

Step 4: Save the output

For the final step, you want to attach evidence for both the FM as well as the application as a whole to the control ACCUAI 3.1: Model Evaluation Metrics. To do so, save the output of your prompt testing into an S3 bucket. In addition, the performance metrics of the FM can be found in the model card, which is also first saved to an S3 bucket. Within Audit Manager, navigate to the corresponding control, ACCUAI 3.1: Model Evaluation Metrics, select Add manual evidence and Import file from S3 to provide both model performance metrics and application performance as shown in the following figure.

Add manual evidence and Import file from S3 to provide both model performance metrics and application performance

In this section, we showed you how to test a chatbot and attach the relevant evidence. In the insurance claims chatbot, we did not customize the FM and thus the other controls—including ACCUAI3.2: Regular Retraining for Accuracy, ACCUAI3.11: Null Values, ACCUAI3.12: Noise and Outliers, and ACCUAI3.15: Update Frequency—are not applicable. Hence, we will not include these controls in the assessment performed for the use case of an insurance claims assistant.

We showed you how to test a RAG-based chatbot for controls using a synthetic test benchmark of prompts and add the results to the evaluation control. Based on your application, one or more controls in this section might apply and be relevant to demonstrate the trustworthiness of your application.

Fair

Fairness in AI includes concerns for equality and equity by addressing issues such as harmful bias and discrimination.

Fairness of the insurance claims assistant can be tested through the model responses when user-specific information is presented to the chatbot. For this application, it’s desirable to see no deviations in the behavior of the application when the chatbot is exposed to user-specific characteristics. To test this, you can create prompts containing user characteristics and then test the application using a process similar to the one described in the previous section. This evaluation can then be added as evidence to the control for FAIRAI 3.1: Bias Assessment.

An important element of fairness is having diversity in the teams that develop and test the application. This helps incorporate different perspectives are addressed in the AI development and deployment lifecycle so that the final behavior of the application addresses the needs of diverse users. The details of the team structure can be added as manual evidence for the control FAIRAI 3.5: Diverse Teams. Organizations might also already have ethics committees that review AI applications. The structure of the ethics committee and the assessment of the application can be included as manual evidence for the control FAIRAI 3.6: Ethics Committees.

Moreover, the organization can also improve fairness by incorporating features to improve accessibility of the chatbot for individuals with disabilities. By using Amazon Transcribe to stream transcription of user speech to text and Amazon Polly to play back speech audio to the user, voice can be used with an application built with Amazon Bedrock as detailed in Amazon Bedrock voice conversation architecture.

Privacy

NIST defines privacy as the norms and practices that help to safeguard human autonomy, identity, and dignity. Privacy values such as anonymity, confidentiality, and control should guide choices for AI system design, development, and deployment. The insurance claims assistant example doesn’t include any knowledge bases or connections to databases that contain customer data. If it did, additional access controls and authentication mechanisms would be required to make sure that customers can only access data they are authorized to retrieve.

Additionally, to discourage users from providing personally identifiable information (PII) in their interactions with the chatbot, you can use Amazon Bedrock Guardrails. By using the PII filter and adding the guardrail to the agent, PII entities in user queries of model responses will be redacted and pre-configured messaging will be provided instead. After guardrails are implemented, you can test them by invoking the chatbot with prompts that contain dummy PII. These model invocations are logged in Amazon CloudWatch; the logs can then be appended as automated evidence for privacy-related controls including PRIAI 3.10: Personal Identifier Anonymization or Pseudonymization and PRIAI 3.9: PII Anonymization.

In the following figure, a guardrail was created to filter PII and unsupported topics. The user can test and view the trace of the guardrail within the Amazon Bedrock console using natural language. For this use case, the user asked a question whose answer would require the FM to provide PII. The trace shows that sensitive information has been blocked because the guardrail detected PII in the prompt.

Under Guardrail details section of the agent builder, add the PII filter

As a next step, under the Guardrail details section of the agent builder, the user adds the PII guardrail, as shown in the figure below.

filter for bedrock-logs and choose to download them

Amazon Bedrock is integrated with CloudWatch, which allows you to track usage metrics for audit purposes. As described in Monitoring generative AI applications using Amazon Bedrock and Amazon CloudWatch integration, you can enable model invocation logging. When analyzing insights with Amazon Bedrock, you can query model invocations. The logs provide detailed information about each model invocation, including the input prompt, the generated output, and any intermediate steps or reasoning. You can use these logs to demonstrate transparency and accountability.

Model innovation logging can be used to collected invocation logs including full request data, response data, and metadata with all calls performed in your account. This can be enabled by following the steps described in Monitor model invocation using CloudWatch Logs.

You can then export the relevant CloudWatch logs from Log Insights for this model invocation as evidence for relevant controls. You can filter for bedrock-logs and choose to download them as a table, as shown in the figure below, so the results can be uploaded as manual evidence for AWS Audit Manager.

filter for bedrock-logs and choose to download them

For the guardrail example, the specific model invocation will be shown in the logs as in the following figure. Here, the prompt and the user who ran it are captured. Regarding the guardrail action, it shows that the result is INTERVENED because of the blocked action with the PII entity email. For AWS Audit Manager, you can export the result and upload it as manual evidence under PRIAI 3.9: PII Anonymization.

Add the Guardrail intervened behavior as evidence to the AWS Audit Manager assessment

Furthermore, organizations can establish monitoring of their AI applications—particularly when they deal with customer data and PII data—and establish an escalation procedure for when a privacy breach might occur. Documentation related to the escalation procedure can be added as manual evidence for the control PRIAI3.6: Escalation Procedures – Privacy Breach.

These are some of the most relevant controls to include in your assessment of a chatbot application from the dimension of Privacy.

Resilience

In this section, we show you how to improve the resilience of an application to add evidence of the same to controls defined in the Resilience section of the AWS generative AI best practices framework.

AI systems, as well as the infrastructure in which they are deployed, are said to be resilient if they can withstand unexpected adverse events or unexpected changes in their environment or use. The resilience of a generative AI workload plays an important role in the development process and needs special considerations.

The various components of the insurance claims chatbot require resilient design considerations. Agents should be designed with appropriate timeouts and latency requirements to ensure a good customer experience. Data pipelines that ingest data to the knowledge base should account for throttling and use backoff techniques. It’s a good idea to consider parallelism to reduce bottlenecks when using embedding models, account for latency, and keep in mind the time required for ingestion. Considerations and best practices should be implemented for vector databases, the application tier, and monitoring the use of resources through an observability layer. Having a business continuity plan with a disaster recovery strategy is a must for any workload. Guidance for these considerations and best practices can be found in Designing generative AI workloads for resilience. Details of these architectural elements should be added as manual evidence in the assessment.

Responsible

Key principles of responsible design are explainability and interpretability. Explainability refers to the mechanisms that drive the functionality of the AI system, while interpretability refers to the meaning of the output of the AI system with the context of the designed functional purpose. Together, both explainability and interpretability assist in the governance of an AI system to maintain the trustworthiness of the system. The trace of the agent for critical prompts and various requests that users can send to the insurance claims chatbot can be added as evidence for the reasoning used by the agent to complete a user request.

The logs gathered from Amazon Bedrock offer comprehensive insights into the model’s handling of user prompts and the generation of corresponding answers. The figure below shows a typical model invocation log. By analyzing these logs, you can gain visibility into the model’s decision-making process. This logging functionality can serve as a manual audit trail, fulfilling RESPAI3.4: Auditable Model Decisions.

typical model invocation log

Another important aspect of maintaining responsible design, development, and deployment of generative AI applications is risk management. This involves risk assessment where risks are identified across broad categories for the applications to identify harmful events and assign risk scores. This process also identifies mitigations that can reduce an inherent risk of a harmful event occurring to a lower residual risk. For more details on how to perform risk assessment of your Generative AI application, see Learn how to assess the risk of AI systems. Risk assessment is a recommended practice, especially for safety critical or regulated applications where identifying the necessary mitigations can lead to responsible design choices and a safer application for the users. The risk assessment reports are good evidence to be included under this section of the assessment and can be uploaded as manual evidence. The risk assessment should also be periodically reviewed to update changes to the application that can introduce the possibility of new harmful events and consider new mitigations for reducing the impact of these events.

Safe

AI systems should “not under defined conditions, lead to a state in which human life, health, property, or the environment is endangered.” (Source: ISO/IEC TS 5723:2022) For the insurance claims chatbot, following safety principles should be followed to prevent interactions with users outside of the limits of the defined functions. Amazon Bedrock Guardrails can be used to define topics that are not supported by the chatbot. The intended use of the chatbot should also be transparent to users to guide them in the best use of the AI application. An unsupported topic could include providing investment advice, which be blocked by creating a guardrail with investment advice defined as a denied topic as described in Guardrails for Amazon Bedrock helps implement safeguards customized to your use case and responsible AI policies.

After this functionality is enabled as a guardrail, the model will prohibit unsupported actions. The instance illustrated in the following figure depicts a scenario where requesting investment advice is a restricted behavior, leading the model to decline providing a response.

Guardrail can help to enforce restricted behavior

After the model is invoked, the user can navigate to CloudWatch to view the relevant logs. In cases where the model denies or intervenes in certain actions, such as providing investment advice, the logs will reflect the specific reasons for the intervention, as shown in the following figure. By examining the logs, you can gain insights into the model’s behavior, understand why certain actions were denied or restricted, and verify that the model is operating within the intended guidelines and boundaries. For the controls defined under the safety section of the assessment, you might want to design more experiments by considering various risks that arise from your application. The logs and documentation collected from the experiments can be attached as evidence to demonstrate the safety of the application.

Log insights from Amazon Bedrock shows the details of how Amazon Bedrock Guardrails intervened

Secure

NIST defines AI systems to be secure when they maintain confidentiality, integrity, and availability through protection mechanisms that prevent unauthorized access and use. Applications developed using generative AI should build defenses for adversarial threats including but not limited to prompt injection, data poisoning if a model is being fine-tuned or pre-trained, and model and data extraction exploits through AI endpoints.

Your information security teams should conduct standard security assessments that have been adapted to address the new challenges with generative AI models and applications—such as adversarial threats—and consider mitigations such as red-teaming. To learn more on various security considerations for generative AI applications, see Securing generative AI: An introduction to the Generative AI Security Scoping Matrix. The resulting documentation of the security assessments can be attached as evidence to this section of the assessment.

Sustainable

Sustainability refers to the “state of the global system, including environmental, social, and economic aspects, in which the needs of the present are met without compromising the ability of future generations to meet their own needs.”

Some actions that contribute to a more sustainable design of generative AI applications include considering and testing smaller models to achieve the same functionality, optimizing hardware and data storage, and using efficient training algorithms. To learn more about how you can do this, see Optimize generative AI workloads for environmental sustainability. Considerations implemented for achieving more sustainable applications can be added as evidence for the controls related to this part of the assessment.

Conclusion

In this post, we used the example of an insurance claims assistant powered by Amazon Bedrock Agents and looked at various principles that you need to consider when getting this application audit ready using the AWS generative AI best practices framework on Audit Manager. We defined each principle of safeguarding applications for trustworthy AI and provided some best practices for achieving the key objectives of the principles. Finally, we showed you how these development and design choices can be added to the assessment as evidence to help you prepare for an audit.

The AWS generative AI best practices framework provides a purpose-built tool that you can use for monitoring and governance of your generative AI projects on Amazon Bedrock and Amazon SageMaker. To learn more, see:


About the Authors

Bharathi Srinivasan is a Generative AI Data Scientist at the AWS Worldwide Specialist Organisation. She works on developing solutions for Responsible AI, focusing on algorithmic fairness, veracity of large language models, and explainability. Bharathi guides internal teams and AWS customers on their responsible AI journey. She has presented her work at various learning conferences.

Irem Gokcek is a Data Architect in the AWS Professional Services team, with expertise spanning both Analytics and AI/ML. She has worked with customers from various industries such as retail, automotive, manufacturing and finance to build scalable data architectures and generate valuable insights from the data. In her free time, she is passionate about swimming and painting.

Fiona McCann is a Solutions Architect at Amazon Web Services in the public sector. She specializes in AI/ML with a focus on Responsible AI. Fiona has a passion for helping nonprofit customers achieve their missions with cloud solutions. Outside of building on AWS, she loves baking, traveling, and running half marathons in cities she visits.

Read More

London Stock Exchange Group uses Amazon Q Business to enhance post-trade client services

London Stock Exchange Group uses Amazon Q Business to enhance post-trade client services

This post was co-written with Ben Doughton, Head of Product Operations – LCH, Iulia Midus, Site Reliability Engineer – LCH, and Maurizio Morabito, Software and AI specialist – LCH (part of London Stock Exchange Group, LSEG).

In the financial industry, quick and reliable access to information is essential, but searching for data or facing unclear communication can slow things down. An AI-powered assistant can change that. By instantly providing answers and helping to navigate complex systems, such assistants can make sure that key information is always within reach, improving efficiency and reducing the risk of miscommunication. Amazon Q Business is a generative AI-powered assistant that can answer questions, provide summaries, generate content, and securely complete tasks based on data and information in your enterprise systems. Amazon Q Business enables employees to become more creative, data-driven, efficient, organized, and productive.

In this blog post, we explore a client services agent assistant application developed by the London Stock Exchange Group (LSEG) using Amazon Q Business. We will discuss how Amazon Q Business saved time in generating answers, including summarizing documents, retrieving answers to complex Member enquiries, and combining information from different data sources (while providing in-text citations to the data sources used for each answer).

The challenge

The London Clearing House (LCH) Group of companies includes leading multi-asset class clearing houses and are part of the Markets division of LSEG PLC (LSEG Markets). LCH provides proven risk management capabilities across a range of asset classes, including over-the-counter (OTC) and listed interest rates, fixed income, foreign exchange (FX), credit default swap (CDS), equities, and commodities.

As the LCH business continues to grow, the LCH team has been continuously exploring ways to improve their support to customers (members) and to increase LSEG’s impact on customer success. As part of LSEG’s multi-stage AI strategy, LCH has been exploring the role that generative AI services can have in this space. One of the key capabilities that LCH is interested in is a managed conversational assistant that requires minimal technical knowledge to build and maintain. In addition, LCH has been looking for a solution that is focused on its knowledge base and that can be quickly kept up to date. For this reason, LCH was keen to explore techniques such as Retrieval Augmented Generation (RAG). Following a review of available solutions, the LCH team decided to build a proof-of-concept around Amazon Q Business.

Business use case

Realizing value from generative AI relies on a solid business use case. LCH has a broad base of customers raising queries to their client services (CS) team across a diverse and complex range of asset classes and products. Example queries include: “What is the eligible collateral at LCH?” and “Can members clear NIBOR IRS at LCH?” This requires CS team members to refer to detailed service and policy documentation sources to provide accurate advice to their members.

Historically, the CS team has relied on producing product FAQs for LCH members to refer to and, where required, an in-house knowledge center for CS team members to refer to when answering complex customer queries. To improve the customer experience and boost employee productivity, the CS team set out to investigate whether generative AI could help answer questions from individual members, thus reducing the number of customer queries. The goal was to increase the speed and accuracy of information retrieval within the CS workflows when responding to the queries that inevitably come through from customers.

Project workflow

The CS use case was developed through close collaboration between LCH and Amazon Web Service (AWS) and involved the following steps:

  1. Ideation: The LCH team carried out a series of cross-functional workshops to examine different large language model (LLM) approaches including prompt engineering, RAG, and custom model fine tuning and pre-training. They considered different technologies such as Amazon SageMaker and Amazon SageMaker Jumpstart and evaluated trade-offs between development effort and model customization. Amazon Q Business was selected because of its built-in enterprise search web crawler capability and ease of deployment without the need for LLM deployment. Another attractive feature was the ability to clearly provide source attribution and citations. This enhanced the reliability of the responses, allowing users to verify facts and explore topics in greater depth (important aspects to increase their overall trust in the responses received).
  2. Knowledge base creation: The CS team built data sources connectors for the LCH website, FAQs, customer relationship management (CRM) software, and internal knowledge repositories and included the Amazon Q Business built-in index and retriever in the build.
  3. Integration and testing: The application was secured using a third-party identity provider (IdP) as the IdP for identity and access management to manage users with their enterprise IdP and used AWS Identity and Access Management (IAM) to authenticate users when they signed in to Amazon Q Business. Testing was carried out to verify factual accuracy of responses, evaluating the performance and quality of the AI-generated answers, which demonstrated that the system had achieved a high level of factual accuracy. Wider improvements in business performance were demonstrated including enhancements in response time, where responses were delivered within a few seconds. Tests were undertaken with both unstructured and structured data within the documents.
  4. Phased rollout: The CS AI assistant was rolled out in a phased approach to provide thorough, high-quality answers. In the future, there are plans to integrate their Amazon Q Business application with existing email and CRM interfaces, and to expand its use to additional use cases and functions within LSEG. 

Solution overview

In this solution overview, we’ll explore the LCH-built Amazon Q Business application.

The LCH admin team developed a web-based interface that serves as a gateway for their internal client services team to interact with the Amazon Q Business API and other AWS services (Amazon Elastic Compute Cloud (Amazon ECS), Amazon API Gateway, AWS Lambda, Amazon DynamoDB, Amazon Simple Storage Service (Amazon S3), and Amazon Bedrock) and secured it using SAML 2.0 IAM federation—maintaining secure access to the chat interface—to retrieve answers from a pre-indexed knowledge base and to validate the responses using Anthropic’s Claude v2 LLM.

The following figure illustrates the architecture for the LCH client services application.

Architectural Design of the Solution

The workflow consists of the following steps:

  1. The LCH team set up the Amazon Q Business application using a SAML 2.0 IAM IdP. (The example in the blog post shows connecting with Okta as the IdP for Amazon Q Business. However, the LCH team built the application using a third-party solution as the IdP instead of Okta). This architecture allows LCH users to sign in using their existing identity credentials from their enterprise IdP, while they maintain control over which users have access to their Amazon Q Business application.
  2. The application had two data sources as part of the configuration for their Amazon Q Business application:
    1. An S3 bucket to store and index their internal LCH documents. This allows the Amazon Q Business application to access and search through their internal product FAQ PDF documents as part of providing responses to user queries. Indexing the documents in Amazon S3 makes them readily available for the application to retrieve relevant information.
    2. In addition to internal documents, the team has also set up their public-facing LCH website as a data source using a web crawler that can index and extract information from their rulebooks.
  3. The LCH team opted for a custom user interface (UI) instead of the built-in web experience provided by Amazon Q Business to have more control over the frontend by directly accessing the Amazon Q Business API. The application’s frontend was developed using the open source application framework and hosted on Amazon ECS. The frontend application accesses an Amazon API Gateway REST API endpoint to interact with the business logic written in AWS Lambda
  4. The architecture consists of two Lambda functions:
    1. An authorizer Lambda function is responsible for authorizing the frontend application to access the Amazon Q business API by generating temporary AWS credentials.
    2. A ChatSync Lambda function is responsible for accessing the Amazon Q Business ChatSync API to start an Amazon Q Business conversation.
  5. The architecture includes a Validator Lambda function, which is used by the admin to validate the accuracy of the responses generated by the Amazon Q Business application.
    1. The LCH team has stored a golden answer knowledge base in an S3 bucket, consisting of approximately 100 questions and answers about their product FAQs and rulebooks collected from their live agents. This knowledge base serves as a benchmark for the accuracy and reliability of the AI-generated responses.
    2. By comparing the Amazon Q Business chat responses against their golden answers, LCH can verify that the AI-powered assistant is providing accurate and consistent information to their customers.
    3. The Validator Lambda function retrieves data from a DynamoDB table and sends it to Amazon Bedrock, a fully managed service that offers a choice of high-performing foundation models (FMs) that can be used to quickly experiment with and evaluate top FMs for a given use case, privately customize the FMs with existing data using techniques such as fine-tuning and RAG, and build agents that execute tasks using enterprise systems and data sources.
    4. The Amazon Bedrock service uses Anthropic’s Claude v2 model to validate the Amazon Q Business application queries and responses against the golden answers stored in the S3 bucket.
    5. Anthropic’s Claude v2 model returns a score for each question and answer, in addition to a total score, which is then provided to the application admin for review.
    6. The Amazon Q Business application returned answers within a few seconds for each question. The overall expectation is that Amazon Q Business saves time for each live agent on each question by providing quick and correct responses.

This validation process helped LCH to build trust and confidence in the capabilities of Amazon Q Business, enhancing the overall customer experience.

Conclusion

This post provides an overview of LSEG’s experience in adopting Amazon Q Business to support LCH client services agents for B2B query handling. This specific use case was built by working backward from a business goal to improve customer experience and staff productivity in a complex, highly technical area of the trading life cycle (post-trade). The variety and large size of enterprise data sources and the regulated environment that LSEG operates in makes this post particularly relevant to customer service operations dealing with complex query handling. Managed, straightforward-to-use RAG is a key capability within a wider vision of providing technical and business users with an environment, tools, and services to use generative AI across providers and LLMs. You can get started with this tool by creating a sample Amazon Q Business application.


About the Authors

Ben Doughton is a Senior Product Manager at LSEG with over 20 years of experience in Financial Services. He leads product operations, focusing on product discovery initiatives, data-informed decision-making and innovation. He is passionate about machine learning and generative AI as well as agile, lean and continuous delivery practices.

Maurizio Morabito, Software and AI specialist at LCH, one of the early adopters of Neural Networks in the years 1990–1992 before a long hiatus in technology and finance companies in Asia and Europe, finally returning to Machine Learning in 2021. Maurizio is now leading the way to implement AI in LSEG Markets, following the motto “Tackling the Long and the Boring”

Iulia Midus is a recent IT Management graduate and currently working in Post-trade. The main focus of the work so far has been data analysis and AI, and looking at ways to implement these across the business.

Magnus Schoeman is a Principal Customer Solutions Manager at AWS. He has 25 years of experience across private and public sectors where he has held leadership roles in transformation programs, business development, and strategic alliances. Over the last 10 years, Magnus has led technology-driven transformations in regulated financial services operations (across Payments, Wealth Management, Capital Markets, and Life & Pensions).

Sudha Arumugam is an Enterprise Solutions Architect at AWS, advising large Financial Services organizations. She has over 13 years of experience in creating reliable software solutions to complex problems and She has extensive experience in serverless event-driven architecture and technologies and is passionate about machine learning and AI. She enjoys developing mobile and web applications.

Elias Bedmar is a Senior Customer Solutions Manager at AWS. He is a technical and business program manager helping customers be successful on AWS. He supports large migration and modernization programs, cloud maturity initiatives, and adoption of new services. Elias has experience in migration delivery, DevOps engineering and cloud infrastructure.

Marcin Czelej is a Machine Learning Engineer at AWS Generative AI Innovation and Delivery. He combines over 7 years of experience in C/C++ and assembler programming with extensive knowledge in machine learning and data science. This unique skill set allows him to deliver optimized and customised solutions across various industries. Marcin has successfully implemented AI advancements in sectors such as e-commerce, telecommunications, automotive, and the public sector, consistently creating value for customers.

Zmnako Awrahman, Ph.D., is a generative AI Practice Manager at AWS Generative AI Innovation and Delivery with extensive experience in helping enterprise customers build data, ML, and generative AI strategies. With a strong background in technology-driven transformations, particularly in regulated industries, Zmnako has a deep understanding of the challenges and opportunities that come with implementing cutting-edge solutions in complex environments.

Read More

Evaluate large language models for your machine translation tasks on AWS

Evaluate large language models for your machine translation tasks on AWS

Large language models (LLMs) have demonstrated promising capabilities in machine translation (MT) tasks. Depending on the use case, they are able to compete with neural translation models such as Amazon Translate. LLMs particularly stand out for their natural ability to learn from the context of the input text, which allows them to pick up on cultural cues and produce more natural sounding translations. For instance, the sentence “Did you perform well?” translated in French might be translated into “Avez-vous bien performé?” The target translation can vary widely depending on the context. If the question is asked in the context of sport, such as “Did you perform well at the soccer tournament?”, the natural French translation would be very different. It is critical for AI models to capture not only the context, but also the cultural specificities to produce a more natural sounding translation. One of LLMs’ most fascinating strengths is their inherent ability to understand context.

A number of our global customers are looking to take advantage of this capability to improve the quality of their translated content. Localization relies on both automation and humans-in-the-loop in a process called Machine Translation Post Editing (MTPE). Building solutions that help enhance translated content quality present multiple benefits:

  • Potential cost savings on MTPE activities
  • Faster turnaround for localization projects
  • Better experience for content consumers and readers overall with enhanced quality

LLMs have also shown gaps with regards to MT tasks, such as:

  • Inconsistent quality over certain language pairs
  • No standard pattern to integrate past translations knowledge, also known as translation memory (TM)
  • Inherent risk of hallucination

Switching MT workloads from to LLM-driven translation should be considered on a case-by-case basis. However, the industry is seeing enough potential to consider LLMs as a valuable option.

This blog post with accompanying code presents a solution to experiment with real-time machine translation using foundation models (FMs) available in Amazon Bedrock. It can help collect more data on the value of LLMs for your content translation use cases.

Steering the LLMs’ output

Translation memory and TMX files are important concepts and file formats used in the field of computer-assisted translation (CAT) tools and translation management systems (TMSs).

Translation memory

A translation memory is a database that stores previously translated text segments (typically sentences or phrases) along with their corresponding translations. The main purpose of a TM is to aid human or machine translators by providing them with suggestions for segments that have already been translated before. This can significantly improve translation efficiency and consistency, especially for projects involving repetitive content or similar subject matter.

Translation Memory eXchange (TMX) is a widely used open standard for representing and exchanging TM data. It is an XML-based file format that allows for the exchange of TMs between different CAT tools and TMSs. A typical TMX file contains a structured representation of translation units, which are groupings of a same text translated into multiple languages.

Integrating TM with LLMs

The use of TMs in combination with LLMs can be a powerful approach for improving the quality and efficiency of machine translation. The following are a few potential benefits:

  • Improved accuracy and consistency – LLMs can benefit from the high-quality translations stored in TMs, which can help improve the overall accuracy and consistency of the translations produced by the LLM. The TM can provide the LLM with reliable reference translations for specific segments, reducing the chances of errors or inconsistencies.
  • Domain adaptation – TMs often contain translations specific to a particular domain or subject matter. By using a domain-specific TM, the LLM can better adapt to the terminology, style, and context of that domain, leading to more accurate and natural translations.
  • Efficient reuse of human translations – TMs store human-translated segments, which are typically of higher quality than machine-translated segments. By incorporating these human translations into the LLM’s training or inference process, the LLM can learn from and reuse these high-quality translations, potentially improving its overall performance.
  • Reduced post-editing effort – When the LLM can accurately use the translations stored in the TM, the need for human post-editing can be reduced, leading to increased productivity and cost savings.

Another approach to integrating TM data with LLMs is to use fine-tuning in the same way you would fine-tune a model for business domain content generation, for instance. For customers operating in global industries, potentially translating to and from over 10 languages, this approach can prove to be operationally complex and costly. The solution proposed in this post relies on LLMs’ context learning capabilities and prompt engineering. It enables you to use an off-the-shelf model as is without involving machine learning operations (MLOps) activity.

Solution overview

The LLM translation playground is a sample application providing the following capabilities:

  • Experiment with LLM translation capabilities using models available in Amazon Bedrock
  • Create and compare various inference configurations
  • Evaluate the impact of prompt engineering and Retrieval Augmented Generation (RAG) on translation with LLMs
  • Configure supported language pairs
  • Import, process, and test translation using your existing TMX file with Multiple LLMS
  • Custom terminology conversion
  • Performance, quality, and usage metrics including BLEU, BERT, METEOR and, CHRF

The following diagram illustrates the translation playground architecture. The numbers are color-coded to represent two flows: the translation memory ingestion flow (orange) and the text translation flow (gray). The solution offers two TM retrieval modes for users to choose from: vector and document search. This is covered in detail later in the post.

Streamlit Application Architecture

The TM ingestion flow (orange) consists of the following steps:

  1. The user uploads a TMX file to the playground UI.
  2. Depending on which retrieval mode is being used, the appropriate adapter is invoked.
  3. When using the Amazon OpenSearch Service adapter (document search), translation unit groupings are parsed and stored into an index dedicated to the uploaded file. When using the FAISS adapter (vector search), translation unit groupings are parsed and turned into vectors using the selected embedding model from Amazon Bedrock.
  4. When using the FAISS adapter, translation units are stored into a local FAISS index along with the metadata.

The text translation flow (gray) consists of the following steps:

  1. The user enters the text they want to translate along with source and target language.
  2. The request is sent to the prompt generator.
  3. The prompt generator invokes the appropriate knowledge base according to the selected mode.
  4. The prompt generator receives the relevant translation units.
  5. Amazon Bedrock is invoked using the generated prompt as input along with customization parameters.

The translation playground could be adapted into a scalable serverless solution as represented by the following diagram using AWS Lambda, Amazon Simple Storage Service (Amazon S3), and Amazon API Gateway.

Serverless Solution Architecture Diagram

Strategy for TM knowledge base

The LLM translation playground offers two options to incorporate the translation memory into the prompt. Each option is available through its own page within the application:

  • Vector store using FAISS – In this mode, the application processes the .tmx file the user uploaded, indexes it, and stores it locally into a vector store (FAISS).
  • Document store using Amazon OpenSearch Serverless – Only standard document search using Amazon OpenSearch Serverless is supported. To test vector search, use the vector store option (using FAISS).

In vector store mode, the translation segments are processed as follows:

  1. Embed the source segment.
  2. Extract metadata:
    • Segment language
    • System generated <tu> segment unique identifier
  3. Store source segment vectors along with metadata and the segment itself in plain text as a document

The translation customization section allows you to select the embedding model. You can choose either Amazon Titan Embeddings Text V2 or Cohere Embed Multilingual v3. Amazon Titan Text Embeddings V2 includes multilingual support for over 100 languages in pre-training. Cohere Embed supports 108 languages.

In document store mode, the language segments are not embedded and are stored following a flat structure. Two metadata attributes are maintained across the documents:

  • Segment Language
  • System generated <tu> segment unique identifier

Translation Memory Chunking

Prompt engineering

The application uses prompt engineering techniques to incorporate several types of inputs for the inference. The following sample XML illustrates the prompt’s template structure:

<prompt>
<system_prompt>…</system_prompt>
<source_language>EN</source_language>
<target_language>FR</target_language>
<translation_memory_pairs>
<source_language>…</source_language>
<target_language>…</target_language>
</translation_memory_pairs>
<custom_terminology_pairs>
<source_language>…</source_language>
<target_language>…</target_language>
</custom_terminology_pairs ><user_prompt>…</user_prompt>
</prompt>

Prerequisites

The project code uses the Python version of the AWS Cloud Development Kit (AWS CDK). To run the project code, make sure that you have fulfilled the AWS CDK prerequisites for Python.

The project also requires that the AWS account is bootstrapped to allow the deployment of the AWS CDK stack.

Install the UI

To deploy the solution, first install the UI (Streamlit application):

  1. Clone the GitHub repository using the following command:
git clone https://github.com/aws-samples/llm-translation-playground.git
  1. Navigate to the deployment directory:
cd llm-translation-playground
  1. Install and activate a Python virtual environment:
python3 -m venv .venv
source .venv/bin/activate
  1. Install Python libraries:
python -m pip install -r requirements.txt

Deploy the AWS CDK stack

Complete the following steps to deploy the AWS CDK stack:

  1. Move into the deployment folder:
cd deployment/cdk
  1. Configure the AWS CDK context parameters file context.json. For collection_name, use the OpenSearch Serverless collection name. For example:

"collection_name": "search-subtitles"

  1. Deploy the AWS CDK stack:
cdk deploy
  1. Validate successful deployment by reviewing the OpsServerlessSearchStack stack on the AWS CloudFormation The status should read CREATE_COMPLETE.
  2. On the Outputs tab, make note of the OpenSearchEndpoint attribute value.

Cloudformation Stack Output

Configure the solution

The stack creates an AWS Identity and Access Management (IAM) role with the right level of permission needed to run the application. The LLM translation playground assumes this role automatically on your behalf. To achieve this, modify the role or principal under which you are planning to run the application so you are allowed to assume the newly created role. You can use the pre-created policy and attach it to your role. The policy Amazon Resource Name (ARN) can be retrieved as a stack output under the key LLMTranslationPlaygroundAppRoleAssumePolicyArn, as illustrated in the preceding screenshot. You can do so from the IAM console after selecting your role and choosing Add permissions. If you prefer to use the AWS Command Line Interface (AWS CLI), refer to the following sample command line:

aws iam attach-role-policy --role-name &lt;role-name&gt;  --policy-arn &lt;policy-arn&gt;

Finally, configure the .env file in the utils folder as follows:

  • APP_ROLE_ARN – The ARN of the role created by the stack (stack output LLMTranslationPlaygroundAppRoleArn)
  • HOST – OpenSearch Serverless collection endpoint (without https)
  • REGION – AWS Region the collection was deployed into
  • INGESTION_LIMIT – Maximum amount of translation units (<tu> tags) indexed per TMX file you upload

Run the solution

To start the translation playground, run the following commands:

cd llm-translation-playground/source
streamlit run LLM_Translation_Home.py

Your default browser should open a new tab or window displaying the Home page.

LLM Translation Playground Home

Simple test case

Let’s run a simple translation test using the phrase mentioned earlier: “Did you perform well?”

Because we’re not using a knowledge base for this test case, we can use either a vector store or document store. For this post, we use a document store.

  1. Choose With Document Store.
  2. For Source Text, enter the text to be translated.
  3. Choose your source and target languages (for this post, English and French, respectively).
  4. You can experiment with other parameters, such as model, maximum tokens, temperature, and top-p.
  5. Choose Translate.

Translation Configuration Page

The translated text appears in the bottom section. For this example, the translated text, although accurate, is close to a literal translation, which is not a common phrasing in French.

English-French Translation Test 1

  1. We can rerun the same test after slightly modifying the initial text: “Did you perform well at the soccer tournament?”

We’re now introducing some situational context in the input. The translated text should be different and closer to a more natural translation. The new output literally means “Did you play well at the soccer tournament?”, which is consistent with the initial intent of the question.

English-French Translation Test 2

Also note the completion metrics on the left pane, displaying latency, input/output tokens, and quality scores.

This example highlights the ability of LLMs to naturally adapt the translation to the context.

Adding translation memory

Let’s test the impact of using a translation memory TMX file on the translation quality.

  1. Copy the text contained within test/source_text.txt and paste into the Source text
  2. Choose French as the target language and run the translation.
  3. Copy the text contained within test/target_text.txt and paste into the reference translation field.

Translation Memory Configuration

  1. Choose Evaluate and notice the quality scores on the left.
  2. In the Translation Customization section, choose Browse files and choose the file test/subtitles_memory.tmx.

This will index the translation memory into the OpenSearch Service collection previously created. The indexing process can take a few minutes.

  1. When the indexing is complete, select the created index from the index dropdown.
  2. Rerun the translation.

You should see a noticeable increase in the quality score. For instance, we’ve seen up to 20 percentage points improvement in BLEU score with the preceding test case. Using prompt engineering, we were able to steer the model’s output by providing sample phrases directly pulled from the TMX file. Feel free to explore the generated prompt for more details on how the translation pairs were introduced.

You can replicate a similar test case with Amazon Translate by launching an asynchronous job customized using parallel data.

Prompt Engineering

Here we took a simplistic retrieval approach, which consists of loading all of the samples as part of the same TMX file, matching the source and target language. You can enhance this technique by using metadata-driven filtering to collect the relevant pairs according to the source text. For example, you can classify the documents by theme or business domain, and use category tags to select language pairs relevant to the text and desired output.

Semantic similarity for translation memory selection

In vector store mode, the application allows you to upload a TMX and create a local index that uses semantic similarity to select the translation memory segments. First, we retrieve the segment with the highest similarity score based on the text to be translated and the source language. Then we retrieve the corresponding segment matching the target language and parent translation unit ID.

To try it out, upload the file in the same way as shown earlier. Depending on the size of the file, this can take a few minutes. There is a maximum limit of 200 MB. You can use the sample file as in the previous example or one of the other samples provided in the code repository.

This approach differs from the static index search because it’s assumed that the source text is semantically close to segments representative enough of the expected style and tone.

TMX File Upload Widget

Adding custom terminology

Custom terminology allows you to make sure that your brand names, character names, model names, and other unique content get translated to the desired result. Given that LLMs are pre-trained on massive amounts of data, they can likely already identify unique names and render them accurately in the output. If there are names for which you want to enforce a strict and literal translation, you can try the custom terminology feature of this translate playground. Simply provide the source and target language pairs separated by semicolon in the Translation Customization section. For instance, if you want to keep the phrase “Gen AI” untranslated regardless of the language, you can configure the custom terminology as illustrated in the following screenshot.

Custom Terminology

Clean up

To delete the stack, navigate to the deployment folder and run:cdk destroy.

Further considerations

Using existing TMX files with generative AI-based translation systems can potentially improve the quality and consistency of translations. The following are some steps to use TMX files for generative AI translations:

  • TMX data pipeline – TMX files contain structured translation units, but the format might need to be preprocessed to extract the source and target text segments in a format that can be consumed by the generative AI model. This involves extract, transform, and load (ETL) pipelines able to parse the XML structure, handle encoding issues, and add metadata.
  • Incorporate quality estimation and human review – Although generative AI models can produce high-quality translations, it is recommended to incorporate quality estimation techniques and human review processes. You can use automated quality estimation models to flag potentially low-quality translations, which can then be reviewed and corrected by human translators.
  • Iterate and refine – Translation projects often involve iterative cycles of translation, review, and improvement. You can periodically retrain or fine-tune the generative AI model with the updated TMX file, creating a virtuous cycle of continuous improvement.

Conclusion

The LLM translation playground presented in this post enables you evaluate the use of LLMs for your machine translation needs. The key features of this solution include:

  • Ability to use translation memory – The solution allows you to integrate your existing TM data, stored in the industry-standard TMX format, directly into the LLM translation process. This helps improve the accuracy and consistency of the translations by using high-quality human-translated content.
  • Prompt engineering capabilities – The solution showcases the power of prompt engineering, demonstrating how LLMs can be steered to produce more natural and contextual translations by carefully crafting the input prompts. This includes the ability to incorporate custom terminology and domain-specific knowledge.
  • Evaluation metrics – The solution includes standard translation quality evaluation metrics, such as BLEU, BERT Score, METEOR, and CHRF, to help you assess the quality and effectiveness of the LLM-powered translations compared to their your existing machine translation workflows.

As the industry continues to explore the use of LLMs, this solution can help you gain valuable insights and data to determine if LLMs can become a viable and valuable option for your content translation and localization workloads.

To dive deeper into the fast-moving field of LLM-based machine translation on AWS, check out the following resources:


About the Authors

Narcisse Zekpa is a Sr. Solutions Architect based in Boston. He helps customers in the Northeast U.S. accelerate their business transformation through innovative, and scalable solutions, on the AWS Cloud. He is passionate about enabling organizations to transform transform their business, using advanced analytics and AI. When Narcisse is not building, he enjoys spending time with his family, traveling, running, cooking and playing basketball.

Ajeeb Peter is a Principal Solutions Architect with Amazon Web Services based in Charlotte, North Carolina, where he guides global financial services customers to build highly secure, scalable, reliable, and cost-efficient applications on the cloud. He brings over 20 years of technology experience on Software Development, Architecture and Analytics from industries like finance and telecom

Read More

Parameta accelerates client email resolution with Amazon Bedrock Flows

Parameta accelerates client email resolution with Amazon Bedrock Flows

This blog post is co-written with Siokhan Kouassi and Martin Gregory at Parameta. 

When financial industry professionals need reliable over-the-counter (OTC) data solutions and advanced analytics, they can turn to Parameta Solutions, the data powerhouse behind TP ICAP . With a focus on data-led solutions, Parameta Solutions makes sure that these professionals have the insights they need to make informed decisions. Managing thousands of client service requests efficiently while maintaining accuracy is crucial for Parameta’s reputation as a trusted data provider. Through a simple yet effective application of Amazon Bedrock Flows, Parameta transformed their client service operations from a manual, time-consuming process into a streamlined workflow in just two weeks.

Parameta empowers clients with comprehensive industry insights, from price discovery to risk management, and pre- to post-trade analytics. Their services are fundamental to clients navigating the complexities of OTC transactions and workflow effectively. Accurate and timely responses to technical support queries are essential for maintaining service quality.

However, Parameta’s support team faced a common challenge in the financial services industry: managing an increasing volume of email-based client requests efficiently. The traditional process involved multiple manual steps—reading emails, understanding technical issues, gathering relevant data, determining the correct routing path, and verifying information in databases. This labor-intensive approach not only consumed valuable time, but also introduced risks of human error that could potentially impact client trust.

Recognizing the need for modernization, Parameta sought a solution that could maintain their high standards of service while significantly reducing resolution times. The answer lay in using generative AI through Amazon Bedrock Flows, enabling them to build an automated, intelligent request handling system that would transform their client service operations. Amazon Bedrock Flows provide a powerful, low-code solution for creating complex generative AI workflows with an intuitive visual interface and with a set of APIs in the Amazon Bedrock SDK. By seamlessly integrating foundation models (FMs), prompts, agents, and knowledge bases, organizations can rapidly develop flexible, efficient AI-driven processes tailored to their specific business needs.

In this post, we show you how Parameta used Amazon Bedrock Flows to transform their manual client email processing into an automated, intelligent workflow that reduced resolution times from weeks to days while maintaining high accuracy and operational control.

Client email triage

For Parameta, every client email represents a critical touchpoint that demands both speed and accuracy. The challenge of email triage extends beyond simple categorization—it requires understanding technical queries, extracting precise information, and providing contextually appropriate responses.

The email triage workflow involves multiple critical steps:

  • Accurately classifying incoming technical support emails
  • Extracting relevant entities like data products or time periods
  • Validating if all required information is present for the query type
  • Consulting internal knowledge bases and databases for context
  • Generating either complete responses or specific requests for additional information

The manual handling of this process led to time-consuming back-and-forth communications, the risk of overlooking critical details, and inconsistent response quality. With that in mind, Parameta identified this as an opportunity to develop an intelligent system that could automate this entire workflow while maintaining their high standard of accuracy and professionalism.

Path to the solution

When evaluating solutions for email triage automation, several approaches appeared viable, each with its own pros and cons. However, not all of them were effective for Parameta.

Traditional NLP pipelines and ML classification models

Traditional natural language processing pipelines struggle with email complexity due to their reliance on rigid rules and poor handling of language variations, making them impractical for dynamic client communications. The inconsistency in email structures and terminology, which varies significantly between clients, further complicates their effectiveness. These systems depend on predefined patterns, which are difficult to maintain and adapt when faced with such diverse inputs, leading to inefficiencies and brittleness in handling real-world communication scenarios. Machine learning (ML) classification models offer improved categorization, but introduce complexity by requiring separate, specialized models for classification, entity extraction, and response generation, each with its own training data and contextual limitations.

Deterministic LLM-based workflows

Parameta’s solution demanded more than just raw large language model (LLM) capabilities—it required a structured approach while maintaining operational control. Amazon Bedrock Flows provided this critical balance through the following capabilities:

  • Orchestrated prompt chaining – Multiple specialized prompts work together in a deterministic sequence, each optimized for specific tasks like classification, entity extraction, or response generation.
  • Multi-conditional workflows – Support for complex business logic with the ability to branch flows based on validation results or extracted information completeness.
  • Version management – Simple switching between different prompt versions while maintaining workflow integrity, enabling rapid iteration without disrupting the production pipeline.
  • Component integration – Seamless incorporation of other generative AI capabilities like Amazon Bedrock Agents or Amazon Bedrock Knowledge Bases, creating a comprehensive solution.
  • Experimentation framework – The ability to test and compare different prompt variations while maintaining version control. This is crucial for optimizing the email triage process.
  • Rapid iteration and tight feedback loop – The system allows for quick testing of new prompts and immediate feedback, facilitating continuous improvement and adaptation.

This structured approach to generative AI through Amazon Bedrock Flows enabled Parameta to build a reliable, production-grade email triage system that maintains both flexibility and control.

Solution overview

Parameta’s solution demonstrates how Amazon Bedrock Flows can transform complex email processing into a structured, intelligent workflow. The architecture comprises three key components, as shown in the following diagram: orchestration, structured data extraction, and intelligent response generation.

Orchestration

Amazon Bedrock Flows serves as the central orchestrator, managing the entire email processing pipeline. When a client email arrives through Microsoft Teams, the workflow invokes the following stages:

  •  The workflow initiates through Amazon API Gateway, taking the email and using an AWS Lambda function to extract the text contained in the email and store it in Amazon Simple Storage Service (Amazon S3).
  • Amazon Bedrock Flows coordinates the sequence of operations, starting with the email from Amazon S3.
  • Version management streamlines controlled testing of prompt variations.
  • Built-in conditional logic handles different processing paths.

Structured data extraction

A sequence of specialized prompts within the flow handles the critical task of information processing:

  • The classification prompt identifies the type of technical inquiry
  • The entity extraction prompt discovers key data points
  • The validation prompt verifies completeness of required information

These prompts work in concert to transform unstructured emails into actionable data, with each prompt optimized for its specific task.

Intelligent response generation

The final stage uses advanced AI capabilities for response creation:

  • An Amazon Bedrock agent synthesizes information from multiple sources:
  • Response generation adapts based on validation results:
    • Specific information requests for incomplete queries
    • Comprehensive solutions for complete inquiries
  • Delivery back to clients using Microsoft Teams

The following diagram illustrates the flow for the email triaging system.

This structured approach allows Parameta to maintain consistent, high-quality responses while significantly reducing processing time for client inquiries.

Solution walkthrough

Let’s walk through how Parameta’s email triage system processes a typical client inquiry. We start with the following sample client email:

Dear Support Team,

Could you please verify the closing price for the Dollar ATM swaption (USD_2Y_1Y) as of March 15, 2024? We need this for our end-of-day reconciliation.

Best regards,

John Smith

Portfolio Manager, ABC Investments

The classification prompt classifies this as a price verification request based on the content and intent. It uses the email as the input, and the output is type: price_verification_request.

The entity extraction prompt uses the preceding email, and provides the following output:

{
"product_type": "Interest Rate Option",
"ticker": "USD_2Y_1Y",
"date_requested": "2024-03-15",
"data_source": "ICAP",
"request_type": "closing_price"
}

The workflow then performs validation using Amazon Bedrock Flows. This requires the following checks for price verification:

  • Product identifier ( USD_2Y_1Y present)
  • Date (March 15, 2024 present)
  • Price type (closing specified)

When all required entities are found, the workflow proceeds to the Amazon Bedrock agent.

The agent submits the following query to the knowledge base: “Product specifications and market context for Interest Rate option USD_2Y_1Y.”

The agent submits the following Snowflake query (using Athena):

SELECTclosing_price,settlement_date,currency,market_status
FROM otc_derivatives.daily_prices
WHERE instrument_id ='USD_2Y_1Y'
AND trade_date = '2024-03-15'

The generated response is as follows:

Dear John,

Regarding your price verification request for USD_2Y_1Y (USD ATM Swaption Implied Volatility Mid 2Y vs 1Y) as of March 15, 2024:
The closing price has been verified in our records. The market was active on the requested date with normal trading volumes.

Please note this information is provided under your existing data service agreement.

If you need any further clarification, please don’t hesitate to ask.

Best regards,

Parameta Support

Benefits

Parameta quickly transitioned from implementation to achieving impactful results, thanks to the substantial benefits provided by Amazon Bedrock Flows across various areas:

  • Operational efficiency
    • Development teams accelerated prompt optimization by quickly testing different variations for email classification and entity extraction
    • Time-to-insight reduced from weeks to days through rapid prompt iteration and immediate feedback on performance
    • Quick adjustments to validation rules without rebuilding the entire workflow
  • Team collaboration
    • Modification of prompts through a simplified interface without deep AWS knowledge
    • Support teams gained the ability to understand and adjust the response process
    • Cross-functional teams collaborated on prompt improvements using familiar interfaces
  • Model transparency
    • Clear visibility into why emails were classified into specific categories
    • Understanding of entity extraction decisions helped refine prompts for better accuracy
    • Ability to trace decisions through the workflow enhanced trust in automated responses
  • Observability and governance
    • Comprehensive observability provided stakeholders with a holistic view of the end-to-end process
    • Built-in controls provided appropriate oversight of the automated system, aligning with governance and compliance requirements
    • Transparent workflows enabled stakeholders to monitor, audit, and refine the system effectively, providing accountability and reliability

These benefits directly translated to Parameta’s business objectives: faster response times to client queries, more accurate classifications, and improved ability to maintain and enhance the system across teams. The structured yet flexible nature of Amazon Bedrock Flows enabled Parameta to achieve these gains while maintaining control over their critical client communications.

Key takeaways and best practices

When implementing Amazon Bedrock Flows, consider these essential learnings:

  • Prompt design principles
    • Design modular prompts that handle specific tasks for better maintainability of the system
    • Keep prompts focused and concise to optimize token usage
    • Include clear input and output specifications for better maintainability and robustness
    • Diversify model selection for different tasks within the flow:
      • Use lighter models for simple classifications
      • Reserve advanced models for complex reasoning
      • Create resilience through model redundancy
  • Flow architecture
    • Start with a clear validation strategy early in the flow
    • Include error handling in prompt design
    • Consider breaking complex flows into smaller, manageable segments
  • Version management
    • Implement proper continuous deployment and delivery (CI/CD) pipelines for flow deployment
    • Establish approval workflows for flow changes
    • Document flow changes and their impact including metrics
  • Testing and implementation
    • Create comprehensive test cases covering a diverse set of scenarios
    • Validate flow behavior with sample datasets
    • Constantly monitor flow performance and token usage in production
    • Start with smaller workflows and scale gradually
  • Cost optimization
    • Review and optimize prompt lengths regularly
    • Monitor token usage patterns
    • Balance between model capability and cost when selecting models

Consider these practices derived from real-world implementation experience to help successfully deploy Amazon Bedrock Flows while maintaining efficiency and reliability.

Testimonials

“As the CIO of our company, I am thoroughly impressed by how rapidly our team was able to leverage Amazon Bedrock Flows to create an innovative solution to a complex business problem. The low barrier to entry of Amazon Bedrock Flows allowed our team to quickly get up to speed and start delivering results. This tool is democratizing generative AI, making it easier for everyone in the business to get hands-on with Amazon Bedrock, regardless of their technical skill level. I can see this tool being incredibly useful across multiple parts of our business, enabling seamless integration and efficient problem-solving.”

– Roland Anderson, CIO at Parameta Solutions

“As someone with a tech background, using Amazon Bedrock Flows for the first time was a great experience. I found it incredibly intuitive and user-friendly. The ability to refine prompts based on feedback made the process seamless and efficient. What impressed me the most was how quickly I could get started without needing to invest time in creating code or setting up infrastructure. The power of generative AI applied to business problems is truly transformative, and Amazon Bedrock has made it accessible for tech professionals like myself to drive innovation and solve complex challenges with ease.”

– Martin Gregory, Market Data Support Engineer, Team Lead at Parameta Solutions

Conclusion

In this post, we showed how Parameta uses Amazon Bedrock Flows to build an intelligent client email processing workflow that reduces resolution times from days to minutes while maintaining high accuracy and control. As organizations increasingly adopt generative AI, Amazon Bedrock Flows offers a balanced approach, combining the flexibility of LLMs with the structure and control that enterprises require.

For more information, refer to Build an end-to-end generative AI workflow with Amazon Bedrock Flows. For code samples, see Run Amazon Bedrock Flows code samples. Visit the Amazon Bedrock console to start building your first flow, and explore our AWS Blog for more customer success stories and implementation patterns.


About the Authors

Siokhan Kouassi is a Data Scientist at Parameta Solutions with expertise in statistical machine learning, deep learning, and generative AI. His work is focused on the implementation of efficient ETL data analytics pipelines, and solving business problems via automation, experimenting and innovating using AWS services with a code-first approach using AWS CDK.

Martin Gregory is a Senior Market Data Technician at Parameta Solutions with over 25 years of experience. He has recently played a key role in transitioning Market Data systems to the cloud, leveraging his deep expertise to deliver seamless, efficient, and innovative solutions for clients.

Talha Chattha is a Senior Generative AI Specialist SA at AWS, based in Stockholm. With 10+ years of experience working with AI, Talha now helps establish practices to ease the path to production for Gen AI workloads. Talha is an expert in Amazon Bedrock and supports customers across entire EMEA. He holds passion about meta-agents, scalable on-demand inference, advanced RAG solutions and optimized prompt engineering with LLMs. When not shaping the future of AI, he explores the scenic European landscapes and delicious cuisines.

Jumana Nagaria is a Prototyping Architect at AWS, based in London. She builds innovative prototypes with customers to solve their business challenges. She is passionate about cloud computing and believes in giving back to the community by inspiring women to join tech and encouraging young girls to explore STEM fields. Outside of work, Jumana enjoys travelling, reading, painting, and spending quality time with friends and family.

Hin Yee Liu is a prototype Engagement Manager at AWS, based in London. She helps AWS customers to bring their big ideas to life and accelerate the adoption of emerging technologies. Hin Yee works closely with customer stakeholders to identify, shape and deliver impactful use cases leveraging Generative AI, AI/ML, Big Data, and Serverless technologies using agile methodologies. In her free time, she enjoys knitting, travelling and strength training.

Read More

Efficiently build and tune custom log anomaly detection models with Amazon SageMaker

Efficiently build and tune custom log anomaly detection models with Amazon SageMaker

In this post, we walk you through the process to build an automated mechanism using Amazon SageMaker to process your log data, run training iterations over it to obtain the best-performing anomaly detection model, and register it with the Amazon SageMaker Model Registry for your customers to use it.

Log-based anomaly detection involves identifying anomalous data points in log datasets for discovering execution anomalies, as well as suspicious activities. It usually comprises parsing log data into vectors or machine-understandable tokens, which you can then use to train custom machine learning (ML) algorithms for determining anomalies.

You can adjust the inputs or hyperparameters for an ML algorithm to obtain a combination that yields the best-performing model. This process is called hyperparameter tuning and is an essential part of machine learning. Choosing appropriate hyperparameter values is crucial for success, and it’s usually performed iteratively by experts, which can be time-consuming. Added to this are the general data-related processes such as loading data from appropriate sources, parsing and processing them with custom logic, storing the parsed data back to storage, and loading them again for training custom models. Moreover, these tasks need to be done repetitively for each combination of hyperparameters, which doesn’t scale well with increasing data and new supplementary steps. You can use Amazon SageMaker Pipelines to automate all these steps into a single execution flow. In this post, we demonstrate how to set up this entire workflow.

Solution overview

Contemporary log anomaly detection techniques such as Drain-based detection [1] or DeepLog [2] consist of the following general approach: perform custom processing on logs, train their anomaly detection models using custom models, and obtain the best-performing model with an optimal set of hyperparameters. To build an anomaly detection system using such techniques, you need to write custom scripts for processing as well for training. SageMaker provides support for developing scripts by extending in-built algorithm containers, or by building your own custom containers. Moreover, you can combine these steps as a series of interconnected stages using SageMaker Pipelines. The following figure shows an example architecture:

The workflow consists of the following steps:

  1. The log training data is initially stored in an Amazon Simple Storage Service (Amazon S3) bucket, from where it’s picked up by the SageMaker processing step of the SageMaker pipeline.
  2. After the pipeline is started, the processing step loads the Amazon S3 data into SageMaker containers and runs custom processing scripts that parse and process the logs before uploading them to a specified Amazon S3 destination. This processing could be either decentralized with a single script running on one or more instances, or it could be run in parallel over multiple instances using a distributed framework like Apache Spark. We discuss both approaches in this post.
  3. After processing, the data is automatically picked up by the SageMaker tuning step, where multiple training iterations with unique hyperparameter combinations are run for the custom training script.
  4. Finally, the SageMaker model step creates a SageMaker model using the best-trained model obtained from the tuning step and registers it to the SageMaker Model Registry for consumers to use. These consumers, for example, could be testers, who use models trained on different datasets by different pipelines to compare their effectiveness and generality, before deploying them to a public endpoint.

We walk through implementing the solution with the following high-level steps:

  1. Perform custom data processing, using either a decentralized or distributed approach.
  2. Write custom SageMaker training scripts that automatically tune the resulting models with a range of hyperparameters.
  3. Select the best-tuned model, create a custom SageMaker model from it, and register it to the SageMaker Model Registry.
  4. Combine all the steps in a SageMaker pipeline and run it.

Prerequisites

You should have the following prerequisites:

Process the data

To start, upload the log dataset to an S3 bucket in your AWS account. You can use the AWS Command Line Interface (AWS CLI) using Amazon S3 commands, or use the AWS Management Console. To process the data, you use a SageMaker processing step as the first stage in your SageMaker pipeline. This step spins up a SageMaker container and runs a script that you provide for custom processing. There are two ways to do this: decentralized or distributed processing. SageMaker provides Processor classes for both approaches. You can choose either approach for your custom processing depending on your use case.

Decentralized processing with ScriptProcessor

In the decentralized approach, a single custom script runs on one or more standalone instances and processes the input data. The SageMaker Python SDK provides the ScriptProcessor class, which you can use to run your custom processing script in a SageMaker processing step. For small datasets, a single instance can usually suffice for performing data processing. Increasing the number of instances is recommended if your dataset is large and can be split into multiple independent components, which can all be processed separately (this can be done using the ShardedByS3Key parameter, which we discuss shortly).

If you have custom dependencies (which can often be the case during R&D processes), you can extend an existing container and customize it with your dependencies before providing it to the ScriptProcessor class. For example, if you’re using the Drain technique, you need the logparser Python library for log parsing, in which case you write a simple Dockerfile that installs it along with the usual Python ML libraries:

FROM python:3.7-slim-buster
RUN pip3 install pandas==0.25.3 scikit-learn==0.21.3 logparser3 boto3
ENV PYTHONUNBUFFERED=TRUE
ENTRYPOINT ["python3"]

You can use a Python SageMaker notebook instance in your AWS account to create such a Dockerfile and save it to an appropriate folder, such as docker. To build a container using this Dockerfile, enter the following code into a main driver program in a Jupyter notebook on your notebook instance:

import boto3
from sagemaker import get_execution_role

region = boto3.session.Session().region_name
role = get_execution_role()
account_id = boto3.client("sts").get_caller_identity().get("Account")
ecr_repository = "sagemaker-processing-my-container"
tag = ":latest"

uri_suffix = "amazonaws.com"
if region in ["cn-north-1", "cn-northwest-1"]:
uri_suffix = "amazonaws.com.cn"
processing_repository_uri = "{}.dkr.ecr.{}.{}/{}".format(
account_id, region, uri_suffix, ecr_repository + tag
)

# Create ECR repository and push docker image
!docker build -t $ecr_repository docker
!$(aws ecr get-login --region $region --registry-ids $account_id --no-include-email)
!aws ecr create-repository --repository-name $ecr_repository
!docker tag {ecr_repository + tag} $processing_repository_uri
!docker push $processing_repository_uri

This code creates an Amazon Elastic Container Registry (Amazon ECR) repository where your custom container image will be stored (the repository will be created if it’s not already present). The container image is then built, tagged with the repository name (and :latest), and pushed to the ECR repository.

The next step is writing your actual processing script. For more information on writing a processing script using ScriptProcessor, refer to Amazon SageMaker Processing – Fully Managed Data Processing and Model Evaluation. The following are a few key points to remember:

  • A SageMaker processing step loads the data from an input location (Amazon S3 or local developer workspace) to an input path specified by you under the /opt/ml/processing directory of your container. It then runs your script in the container and uploads the output data from your specified path under /opt/ml/processing to an Amazon S3 destination you’ve specified.
  • Customer log datasets can sometimes consist of multiple subsets without any inter-dependencies amongst them. For these cases, you can parallelize your processing by making your processing script run over multiple instances in a single processing step, with each instance processing one of these independent subsets. It’s a good practice to keep the script’s logic redundant so that each execution on every instance happens independently of the others. This avoids duplicative work.

When your script is ready, you can instantiate the SageMaker ScriptProcessor class for running it on your custom container (created in the previous step) by adding the following code to your driver program:

from sagemaker.processing import (
ProcessingInput,
ProcessingOutput,
ScriptProcessor,
)
from sagemaker.workflow.pipeline_context import PipelineSession

from sagemaker.workflow.steps import ProcessingStep

pipeline_session = PipelineSession()
script_processor = ScriptProcessor(
command=["python3"],
image_uri=processing_repository_uri,
role=role,
instance_count=1,
instance_type="ml.m5.xlarge",
sagemaker_session=pipeline_session,
)

script_processor_run_args = script_processor.run(
code="preprocessing.py",
inputs=[ProcessingInput(source="s3://amzn-s3-demo-bucket-pca-detect/processing_input/", destination="/opt/ml/processing/input")],
outputs=[ProcessingOutput(output_name="training", source="/opt/ml/processing/train")
])

step_processing = ProcessingStep(

name="PreprocessStep",

step_args=script_processor_run_args,

)

In the preceding code, a ScriptProcessor class is being instantiated to run the python3 command for running your custom Python script. You provide the following information:

  • You provide the ECR URI of your custom container image and give SageMaker PipelineSession credentials to the class. When you specify the PipelineSession, the ScriptProcessor doesn’t actually begin the execution when you call its run() method—rather, it defers until the SageMaker pipeline as a whole is invoked.
  • In the run() method, you specify the preprocessing script along with the appropriate ProcessingInput and ProcessingOutput These specify where the data will be mounted in your custom container from Amazon S3, and where it will be later uploaded in Amazon S3 from your container’s output folder. The output channel is named training, and the final Amazon output location will be located at s3://<amzn-s3-demo-bucket-pca-detect>/<job-name>/output/<output-name>.

You can also specify an additional parameter in run() named distribution, and it can either be ShardedByS3Key or FullyReplicated, depending on whether you’re splitting and sending your S3 dataset to multiple ScriptProcessor instances or not. You can specify the number of instances in the instance_count parameter of your ScriptProcessor class.

Once instantiated, you can pass the ScriptProcessor class as an argument to the SageMaker processing step along with an appropriate name.

Distributed processing with PySparkProcessor

An alternative to the decentralized processing is distributed processing. Distributed processing is particularly effective when you need to process large amounts of log data. Apache Spark is a popular engine for distributed data processing. It uses in-memory caching and optimized query execution for fast analytic queries against datasets of all sizes. SageMaker provides the PySparkProcessor class within the SageMaker Python SDK for running Spark jobs. For an example of performing distributed processing with PySparkProcessor on SageMaker processing, see Distributed Data Processing using Apache Spark and SageMaker Processing. The following are a few key points to note:

  • To install custom dependencies in your Spark container, you can either build a custom container image (similar to the decentralized processing example) or use the subprocess Python module to install them using pip at runtime. For example, to run the anomaly detection technique on Spark, you need an argformat module, which you can install along with other dependencies as follows:
import subprocess
subprocess.run(["pip3", "install", "scipy", "scikit-learn", "logparser3"])
  • Spark transformations are powerful operations to process your data, and Spark actions are the operations that actually perform the requested transformations on your data. The collect() method is a Spark action that brings all the data from worker nodes to the main driver node. It’s a good practice to use it in conjunction with filter functions so you don’t run into memory issues when working with large log datasets.
  • You should also try to partition your input data based on the total number of cores you plan to have in your SageMaker cluster. The official Spark recommendation is to have approximately 2–3 times the number of partitions as the total number of cores in your cluster.

When your Spark processing script is ready, you can instantiate the SageMaker PySparkProcessor class for running it by adding the following lines to your driver program:

from sagemaker.processing import (
ProcessingInput,
ProcessingOutput,

PySparkProcessor
)

from sagemaker.workflow.steps import ProcessingStep

pipeline_session = PipelineSession()

spark_processor = PySparkProcessor(

base_job_name="hdfs-spark-job",

framework_version="3.1",

role=role,

sagemaker_session=pipeline_session,

instance_count=3,

instance_type="ml.m5.xlarge",

max_runtime_in_seconds=6000,

)

spark_processor.run(

submit_app="./sagemaker_spark_processing.py",

spark_event_logs_s3_uri="s3://amzn-s3-demo-bucket-pca-detect/logs/spark_event_logs",

logs=True,

)

step_processing = ProcessingStep(

name="SparkPreprocessStep",

step_args=spark_processor_run_args,

)

The preceding code instantiates a PySparkProcessor instance with three nodes in the SageMaker cluster with Spark v3.1 installed in them. You submit your Spark processing code to it along with the Amazon S3 location where your event logs would be uploaded. These logs can be useful for debugging.

In the run() method invocation, you don’t need to specify your inputs and outputs, which can be the case if these are fixed Amazon S3 destinations already known to your processing code. Otherwise, you can specify them using the ProcessingInput and ProcessingOutput parameters just like in the decentralized example.

Post-instantiation, the PySparkProcessor class is passed to a SageMaker processing step with an appropriate name. Its execution won’t be triggered until the pipeline is created.

Train and tune the model

Now that your processing steps are complete, you can proceed to the model training step. The training algorithm could either be a classical anomaly detection model like Drain-based detection or a neural-network based model like DeepLog. Every model takes in certain hyperparameters that influence how the model is trained. To obtain the best-performing model, the model is usually executed and validated multiple times over a wide range of hyperparameters. This can be a time-consuming manual process and can instead be automated using SageMaker hyperparameter tuning jobs. Tuning jobs perform hyperparameter optimization by running your training script with a specified range of hyperparameter values and obtaining the best model based on the metrics you specify. You can predefine these metrics if you use built-in SageMaker algorithms or define them for your custom training algorithm.

You first need to write your training script for your anomaly detection model. Keep the following in mind:

  • SageMaker makes artifacts available to your container under the /opt/ml container directory. You should use this when fetching your artifacts. For more details on the SageMaker container structure, see SageMaker AI Toolkits Containers Structure.
  • For using a tuning job, you need to make sure that your code doesn’t hardcode parameter hyperparameter values but instead reads them from the /opt/ml/input/config/hyperparameters.json file in your container where SageMaker places it.
  • When using a custom training script, you also need to add a custom training metric to your script that can be used by the tuning job to find the best model. For this, you should print your desired metrics in your training script using a logger or print function. For example, you could print out custom_metric_value: 91, which indicates that your custom metric’s value is 91. We demonstrate later in this post how SageMaker can be informed about this metric.

When your training script is ready, you can use it inside a SageMaker container. SageMaker provides a wide range of built-in algorithm containers that you can use to run your training code. However, there might be cases when you need to build your own training containers. This could be the case when you need custom libraries installed or if you plan to use a new algorithm not built in by SageMaker. In such a case, you can build your own containers in two ways:

After you create your training container image, you need to define the hyperparameter ranges for your tuning job. For example, if you’re using a custom adaptation of the PCA algorithm (like in Drain-based detection), you add the following lines to your driver program:

from sagemaker.tuner import (

IntegerParameter,

)

hyperparameter_ranges = {
"max_components": IntegerParameter(1, 30, scaling_type="Auto")
}

The preceding code indicates that your hyperparameter max_components is an integer and it ranges from 1–30. The auto scaling type indicates that SageMaker will choose the best scale for hyperparameter changes. For more details on other scaling options, see Hyperparameter scaling types.

Then you can use the following code to fully configure your training and tuning steps in the driver program:

estimator = Estimator(
image_uri= training_image_uri,
role=role,
base_job_name='new_training_job',
sagemaker_session=pipeline_session,
instance_count=1,
instance_type='ml.m5.large',
output_path='s3://amzn-s3-demo-bucket-pca-detect/models/',
metric_definitions=[{'Name': custom_metric, 'Regex': "custom_metric_value: ([0-9\.]+)"}]
)

parameter_tuner = HyperparameterTuner(
estimator,
objective_metric_name ="custom_metric",
hyperparameter_ranges,
metric_definitions=[{'Name': custom_metric, 'Regex': "custom_metric_value: ([0-9\.]+)"}],
max_jobs=30,
max_parallel_jobs=5,
strategy="Bayesian",
objective_type="Maximize",
early_stopping_type="Auto"
)

hpo_args = parameter_tuner.fit(
inputs={
"training": TrainingInput(
s3_data= step_processing.properties.ProcessingOutputConfig.Outputs["training"].S3Output.S3Uri,
s3_data_type="S3Prefix",
distribution="FullyReplicated"
)
}
)

step_tuning = TuningStep(
name="AnomalyDetectionTuning",
step_args=hpo_args,
)

In the preceding code, a SageMaker Estimator instance is created using your custom training image’s ECR URI. SageMaker Estimators help in training your models and orchestrating their training lifecycles. The Estimator is provided with a suitable role and the PipelineSession is designated as its SageMaker session.

You provide the location where your trained model should be stored to the Estimator and supply it with custom metric definitions that you created. For the example metric custom_metric_value: 91, the definition to the Estimator includes its name along with its regex. The regex informs SageMaker how to pick up the metric’s values from training logs in Amazon CloudWatch. The tuning job uses these values to find the best-performing model. You also specify where the output model should be uploaded in the output_path parameter.

You then use this Estimator to instantiate your HyperparameterTuner. Its parameters include the total and maximum parallel number of training jobs, search strategy (for more details on strategies, see Understand the hyperparameter tuning strategies available in Amazon SageMaker AI), and whether you want to use early stopping. Early stopping can be set to Auto so that SageMaker automatically stops model training when it doesn’t see improvements in your custom logged metric.

After the HyperparameterTuner is instantiated, you can call its fit() method. In its input parameter, you specify the output Amazon S3 URI from the processing step as the input location for obtaining training data in your tuning step. This way, you don’t need to specify the Amazon S3 URI yourself and it’s passed between steps implicitly. You can then specify your s3prefix and distribution depending on whether you’re using multiple instances or not.

Once instantiated, the HyperparameterTuner is passed to the tuning step, where it becomes part of your SageMaker pipeline. The training configuration is now complete!

Register the model

You can now choose the best model from the tuning step to create a SageMaker model and publish it to the SageMaker Model Registry. You can use the following driver program code:

from sagemaker import PipelineModel
from sagemaker.workflow.model_step import ModelStep

best_model = sagemaker.model.Model(
image_uri=training_image_uri,
model_data=step_tuning.get_top_model_s3_uri(
top_k=0,
s3_bucket="amzn-s3-demo-bucket-pca-detect",
prefix="models"
)
)

pipeline_model = PipelineModel(
models=[best_model],
role=role,

sagemaker_session=pipeline_session,
)

register_model_step_args = pipeline_model.register(
content_types=["text/csv"],
response_types=["text/csv"],
model_package_group_name="PCAAnomalyDetection",
)

step_model_registration = ModelStep(
name="NewRegistry",
step_args=register_model_step_args,
)

The code instantiates a SageMaker model using the Amazon S3 URI of the best model obtained from the tuning step. The top_k attribute of the get_top_model_s3_uri() method indicates that you’re interested in only obtaining the best-trained model.

After the model is instantiated, you can use it to create a SageMaker PipelineModel so that your pipeline can work directly with your model. You then call the register() method of PipelineModel to register your model to the SageMaker Model Registry. In the register() call, you specify the name of the new model package group where your model will be registered and specify its input and output request and response prediction types.

Finally, a SageMaker ModelStep is invoked with the instantiated PipelineModel to carry out the model registration process.

Create and run a pipeline

You’ve now reached the final step where all your steps will be tied together in a SageMaker pipeline. Add the following code to your driver program to complete your pipeline creation steps:

from sagemaker.workflow.pipeline import Pipeline

pipeline = Pipeline(
name="Anomaly-Detection-Pipeline",
steps=[
step_processing,

step_tuning,
step_model_registration
],
sagemaker_session=pipeline_session,
)
pipeline.upsert(role_arn=role)

pipeline.start()

This code instantiates the SageMaker Pipeline construct and provides it with all the steps defined until now—processing, tuning, and registering the model. It’s provided with a role and then invoked with the start() method.

The pipeline invocation could be on-demand using code (using pipeline.start() as shown earlier) or it could be event-driven using Amazon EventBridge rules. For example, you can create an EventBridge rule that triggers when new training data is uploaded to your S3 buckets and specify your SageMaker pipeline as the target for this rule. This makes sure that when new data is uploaded to your training bucket, your SageMaker pipeline is automatically invoked. For more details on SageMaker and EventBridge integration, refer to Schedule Pipeline Runs.

On invocation, your SageMaker pipeline runs your custom processing script in the processing step and uploads the processed data to your specified Amazon S3 destination. It then starts a tuning job with your custom training code and iteratively trains multiple models with your supplied hyperparameters and selects the best model based on your custom provided metric. The following screenshot shows that it selected the best model when tuning was complete:

Finally, the best model is selected and a model package resource is created with it in your model registry. Your customers can use it to deploy your model:

You have now completed all the steps in processing, training, tuning, and registering your custom anomaly detection model automatically with the aid of a SageMaker pipeline that was initiated using your driver program.

Clean up

To avoid incurring future charges, complete the following steps:

  1. Delete the SageMaker notebook instance used for this post.
  2. Delete the model package resource that was created using the best-tuned model.
  3. Delete any Amazon S3 data that was used for this post.

Conclusion

In this post, we demonstrated the building, training, tuning, and registering of an anomaly detection system with custom processing code, custom training code, and custom training metrics. We ran these steps automatically with the aid of a SageMaker pipeline, which was run by invoking a single main driver program. We also discussed the different ways of processing our data, and how it could be done using the various constructs and tools that SageMaker provides in a user-friendly and straightforward manner.

Try this approach for building your own custom anomaly detection model, and share your feedback in the comments.

References

[1] https://ieeexplore.ieee.org/document/8029742

[2] https://dl.acm.org/doi/pdf/10.1145/3133956.3134015


About the Author

Nitesh Sehwani is an SDE with the EC2 Threat Detection team where he’s involved in building large-scale systems that provide security to our customers. In his free time, he reads about art history and enjoys listening to mystery thrillers.

Read More

Optimizing costs of generative AI applications on AWS

Optimizing costs of generative AI applications on AWS

The report The economic potential of generative AI: The next productivity frontier, published by McKinsey & Company, estimates that generative AI could add an equivalent of $2.6 trillion to $4.4 trillion in value to the global economy. The largest value will be added across four areas: customer operations, marketing and sales, software engineering, and R&D.

The potential for such large business value is galvanizing tens of thousands of enterprises to build their generative AI applications in AWS. However, many product managers and enterprise architect leaders want a better understanding of the costs, cost-optimization levers, and sensitivity analysis.

This post addresses these cost considerations so you can optimize your generative AI costs in AWS.

The post assumes a basic familiarity of foundation model (FMs) and large language models (LLMs), tokens, vector embeddings, and vector databases in AWS. With Retrieval Augmented Generation (RAG) being one of the most common frameworks used in generative AI solutions, the post explains costs in the context of a RAG solution and respective optimization pillars on Amazon Bedrock.

In Part 2 of this series, we will cover how to estimate business value and the influencing factors.

Cost and performance optimization pillars

Designing performant and cost-effective generative AI applications is essential for realizing the full potential of this transformative technology and driving widespread adoption within your organization.

Forecasting and managing costs and performance in generative AI applications is driven by the following optimization pillars:

  • Model selection, choice, and customization – We define these as follows:
    • Model selection – This process involves identifying the optimal model that meets a wide variety of use cases, followed by model validation, where you benchmark against high-quality datasets and prompts to identify successful model contenders.
    • Model choice – This refers to the choice of an appropriate model because different models have varying pricing and performance attributes.
    • Model customization – This refers to choosing the appropriate techniques to customize the FMs with training data to optimize the performance and cost-effectiveness according to business-specific use cases.
  • Token usage – Analyzing token usage consists of the following:
    • Token count – The cost of using a generative AI model depends on the number of tokens processed. This can directly impact the cost of an operation.
    • Token limits – Understanding token limits and what drives token count, and putting guardrails in place to limit token count can help you optimize token costs and performance.
    • Token caching – Caching at the application layer or LLM layer for commonly asked user questions can help reduce the token count and improve performance.
  • Inference pricing plan and usage patterns – We consider two pricing options:
    • On-Demand – Ideal for most models, with charges based on the number of input/output tokens, with no guaranteed token throughput.
    • Provisioned Throughput – Ideal for workloads demanding guaranteed throughput, but with relatively higher costs.
  • Miscellaneous factors – Additional factors can include:
    • Security guardrails – Applying content filters for personally identifiable information (PII), harmful content, undesirable topics, and detecting hallucinations improves the safety of your generative AI application. These filters can perform and scale independently of LLMs and have costs that are directly proportional to the number of filters and the tokens examined.
    • Vector database – The vector database is a critical component of most generative AI applications. As the amount of data usage in your generative AI application grows, vector database costs can also grow.
    • Chunking strategy – Chunking strategies such as fixed size chunking, hierarchical chunking, or semantic chunking can influence the accuracy and costs of your generative AI application.

Let’s dive deeper to examine these factors and associated cost-optimization tips.

Retrieval Augmented Generation

RAG helps an LLM answer questions specific to your corporate data, even though the LLM was never trained on your data.

As illustrated in the following diagram, the generative AI application reads your corporate trusted data sources, chunks it, generates vector embeddings, and stores the embeddings in a vector database. The vectors and data stored in a vector database are often called a knowledge base.

The generative AI application uses the vector embeddings to search and retrieve chunks of data that are most relevant to the user’s question and augment the question to generate the LLM response. The following diagram illustrates this workflow.

The workflow consists of the following steps:

  1. A user asks a question using the generative AI application.
  2. A request to generate embeddings is sent to the LLM.
  3. The LLM returns embeddings to the application.
  4. These embeddings are searched against vector embeddings stored in a vector database (knowledge base).
  5. The application receives context relevant to the user question from the knowledge base.
  6. The application sends the user question and the context to the LLM.
  7. The LLM uses the context to generate an accurate and grounded response.
  8. The application sends the final response back to the user.

Amazon Bedrock is a fully managed service providing access to high-performing FMs from leading AI providers through a unified API. It offers a wide range of LLMs to choose from.

In the preceding workflow, the generative AI application invokes Amazon Bedrock APIs to send text to an LLM like Amazon Titan Embeddings V2 to generate text embeddings, and to send prompts to an LLM like Anthropic’s Claude Haiku or Meta Llama to generate a response.

The generated text embeddings are stored in a vector database such as Amazon OpenSearch Service, Amazon Relational Database Service (Amazon RDS), Amazon Aurora, or Amazon MemoryDB.

A generative AI application such as a virtual assistant or support chatbot might need to carry a conversation with users. A multi-turn conversation requires the application to store a per-user question-answer history and send it to the LLM for additional context. This question-answer history can be stored in a database such as Amazon DynamoDB.

The generative AI application could also use Amazon Bedrock Guardrails to detect off-topic questions, ground responses to the knowledge base, detect and redact PII information, and detect and block hate or violence-related questions and answers.

Now that we have a good understanding of the various components in a RAG-based generative AI application, let’s explore how these factors influence costs while running your application in AWS using RAG.

Directional costs for small, medium, large, and extra large scenarios

Consider an organization that wants to help their customers with a virtual assistant that can answer their questions any time with a high degree of accuracy, performance, consistency, and safety. The performance and cost of the generative AI application depends directly on a few major factors in the environment, such as the velocity of questions per minute, the volume of questions per day (considering peak and off-peak), the amount of knowledge base data, and the LLM that is used.

Although this post explains the factors that influence costs, it can be useful to know the directional costs, based on some assumptions, to get a relative understanding of various cost components for a few scenarios such as small, medium, large, and extra large environments.

The following table is a snapshot of directional costs for four different scenarios with varying volume of user questions per month and knowledge base data.

. SMALL MEDIUM LARGE EXTRA LARGE
INPUTs 500,000 2,000,000 5,000,000 7,020,000
Total questions per month 5 25 50 100
Knowledge base data size in GB (actual text size on documents) . . . .
Annual costs (directional)* . . . .
Amazon Bedrock On-Demand costs using Anthropic’s Claude 3 Haiku $5,785 $23,149 $57,725 $81,027
Amazon OpenSearch Service provisioned cluster costs $6,396 $13,520 $20,701 $39,640
Amazon Bedrock Titan Text Embedding v2 costs $396 $5,826 $7,320 $13,585
Total annual costs (directional) $12,577 $42,495 $85,746 $134,252
Unit cost per 1,000 questions (directional) $2.10 $1.80 $1.40 $1.60

These costs are based on assumptions. Costs will vary if assumptions change. Cost estimates will vary for each customer. The data in this post should not be used as a quote and does not guarantee the cost for actual use of AWS services. The costs, limits, and models can change over time.

For the sake of brevity, we use the following assumptions:

  • Amazon Bedrock On-Demand pricing model
  • Anthropic’s Claude 3 Haiku LLM
  • AWS Region us-east-1
  • Token assumptions for each user question:
    • Total input tokens to LLM = 2,571
    • Output tokens from LLM = 149
    • Average of four characters per token
    • Total tokens = 2,720
  • There are other cost components such as DynamoDB to store question-answer history, Amazon Simple Storage Service (Amazon S3) to store data, and AWS Lambda or Amazon Elastic Container Service (Amazon ECS) to invoke Amazon Bedrock APIs. However, these costs are not as significant as the cost components mentioned in the table.

We refer to this table in the remainder of this post. In the next few sections, we will cover Amazon Bedrock costs and the key factors influences its costs, vector embedding costs, vector database costs, and Amazon Bedrock Guardrails costs. In the final section, we will cover how chunking strategies will influence some of the above cost components.

Amazon Bedrock costs

Amazon Bedrock has two pricing models: On-Demand (used in the preceding example scenario) and Provisioned Throughput.

With the On-Demand model, an LLM has a maximum requests (questions) per minute (RPM) and tokens per minute (TPM) limit. The RPM and TPM are typically different for each LLM. For more information, see Quotas for Amazon Bedrock.

In the extra large use case, with 7 million questions per month, assuming 10 hours per day and 22 business days per month, it translates to 532 questions per minute (532 RPM). This is well below the maximum limit of 1,000 RPM for Anthropic’s Claude 3 Haiku.

With 2,720 average tokens per question and 532 requests per minute, the TPM is 2,720 x 532 = 1,447,040, which is well below the maximum limit of 2,000,000 TPM for Anthropic’s Claude 3 Haiku.

However, assume that the user questions grow by 50%. The RPM, TPM, or both might cross the thresholds. In such cases where the generative AI application needs cross the On-Demand RPM and TPM thresholds, you should consider the Amazon Bedrock Provisioned Throughput model.

With Amazon Bedrock Provisioned Throughput, cost is based on a per-model unit basis. Model units are dedicated for the duration you plan to use, such as an hourly, 1-month, 6-month commitment.

Each model unit offers a certain capacity of maximum tokens per minute. Therefore, the number of model units (and the costs) are determined by the input and output TPM.

With Amazon Bedrock Provisioned Throughput, you incur charges per model unit whether you use it or not. Therefore, the Provisioned Throughput model is relatively more expensive than the On-Demand model.

Consider the following cost-optimization tips:

  • Start with the On-Demand model and test for your performance and latency with your choice of LLM. This will deliver the lowest costs.
  • If On-Demand can’t satisfy the desired volume of RPM or TPM, start with Provisioned Throughput with a 1-month subscription during your generative AI application beta period. However, for steady state production, consider a 6-month subscription to lower the Provisioned Throughput costs.
  • If there are shorter peak hours and longer off-peak hours, consider using a Provisioned Throughput hourly model during the peak hours and On-Demand during the off-peak hours. This can minimize your Provisioned Throughput costs.

Factors influencing costs

In this section, we discuss various factors that can influence costs.

Number of questions

Cost grows as the number of questions grow with the On-Demand model, as can be seen in the following figure for annual costs (based on the table discussed earlier).

Input tokens

The main sources of input tokens to the LLM are the system prompt, user prompt, context from the vector database (knowledge base), and context from QnA history, as illustrated in the following figure.

As the size of each component grows, the number of input tokens to the LLM grows, and so does the costs.

Generally, user prompts are relatively small. For example, in the user prompt “What are the performance and cost optimization strategies for Amazon DynamoDB?”, assuming four characters per token, there are approximately 20 tokens.

System prompts can be large (and therefore the costs are higher), especially for multi-shot prompts where multiple examples are provided to get LLM responses with better tone and style. If each example in the system prompt uses 100 tokens and there are three examples, that’s 300 tokens, which is quite larger than the actual user prompt.

Context from the knowledge base tends to be the largest. For example, when the documents are chunked and text embeddings are generated for each chunk, assume that the chunk size is 2,000 characters. Assume that the generative AI application sends three chunks relevant to the user prompt to the LLM. This is 6,000 characters. Assuming four characters per token, this translates to 1,500 tokens. This is much higher compared to a typical user prompt or system prompt.

Context from QnA history can also be high. Assume an average of 20 tokens in the user prompt and 100 tokens in LLM response. Assume that the generative AI application sends a history of three question-answer pairs along with each question. This translates to (20 tokens per question + 100 tokens per response) x 3 question-answer pairs = 360 tokens.

Consider the following cost-optimization tips:

  • Limit the number of characters per user prompt
  • Test the accuracy of responses with various numbers of chunks and chunk sizes from the vector database before finalizing their values
  • For generative AI applications that need to carry a conversation with a user, test with two, three, four, or five pairs of QnA history and then pick the optimal value

Output tokens

The response from the LLM will depend on the user prompt. In general, the pricing for output tokens is three to five times higher than the pricing for input tokens.

Consider the following cost-optimization tips:

  • Because the output tokens are expensive, consider specifying the maximum response size in your system prompt
  • If some users belong to a group or department that requires higher token limits on the user prompt or LLM response, consider using multiple system prompts in such a way that the generative AI application picks the right system prompt depending on the user

Vector embedding costs

As explained previously, in a RAG application, the data is chunked, and text embeddings are generated and stored in a vector database (knowledge base). The text embeddings are generated by invoking the Amazon Bedrock API with an LLM, such as Amazon Titan Text Embeddings V2. This is independent of the Amazon Bedrock model you choose for inferencing, such as Anthropic’s Claude Haiku or other LLMs.

The pricing to generate text embeddings is based on the number of input tokens. The greater the data, the greater the input tokens, and therefore the higher the costs.

For example, with 25 GB of data, assuming four characters per token, input tokens total 6,711 million. With the Amazon Bedrock On-Demand costs for Amazon Titan Text Embeddings V2 as $0.02 per million tokens, the cost of generating embeddings is $134.22.

However, On-Demand has an RPM limit of 2,000 for Amazon Titan Text Embeddings V2. With 2,000 RPM, it will take 112 hours to embed 25 GB of data. Because this is a one-time job of embedding data, this might be acceptable in most scenarios.

For monthly change rate and new data of 5% (1.25 GB per month), the time required will be 6 hours.

In rare situations where the actual text data is very high in TBs, Provisioned Throughput will be needed to generate text embeddings. For example, to generate text embeddings for 500 GB in 3, 6, and 9 days, it will be approximately $60,000, $33,000, or $24,000 one-time costs using Provisioned Throughput.

Typically, the actual text inside a file is 5–10 times smaller than the file size reported by Amazon S3 or a file system. Therefore, when you see 100 GB size for all your files that need to be vectorized, there is a high probability that the actual text inside the files will be 2–20 GB.

One way to estimate the text size inside files is with the following steps:

  1. Pick 5–10 sample representations of the files.
  2. Open the files, copy the content, and enter it into a Word document.
  3. Use the word count feature to identify the text size.
  4. Calculate the ratio of this size with the file system reported size.
  5. Apply this ratio to the total file system to get a directional estimate of actual text size inside all the files.

Vector database costs

AWS offers many vector databases, such as OpenSearch Service, Aurora, Amazon RDS, and MemoryDB. As explained earlier in this post, the vector database plays a critical role in grounding responses to your enterprise data whose vector embeddings are stored in a vector database.

The following are some of the factors that influence the costs of vector database. For the sake of brevity, we consider an OpenSearch Service provisioned cluster as the vector database.

  • Amount of data to be used as the knowledge base – Costs are directly proportional to data size. More data means more vectors. More vectors mean more indexes in a vector database, which in turn requires more memory and therefore higher costs. For best performance, it’s recommended to size the vector database so that all the vectors are stored in memory.
  • Index compression – Vector embeddings can be indexed by HNSW or IVF algorithms. The index can also be compressed. Although compressing the indexes can reduce the memory requirements and costs, it might lose accuracy. Therefore, consider doing extensive testing for accuracy before deciding to use compression variants of HNSW or IVF. For example, for a large text data size of 100 GB, assuming 2,000 bytes of chunk size, 15% overlap, vector dimension count of 512, no upfront Reserved Instance for 3 years, and HNSW algorithm, the approximate costs are $37,000 per year. The corresponding costs with compression using hnsw-fp16 and hnsw-pq are $21,000 and $10,000 per year, respectively.
  • Reserved Instances – Cost is inversely proportional to the number of years you reserve the cluster instance that stores the vector database. For example, in the preceding scenario, an On-Demand instance would cost approximately, $75,000 per year, a no upfront 1-year Reserved Instance would cost $52,000 per year, and a no upfront 3-year Reserved Instance would cost $37,000 per year.

Other factors, such as the number of retrievals from the vector database that you pass as context to the LLM, can influence input tokens and therefore costs. But in general, the preceding factors are the most important cost drivers.

Amazon Bedrock Guardrails

Let’s assume your generative AI virtual assistant is supposed to answer questions related to your products for your customers on your website. How will you avoid users asking off-topic questions such as science, religion, geography, politics, or puzzles? How do you avoid responding to user questions on hate, violence, or race? And how can you detect and redact PII in both questions and responses?

The Amazon Bedrock ApplyGuardrail API can help you solve these problems. Guardrails offer multiple policies such as content filters, denied topics, contextual grounding checks, and sensitive information filters (PII). You can selectively apply these filters to all or a specific portion of data such as user prompt, system prompt, knowledge base context, and LLM responses.

Applying all filters to all data will increase costs. Therefore, you should evaluate carefully which filter you want to apply on what portion of data. For example, if you want PII to be detected or redacted from the LLM response, for 2 million questions per month, approximate costs (based on output tokens mentioned earlier in this post) would be $200 per month. In addition, if your security team wants to detect or redact PII for user questions as well, the total Amazon Bedrock Guardrails costs will be $400 per month.

Chunking strategies

As explained earlier in how RAG works, your data is chunked, embeddings are generated for those chunks, and the chunks and embeddings are stored in a vector database. These chunks of data are retrieved later and passed as context along with user questions to the LLM to generate a grounded and relevant response.

The following are different chunking strategies, each of which can influence costs:

  • Standard chunking – In this case, you can specify default chunking, which is approximately 300 tokens, or fixed-size chunking, where you specify the token size (for example, 300 tokens) for each chunk. Larger chunks will increase input tokens and therefore costs.
  • Hierarchical chunking – This strategy is useful when you want to chunk data at smaller sizes (for example, 300 tokens) but send larger pieces of chunks (for example, 1,500 tokens) to the LLM so the LLM has a bigger context to work with while generating responses. Although this can improve accuracy in some cases, this can also increase the costs because of larger chunks of data being sent to the LLM.
  • Semantic chunking – This strategy is useful when you want chunking based on semantic meaning instead of just the token. In this case, a vector embedding is generated for one or three sentences. A sliding window is used to consider the next sentence and embeddings are calculated again to identify whether the next sentence is semantically similar or not. The process continues until you reach an upper limit of tokens (for example, 300 tokens) or you find a sentence that isn’t semantically similar. This boundary defines a chunk. The input token costs to the LLM will be similar to standard chunking (based on a maximum token size) but the accuracy might be better because of chunks having sentences that are semantically similar. However, this will increase the costs of generating vector embeddings because embeddings are generated for each sentence, and then for each chunk. But at the same time, these are one-time costs (and for new or changed data), which might be worth it if the accuracy is comparatively better for your data.
  • Advanced parsing – This is an optional pre-step to your chunking strategy. This is used to identify chunk boundaries, which is especially useful when you have documents with a lot of complex data such as tables, images, and text. Therefore, the costs will be the input and output token costs for the entire data that you want to use for vector embeddings. These costs will be high. Consider using advanced parsing only for those files that have a lot of tables and images.

The following table is a relative cost comparison for various chunking strategies.

Chunking Strategy Standard Semantic Hierarchical
Relative Inference Costs Low Medium High

Conclusion

In this post, we discussed various factors that could impact costs for your generative AI application. This a rapidly evolving space, and costs for the components we mentioned could change in the future. Consider the costs in this post as a snapshot in time that is based on assumptions and is directionally accurate. If you have any questions, reach out to your AWS account team.

In Part 2, we discuss how to calculate business value and the factors that impact business value.


About the Authors

Vinnie Saini is a Senior Generative AI Specialist Solution Architect at Amazon Web Services(AWS) based in Toronto, Canada. With a background in Machine Learning, she has over 15 years of experience designing & building transformational cloud based solutions for customers across industries. Her focus has been primarily scaling AI/ML based solutions for unparalleled business impacts, customized to business needs.

Chandra Reddy is a Senior Manager of Solution Architects team at Amazon Web Services(AWS) in Austin, Texas. He and his team help enterprise customers in North America on their AIML and Generative AI use cases in AWS. He has more than 20 years of experience in software engineering, product management, product marketing, business development, and solution architecture.

Read More