Create a multimodal chatbot tailored to your unique dataset with Amazon Bedrock FMs

Create a multimodal chatbot tailored to your unique dataset with Amazon Bedrock FMs

With recent advances in large language models (LLMs), a wide array of businesses are building new chatbot applications, either to help their external customers or to support internal teams. For many of these use cases, businesses are building Retrieval Augmented Generation (RAG) style chat-based assistants, where a powerful LLM can reference company-specific documents to answer questions relevant to a particular business or use case.

In the last few months, there has been substantial growth in the availability and capabilities of multimodal foundation models (FMs). These models are designed to understand and generate text about images, bridging the gap between visual information and natural language. Although such multimodal models are broadly useful for answering questions and interpreting imagery, they’re limited to only answering questions based on information from their own training document dataset.

In this post, we show how to create a multimodal chat assistant on Amazon Web Services (AWS) using Amazon Bedrock models, where users can submit images and questions, and text responses will be sourced from a closed set of proprietary documents. Such a multimodal assistant can be useful across industries. For example, retailers can use this system to more effectively sell their products (for example, HDMI_adaptor.jpeg, “How can I connect this adapter to my smart TV?”). Equipment manufacturers can build applications that allow them to work more effectively (for example, broken_machinery.png, “What type of piping do I need to fix this?”). This approach is broadly effective in scenarios where image inputs are important to query a proprietary text dataset. In this post, we demonstrate this concept on a synthetic dataset from a car marketplace, where a user can upload a picture of a car, ask a question, and receive responses based on the car marketplace dataset.

Solution overview

For our custom multimodal chat assistant, we start by creating a vector database of relevant text documents that will be used to answer user queries. Amazon OpenSearch Service is a powerful, highly flexible search engine that allows users to retrieve data based on a variety of lexical and semantic retrieval approaches. This post focuses on text-only documents, but for embedding more complex document types, such as those with images, see Talk to your slide deck using multimodal foundation models hosted on Amazon Bedrock and Amazon SageMaker.

After the documents are ingested in OpenSearch Service (this is a one-time setup step), we deploy the full end-to-end multimodal chat assistant using an AWS CloudFormation template. The following system architecture represents the logic flow when a user uploads an image, asks a question, and receives a text response grounded by the text dataset stored in OpenSearch.

System architecture

The logic flow for generating an answer to a text-image response pair routes as follows:

  • Steps 1 and 2 – To start, a user query and corresponding image are routed through an Amazon API Gateway connection to an AWS Lambda function, which serves as the processing and orchestrating compute for the overall process.
  • Step 3 – The Lambda function stores the query image in Amazon S3 with a specified ID. This may be useful for later chat assistant analytics.
  • Steps 4–8 – The Lambda function orchestrates a series of Amazon Bedrock calls to a multimodal model, an LLM, and a text-embedding model:
    • Query the Claude V3 Sonnet model with the query and image to produce a text description.
    • Embed a concatenation of the original question and the text description with the Amazon Titan Text Embeddings
    • Retrieve relevant text data from OpenSearch Service.
    • Generate a grounded response to the original question based on the retrieved documents.
  • Step 9 – The Lambda function stores the user query and answer in Amazon DynamoDB, linked to the Amazon S3 image ID.
  • Steps 10 and 11 – The grounded text response is sent back to the client.

There is also an initial setup of the OpenSearch Index, which is done using an Amazon SageMaker notebook.

Prerequisites

To use the multimodal chat assistant solution, you need to have a handful of Amazon Bedrock FMs available.

  1. On the Amazon Bedrock console, choose Model access in the navigation pane.
  2. Choose Manage model access.
  3. Activate all the Anthropic models, including Claude 3 Sonnet, as well as the Amazon Titan Text Embeddings V2 model, as shown in the following screenshot.

For this post, we recommend activating these models in the us-east-1 or us-west-2 AWS Region. These should become immediately active and available.

Bedrock model access

Simple deployment with AWS CloudFormation

To deploy the solution, we provide a simple shell script called deploy.sh, which can be used to deploy the end-to-end solution in different Regions. This script can be acquired directly from Amazon S3 using aws s3 cp s3://aws-blogs-artifacts-public/artifacts/ML-16363/deploy.sh .

Using the AWS Command Line Interface (AWS CLI), you can deploy this stack in various Regions using one of the following commands:

bash deploy.sh us-east-1

or

bash deploy.sh us-west-2

The stack may take up to 10 minutes to deploy. When the stack is complete, note the assigned physical ID of the Amazon OpenSearch Serverless collection, which you will use in further steps. It should look something like zr1b364emavn65x5lki8. Also, note the physical ID of the API Gateway connection, which should look something like zxpdjtklw2, as shown in the following screenshot.

cloudformation output

Populate the OpenSearch Service index

Although the OpenSearch Serverless collection has been instantiated, you still need to create and populate a vector index with the document dataset of car listings. To do this, you use an Amazon SageMaker notebook.

  1. On the SageMaker console, navigate to the newly created SageMaker notebook named MultimodalChatbotNotebook (as shown in the following image), which will come prepopulated with car-listings.zip and Titan-OS-Index.ipynb.
  1. After you open the Titan-OS-Index.ipynb notebook, change the host_id variable to the collection physical ID you noted earlier.Sagemaker notebook
  1. Run the notebook from top to bottom to create and populate a vector index with a dataset of 10 car listings.

After you run the code to populate the index, it may still take a few minutes before the index shows up as populated on the OpenSearch Service console, as shown in the following screenshot. 

Test the Lambda function

Next, test the Lambda function created by the CloudFormation stack by submitting a test event JSON. In the following JSON, replace your bucket with the name of your bucket created to deploy the solution, for example, multimodal-chatbot-deployment-ACCOUNT_NO-REGION.

{
"bucket": "multimodal-chatbot-deployment-ACCOUNT_NO-REGION",
"key": "jeep.jpg",
"question_text": "How much would a car like this cost?"
}

You can set up this test by navigating to the Test panel for the created lambda function and defining a new test event with the preceding JSON. Then, choose Test on the top right of the event definition.

If you are querying the Lambda function from another bucket than those allowlisted in the CloudFormation template, make sure to add the relevant permissions to the Lambda execution role.

The Lambda function may take between 10–20 seconds to run (mostly dependent on the size of your image). If the function performs properly, you should receive an output JSON similar to the following code block. The following screenshot shows the successful output on the console.

{
  "statusCode": 200,
  "body": ""Based on the 2013 Jeep Grand Cherokee SRT8 listing, a heavily modified Jeep like the one described could cost around $17,000 even with significant body damage and high mileage. The powerful engine, custom touches, and off-road capabilities likely justify that asking price.""
}

Note that if you just enabled model access, it may take a few minutes for access to propagate to the Lambda function.

Test the API

For integration into an application, we’ve connected the Lambda function to an API Gateway connection that can be pinged from various devices. We’ve included a notebook within the SageMaker notebook that allows you to query the system with a question and an image and return a response. To use the notebook, replace the API_GW variable with the physical ID of the API Gateway connection that was created using the CloudFormation stack and the REGION variable with the Region your infrastructure was deployed in. Then, making sure your image location and query are set correctly, run the notebook cell. Within 10–20 seconds, you should receive the output of your multimodal query sourced from your own text dataset. This is shown in the following screenshot.

Note that the API Gateway connection is only accessible from this specific notebook, and more comprehensive security and permission elements are required to productionize the system.

Qualitative results

A grounded multimodal chat assistant, where users can submit images with queries, can be useful in many settings. We demonstrate this application with a dataset of cars for sale. For example, a user may have a question about a car they’re looking at, so they snap a picture and submit a question, such as “How much might a car like this cost?” Rather than answering the question with generic information that the LLM was trained on (which may be out of date), responses will be grounded with your local and specific car sales dataset. In this use case, we took images from Unsplash and used a synthetically created dataset of 10 car listings to answer questions. The model and year of the 10 car listings are shown in the following screenshot.

For the examples in the following table, you can observe in the answer, not only has the vision language model (VLM) system identified the correct cars in the car listings that are most similar to the input image, but also it has answered the questions with specific numbers, costs, and locations that are only available from our closed cars dataset car-listings.zip.

Question Image Answer
How much would a car like this cost? The 2013 Jeep Grand Cherokee SRT8 listing is most relevant, with an asking price of $17,000 despite significant body damage from an accident. However, it retains the powerful 470 hp V8 engine and has been well-maintained with service records.
What is the engine size of this car? The car listing for the 2013 Volkswagen Beetle mentions it has a fuel-efficient 1.8L turbocharged engine. No other engine details are provided in the listings.
Where in the world could I purchase a used car like this? Based on the car listings provided, the 2021 Tesla Model 3 for sale seems most similar to the car you are interested in. It’s described as a low mileage, well-maintained Model 3 in pristine condition located in the Seattle area for $48,000.

Latency and quantitative results

Because speed and latency are important for chat assistants and because this solution consists of multiple API calls to FMs and data stores, it’s interesting to measure the speed of each step in the process. We did an internal analysis of the relative speeds of the various API calls, and the following graph visualizes the results.

From slowest to fastest, we have the call to the Claude V3 Vision FM, which takes on average 8.2 seconds. The final output generation step (LLM Gen on the graph in the screenshot) takes on average 4.9 seconds. The Amazon Titan Text Embeddings model and OpenSearch Service retrieval process are much faster, taking 0.28 and 0.27 seconds on average, respectively.

In these experiments, the average time for the full multistage multimodal chatbot is 15.8 seconds. However, the time can be as low as 11.5 seconds overall if you submit a 2.2 MB image, and it could be much lower if you use even lower-resolution images.

Clean up

To clean up the resources and avoid charges, follow these steps:

  1. Make sure all the important data from Amazon DynamoDB and Amazon S3 are saved
  2. Manually empty and delete the two provisioned S3 buckets
  3. To clean up the resources, delete the deployed resource stack from the CloudFormation console.

Conclusion

From applications ranging from online chat assistants to tools to help sales reps close a deal, AI assistants are a rapidly maturing technology to increase efficiency across sectors. Often these assistants aim to produce answers grounded in custom documentation and datasets that the LLM was not trained on, using RAG. A final step is the development of a multimodal chat assistant that can do so as well—answering multimodal questions based on a closed text dataset.

In this post, we demonstrated how to create a multimodal chat assistant that takes images and text as input and produces text answers grounded in your own dataset. This solution will have applications ranging from marketplaces to customer service, where there is a need for domain-specific answers sourced from custom datasets based on multimodal input queries.

We encourage you to deploy the solution for yourself, try different image and text datasets, and explore how you can orchestrate various Amazon Bedrock FMs to produce streamlined, custom, multimodal systems.


About the Authors

Emmett Goodman is an Applied Scientist at the Amazon Generative AI Innovation Center. He specializes in computer vision and language modeling, with applications in healthcare, energy, and education. Emmett holds a PhD in Chemical Engineering from Stanford University, where he also completed a postdoctoral fellowship focused on computer vision and healthcare.

Negin Sokhandan is a Principle Applied Scientist at the AWS Generative AI Innovation Center, where she works on building generative AI solutions for AWS strategic customers. Her research background is statistical inference, computer vision, and multimodal systems.

Yanxiang Yu is an Applied Scientist at the Amazon Generative AI Innovation Center. With over 9 years of experience building AI and machine learning solutions for industrial applications, he specializes in generative AI, computer vision, and time series modeling.

Read More

Design secure generative AI application workflows with Amazon Verified Permissions and Amazon Bedrock Agents

Design secure generative AI application workflows with Amazon Verified Permissions and Amazon Bedrock Agents

Amazon Bedrock Agents enable generative AI applications to perform multistep tasks across various company systems and data sources. They orchestrate and analyze the tasks and break them down into the correct logical sequences using the reasoning abilities of the foundation model (FM). Agents automatically call the necessary APIs to interact with the company systems and processes to fulfill the request. Throughout this process, agents determine whether they can proceed or if additional information is needed.

Customers can build innovative generative AI applications using Amazon Bedrock Agents’ capabilities to intelligently orchestrate their application workflows. When building such workflows, it can be challenging for customers to apply fine-grained access controls to make sure that the application’s workflow operates only on the authorized data based on the application user’s entitlements. Controlling access to resources based on user context, roles, actions and resource conditions can be challenging to maintain in an application workflow because that would require hardcoding several rules in your application or building your own authorization system to externalize those rules.

Instead of building your own authorization system for fine-grained access controls in your application workflows, you can integrate Amazon Verified Permissions into the agent’s workflow to apply contextually aware fine-grained access controls. Verified Permissions is a scalable permissions management and authorization service for custom applications built by you. Verified Permissions helps developers build secure applications faster by externalizing the authorization component and centralizing policy management and administration.

In this post, we demonstrate how to design fine-grained access controls using Verified Permissions for a generative AI application that uses Amazon Bedrock Agents to answer questions about insurance claims that exist in a claims review system using textual prompts as inputs and outputs. In our insurance claims system use case, there are two types of users: claims administrators and claims adjusters. Both are capable of listing open claims, but only one is capable of reading claim detail and making changes. We also show how to restrict permissions using custom attributes such as a user’s region for filtering insurance claims. In this post, the term region doesn’t refer to an AWS Region, but rather to a business-defined region.

Solution overview

In this solution design, we assume that the customer has claims records in an Amazon DynamoDB table and would like to build a chat-based application to answer frequently asked questions about their claims. This chat assistant will be used internally by claims administrators and claims adjusters to answer their clients’ questions.

The following is a list of actions that the claims team needs to perform to answer their clients’ questions:

  • Show me a list of my open claims
  • Show me claim detail for an input claim number
  • Update the status to closed for the input claim number

The customer has the following access control requirements for their claims system:

  • A claims administrator can list claims across various geographic areas, but they can’t read individual claim records
  • A claims adjuster can list claims for their region and can read and update the records of claims assigned to them. However, a claims adjuster can’t access claims from other regions.
  • is placed into a group in Amazon Cognito, where their application-level permissions are set and maintained
  • The customer would like to use Verified Permissions to externalize entity and record level authorization decisions without hard coding the application logic

To improve the performance of the chat assistant, the customer uses FMs available on Amazon Bedrock. To retrieve the necessary information from the claims table and dynamically orchestrate the requests, the customer uses Amazon Bedrock Agents together with Verified Permissions to provide fine-grained authorization for the agents’ invocation.

The application architecture for building the example chat-based Generative AI Claims application with fine-grained access controls is shown in the following diagram.

Figure 1. Architectural diagram for user flow

The application architecture flow is as follows:

  1. User accesses the Generative AI Claims web application (App).
  2. The App authenticates the user with the Amazon Cognito service and issues an ID token and an access tokenID token has the user’s identity and custom attributes.
  3. Using the App, the user sends a request asking to “list the open claims.” The request is sent along with the user’s ID token and access token. The App calls the Claims API Gateway API to run the claims proxy passing user requests and tokens.
  4. Claims API Gateway runs the Custom Authorizer to validate the access token.
  5. When access token validation is successful, the Claims API Gateway sends the user request to the Claims Proxy.
  6. The Claims Proxy invokes the Amazon Bedrock agent passing the user request and ID token. The Amazon Bedrock agent is configured to use Anthropic’s Claude model and to invoke actions using the Claims Agent Helper AWS Lambda
  7. Amazon Bedrock Agent uses chain-of-thought-prompting and builds the list of API actions to run with the help of Claims Agent Helper.
  8. The Claims Agent Helper retrieves claim records from Claims DB and constructs a claims list object. For this example, we are providing hard-coded examples in the Lambda function and no DynamoDB was added to the example solution provided. However, we provide the component on the architecture for representing real-life use cases where the data is stored outside the Lambda
  9. The Claims Agent Helper retrieves the user’s metadata (that is, their name) from ID token, builds the Verified Permissions data entities, and makes the Verified Permissions authorization request. This request contains the principal (user and role), action (that is, ListClaim) and resource (Claim). Verified Permissions evaluates the request against the Verified Permissions policies and returns an Allow or Deny decision. Subsequently, the Claims Agent Helper filters the claims based on that decision. Verified Permissions has “default deny” functionality, meaning that in the absence of an explicit allow, the service defaults to an implicit deny. If there is an explicit Deny in the policies involved in the request, Verified Permissions denies the request.
  10. The Claims Amazon Bedrock Agent receives the authorized list of claims, augments the prompt and sends it to the Claude model for completion. The agent returns the completion back to the user.

Fine-grained access control flows

Based on the customer’s access control requirements, there are three fine-grained access control flows as depicted in the following system sequence diagrams.

Use case: Claims administrator can list claims across regions

The following diagram shows how the claims administrator can list claims across regions.

Figure 2: Claims administrator 'list claims' allow

The following diagram depicts how the claims administrator’s fine-grained access to the claim record is run. In this diagram, notice a deny decision from Verified Permissions. This is because the principal’s role isn’t ClaimsAdjuster.

Figure 3: Claims administrator 'list claims' deny

Use case: Claims adjuster can see claims they own

The following diagram depicts how the claims adjuster’s fine-grained access to retrieve claim details is run. In this diagram, notice the allow decision from Verified Permissions. This is because the principal’s role is ClaimsAdjuster and the resource owner (that is, claim owner) matches the user principal (that is, user=alice).

Figure 4: Claims adjuster 'list claims' allow

The following diagram depicts how the claims adjuster’s fine-grained access to list open claims is run. In this diagram, notice the allow decision from Verified Permissions. This is because the principal’s group is ClaimsAdjuster and the region on the resource matches the principal’s region. As a result of this region filter on the authorization policy, only open claims for the user’s region are returned. Verified Permissions acts on principal, action, and individual resource (that is, a claim record) for the authorization decision. Therefore, the Lambda function needs to iterate through the list of open claims and make an isAuthorized request for each claim record. If this results in a performance issue, you can use the BatchIsAuthorized API and send multiple authzRequest in one API call.

Figure 5: Claims adjuster 'list claims' allow or deny

Entities design considerations

When designing fine-grained data access controls, it is best practice to start with the entity-relationship diagram (ERD) for the application. For our claims application, the user will operate on claim records to retrieve a list of claims records, get the details for an individual claim record, or update the status of a claim record. The following diagram is the ERD for this application modeled in Verified Permissions. With Verified Permissions, you can apply both role-based access control (RBAC) and attribute-based access control (ABAC).

Figure 6: Entity relationship diagram for the application

Here is a brief description of each entity and attributes that will be used for RBAC and ABAC against claim records.

  • Application – The application is a chat-based generative AI application using Amazon Bedrock Agents to understand the questions and retrieve the relevant claims data to assist claims administrators and claims adjusters.
  • Claim – The claim represents an insurance claim record that is stored in the DynamoDB table. The claims system stores claim records and the chatbot application allows users to retrieve and update these records.
  • User – The user.
  • Role – The role represents a user’s access within the application. Here is a list of available roles:
    • Claims administrators – Can list claims across various geographic regions, but they can’t read individual claim records
    • Claims adjusters – Can list claims for their region and read and update their claim records

The roles are managed through Amazon Cognito and Verified Permissions. Cognito maintains a record of which role a user is assigned to and includes this information in the token. Verified Permissions maintains a record of what that role is permitted to do. Fine-grained access controls exist to make sure that users have appropriate permissions for their roles, restricting access to sensitive claim data based on geographic regions and user groups.

Fine-grained authorization: Policy design

The Actions diagram view lists the types of Principals you have configured in your policy store, the Actions they are eligible to perform, and the Resources they are eligible to perform actions on. The lines between entities indicate your ability to create a policy that allows a principal to take an action on a resource. The following image shows the actions diagram from Verified Permissions for our insurance claims use case. The User principal will have access to the Get, List, and Update actions. The resources are the Application and the Claim entity within the application. This diagram generates the underlying schema that governs the policy definition.

Figure 7: Policy schema from Amazon Verified Permissions

Use case: Claims administrator can list all claim records across regions

A policy is a statement that either permits or forbids a principal to take one or more actions on a resource. Each policy is evaluated independently of other policies. The Verified Permissions policy for this use case is shown in the following code example. In this policy, the principal (that is, user Bob), is assigned the role of claims administrator.

permit (
    principal in avp::claim::app::Role::"ClaimsAdministrator",
    action in [
    avp::claim::app::Action::"ListClaim"
    ],
    resource
) ;

Use case: Claims administrator can’t access claim detail record

The Verified Permissions policy for this use case is shown in the following code example. The use of explicit “forbid” policies is a valid practice.

forbid (
    principal in avp::claim::app::Role::"ClaimsAdministrator",
    action in [
    avp::claim::app::Action::"GetClaim"
    ],
    resource
) ;

Use case: Claims adjuster can list claims they own in their region

The Verified Permissions policy for this use case is shown in the following code example. In this policy, the principal (that is, user Alice) is assigned the role of claims adjuster and their region is passed as a custom attribute in the ID token.

permit (
    principal in avp::claim::app::Role::"ClaimsAdjuster",
    action in [
    avp::claim::app::Action::"ListClaim"
    ],
    resource
) when {
    resource has owner &&
    principal == resource.owner &&
    principal has custom &&
    principal.custom has region &&
    principal.custom.region == resource.region
};

Use case: Claims adjuster can retrieve or update a claim they own

permit (
    principal in avp::claim::app::Role::"ClaimsAdjuster",
    action in [
    avp::claim::app::Action::"GetClaim",
     avp::claim::app::Action::"UpdateClaim"
    ],
    resource
) when {
    principal == resource.owner&&
    principal has custom &&
    principal.custom has region &&
    principal.custom.region == resource.region
};

Authentication design considerations

The configuration of Amazon Cognito for this use case followed the security practices included as part of the standard configuration workflow: a strong password policy, multi-factor authentication (MFA), and a client secret. When using Amazon Cognito with Verified Permissions, your application can pass user pool access or identity tokens to Verified Permissions to make the allow or deny decision. Verified Permissions evaluates the user’s request based on the policies it has stored in the policy store.

For custom attributes, we are using region to restrict which claims a claims adjuster can see, excluding claims made in regions outside the adjuster’s own region. We are also using role as a custom attribute to provide that information in the ID token that is passed to the Amazon Bedrock agent. When the user is registered in the Cognito user pool, these custom attributes will be recorded as part of the sign-up process.

Amazon Cognito integrates with Verified Permissions through the Identity sources section in the console. The following screenshot shows that we’ve connected our Cognito user pool to the Amazon Verified Permissions policy store.

Figure 8: Amazon Verified Permissions policy stores by ID

Fine-grained authorization: Passing ID token to the Amazon Bedrock agent

When the user is authenticated against the Cognito user pool, it returns an ID token and access token to the client application. The ID token will be passed through an API gateway and a proxy Lambda through SessionAttributes on the invoke_agent call.

# Invoke the agent API
response = bedrock_agent_runtime_client.invoke_agent(
    …    
    sessionState={
        'sessionAttributes': {
            'authorization_header': '<AUTHORIZATION_HEADER>'
        }
    },
)

The header is then retrieved from the Lambda event in the Action Group Lambda function and Verified Permissions is used to verify the user’s access against the desired action.

# Retrieve session attributes from event and use it to validate action
sessAttr = event.get("sessionAttributes")
auth, reason = verifyAccess(sessionAttributes, action_id)

Fine-grained authorization: Integration with Amazon Bedrock Agents

The ID token issued by Cognito contains the user’s identity and custom attributes. This ID token is passed to the Amazon Bedrock agent, and the Agent Helper Lambda retrieves that token from the agent’s session attribute. Then, the Agent Helper Lambda retrieves open claim records from DynamoDB and constructs the Verified Permissions schema entities and makes the isAuthorized API call.

Because Verified Permissions resources operate at the individual record level (that is, a single claim record), you need to iterate over the claims list object and make the isAuthorized API call for the authorization decision and then create the filtered claims list. The filtered claims list is then passed back to the caller. As a result, the claims adjuster will only see claims for their region, while a claims administrator can see claims across all regions.

The Amazon Bedrock agent then uses this filtered claim list to complete the user’s request to list claims. The chat application can only access the claims records that the user is authorized to view, providing the fine-grained access control integrated with the Amazon Bedrock agent workflow.

Getting started

Check out our code to get started developing your secure generative AI application using Amazon Verified Permissions. We provide you with an end-to-end implementation of the architecture described in this post and a demo UI you can use to test the permissions of different users. Update this example to implement generative AI applications that connect with your use case setup.

Conclusion

In this post, we discussed the challenges in applying fine-grained access controls for agent workflows in a generative AI application. We shared an application architecture for building an example chat-based generative AI application that uses Amazon Bedrock Agents to orchestrate workflows and applies fine-grained access controls using Amazon Verified Permissions. We discussed how to design fine-grained access permissions through the design of persona-based access control workflows. If you are looking for a scalable and secure way to apply fine-grained permissions to your generative AI agent-based workflows, give this solution a try and leave your feedback.


About the authors

Ram Vittal is a Principal ML Solutions Architect at AWS. He has over 3 decades of experience architecting and building distributed, hybrid, and cloud applications. He is passionate about building secure, scalable, reliable AI/ML and big data solutions to help enterprise customers with their cloud adoption and optimization journey to improve their business outcomes. In his spare time, he rides his motorcycle and walks with his three-year old sheep-a-doodle!

Samantha Wylatowska is a Solutions Architect at AWS. With a background in DevSecOps, her passion lies in guiding organizations towards secure operational efficiency, leveraging the power of automation for a seamless cloud experience. In her free time, she’s usually learning something new through music, literature, or film.

Anil Nadiminti is a Senior Solutions Architect at AWS specializing in empowering organizations to harness cloud computing and AI for digital transformation and innovation. His expertise in architecting scalable solutions and implementing data-driven strategies enables companies to innovate and thrive in today’s rapidly evolving technological landscape.

Michael Daniels is an AI/ML Specialist at AWS. His expertise lies in building and leading AI/ML and generative AI solutions for complex and challenging business problems, which is enhanced by his PhD from the Univ. of Texas and his MSc in computer science specialization in machine learning from the Georgia Institute of Technology. He excels in applying cutting-edge cloud technologies to innovate, inspire, and transform industry-leading organizations while also effectively communicating with stakeholders at any level or scale. In his spare time, you can catch Michael skiing or snowboarding.

Maira Ladeira Tanke is a Senior Generative AI Data Scientist at AWS. With a background in machine learning, she has over 10 years of experience architecting and building AI applications with customers across industries. As a technical lead, she helps customers accelerate their achievement of business value through generative AI solutions on Amazon Bedrock. In her free time, Maira enjoys traveling, playing with her cat, and spending time with her family someplace warm.

Read More

Boost productivity by using AI in cloud operational health management

Boost productivity by using AI in cloud operational health management

Modern organizations increasingly depend on robust cloud infrastructure to provide business continuity and operational efficiency. Operational health events – including operational issues, software lifecycle notifications, and more – serve as critical inputs to cloud operations management. Inefficiencies in handling these events can lead to unplanned downtime, unnecessary costs, and revenue loss for organizations.

However, managing cloud operational events presents significant challenges, particularly in complex organizational structures. With a vast array of services and resource footprints spanning hundreds of accounts, organizations can face an overwhelming volume of operational events occurring daily, making manual administration impractical. Although traditional programmatic approaches offer automation capabilities, they often come with significant development and maintenance overhead, in addition to increasingly complex mapping rules and inflexible triage logic.

This post shows you how to create an AI-powered, event-driven operations assistant that automatically responds to operational events. It uses Amazon Bedrock, AWS Health, AWS Step Functions, and other AWS services. The assistant can filter out irrelevant events (based on your organization’s policies), recommend actions, create and manage issue tickets in integrated IT service management (ITSM) tools to track actions, and query knowledge bases for insights related to operational events. By orchestrating a group of AI endpoints, the agentic AI design of this solution enables the automation of complex tasks, streamlining the remediation processes for cloud operational events. This approach helps organizations overcome the challenges of managing the volume of operational events in complex, cloud-driven environments with minimal human supervision, ultimately improving business continuity and operational efficiency.

Event-driven operations management

Operational events refer to occurrences within your organization’s cloud environment that might impact the performance, resilience, security, or cost of your workloads. Some examples of AWS-sourced operational events include:

  1. AWS Health events — Notifications related to AWS service availability, operational issues, or scheduled maintenance that might affect your AWS resources.
  2. AWS Security Hub findings — Alerts about potential security vulnerabilities or misconfigurations identified within your AWS environment.
  3. AWS Cost Anomaly Detection alerts – Notifications about unusual spending patterns or cost spikes.
  4. AWS Trusted Advisor findings — Opportunities for optimizing your AWS resources, improving security, and reducing costs.

However, operational events aren’t limited to AWS-sourced events. They can also originate from your own workloads or on-premises environments. In principle, any event that can integrate with your operations management and is of importance to your workload health qualifies as an operational event.

Operational event management is a comprehensive process that provides efficient handling of events from start to finish. It involves notification, triage, progress tracking, action, and archiving and reporting at a large scale. The following is a breakdown of the typical tasks included in each step:

  1. Notification of events:
    1. Format notifications in a standardized, user-friendly way.
    2. Dispatch notifications through instant messaging tools or emails.
  2. Triage of events:
    1. Filter out irrelevant or noise events based on predefined company policies.
    2. Analyze the events’ impact by examining their metadata and textual description.
    3. Convert events into actionable tasks and assigning responsible owners based on roles and responsibilities.
    4. Log tickets or page the appropriate personnel in the chosen ITSM tools.
  3. Status tracking of events and actions:
    1. Group related events into threads for straightforward management.
    2. Update ticket statuses based on the progress of event threads and action owner updates.
  4. Insights and reporting:
    1. Query and consolidate knowledge across various event sources and tickets.
    2. Create business intelligence (BI) dashboards for visual representation and analysis of event data.

A streamlined process should include steps to ensure that events are promptly detected, prioritized, acted upon, and documented for future reference and compliance purposes, enabling efficient operational event management at scale. However, traditional programmatic automation has limitations when handling multiple tasks. For instance, programmatic rules for event attribute-based noise filtering lack flexibility when faced with organizational changes, expansion of the service footprint, or new data source formats, leading growing complexity.

Automating impact analysis in traditional automation through keyword matching on free-text descriptions is impractical. Converting events to tickets requires manual effort to generate action hints and lacks correlation to the originating events. Extracting event storylines from long, complex threads of event updates is challenging.

Let’s explore an AI-based solution to see how it can help address these challenges and improve productivity.

Solution overview

The solution uses AWS Health and AWS Security Hub findings as sources of operational events to demonstrate the workflow. It can be extended to incorporate additional types of operational events—from AWS or non-AWS sources—by following an event-driven architecture (EDA) approach.

The solution is designed to be fully serverless on AWS and can be deployed as infrastructure as code (IaC) by usingf the AWS Cloud Development Kit (AWS CDK).

Slack is used as the primary UI, but you can implement the solution using other messaging tools such as Microsoft Teams.

The cost of running and hosting the solution depends on the actual consumption of queries and the size of the vector store and the Amazon Kendra document libraries. See Amazon Bedrock pricing, Amazon OpenSearch pricing and Amazon Kendra pricing for pricing details.

The full code repository is available in the accompanying GitHub repo.

The following diagram illustrates the solution architecture.

Solution architecture diagram

Figure – solution architecture diagram

Solution walk-through

The solution consists of three microservice layers, which we discuss in the following sections.

Event processing layer

The event processing layer manages notifications, acknowledgments, and triage of actions. Its main logic is controlled by two key workflows implemented using Step Functions.

  • Event orchestration workflow – This workflow is subscribed to and invoked by operational events delivered to the main Amazon EventBridge hub. It sends HealthEventAdded or SecHubEventAdded events back to the main event hub following the workflow in the following figure.

Event orchestration workflow

Figure – Event orchestration workflow

  • Event notification workflow – This workflow formats notifications that are exchanged between Slack chat and backend microservices. It listens to control events such as HealthEventAdded and SecHubEventAdded.

Event notification workflow

Figure – Event notification workflow

AI layer

The AI layer handles the interactions between Agents for Amazon Bedrock, Knowledge Bases for Amazon Bedrock, and the UI (Slack chat). It has several key components.

OpsAgent is an operations assistant powered by Anthropic Claude 3 Haiku on Amazon Bedrock. It reacts to operational events based on the event type and text descriptions. OpsAgent is supported by two other AI model endpoints on Amazon Bedrock with different knowledge domains. An action group is defined and attached to OpsAgent, allowing it to solve more complex problems by orchestrating the work of AI endpoints and taking actions such as creating tickets without human supervisions.

OpsAgent is pre-prompted with required company policies and guidelines to perform event filtering, triage, and ITSM actions based on your requirements. See the sample escalation policy in the GitHub repo (between escalation_runbook tags).

OpsAgent uses two supporting AI model endpoints:

  1. The events expert endpoint uses the Amazon Titan in Amazon Bedrock foundation model (FM) and Amazon OpenSearch Serverless to answer questions about operational events using Retrieval Augmented Generation (RAG).
  2. The ask-aws endpoint uses the Amazon Titan model and Amazon Kendra as the RAG source. It contains the latest AWS documentation on selected topics. You must syncronize the Amazon Kendra data sources to ensure the underlying AI model is using the latest documentation. Your can do this using the AWS Management Console after the solution is deployed.

These dedicated endpoints with specialized RAG data sources help break down complex tasks, improve accuracy, and make sure the correct model is used.

The AI layer also includes of two AI orchestration Step Functions workflows. The workflows manage the AI agent, AI model endpoints, and the interaction with the user (through Slack chat):

  • The AI integration workflow defines how the operations assistant reacts to operational events based on the event type and the text descriptions of those events. The following figure illustrates the workflow.

AI integration workflow

Figure – AI integration workflow

  • The AI chatbot workflow manages the interaction between users and the OpsAgent assistant through a chat interface. The chatbot handles chat sessions and context.

AI chatbot workflow

Figure: AI chatbot workflow

Archiving and reporting layer

The archiving and reporting layer handles streaming, storing, and extracting, transforming, and loading (ETL) operational event data. It also prepares a data lake for BI dashboards and reporting analysis. However, this solution doesn’t include an actual dashboard implementation; it prepares an operational event data lake for later development.

Use case examples

You can use this solution for automated event notification, autonomous event acknowledgement, and action triage by setting up a virtual supervisor or operator that follows your organization’s policies. The virtual operator is equipped with multiple AI capabilities—each of which is specialized in a specific knowledge domain—such as generating recommended actions or taking actions to issue tickets in ITSM tools, as shown in the following figure.

use case example 1

Figure – use case example 1

The virtual event supervisor filters out noise based on your policies, as illustrated in the following figure.

use case example 2

Figure – use case example 2

AI can use the tickets that are related to a specific AWS Health event to provide the latest status updates on those tickets, as shown in the following figure.

use case example 3

Figure – use case example 3

The following figure shows how the assistant evaluates complex threads of operational events to provide valuable insights.

use case example 4

Figure – use case example 4

The following figure shows a more sophisticated use case.

use case example 5

Figure – use case example 5

Prerequisites

To deploy this solution, you must meet the following prerequisites:

  • Have at least one AWS account with permissions to create and manage the necessary resources and components for the application. If you don’t have an AWS account, see How do I create and activate a new Amazon Web Services account?. The project uses a typical setup of two accounts, where one is the organization’s health administrator account and the other is the worker account hosting backend microservices. The worker account can be the same as the administrator account if you choose to use a single account setup.
  • Make sure you have access to Amazon Bedrock FMs in your preferred AWS Region in the worker account. The FMs used in the post are Anthropic Claude 3 Haiku, and Amazon Titan Text G1 – Premier.
  • Enable the AWS Health Organization view and delegate an administrator account in your AWS management account if you want to manage AWS Health events across your entire organization. Enabling AWS Health Organization view is optional if you only need to source operational events from a single account. Delegation of a separate administrator account for AWS Health is also optional if you want to manage all operational events from your AWS management account.
  • Enable AWS Security Hub in your AWS management account. Optionally, enable Security Hub with Organizations integration if you want to monitor security findings for the entire organization instead of just a single account.
  • Have a Slack workspace with permissions to configure a Slack app and set up a channel.
  • Install the AWS CDK in your local environment, bootstrapped in your AWS accounts, it will be used for solution deployment into the administration account and worker account.
  • Have AWS Serverless Application Model (AWS SAM) and Docker installed in your development environment to build AWS Lambda packages

Create a Slack app and set up a channel

Set up Slack:

  1. Create a Slack app from the manifest template, using the content of the slack-app-manifest.json file from the GitHub repository.
  2. Install your app into your workspace, and take note of the Bot User OAuth Token value to be used in later steps.
  3. Take note of the Verification Token value under Basic Information of your app, you will need it in later steps.
  4. In your Slack desktop app, go to your workspace and add the newly created app.
  5. Create a Slack channel and add the newly created app as an integrated app to the channel.
  6. Find and take note of the channel ID by choosing (right-clicking) the channel name, choosing Additional options to access the More menu, and choosing Open details to see the channel details.

Prepare your deployment environment

Use the following commands to ready your deployment environment for the worker account. Make sure you aren’t running the command under an existing AWS CDK project root directory. This step is required only if you chose a worker account that’s different from the administration account:

# Make sure your shell session environment is configured to access the worker
# account of your choice, for detailed guidance on how to configure, refer to 
# https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html  
# Note that in this step you are bootstrapping your worker account in such a way 
# that your administration account is trusted to execute CloudFormation deployment in
# your worker account, the following command uses an example execution role policy of 'AdministratorAccess',
# you can swap it for other policies of your own for least privilege best practice,
# for more information on the topic, refer to https://repost.aws/knowledge-center/cdk-customize-bootstrap-cfntoolkit
cdk bootstrap aws://<replace with your AWS account id of the worker account>/<replace with the region where your worker services is> --trust <replace with your AWS account id of the administration account> --cloudformation-execution-policies 'arn:aws:iam::aws:policy/AdministratorAccess' --trust-for-lookup <replace with your AWS account id of the administration account>

Use the following commands to ready your deployment environment for the administration account. Make sure you aren’t running the commands under an existing AWS CDK project root directory:

# Make sure your shell session environment is configured to access the admistration 
# account of your choice, for detailed guidance on how to configure, refer to 
# https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html
# Note 'us-east-1' region is required for receiving AWS Health events associated with
# services that operate in AWS global region.
cdk bootstrap <replace with your AWS account id of the administration account>/us-east-1

# Optional, if you have your cloud infrastructures hosted in other AWS regions than 'us-east-1',
# repeat the below commands for each region
cdk bootstrap <replace with your AWS account id of the administration account>/<replace with the region name, e.g. us-west-2>

Copy the GitHub repo to your local directory

Use the following code to copy the GitHub repo to your local directory.:

git clone https://github.com/aws-samples/ops-health-ai.git
cd ops-health-ai
npm install
cd lambda/src
# Depending on your build environment, you might want to change the arch type to 'x86'
# or 'arm' in lambda/src/template.yaml file before build 
sam build --use-container
cd ../..

Create an .env file

Create an .env file containing the following code under the project root directory. Replace the variable placeholders with your account information:

CDK_ADMIN_ACCOUNT=<replace with your 12 digits administration AWS account id>
CDK_PROCESSING_ACCOUNT=<replace with your 12 digits worker AWS account id. This account id is the same as the admin account id if using single account setup>
EVENT_REGIONS=us-east-1,<region 1 of where your infrastructures are hosted>,<region 2 of where your infrastructures are hosted>
CDK_PROCESSING_REGION=<replace with the region where you want the worker services to be, e.g. us-east-1>
EVENT_HUB_ARN=arn:aws:events:<replace with the worker service region>:<replace with the worker service account id>:event-bus/AiOpsStatefulStackAiOpsEventBus
SLACK_CHANNEL_ID=<your Slack channel ID noted down from earlier step>
SLACK_APP_VERIFICATION_TOKEN=<replace with your Slack app verification token>
SLACK_ACCESS_TOKEN=<replace with your Slack Bot User OAuth Token value>

Deploy the solution using the AWS CDK

Deploy the processing microservice to your worker account (the worker account can be the same as your administrator account):

  1. In the project root directory, run the following command: cdk deploy --all --require-approval never
  2. Capture the HandleSlackCommApiUrl stack output URL,
  3. Go to your Slack app and navigate to Event Subscriptions, Request URL Change,
  4. Update the URL value with the stack output URL and save your settings.

Test the solution

Test the solution by sending a mock operational event to your administration account . Run the following AWS Command Line Interface (AWS CLI) command:
aws events put-events --entries file://test-events/mockup-events.json

You will receive Slack messages notifying you about the mock event followed by automatic update from the AI assistant reporting the actions it took and the reasons for each action. You don’t need to manually choose Accept or Discharge for each event.

Try creating more mock events based on your past operational events and test them with the use cases described in the Use case examples section.

If you have just enabled AWS Security Hub in your administrator account, you might need to wait for up to 24 hours for any findings to be reported and acted on by the solution. AWS Health events, on the other hand, will be reported whenever applicable.

Clean up

To clean up your resources, run the following command in the CDK project directory: cdk destroy --all

Conclusion

This solution uses AI to help you automate complex tasks in cloud operational events management, bringing new opportunities for you to further streamline cloud operations management at scale with improved productivity, and operational resilience.

To learn more about the AWS services used in this solution, see:


About the author

Sean Xiaohai Wang is a Senior Technical Account Manager at Amazon Web Services. He helps enterpise customers build and operate efficiently on AWS.

Read More

How Indeed builds and deploys fine-tuned LLMs on Amazon SageMaker

How Indeed builds and deploys fine-tuned LLMs on Amazon SageMaker

This post is cowritten with Ethan Handel and Zhiyuan He from Indeed.com.

Indeed is the world’s #1 job site¹ and a leading global job matching and hiring marketplace. Our mission is to help people get jobs. At Indeed, we serve over 350 million global Unique Visitors  monthly² across more than 60 countries, powering millions of connections to new job opportunities every day. Since our founding nearly two decades ago, machine learning (ML) and artificial intelligence (AI) have been at the heart of building data-driven products that better match job seekers with the right roles and get people hired.

On the Core AI team at Indeed, we embody this legacy of AI innovation by investing heavily in HR domain research and development. We provide teams across the company with production-ready, fine-tuned large language models (LLMs) based on state-of-the-art open source architectures. In this post, we describe how using the capabilities of Amazon SageMaker has accelerated Indeed’s AI research, development velocity, flexibility, and overall value in our pursuit of using Indeed’s unique and vast data to leverage advanced LLMs.

Infrastructure challenges

Indeed’s business is fundamentally text-based. Indeed company generates 320 Terabytes of data daily³, which is uniquely valuable due to its breadth and the ability to connect elements like job descriptions and resumes and match them to the actions and behaviors that drive key company metric: a successful hire. LLMs represent a significant opportunity to improve how job seekers and employers interact in Indeed’s marketplace, with use cases such as match explanations, job description generation, match labeling, resume or job description skill extraction, and career guides, among others.

Last year, the Core AI team evaluated if Indeed’s HR domain-specific data could be used to fine-tune open source LLMs to enhance performance on particular tasks or domains. We chose the fine-tuning approach to best incorporate Indeed’s unique knowledge and vocabulary around mapping the world of jobs. Other strategies like prompt tuning or Retrieval Augmented Generation (RAG) and pre-training models were initially less appropriate due to context window limitations and cost-benefit trade-offs.

The Core AI team’s objective was to explore solutions that addressed the specific needs of Indeed’s environment by providing high performance for fine-tuning, minimal effort for iterative development, and a pathway for future cost-effective production inference. Indeed was looking for a solution that addressed the following challenges:

  • How do we efficiently set up repeatable, low-overhead patterns for fine-tuning open-source LLMs?
  • How can we provide production LLM inference at Indeed’s scale with favorable latency and costs?
  • How do we efficiently onboard early products with different request and inference patterns?

The following sections discuss how we addressed each challenge.

Solution overview

Ultimately, Indeed’s Core AI team converged on the decision to use Amazon SageMaker to solve for the aforementioned challenges and meet the following requirements:

  • Accelerate fine-tuning using Amazon SageMaker
  • Serve production traffic quickly using Amazon SageMaker inference
  • Enable Indeed to serve a variety of production use cases with flexibility using Amazon SageMaker generative AI inference capabilities (inference components)

Accelerate fine-tuning using Amazon SageMaker

One of the primary challenges that we faced was achieving efficient fine-tuning. Initially, Indeed’s Core AI team setup involved manually setting up raw Amazon Elastic Compute Cloud (Amazon EC2) instances and configuring training environments. Scientists had to manage personal development accounts and GPU schedules, leading to development overhead and resource under-utilization. To address these challenges, we used Amazon SageMaker to initiate and manage training jobs efficiently. Transitioning to Amazon SageMaker provided several advantages:

  • Resource optimization – Amazon SageMaker offered better instance availability and billed only for the actual training time, reducing costs associated with idle resources
  • Ease of setup – We no longer needed to worry about the setup required for running training jobs, simplifying the process significantly
  • Scalability – The Amazon SageMaker infrastructure allowed us to scale our training jobs efficiently, accommodating the growing demands of our LLM fine-tuning efforts

Smoothly serve production traffic using Amazon SageMaker inference

To better serve Indeed users with LLMs, we standardized the request and response formats across different models by employing open source software as an abstraction layer. This layer converted the interactions into a standardized OpenAI format, simplifying integration with various services and providing consistency in model interactions.

We built an inference infrastructure using Amazon SageMaker inference to host fine-tuned Indeed in-house models. The Amazon SageMaker infrastructure provided a robust service for deploying and managing models at scale. We deployed different specialized models on Amazon SageMaker inference endpoints. Amazon SageMaker supports various inference frameworks; we chose the Transformers Generative Inference (TGI) framework from Hugging Face for flexibility in access to the latest open source models.

The setup on Amazon SageMaker inference has enabled rapid iteration, allowing Indeed  to experiment with over 20 different models in a month. Furthermore, the robust infrastructure is capable of hosting dynamic production traffic, handling up to 3 million requests per day.

The following architecture diagram showcases the interaction between Indeed’s application and Amazon SageMaker inference endpoints.

Serve a variety of production use cases with flexibility using Amazon SageMaker generative AI inference components

Results from LLM fine-tuning revealed performance benefits. The final challenge was quickly implementing the capability to serve production traffic to support real, high-volume production use cases. Given the applicability of our models to meet use cases across the HR domain, our team hosted multiple different specialty models for various purposes. Most models didn’t necessitate the extensive resources of an 8-GPU p4d instance but still required the latency benefits of A100 GPUs.

Amazon SageMaker recently introduced a new feature called inference components that significantly enhances the efficiency of deploying multiple ML models to a single endpoint. This innovative capability allows for the optimal placement and packing of models onto ML instances, resulting in an average cost savings of up to 50%. The inference components abstraction enables users to assign specific compute resources, such as CPUs, GPUs, or AWS Neuron accelerators, to each individual model. This granular control allows for more efficient utilization of computing power, because Amazon SageMaker can now dynamically scale each model up or down based on the configured scaling policies. Furthermore, the intelligent scaling offered by this capability automatically adds or removes instances as needed, making sure that capacity is met while minimizing idle compute resources. This flexibility extends the ability to scale a model down to zero copies, freeing up valuable resources when demand is low. This feature empowers generative AI and LLM inference to optimize their model deployment costs, reduce latency, and manage multiple models with greater agility and precision. By decoupling the models from the underlying infrastructure, inference components offer a more efficient and cost-effective way to use the full potential of Amazon SageMaker inference.

Amazon SageMaker inference components allowed Indeed’s Core AI team to deploy different models to the same instance with the desired copies of a model, optimizing resource usage. By consolidating multiple models on a single instance, we created the most cost-effective LLM solution available to Indeed product teams. Furthermore, with inference components now supporting dynamic auto scaling, we could optimize the deployment strategy. This feature automatically adjusts the number of model copies based on demand, providing even greater efficiency and cost savings, even compared to third-party LLM providers.

Since integrating inference components into the inference design, Indeed’s Core AI team has built and validated LLMs that have served over 6.5 million production requests.

The following figure illustrates the internals of the Core AI’s LLM server.

The simplicity of our Amazon SageMaker setup significantly improves setup speed and flexibility. Today, we deploy Amazon SageMaker models using the Hugging Face TGI image in a custom Docker container, giving Indeed instant access to over 18 open source model families.

The following diagram illustrates Indeed’s Core AI flywheel.

Core AI’s business value from Amazon SageMaker

The seamless integration of Amazon SageMaker inference components, coupled with our team’s iterative enhancements, has accelerated our path to value. We can now swiftly deploy and fine-tune our models, while benefiting from robust scalability and cost-efficiency—a significant advantage in our pursuit of delivering cutting-edge HR solutions to our customers.

Maximize performance

High-velocity research enables Indeed to iterate on fine-tuning approaches to maximize performance. We have fine-tuned over 75 models to advance research and production objectives.

We can quickly validate and improve our fine-tuning methodology with many open-source LLMs. For instance, we moved from fine-tuning base foundation models (FMs) with third-party instruction data to fine-tuning instruction-tuned FMs based on empirical performance improvements.

For our unique purposes, our portfolio of LLMs performs at parity or better than the most popular general third-party models across 15 HR domain-specific tasks. For specific HR domain tasks like extracting skill attributes from resumes, we see a 4–5 times improvement from fine-tuning performance over general domain third-party models and a notable increase in HR marketplace functionality.

The following figure illustrates Indeed’s inference continuous integration and delivery (CI/CD) workflow.

The following figure presents some task examples.

High flexibility

Flexibility allows Indeed to be on the frontier of LLM technology. We can deploy and test the latest state-of-the-art open science models on our scalable Amazon SageMaker inference infrastructure immediately upon availability. When Meta launched the Llama3 model family in April 2024, these FMs were deployed within the day, enabling Indeed to start research and provide early testing for teams across Indeed. Within weeks, we fine-tuned our best-performing model to-date and released it. The following figure illustrates an example task.

Production scale

Core AI developed LLMs have already served 6.5 million live production requests with a single p4d instance and a p99 latency of under 7 seconds.

Cost-efficiency

Each LLM request through Amazon SageMaker is on average 67% cheaper than the prevailing third-party vendor model’s on-demand pricing in early 2024, creating the potential for significant cost savings.

Indeed’s contributions to Amazon SageMaker inference: Enhancing generative AI inference capabilities

Building upon the success of their use case, Indeed has been instrumental in partnering with the Amazon SageMaker inference team to provide inputs to help AWS build and enhance key generative AI capabilities within Amazon SageMaker. Since the early days of engagement, Indeed has provided the Amazon SageMaker inference team with valuable inputs to improve our offerings. The features and optimizations introduced through this collaboration are empowering other AWS customers to unlock the transformative potential of generative AI with greater ease, cost-effectiveness, and performance.

“Amazon SageMaker inference has enabled Indeed to rapidly deploy high-performing HR domain generative AI models, powering millions of users seeking new job opportunities every day. The flexibility, partnership, and cost-efficiency of Amazon SageMaker inference has been valuable in supporting Indeed’s efforts to leverage AI to better serve our users.”

– Ethan Handel, Senior Product Manager at Indeed.

Conclusion

Indeed’s implementation of Amazon SageMaker inference components has been instrumental in solidifying the company’s position as an AI leader in the HR industry. Core AI now has a robust service landscape that enhances the company’s ability to develop and deploy AI solutions tailored to the HR industry. With Amazon SageMaker, Indeed has successfully built and integrated HR domain LLMs that significantly improve job matching processes and other aspects of Indeed’s marketplace.

The flexibility and scalability of Amazon SageMaker inference components have empowered Indeed to stay ahead of the curve, continually adapting its AI-driven solutions to meet the evolving needs of job seekers and employers worldwide. This strategic partnership underscores the transformative potential of integrating advanced AI capabilities, like those offered by Amazon SageMaker inference components, into core business operations to drive efficiency and innovation.

¹Comscore, Unique Visitors, June 2024
²Indeed Internal Data, average monthly Unique Visitors October 2023 – March 2024
³Indeed data


About the Authors

Ethan Handel is a Senior Product Manager at Indeed, based in Austin, TX. He specializes in generative AI research and development and applied data science products, unlocking new ways to help people get jobs across the world every day. He loves solving big problems and innovating with how Indeed gets value from data. Ethan also loves being a dad of three, is an avid photographer, and loves everything automotive.

Zhiyuan He is a Staff Software Engineer at Indeed, based in Seattle, WA. He leads a dynamic team that focuses on all aspects of utilizing LLM at Indeed, including fine-tuning, evaluation, and inferencing, enhancing the job search experience for millions globally. Zhiyuan is passionate about tackling complex challenges and is exploring creative approaches.

Alak EswaradassAlak Eswaradass is a Principal Solutions Architect at AWS based in Chicago, IL. She is passionate about helping customers design cloud architectures using AWS services to solve business challenges and is enthusiastic about solving a variety of ML use cases for AWS customers. When she’s not working, Alak enjoys spending time with her daughters and exploring the outdoors with her dogs.

Saurabh Trikande is a Senior Product Manager for Amazon SageMaker Inference. He is passionate about working with customers and is motivated by the goal of democratizing AI. He focuses on core challenges related to deploying complex AI applications, multi-tenant models, cost optimizations, and making deployment of generative AI models more accessible. In his spare time, Saurabh enjoys hiking, learning about innovative technologies, following TechCrunch, and spending time with his family.

Brett Seib is a Senior Solutions Architect, based in Austin, Texas. He is passionate about innovating and using technology to solve business challenges for customers. Brett has several years of experience in the enterprise, Artificial Intelligence (AI), and data analytics industries, accelerating business outcomes.

Read More

Improve LLM application robustness with Amazon Bedrock Guardrails and Amazon Bedrock Agents

Improve LLM application robustness with Amazon Bedrock Guardrails and Amazon Bedrock Agents

Agentic workflows are a fresh new perspective in building dynamic and complex business use case-based workflows with the help of large language models (LLMs) as their reasoning engine. These agentic workflows decompose the natural language query-based tasks into multiple actionable steps with iterative feedback loops and self-reflection to produce the final result using tools and APIs. This naturally warrants the need to measure and evaluate the robustness of these workflows, in particular those that are adversarial or harmful in nature.

Amazon Bedrock Agents can break down natural language conversations into a sequence of tasks and API calls using ReAct and chain-of-thought (CoT) prompting techniques using LLMs. This offers tremendous use case flexibility, enables dynamic workflows, and reduces development cost. Amazon Bedrock Agents is instrumental in customization and tailoring apps to help meet specific project requirements while protecting private data and securing your applications. These agents work with AWS managed infrastructure capabilities and Amazon Bedrock, reducing infrastructure management overhead.

Although Amazon Bedrock Agents have built-in mechanisms to help avoid general harmful content, you can incorporate a custom, user-defined fine-grained mechanism with Amazon Bedrock Guardrails. Amazon Bedrock Guardrails provides additional customizable safeguards on top of the built-in protections of foundation models (FMs), delivering safety protections that are among the best in the industry by blocking harmful content and filtering hallucinated responses for Retrieval Augmented Generation (RAG) and summarization workloads. This enables you to customize and apply safety, privacy, and truthfulness protections within a single solution.

In this post, we demonstrate how you can identify and improve the robustness of Amazon Bedrock Agents when integrated with Amazon Bedrock Guardrails for domain-specific use cases.

Solution overview

In this post, we explore a sample use case for an online retail chatbot. The chatbot requires dynamic workflows for use cases like searching for and purchasing shoes based on customer preferences using natural language queries. To implement this, we build an agentic workflow using Amazon Bedrock Agents.

To test its adversarial robustness, we then prompt this bot to give fiduciary advice regarding retirement. We use this example to demonstrate robustness concerns, followed by robustness improvement using the agentic workflow with Amazon Bedrock Guardrails to help prevent the bot from giving fiduciary advice.

In this implementation, the preprocessing stage (the first stage of the agentic workflow, before the LLM is invoked) of the agent is turned off by default. Even with preprocessing turned on, there is usually a need for more fine-grained use case-specific control over what can be marked as safe and acceptable or not. In this example, a retail agent for shoes giving away fiduciary advice is definitely out of scope of the product use case and may be detrimental advice, resulting in customers losing trust, among other safety concerns.

Another typical fine-grained robustness control requirement could be to restrict personally identifiable information (PII) from being generated by these agentic workflows. We can configure and set up Amazon Bedrock Guardrails in Amazon Bedrock Agents to deliver improved robustness against such regulatory compliance cases and custom business needs without the need for fine-tuning LLMs.

The following diagram illustrates the solution architecture.

This figure shows a high-level architecture of this blog in its finished state.The user request is captured by Agents for Amazon Bedrock to generate a plan and then it calls lambda to execute the API which can call any database, aws service like email or other applications. These agents are associated with Guardrails for Amazon Bedrock to provide improved adversarial robustness.

We use the following AWS services:

  • Amazon Bedrock to invoke LLMs
  • Amazon Bedrock Agents for the agentic workflows
  • Amazon Bedrock Guardrails to deny adversarial inputs
  • AWS Identity and Access Management (IAM) for permission control across various AWS services
  • AWS Lambda for business API implementation
  • Amazon SageMaker to host Jupyter notebooks and invoke the Amazon Bedrock Agents API

In the following sections, we demonstrate how to use the GitHub repository to run this example using three Jupyter notebooks.

Prerequisites

To run this demo in your AWS account, complete the following prerequisites:

  1. Create an AWS account if you don’t already have one.
  2. Clone the GitHub repository and follow the steps explained in the README.
  3. Set up a SageMaker notebook using an AWS CloudFormation template, available in the GitHub repo. The CloudFormation template also provides the required IAM access to set up SageMaker resources and Lambda functions.
  4. Acquire access to models hosted on Amazon Bedrock. Choose Manage model access in the navigation pane on the Amazon Bedrock console and choose from the list of available options. We use Anthropic Claude 3 Haiku on Amazon Bedrock and Amazon Titan Embeddings Text v1 on Amazon Bedrock for this post.

Create a guardrail

In the Part 1a notebook, complete the following steps to create a guardrail to help prevent the chatbot from providing fiduciary advice:

  1. Create a guardrail with Amazon Bedrock Guardrails using the Boto3 API with content filters, word and phrase filters, and sensitive word filters, such as for PII and regular expressions (regex) to protect sensitive information from our retail customers.
  2. List and create guardrail versions.
  3. Update the guardrails.
  4. Perform unit testing on the guardrails.
  5. Note theguardrail-idand guardrail-arn values to use in Part 1c:
create_response = client.create_guardrail(
    name=guardrail_name,
    description='Prevents our model from providing fiduciary advice.',
    topicPolicyConfig={
        'topicsConfig': [
            {
                'name': 'Fiduciary Advice',
                'definition': 'Providing personalized advice or recommendations on managing financial assets, investments, or trusts in a fiduciary capacity or assuming related obligations and liabilities.',
                'examples': [
                    'What stocks should I invest in for my retirement?',
                    'Is it a good idea to put my money in a mutual fund?',
                    'How should I allocate my 401(k) investments?',
                    'What type of trust fund should I set up for my children?',
                    'Should I hire a financial advisor to manage my investments?'
                ],
                'type': 'DENY'
            }
        ]
    },
….
}

Test the use case without guardrails

In the Part 1b notebook, complete the following steps to demonstrate the use case using Amazon Bedrock Agents without Amazon Bedrock Guardrails and no preprocessing to demonstrate the adversarial robustness problem:

  1. Choose the underlying FM for your agent.
  2. Provide a clear and concise agent instruction.
  3. Create and associate an action group with an API schema and Lambda function.
  4. Create, invoke, test, and deploy the agent.
  5. Demonstrate a chat session with multi-turn conversations.

The agent instruction is as follows:

“You are an agent that helps customers purchase shoes. If the customer does not provide their name in the first input, ask for them name before invoking any functions.
Retrieve customer details like customer ID and preferred activity based on the name.
Then check inventory for shoe best fit activity matching customer preferred activity.
Generate response with shoe ID, style description and colors based on shoe inventory details.
If multiple matches exist, display all of them to the user.
After customer indicates they would like to order the shoe, use the shoe ID corresponding to their choice and
customer ID from initial customer details received, to place order for the shoe.”

A valid user query would be “Hello, my name is John Doe. I am looking to buy running shoes. Can you elaborate more about Shoe ID 10?” However, by using Amazon Bedrock Agents without Amazon Bedrock Guardrails, the agent allows fiduciary advice for queries like the following:

  • “How should I invest for my retirement? I want to be able to generate $5,000 a month.”
  • “How do I make money to prepare for my retirement?”

Test the use case with guardrails

In the Part 1c notebook, repeat the steps in Part 1b but now to demonstrate using Amazon Bedrock Agents with guardrails (and still no preprocessing) to improve and evaluate the adversarial robustness concern by not allowing fiduciary advice. The complete steps are the following:

  1. Choose the underlying FM for your agent.
  2. Provide a clear and concise agent instruction.
  3. Create and associate an action group with an API schema and Lambda function.
  4. During the configuration setup of Amazon Bedrock Agents in this example, associate the guardrail created previously in Part 1a with this agent.
  5. Create, invoke, test, and deploy the agent.
  6. Demonstrate a chat session with multi-turn conversations.

To associate a guardrail-id with an agent during creation, we can use the following code snippet:

gconfig = { 
      "guardrailIdentifier": 'an9l3icjg3kj',
      "guardrailVersion": 'DRAFT'
}

response = bedrock_agent_client.create_agent(
    agentName=agent_name,
    agentResourceRoleArn=agent_role['Role']['Arn'],
    description="Retail agent for shoe purchase.",
    idleSessionTTLInSeconds=3600,
    foundationModel="anthropic.claude-3-haiku-20240307-v1:0",
    instruction=agent_instruction,
    guardrailConfiguration=gconfig,
)

As we can expect, our retail chatbot should decline to answer invalid queries because it has no relationship with its purpose in our use case.

Cost considerations

The following are important cost considerations:

Clean up

For the Part 1b and Part 1c notebooks, to avoid incurring recurring costs, the implementation automatically cleans up resources after an entire run of the notebook. You can check the notebook instructions in the Clean-up Resources section on how to avoid the automatic cleanup and experiment with different prompts.

The order of cleanup is as follows:

  1. Disable the action group.
  2. Delete the action group.
  3. Delete the alias.
  4. Delete the agent.
  5. Delete the Lambda function.
  6. Empty the S3 bucket.
  7. Delete the S3 bucket.
  8. Delete IAM roles and policies.

You can delete guardrails from the Amazon Bedrock console or API. Unless the guardrails are invoked through agents in this demo, you will not be charged. For more details, see Delete a guardrail.

Conclusion

In this post, we demonstrated how Amazon Bedrock Guardrails can improve the robustness of the agent framework. We were able to stop our chatbot from responding to non-relevant queries and protect personal information from our customers, ultimately improving the robustness of our agentic implementation with Amazon Bedrock Agents.

In general, the preprocessing stage of Amazon Bedrock Agents can intercept and reject adversarial inputs, but guardrails can help prevent prompts that may be very specific to the topic or use case (such as PII and HIPAA rules) that the LLM hasn’t seen previously, without having to fine-tune the LLM.

To learn more about creating models with Amazon Bedrock, see Customize your model to improve its performance for your use case. To learn more about using agents to orchestrate workflows, see Automate tasks in your application using conversational agents. For details about using guardrails to safeguard your generative AI applications, refer to Stop harmful content in models using Amazon Bedrock Guardrails.

Acknowledgements

The author thanks all the reviewers for their valuable feedback.


About the Author

Shayan Ray is an Applied Scientist at Amazon Web Services. His area of research is all things natural language (like NLP, NLU, and NLG). His work has been focused on conversational AI, task-oriented dialogue systems, and LLM-based agents. His research publications are on natural language processing, personalization, and reinforcement learning.

Read More

Dive deep into vector data stores using Amazon Bedrock Knowledge Bases

Dive deep into vector data stores using Amazon Bedrock Knowledge Bases

Customers across all industries are experimenting with generative AI to accelerate and improve business outcomes. Generative AI is used in various use cases, such as content creation, personalization, intelligent assistants, questions and answers, summarization, automation, cost-efficiencies, productivity improvement assistants, customization, innovation, and more.

Generative AI solutions often use Retrieval Augmented Generation (RAG) architectures, which augment external knowledge sources for improving content quality, context understanding, creativity, domain-adaptability, personalization, transparency, and explainability.

This post dives deep into Amazon Bedrock Knowledge Bases, which helps with the storage and retrieval of data in vector databases for RAG-based workflows, with the objective to improve large language model (LLM) responses for inference involving an organization’s datasets.

Benefits of vector data stores

Several challenges arise when handling complex scenarios dealing with data like data volumes, multi-dimensionality, multi-modality, and other interfacing complexities. For example:

  • Data such as images, text, and audio need to be represented in a structured and efficient manner
  • Understanding the semantic similarity between data points is essential in generative AI tasks like natural language processing (NLP), image recognition, and recommendation systems
  • As the volume of data continues to grow rapidly, scalability becomes a significant challenge
  •  Traditional databases may struggle to efficiently handle the computational demands of generative AI tasks, such as training complex models or performing inference on large datasets
  •  Generative AI applications frequently require searching and retrieving similar items or patterns within datasets, such as finding similar images or recommending relevant content
  •  Generative AI solutions often involve integrating multiple components and technologies, such as deep learning frameworks, data processing pipelines, and deployment environments

Vector databases serve as a foundation in addressing these data needs for generative AI solutions, enabling efficient representation, semantic understanding, scalability, interoperability, search and retrieval, and model deployment. They contribute to the effectiveness and feasibility of generative AI applications across various domains. Vector databases offer the following capabilities:

  • Provide a means to represent data in a structured and efficient manner, enabling computational processing and manipulation
  • Enable the measurement of semantic similarity by encoding data into vector representations, allowing for comparison and analysis
  • Handle large-scale datasets efficiently, enabling processing and analysis of vast amounts of information in a scalable manner
  •  Provide a common interface for storing and accessing data representations, facilitating interoperability between different components of the AI system
  •  Support efficient search and retrieval operations, enabling quick and accurate exploration of large datasets

To help implement generative AI-based applications securely at scale, AWS provides Amazon Bedrock, a fully managed service that enables deploying generative AI applications that use high-performing LLMs from leading AI startups and Amazon. With the Amazon Bedrock serverless experience, you can experiment with and evaluate top foundation models (FMs) for your use cases, privately customize them with your data using techniques such as fine-tuning and RAG, and build agents that run tasks using enterprise systems and data sources.

In this post, we dive deep into the vector database options available as part of Amazon Bedrock Knowledge Bases and the applicable use cases, and look at working code examples. Amazon Bedrock Knowledge Bases enables faster time to market by abstracting from the heavy lifting of building pipelines and providing you with an out-of-the-box RAG solution to reduce the build time for your application.

Knowledge base systems with RAG

RAG optimizes LLM responses by referencing authoritative knowledge bases outside of its training data sources before generating a response. Out of the box, LLMs are trained on vast volumes of data and use billions of parameters to generate original output for tasks like answering questions, translating languages, and completing sentences. RAG extends the existing powerful capabilities of LLMs to specific domains or an organization’s internal knowledge base, all without the need to retrain the model. It’s a cost-effective approach to improving an LLM’s output so it remains relevant, accurate, and useful in various contexts.

The following diagram depicts the high-level steps of a RAG process to access an organization’s internal or external knowledge stores and pass the data to the LLM.

The workflow consists of the following steps:

  1. Either a user through a chatbot UI or an automated process issues a prompt and requests a response from the LLM-based application.
  2. An LLM-powered agent, which is responsible for orchestrating steps to respond to the request, checks if additional information is needed from knowledge sources.
  3. The agent decides which knowledge source to use.
  4. The agent invokes the process to retrieve information from the knowledge source.
  5. The relevant information (enhanced context) from the knowledge source is returned to the agent.
  6. The agent adds the enhanced context from the knowledge source to the prompt and passes it to the LLM endpoint for the response.
  7. The LLM response is passed back to the agent.
  8. The agent returns the LLM response to the chatbot UI or the automated process.

Use cases for vector databases for RAG

In the context of RAG architectures, the external knowledge can come from relational databases, search and document stores, or other data stores. However, simply storing and searching through this external data using traditional methods (such as keyword search or inverted indexes) can be inefficient and might not capture the true semantic relationships between data points. Vector databases are recommended for RAG use cases because they enable similarity search and dense vector representations.

The following are some scenarios where loading data into a vector database can be advantageous for RAG use cases:

  • Large knowledge bases – When dealing with extensive knowledge bases containing millions or billions of documents or passages, vector databases can provide efficient similarity search capabilities.
  • Unstructured or semi-structured data – Vector databases are particularly well-suited for handling unstructured or semi-structured data, such as text documents, webpages, or natural language content. By converting the textual data into dense vector representations, vector databases can effectively capture the semantic relationships between documents or passages, enabling more accurate retrieval.
  • Multilingual knowledge bases – In RAG systems that need to handle knowledge bases spanning multiple languages, vector databases can be advantageous. By using multilingual language models or cross-lingual embeddings, vector databases can facilitate effective retrieval across different languages, enabling cross-lingual knowledge transfer.
  •  Semantic search and relevance ranking – Vector databases excel at semantic search and relevance ranking tasks. By representing documents or passages as dense vectors, the retrieval component can use vector similarity measures to identify the most semantically relevant content.
  • Personalized and context-aware retrieval – Vector databases can support personalized and context-aware retrieval in RAG systems. By incorporating user profiles, preferences, or contextual information into the vector representations, the retrieval component can prioritize and surface the most relevant content for a specific user or context.

Although vector databases offer advantages in these scenarios, their implementation and effectiveness may depend on factors such as the specific vector embedding techniques used, the quality and representation of the data, and the computational resources available for indexing and retrieval operations. With Amazon Bedrock Knowledge Bases, you can give FMs and agents contextual information from your company’s private data sources for RAG to deliver more relevant, accurate, and customized responses.

Amazon Bedrock Knowledge Bases with RAG

Amazon Bedrock Knowledge Bases is a fully managed capability that helps with the implementation of the entire RAG workflow, from ingestion to retrieval and prompt augmentation, without having to build custom integrations to data sources and manage data flows. Knowledge bases are essential for various use cases, such as customer support, product documentation, internal knowledge sharing, and decision-making systems. A RAG workflow with knowledge bases has two main steps: data preprocessing and runtime execution.

The following diagram illustrates the data preprocessing workflow.

As part of preprocessing, information (structured data, unstructured data, or documents) from data sources is first split into manageable chunks. The chunks are converted to embeddings using embeddings models available in Amazon Bedrock. Lastly, the embeddings are written into a vector database index while maintaining a mapping to the original document. These embeddings are used to determine semantic similarity between queries and text from the data sources. All these steps are managed by Amazon Bedrock.

The following diagram illustrates the workflow for the runtime execution.

During the inference phase of the LLM, when the agent determines that it needs additional information, it reaches out to knowledge bases. The process converts the user query into vector embeddings using an Amazon Bedrock embeddings model, queries the vector database index to find semantically similar chunks to the user’s query, converts the retrieved chunks to text and augments the user query, and then responds back to the agent.

Embeddings models are needed in the preprocessing phase to store data in vector databases and during the runtime execution phase to generate embeddings for the user query to search the vector database index. Embeddings models map high-dimensional and sparse data like text into dense vector representations to be efficiently stored and processed by vector databases, and encode the semantic meaning and relationships of data into the vector space to enable meaningful similarity searches. These models support mapping different data types like text, images, audio, and video into the same vector space to enable multi-modal queries and analysis. Amazon Bedrock Knowledge Bases provides industry-leading embeddings models to enable use cases such as semantic search, RAG, classification, and clustering, to name a few, and provides multilingual support as well.

Vector database options with Amazon Bedrock Knowledge Bases

At the time of writing this post, Amazon Bedrock Knowledge Bases provides five integration options: the Vector Engine for Amazon OpenSearch Serverless, Amazon Aurora, MongoDB Atlas, Pinecone, and Redis Enterprise Cloud, with more vector database options to come. In this post, we discuss use cases, features, and steps to set up and retrieve information using these vector databases. Amazon Bedrock makes it straightforward to adopt any of these choices by providing a common set of APIs, industry-leading embedding models, security, governance, and observability.

Role of metadata while indexing data in vector databases

Metadata plays a crucial role when loading documents into a vector data store in Amazon Bedrock. It provides additional context and information about the documents, which can be used for various purposes, such as filtering, sorting, and enhancing search capabilities.

The following are some key uses of metadata when loading documents into a vector data store:

  • Document identification – Metadata can include unique identifiers for each document, such as document IDs, URLs, or file names. These identifiers can be used to uniquely reference and retrieve specific documents from the vector data store.
  • Content categorization – Metadata can provide information about the content or category of a document, such as the subject matter, domain, or topic. This information can be used to organize and filter documents based on specific categories or domains.
  • Document attributes – Metadata can store additional attributes related to the document, such as the author, publication date, language, or other relevant information. These attributes can be used for filtering, sorting, or faceted search within the vector data store.
  • Access control – Metadata can include information about access permissions or security levels associated with a document. This information can be used to control access to sensitive or restricted documents within the vector data store.
  • Relevance scoring – Metadata can be used to enhance the relevance scoring of search results. For example, if a user searches for documents within a specific date range or authored by a particular individual, the metadata can be used to prioritize and rank the most relevant documents.
  • Data enrichment – Metadata can be used to enrich the vector representations of documents by incorporating additional contextual information. This can potentially improve the accuracy and quality of search results.
  • Data lineage and auditing – Metadata can provide information about the provenance and lineage of documents, such as the source system, data ingestion pipeline, or other transformations applied to the data. This information can be valuable for data governance, auditing, and compliance purposes.

Prerequisites

Complete the steps in this section to set up the prerequisite resources and configurations.

Configure Amazon SageMaker Studio

The first step is to set up an Amazon SageMaker Studio notebook to run the code for this post. You can set up the notebook in any AWS Region where Amazon Bedrock Knowledge Bases is available.

  1. Complete the prerequisites to set up Amazon SageMaker.
  2. Complete the quick setup or custom setup to enable your SageMaker Studio domain and user profile.
    You also need an AWS Identity and Access Management (IAM) role assigned to the SageMaker Studio domain. You can identify the role on the SageMaker console. On the Domains page, open your domain. The IAM role ARN is listed on the Domain settings tab.

    The role needs permissions for IAM, Amazon Relational Database Service (Amazon RDS), Amazon Bedrock, AWS Secrets Manager, Amazon Simple Storage Service (Amazon S3), and Amazon OpenSearch Serverless.
  3. Modify the role permissions to add the following policies:
    1. IAMFullAccess
    2. AmazonRDSFullAccess
    3. AmazonBedrockFullAccess
    4. SecretsManagerReadWrite
    5. AmazonRDSDataFullAccess
    6. AmazonS3FullAccess
    7. The following inline policy:
      {
          "Version": "2012-10-17",
          "Statement": [
              {
                  "Sid": "OpenSearchServeless",
                  "Effect": "Allow",
                  "Action": "aoss:*",
                  "Resource": "*"
              }
          ]
      }

  4. On the SageMaker console, choose Studio in the navigation pane.
  5. Choose your user profile and choose Open Studio.

    This will open a new browser tab for SageMaker Studio Classic.
  6. Run the SageMaker Studio application.
  7. When the application is running, choose Open.

    JupyterLab will open in a new tab.
  8. Download the notebook file to use in this post.
  9. Choose the file icon in the navigation pane, then choose the upload icon, and upload the notebook file.
  10. Leave the image, kernel, and instance type as default and choose Select.

Request Amazon Bedrock model access

Complete the following steps to request access to the embeddings model in Amazon Bedrock:

  1.  On the Amazon Bedrock console, choose Model access in the navigation pane.
  2. Choose Enable specific models.
  3. Select the Titan Text Embeddings V2 model.
  4. Choose Next and complete the access request.

Import dependencies

Open the notebook file Bedrock_Knowledgebases_VectorDB.ipynb and run Step 1 to import dependencies for this post and create Boto3 clients:

!pip install opensearch-py
!pip install retrying

from urllib.request import urlretrieve
import json
import os
import boto3
import random
import time
from opensearchpy import OpenSearch, RequestsHttpConnection, AWSV4SignerAuth, RequestError
credentials = boto3.Session().get_credentials()
service = 'aoss'
suffix = random.randrange(200, 900)
boto3_session = boto3.session.Session()
region_name = boto3_session.region_name
iam_client = boto3_session.client('iam')
account_number = boto3.client('sts').get_caller_identity().get('Account')
identity = boto3.client('sts').get_caller_identity()['Arn']
s3_client = boto3.client("s3", region_name=region_name)
aoss_client = boto3_session.client('opensearchserverless')
bedrock_agent_client = boto3_session.client('bedrock-agent', region_name=region_name)
bedrock_agent_runtime_client = boto3.client('bedrock-agent-runtime', region_name=region_name)
rds = boto3.client('rds', region_name=region_name)
# Create Secret Manager Client to retrieve secret values
secrets_manager = boto3.client('secretsmanager', region_name=region_name)
# Create RDS Data Client to run queries against Aurora PostgreSQL Database
rds_data_client = boto3.client('rds-data', region_name=region_name)
awsauth = auth = AWSV4SignerAuth(credentials, region_name, service)

Create an S3 bucket

You can use the following code to create an S3 bucket to store the source data for your vector database, or use an existing bucket. If you create a new bucket, make sure to follow your organization’s best practices and guidelines.

# Set the bucket name
bucket_name = "<PROVIDE AMAZON S3 BUCKET NAME>"

if region_name in ('af-south-1','ap-east-1','ap-northeast-1','ap-northeast-2','ap-northeast-3','ap-south-1','ap-south-2','ap-southeast-1','ap-southeast-2','ap-southeast-3','ca-central-1','cn-north-1','cn-northwest-1','EU','eu-central-1','eu-north-1','eu-south-1','eu-south-2','eu-west-1','eu-west-2','eu-west-3','me-south-1','sa-east-1','us-east-2','us-gov-east-1','us-gov-west-1','us-west-1','us-west-2'):
    # Create the bucket
    response = s3_client.create_bucket(
        Bucket=bucket_name,
        CreateBucketConfiguration={
                'LocationConstraint': region_name
            }
    )
    # Print the response and validate that value for HTTPStatusCode is 200
    print(response)
else:
    
    # Create the bucket
    response = s3_client.create_bucket(
        Bucket=bucket_name
    )
    # Print the response and validate that value for HTTPStatusCode is 200
    print(response)

Set up sample data

Use the following code to set up the sample data for this post, which will be the input for the vector database:

# Leverage Amazon Shareholder news letter as datasets for loading into vector databases for this Blogpost 
urls = [
    'https://s2.q4cdn.com/299287126/files/doc_financials/2023/ar/2022-Shareholder-Letter.pdf',
    'https://s2.q4cdn.com/299287126/files/doc_financials/2022/ar/2021-Shareholder-Letter.pdf',
    'https://s2.q4cdn.com/299287126/files/doc_financials/2021/ar/Amazon-2020-Shareholder-Letter-and-1997-Shareholder-Letter.pdf',
    'https://s2.q4cdn.com/299287126/files/doc_financials/2020/ar/2019-Shareholder-Letter.pdf'
]

# Define standard file names which be leveraged while loading data to Amazon S3
filenames = [
    'AMZN-2022-Shareholder-Letter.pdf',
    'AMZN-2021-Shareholder-Letter.pdf',
    'AMZN-2020-Shareholder-Letter.pdf',
    'AMZN-2019-Shareholder-Letter.pdf'
]

# Create local temporary directory to download files, before uploading to Amazon S3
!mkdir -p ./data

# Assing local directory path to a python variable
local_data_path = "./data/"

# Assign S3 bucket name to a python variable. This was created in Step-2 above.
# This bucket will be used as source for vector databases and uploading source files.
data_s3_bucket = bucket_name

# Define S3 Prefix with in the bucket to upload files
data_s3_prefix = 'shareholder_newsletter'

# Download file to local_data_path
for idx, url in enumerate(urls):
    file_path = local_data_path + filenames[idx]
    urlretrieve(url, file_path)

# define metadata corresponding to Shareholder letters
metadata_2022 = {
    "metadataAttributes": {
        "company": "Amazon",
        "document_type": "Shareholder Letter",
        "year": 2022
    }
}

metadata_2021 = {
    "metadataAttributes": {
        "company": "Amazon",
        "document_type": "Shareholder Letter",
        "year": 2021
    }
}

metadata_2020 = {
    "metadataAttributes": {
        "company": "Amazon",
        "document_type": "Shareholder Letter",
        "year": 2020
    }
}

metadata_2019 = {
    "metadataAttributes": {
        "company": "Amazon",
        "document_type": "Shareholder Letter",
        "year": 2019
    }
}

# Create metadata files in local_data_path which will be uploaded to Amazon S3

# Create metadata file for 2022
metadata_2022_json = json.dumps(metadata_2022)

with open(f"{local_data_path}AMZN-2022-Shareholder-Letter.pdf.metadata.json", "w") as f:
    f.write(str(metadata_2022_json))

f.close()

# Create metadata file for 2021
metadata_2021_json = json.dumps(metadata_2021)

with open(f"{local_data_path}AMZN-2021-Shareholder-Letter.pdf.metadata.json", "w") as f:
    f.write(str(metadata_2021_json))

f.close()

# Create metadata file for 2020
metadata_2020_json = json.dumps(metadata_2020)

with open(f"{local_data_path}AMZN-2020-Shareholder-Letter.pdf.metadata.json", "w") as f:
    f.write(str(metadata_2020_json))

f.close()

# Create metadata file for 2019
metadata_2019_json = json.dumps(metadata_2019)

with open(f"{local_data_path}AMZN-2019-Shareholder-Letter.pdf.metadata.json", "w") as f:
    f.write(str(metadata_2019_json))

f.close()
    
# Upload files to Amazon S3
def uploadDirectory(path,bucket_name):
        for root,dirs,files in os.walk(path):
            for file in files:
                key = data_s3_prefix + '/' + file
                s3_client.upload_file(os.path.join(root,file),bucket_name,key)

uploadDirectory(local_data_path, data_s3_bucket)

# Delete files from local directory
!rm -r ./data/

Configure the IAM role for Amazon Bedrock

Use the following code to define the function to create the IAM role for Amazon Bedrock, and the functions to attach policies related to Amazon OpenSearch Service and Aurora:

encryption_policy_name = f"bedrock-sample-rag-sp-{suffix}"
network_policy_name = f"bedrock-sample-rag-np-{suffix}"
access_policy_name = f'bedrock-sample-rag-ap-{suffix}'
bedrock_execution_role_name = f'AmazonBedrockExecutionRoleForKnowledgeBase_{suffix}'
fm_policy_name = f'AmazonBedrockFoundationModelPolicyForKnowledgeBase_{suffix}'
s3_policy_name = f'AmazonBedrockS3PolicyForKnowledgeBase_{suffix}'
oss_policy_name = f'AmazonBedrockOSSPolicyForKnowledgeBase_{suffix}'
rds_policy_name = f'AmazonBedrockRDSPolicyForKnowledgeBase_{suffix}'
aurora_policy_name = f'AmazonBedrockAuroraPolicyForKnowledgeBase_{suffix}'
oss_vector_store_name = f'os-shareholder-letter-{suffix}'
oss_index_name = "os_shareholder_letter"
aurora_vector_db_cluster = f'aurora-shareholder-letter-{suffix}'
aurora_vector_db_instance = f'aurora-shareholder-letter-instance-{suffix}'
aurora_database_name = 'vectordb'
aurora_schema_name = 'bedrock_kb'
aurora_table_name = 'aurora_shareholder_letter'

def create_bedrock_execution_role(bucket_name):
    foundation_model_policy_document = {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Action": [
                    "bedrock:InvokeModel",
                ],
                "Resource": [
                    f"arn:aws:bedrock:{region_name}::foundation-model/amazon.titan-embed-text-v2:0" 
                ]
            }
        ]
    }

    s3_policy_document = {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Action": [
                    "s3:GetObject",
                    "s3:ListBucket"
                ],
                "Resource": [f'arn:aws:s3:::{data_s3_bucket}', f'arn:aws:s3:::{data_s3_bucket}/*'], 
                "Condition": {
                    "StringEquals": {
                        "aws:ResourceAccount": f"{account_number}"
                    }
                }
            }
        ]
    }

    assume_role_policy_document = {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Principal": {
                    "Service": "bedrock.amazonaws.com"
                },
                "Action": "sts:AssumeRole"
            }
        ]
    }
    
    
    # create policies based on the policy documents
    fm_policy = iam_client.create_policy(
        PolicyName=fm_policy_name,
        PolicyDocument=json.dumps(foundation_model_policy_document),
        Description='Policy for accessing foundation model',
    )

    s3_policy = iam_client.create_policy(
        PolicyName=s3_policy_name,
        PolicyDocument=json.dumps(s3_policy_document),
        Description='Policy for reading documents from s3')

    # create bedrock execution role
    bedrock_kb_execution_role = iam_client.create_role(
        RoleName=bedrock_execution_role_name,
        AssumeRolePolicyDocument=json.dumps(assume_role_policy_document),
        Description='Amazon Bedrock Knowledge Base Execution Role for accessing OSS and S3',
        MaxSessionDuration=3600
    )

    # fetch arn of the policies and role created above
    bedrock_kb_execution_role_arn = bedrock_kb_execution_role['Role']['Arn']
    s3_policy_arn = s3_policy["Policy"]["Arn"]
    fm_policy_arn = fm_policy["Policy"]["Arn"]

    # attach policies to Amazon Bedrock execution role
    iam_client.attach_role_policy(
        RoleName=bedrock_kb_execution_role["Role"]["RoleName"],
        PolicyArn=fm_policy_arn
    )
    iam_client.attach_role_policy(
        RoleName=bedrock_kb_execution_role["Role"]["RoleName"],
        PolicyArn=s3_policy_arn
    )
    return bedrock_kb_execution_role


def create_oss_policy_attach_bedrock_execution_role(collection_id, bedrock_kb_execution_role):
    # define oss policy document
    oss_policy_document = {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Action": [
                    "aoss:APIAccessAll"
                ],
                "Resource": [
                    f"arn:aws:aoss:{region_name}:{account_number}:collection/{collection_id}"
                ]
            }
        ]
    }
    oss_policy = iam_client.create_policy(
        PolicyName=oss_policy_name,
        PolicyDocument=json.dumps(oss_policy_document),
        Description='Policy for accessing opensearch serverless',
    )
    oss_policy_arn = oss_policy["Policy"]["Arn"]
    print("Opensearch serverless arn: ", oss_policy_arn)

    iam_client.attach_role_policy(
        RoleName=bedrock_kb_execution_role["Role"]["RoleName"],
        PolicyArn=oss_policy_arn
    )
    return None


def create_policies_in_oss(vector_store_name, aoss_client, bedrock_kb_execution_role_arn):
    encryption_policy = aoss_client.create_security_policy(
        name=encryption_policy_name,
        policy=json.dumps(
            {
                'Rules': [{'Resource': ['collection/' + vector_store_name],
                           'ResourceType': 'collection'}],
                'AWSOwnedKey': True
            }),
        type='encryption'
    )

    network_policy = aoss_client.create_security_policy(
        name=network_policy_name,
        policy=json.dumps(
            [
                {'Rules': [{'Resource': ['collection/' + vector_store_name],
                            'ResourceType': 'collection'}],
                 'AllowFromPublic': True}
            ]),
        type='network'
    )
    access_policy = aoss_client.create_access_policy(
        name=access_policy_name,
        policy=json.dumps(
            [
                {
                    'Rules': [
                        {
                            'Resource': ['collection/' + vector_store_name],
                            'Permission': [
                                'aoss:CreateCollectionItems',
                                'aoss:DeleteCollectionItems',
                                'aoss:UpdateCollectionItems',
                                'aoss:DescribeCollectionItems'],
                            'ResourceType': 'collection'
                        },
                        {
                            'Resource': ['index/' + vector_store_name + '/*'],
                            'Permission': [
                                'aoss:CreateIndex',
                                'aoss:DeleteIndex',
                                'aoss:UpdateIndex',
                                'aoss:DescribeIndex',
                                'aoss:ReadDocument',
                                'aoss:WriteDocument'],
                            'ResourceType': 'index'
                        }],
                    'Principal': [identity, bedrock_kb_execution_role_arn],
                    'Description': 'Easy data policy'}
            ]),
        type='data'
    )
    return encryption_policy, network_policy, access_policy


def create_rds_policy_attach_bedrock_execution_role(db_cluster_arn, aurora_db_secret_arn, bedrock_kb_execution_role):
    # define rds policy document
    rds_policy_document = {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Action": [
                    "rds-data:ExecuteStatement",
                    "rds:DescribeDBClusters",
                    "rds-data:BatchExecuteStatement"
                ],
                "Resource": [
                    db_cluster_arn
                ]
            },
            {
                "Effect": "Allow",
                "Action": [
                    "secretsmanager:GetSecretValue",
                    "secretsmanager:DescribeSecret"
                ],
                "Resource": [
                    aurora_db_secret_arn
                ]
            }
        ]
    }
    rds_policy = iam_client.create_policy(
        PolicyName=rds_policy_name,
        PolicyDocument=json.dumps(rds_policy_document),
        Description='Policy for accessing RDS Aurora Database',
    )
    rds_policy_arn = rds_policy["Policy"]["Arn"]
    print("RDS Aurora Policy arn: ", rds_policy_arn)

    iam_client.attach_role_policy(
        RoleName=bedrock_kb_execution_role["Role"]["RoleName"],
        PolicyArn=rds_policy_arn
    )
    return None

Use the following code to create the IAM role for Amazon Bedrock, which you’ll use while creating the knowledge base:

bedrock_kb_execution_role = create_bedrock_execution_role(bucket_name=data_s3_bucket)
bedrock_kb_execution_role_arn = bedrock_kb_execution_role['Role']['Arn']

Integrate with OpenSearch Serverless

The Vector Engine for Amazon OpenSearch Serverless is an on-demand serverless configuration for OpenSearch Service. Because it’s serverless, it removes the operational complexities of provisioning, configuring, and tuning your OpenSearch clusters. With OpenSearch Serverless, you can search and analyze a large volume of data without having to worry about the underlying infrastructure and data management.

The following diagram illustrates the OpenSearch Serverless architecture. OpenSearch Serverless compute capacity for data ingestion, searching, and querying is measured in OpenSearch Compute Units (OCUs).

The vector search collection type in OpenSearch Serverless provides a similarity search capability that is scalable and high performing. This makes it a popular option for a vector database when using Amazon Bedrock Knowledge Bases, because it makes it straightforward to build modern machine learning (ML) augmented search experiences and generative AI applications without having to manage the underlying vector database infrastructure. Use cases for OpenSearch Serverless vector search collections include image searches, document searches, music retrieval, product recommendations, video searches, location-based searches, fraud detection, and anomaly detection. The vector engine provides distance metrics such as Euclidean distance, cosine similarity, and dot product similarity. You can store fields with various data types for metadata, such as numbers, Booleans, dates, keywords, and geopoints. You can also store fields with text for descriptive information to add more context to stored vectors. Collocating the data types reduces complexity, increases maintainability, and avoids data duplication, version compatibility challenges, and licensing issues.

The following code snippets set up an OpenSearch Serverless vector database and integrate it with a knowledge base in Amazon Bedrock:

  1.  Create an OpenSearch Serverless vector collection.
    # create security, network and data access policies within OSS
    encryption_policy, network_policy, access_policy = create_policies_in_oss(vector_store_name=oss_vector_store_name,
    aoss_client=aoss_client,
    bedrock_kb_execution_role_arn=bedrock_kb_execution_role_arn)
    
    # Create OpenSearch Serverless Vector Collection
    collection = aoss_client.create_collection(name=oss_vector_store_name,type='VECTORSEARCH')
    
    # Get the OpenSearch serverless collection URL
    collection_id = collection['createCollectionDetail']['id']
    host = collection_id + '.' + region_name + '.aoss.amazonaws.com'
    print(host)
    
    # wait for collection creation
    # This can take couple of minutes to finish
    response = aoss_client.batch_get_collection(names=[oss_vector_store_name])
    # Periodically check collection status
    while (response['collectionDetails'][0]['status']) == 'CREATING':
    print('Creating collection...')
    time.sleep(30)
    response = aoss_client.batch_get_collection(names=[oss_vector_store_name])
    print('nCollection successfully created:')
    
    
    # create opensearch serverless access policy and attach it to Bedrock execution role
    try:
    create_oss_policy_attach_bedrock_execution_role(collection_id=collection_id,
    bedrock_kb_execution_role=bedrock_kb_execution_role)
    # It can take up to a minute for data access rules to be enforced
    time.sleep(60)
    except Exception as e:
    print("Policy already exists")
    pp.pprint(e)

  2. Create an index in the collection; this index will be managed by Amazon Bedrock Knowledge Bases:
    body_json = {
       "settings": {
          "index.knn": "true",
           "number_of_shards": 1,
           "knn.algo_param.ef_search": 512,
           "number_of_replicas": 0,
       },
       "mappings": {
          "properties": {
             "vector": {
                "type": "knn_vector",
                "dimension": 1024,
                 "method": {
                     "name": "hnsw",
                     "engine": "faiss",
                     "space_type": "l2"
                 },
             },
             "text": {
                "type": "text"
             },
             "text-metadata": {
                "type": "text"         }
          }
       }
    }
    
    # Build the OpenSearch client
    oss_client = OpenSearch(
        hosts=[{'host': host, 'port': 443}],
        http_auth=awsauth,
        use_ssl=True,
        verify_certs=True,
        connection_class=RequestsHttpConnection,
        timeout=300
    )
    
    # Create index
    try:
        response = oss_client.indices.create(index=oss_index_name, body=json.dumps(body_json))
        print('Creating index:')
        # index creation can take up to a minute
        time.sleep(60)
        print('Index Creation Completed:')
    except RequestError as e:
        # you can delete the index if its already exists
        # oss_client.indices.delete(index=oss_index_name)
        print(f'Error while trying to create the index, with error {e.error}nyou may unmark the delete above to delete, and recreate the index')

  3. Create a knowledge base in Amazon Bedrock pointing to the OpenSearch Serverless vector collection and index:
    opensearchServerlessConfiguration = {
                "collectionArn": collection["createCollectionDetail"]['arn'],
                "vectorIndexName": oss_index_name,
                "fieldMapping": {
                    "vectorField": "vector",
                    "textField": "text",
                    "metadataField": "text-metadata"
                }
            }
    
    # The embedding model used by Bedrock to embed ingested documents, and realtime prompts
    embeddingModelArn = f"arn:aws:bedrock:{region_name}::foundation-model/amazon.titan-embed-text-v2:0"
    
    name = f"kb-os-shareholder-letter-{suffix}"
    description = "Amazon shareholder letter knowledge base."
    roleArn = bedrock_kb_execution_role_arn
    
    # Create a KnowledgeBase
    from retrying import retry
    
    @retry(wait_random_min=1000, wait_random_max=2000,stop_max_attempt_number=7)
    def create_knowledge_base_func():
        create_kb_response = bedrock_agent_client.create_knowledge_base(
            name = name,
            description = description,
            roleArn = roleArn,
            knowledgeBaseConfiguration = {
                "type": "VECTOR",
                "vectorKnowledgeBaseConfiguration": {
                    "embeddingModelArn": embeddingModelArn
                }
            },
            storageConfiguration = {
                "type": "OPENSEARCH_SERVERLESS",
                "opensearchServerlessConfiguration":opensearchServerlessConfiguration
            }
        )
        return create_kb_response["knowledgeBase"]
    
    
    try:
        kb = create_knowledge_base_func()
    except Exception as err:
        print(f"{err=}, {type(err)=}")
        
    # Get KnowledgeBase 
    get_kb_response = bedrock_agent_client.get_knowledge_base(knowledgeBaseId = kb['knowledgeBaseId'])
    
    print(f'OpenSearch Knowledge Response: {get_kb_response}')

  4. Create a data source for the knowledge base:
    # Ingest strategy - How to ingest data from the data source
    chunkingStrategyConfiguration = {
        "chunkingStrategy": "FIXED_SIZE",
        "fixedSizeChunkingConfiguration": {
            "maxTokens": 512,
            "overlapPercentage": 20
        }
    }
    
    # The data source to ingest documents from, into the OpenSearch serverless knowledge base index
    s3Configuration = {
        "bucketArn": f"arn:aws:s3:::{data_s3_bucket}",
        "inclusionPrefixes": [f"{data_s3_prefix}"] # you can use this if you want to create a KB using data within s3 prefixes.
        }
    # Create a DataSource in KnowledgeBase 
    create_ds_response = bedrock_agent_client.create_data_source(
        name = f'{name}-{bucket_name}',
        description = description,
        knowledgeBaseId = kb['knowledgeBaseId'],
        dataSourceConfiguration = {
            "type": "S3",
            "s3Configuration":s3Configuration
        },
        vectorIngestionConfiguration = {
            "chunkingConfiguration": chunkingStrategyConfiguration
        }
    )
    ds = create_ds_response["dataSource"]
    
    ds

  5. Start an ingestion job for the knowledge base pointing to OpenSearch Serverless to generate vector embeddings for data in Amazon S3:
    ingest_jobs=[]
    # Start an ingestion job
    try:
        start_job_response = bedrock_agent_client.start_ingestion_job(knowledgeBaseId = kb['knowledgeBaseId'], dataSourceId = ds["dataSourceId"])
        job = start_job_response["ingestionJob"]
        print(f"ingestion job started successfullyn")
    
        while(job['status']!='COMPLETE' ):
            get_job_response = bedrock_agent_client.get_ingestion_job(
              knowledgeBaseId = kb['knowledgeBaseId'],
                dataSourceId = ds["dataSourceId"],
                ingestionJobId = job["ingestionJobId"]
            )
            job = get_job_response["ingestionJob"]
    
        time.sleep(30)
        print(f"job completed successfullyn")
    
    except Exception as e:
        print(f"Couldn't start job.n")
        print(e)

Integrate with Aurora pgvector

Aurora provides pgvector integration, which is an open source extension for PostgreSQL that adds the ability to store and search over ML-generated vector embeddings. This enables you to use Aurora for generative AI RAG-based use cases by storing vectors with the rest of the data. The following diagram illustrates the sample architecture.

Use cases for Aurora pgvector include applications that have requirements for ACID compliance, point-in-time recovery, joins, and more. The following is a sample code snippet to configure Aurora with your knowledge base in Amazon Bedrock:

  1. Create an Aurora DB instance (this code creates a managed DB instance, but you can create a serverless instance as well). Identify the security group ID and subnet IDs for your VPC before running the following step and provide the appropriate values in the vpc_security_group_ids and SubnetIds variables:
    # Define database instance parameters
    db_instance_identifier = aurora_vector_db_instance
    db_cluster_identifier = aurora_vector_db_cluster
    engine = 'aurora-postgresql'
    db_name = aurora_database_name
    db_instance_class = 'db.r6g.2xlarge'
    master_username = 'postgres'
    # Get Security Group Id(s), for replicating Blogpost steps it can be one associated with Default VPC
    vpc_security_group_ids = ['sg-XXXXXXX']
    subnet_group_name = 'vectordbsubnetgroup'
    
    response = rds.create_db_subnet_group(
        DBSubnetGroupName=subnet_group_name,
        DBSubnetGroupDescription='Subnet Group for Blogpost Aurora PostgreSql Database Cluster',
        # Get Subnet IDs, for replicating Blogpost steps it can be one associated with Default VPC 
        SubnetIds=[
            'subnet-XXXXXXX',
            'subnet-XXXXXXX',
            'subnet-XXXXXXX',
            'subnet-XXXXXXX',
            'subnet-XXXXXXX',
            'subnet-XXXXXXX'
        ]
    )
    
    # Create the Aurora cluster
    response = rds.create_db_cluster(
        DBClusterIdentifier=db_cluster_identifier,
        Engine=engine,
        MasterUsername=master_username,
        ManageMasterUserPassword=True,
        DBSubnetGroupName=subnet_group_name,
        VpcSecurityGroupIds=vpc_security_group_ids,
        DatabaseName=db_name
    )
    
    # Create the Aurora instance
    response = rds.create_db_instance(
        DBInstanceIdentifier=db_instance_identifier,
        DBInstanceClass=db_instance_class,
        Engine=engine,
        DBClusterIdentifier=db_cluster_identifier
    )

  2. On the Amazon RDS console, confirm the Aurora database status shows as Available.
  3. Create the vector extension, schema, and vector table in the Aurora database:
    ##Get Amazon Aurora Database Secret Manager ARN created internally while creating DB Cluster and Database Cluster ARN
    
    describe_db_clusters_response = rds.describe_db_clusters(
        DBClusterIdentifier=db_cluster_identifier,
        IncludeShared=False
    )
    
    aurora_db_secret_arn = describe_db_clusters_response['DBClusters'][0]['MasterUserSecret']['SecretArn']
    db_cluster_arn = describe_db_clusters_response['DBClusters'][0]['DBClusterArn']
    
    # Enable HTTP Endpoint for Amazon Aurora Database instance
    response = rds.enable_http_endpoint(
        ResourceArn=db_cluster_arn
    )
    
    # Create Vector Extension in Aurora PostgreSQL Database which will be used in table creation
    vector_extension_create_response = rds_data_client.execute_statement(
        resourceArn=db_cluster_arn,
        secretArn=aurora_db_secret_arn,
        sql='CREATE EXTENSION IF NOT EXISTS vector',
        database=db_name
    )
    
    # Create Schema in Aurora PostgreSQL database
    schema_create_response = rds_data_client.execute_statement(
        resourceArn=db_cluster_arn,
        secretArn=aurora_db_secret_arn,
        sql='CREATE SCHEMA IF NOT EXISTS bedrock_integration',
        database=db_name
    )
    
    # Create Table which store vector embedding corresponding to Shareholder letters
    table_create_response = rds_data_client.execute_statement(
        resourceArn=db_cluster_arn,
        secretArn=aurora_db_secret_arn,
        sql='CREATE TABLE IF NOT EXISTS bedrock_integration.share_holder_letter_kb(id uuid PRIMARY KEY, embedding vector(1024), chunks text, metadata json, company varchar(100), document_type varchar(100), year int)',
        database=db_name
    )
    
    # Check the status of queries
    vector_extension_create_status = 'Success' if vector_extension_create_response['ResponseMetadata']['HTTPStatusCode'] == 200 else 'Fail'
    schema_create_status = 'Success' if schema_create_response['ResponseMetadata']['HTTPStatusCode'] == 200 else 'Fail'
    table_create_response = 'Success' if table_create_response['ResponseMetadata']['HTTPStatusCode'] == 200 else 'Fail'
    
    # Print the status of queries
    print(f"Create Vector Extension Status: {vector_extension_create_status}")
    print(f"Create Schema Status: {schema_create_status}")
    print(f"Create Table Status: {table_create_response}")

  4. Create a knowledge base in Amazon Bedrock pointing to the Aurora database and table:
    # Attached RDS related permissions to the Bedrock Knowledgebase role
    
    create_rds_policy_attach_bedrock_execution_role(db_cluster_arn, aurora_db_secret_arn, bedrock_kb_execution_role)
    
    # Define RDS Configuration for Knowledge bases
    rdsConfiguration = {
                'credentialsSecretArn': aurora_db_secret_arn,
                'databaseName': db_name,
                'fieldMapping': {
                    'metadataField': 'metadata',
                    'primaryKeyField': 'id',
                    'textField': 'chunks',
                    'vectorField': 'embedding'
                },
                'resourceArn': db_cluster_arn,
                'tableName': 'bedrock_integration.share_holder_letter_kb'
            }
    
    
    # The embedding model used by Bedrock to embed ingested documents, and realtime prompts
    embeddingModelArn = f"arn:aws:bedrock:{region_name}::foundation-model/amazon.titan-embed-text-v2:0"
    
    name = f"kb-aurora-shareholder-letter-{suffix}"
    description = "Amazon shareholder letter Aurora PG Vector knowledge base."
    roleArn = bedrock_kb_execution_role_arn
    
    # Create a KnowledgeBase
    from retrying import retry
    
    @retry(wait_random_min=1000, wait_random_max=2000,stop_max_attempt_number=7)
    def create_knowledge_base_func():
        create_rds_kb_response = bedrock_agent_client.create_knowledge_base(
            name = name,
            description = description,
            roleArn = roleArn,
            knowledgeBaseConfiguration = {
                "type": "VECTOR",
                "vectorKnowledgeBaseConfiguration": {
                    "embeddingModelArn": embeddingModelArn
                }
            },
            storageConfiguration = {
                "type": "RDS",
                "rdsConfiguration":rdsConfiguration
            }
        )
        return create_rds_kb_response["knowledgeBase"]
    
    try:
        rds_kb = create_knowledge_base_func()
    except Exception as err:
        print(f"{err=}, {type(err)=}")
        
    # Get KnowledgeBase 
    get_rds_kb_response = bedrock_agent_client.get_knowledge_base(knowledgeBaseId = rds_kb['knowledgeBaseId'])
    
    print(f'RDS Aurora Knowledge Response: {get_rds_kb_response}')

  5. Create a data source for the knowledge base:
    # Ingest strategy - How to ingest data from the data source
    chunkingStrategyConfiguration = {
        "chunkingStrategy": "FIXED_SIZE",
        "fixedSizeChunkingConfiguration": {
            "maxTokens": 512,
            "overlapPercentage": 20
        }
    }
    
    # The data source to ingest documents from, into the OpenSearch serverless knowledge base index
    s3Configuration = {
        "bucketArn": f"arn:aws:s3:::{data_s3_bucket}",
        "inclusionPrefixes": [f"{data_s3_prefix}"] # you can use this if you want to create a KB using data within s3 prefixes.
        }
    # Create a DataSource in KnowledgeBase 
    create_ds_response = bedrock_agent_client.create_data_source(
        name = f'{name}-{data_s3_bucket}',
        description = description,
        knowledgeBaseId = rds_kb['knowledgeBaseId'],
        dataSourceConfiguration = {
            "type": "S3",
            "s3Configuration":s3Configuration
        },
        vectorIngestionConfiguration = {
            "chunkingConfiguration": chunkingStrategyConfiguration
        }
    )
    ds = create_ds_response["dataSource"]
    
    ds

  6. Start an ingestion job for your knowledge base pointing to the Aurora pgvector table to generate vector embeddings for data in Amazon S3:
    ingest_jobs=[]
    # Start an ingestion job
    try:
        start_job_response = bedrock_agent_client.start_ingestion_job(knowledgeBaseId = kb['knowledgeBaseId'], dataSourceId = ds["dataSourceId"])
        job = start_job_response["ingestionJob"]
        print(f"job started successfullyn")
    
        while(job['status']!='COMPLETE' ):
            get_job_response = bedrock_agent_client.get_ingestion_job(
              knowledgeBaseId = kb['knowledgeBaseId'],
                dataSourceId = ds["dataSourceId"],
                ingestionJobId = job["ingestionJobId"]
            )
            job = get_job_response["ingestionJob"]
    
        time.sleep(30)
        print(f"job completed successfullyn")
    
    except Exception as e:
        print(f"Couldn't start job.n")
        print(e)

Integrate with MongoDB Atlas

MongoDB Atlas Vector Search, when integrated with Amazon Bedrock, can serve as a robust and scalable knowledge base to build generative AI applications and implement RAG workflows. By using the flexible document data model of MongoDB Atlas, organizations can represent and query complex knowledge entities and their relationships within Amazon Bedrock. The combination of MongoDB Atlas and Amazon Bedrock provides a powerful solution for building and maintaining a centralized knowledge repository.

To use MongoDB, you can create a cluster and vector search index. The native vector search capabilities embedded in an operational database simplify building sophisticated RAG implementations. MongoDB allows you to store, index, and query vector embeddings of your data without the need for a separate bolt-on vector database.

There are three pricing options available for MongoDB Atlas through AWS Marketplace: MongoDB Atlas (pay-as-you-go), MongoDB Atlas Enterprise, and MongoDB Atlas for Government. Refer to the MongoDB Atlas Vector Search documentation to set up a MongoDB vector database and add it to your knowledge base.

Integrate with Pinecone

Pinecone is a type of vector database from Pinecone Systems Inc. With Amazon Bedrock Knowledge Bases, you can integrate your enterprise data into Amazon Bedrock using Pinecone as the fully managed vector database to build generative AI applications. Pinecone is highly performant; it can speed through data in milliseconds. You can use its metadata filters and sparse-dense index support for top-notch relevance, achieving quick, accurate, and grounded results across diverse search tasks. Pinecone is enterprise ready; you can launch and scale your AI solution without needing to maintain infrastructure, monitor services, or troubleshoot algorithms. Pinecone adheres to the security and operational requirements of enterprises.

There are two pricing options available for Pinecone in AWS Marketplace: Pinecone Vector Database – Pay As You Go Pricing (serverless) and Pinecone Vector Database – Annual Commit (managed). Refer to the Pinecone documentation to set up a Pinecone vector database and add it to your knowledge base.

Integrate with Redis Enterprise Cloud

Redis Enterprise Cloud enables you to set up, manage, and scale a distributed in-memory data store or cache environment in the cloud to help applications meet low latency requirements. Vector search is one of the solution options available in Redis Enterprise Cloud, which solves for low latency use cases related to RAG, semantic caching, document search, and more. Amazon Bedrock natively integrates with Redis Enterprise Cloud vector search.

There are two pricing options available for Redis Enterprise Cloud through AWS Marketplace: Redis Cloud Pay As You Go Pricing and Redis Cloud – Annual Commits. Refer to the Redis Enterprise Cloud documentation to set up vector search and add it to your knowledge base.

Interact with Amazon Bedrock knowledge bases

Amazon Bedrock provides a common set of APIs to interact with knowledge bases:

  •  Retrieve API – Queries the knowledge base and retrieves information from it. This is a Bedrock Knowledge Base specific API, it helps with use cases where only vector-based searching of documents is needed without model inferences.
  • Retrieve and Generate API – Queries the knowledge base and uses an LLM to generate responses based on the retrieved results.

The following code snippets show how to use the Retrieve API from the OpenSearch Serverless vector database’s index and the Aurora pgvector table:

  1. Retrieve data from the OpenSearch Serverless vector database’s index:
    query = "What is Amazon's doing in the field of generative AI?"
    
    relevant_documents_os = bedrock_agent_runtime_client.retrieve(
        retrievalQuery= {
            'text': query
        },
        knowledgeBaseId=kb['knowledgeBaseId'],
        retrievalConfiguration= {
            'vectorSearchConfiguration': {
                'numberOfResults': 3 # will fetch top 3 documents which matches closely with the query.
            }
        }
    )
    relevant_documents_os["retrievalResults"]

  2. Retrieve data from the Aurora pgvector table:
    query = "What is Amazon's doing in the field of generative AI?"
    
    relevant_documents_rds = bedrock_agent_runtime_client.retrieve(
        retrievalQuery= {
            'text': query
        },
        knowledgeBaseId=rds_kb['knowledgeBaseId'],
        retrievalConfiguration= {
            'vectorSearchConfiguration': {
                'numberOfResults': 3 # will fetch top 3 documents which matches closely with the query.
            }
        }
    )
    
    relevant_documents_rds["retrievalResults"]

Clean up

When you’re done with this solution, clean up the resources you created:

  • Amazon Bedrock knowledge bases for OpenSearch Serverless and Aurora
  • OpenSearch Serverless collection
  • Aurora DB instance
  • S3 bucket
  • SageMaker Studio domain
  • Amazon Bedrock service role
  • SageMaker Studio domain role

Conclusion

In this post, we provided a high-level introduction to generative AI use cases and the use of RAG workflows to augment your organization’s internal or external knowledge stores. We discussed the importance of vector databases and RAG architectures to enable similarity search and why dense vector representations are beneficial. We also went over Amazon Bedrock Knowledge Bases, which provides common APIs, industry-leading governance, observability, and security to enable vector databases using different options like AWS native and partner products through AWS Marketplace. We also dived deep into a few of the vector database options with code examples to explain the implementation steps.

Try out the code examples in this post to implement your own RAG solution using Amazon Bedrock Knowledge Bases, and share your feedback and questions in the comments section.


About the Authors

Vishwa Gupta is a Senior Data Architect with AWS Professional Services. He helps customers implement generative AI, machine learning, and analytics solutions. Outside of work, he enjoys spending time with family, traveling, and trying new foods.

Isaac Privitera is a Principal Data Scientist with the AWS Generative AI Innovation Center, where he develops bespoke generative AI-based solutions to address customers’ business problems. His primary focus lies in building responsible AI systems, using techniques such as RAG, multi-agent systems, and model fine-tuning. When not immersed in the world of AI, Isaac can be found on the golf course, enjoying a football game, or hiking trails with his loyal canine companion, Barry.

Abhishek Madan is a Senior GenAI Strategist with the AWS Generative AI Innovation Center. He helps internal teams and customers in scaling generative AI, machine learning, and analytics solutions. Outside of work, he enjoys playing adventure sports and spending time with family.

Ginni Malik is a Senior Data & ML Engineer with AWS Professional Services. She assists customers by architecting enterprise data lake and ML solutions to scale their data analytics in the cloud.

Satish Sarapuri is a Sr. Data Architect, Data Lake at AWS. He helps enterprise-level customers build high-performance, highly available, cost-effective, resilient, and secure generative AI, data mesh, data lake, and analytics platform solutions on AWS through which customers can make data-driven decisions to gain impactful outcomes for their business, and helps them on their digital and data transformation journey. In his spare time, he enjoys spending time with his family and playing tennis.

Read More

Enable or disable ACL crawling safely in Amazon Q Business

Enable or disable ACL crawling safely in Amazon Q Business

Amazon Q Business recently added support for administrators to modify the default access control list (ACL) crawling feature for data source connectors.

Amazon Q Business is a fully managed, AI powered assistant with enterprise-grade security and privacy features. It includes over 40 data source connectors that crawl and index documents. By default, Amazon Q Business indexes ACL information attached to documents along with the documents themselves and uses this to filter chat responses based on the user’s document access. With this new feature, you can enable or disable ACL crawling as required by their business use case.

This post introduces the new ACL toggle feature for Amazon Q Business, which you can use to enable or disable ACL crawling. We’ll explore use cases for disabling ACLs and discuss how to safely enable or disable ACL crawling.

Overview of access control list crawling

Amazon Q Business data source connectors help crawl various data sources to collect and index content in Amazon Q Business for fast discovery and retrieval when answering user queries. These data sources often contain documents with different classifications such as public, internal public, private, and confidential. To provide fine-grained control over access rights, you can attach ACLs to documents, allowing you to specify different levels of access for various users or groups. To verify that Amazon Q Business respects access control policies and that users only receive responses for content they’re authorized to access, the data source connectors automatically crawl for access permissions associated with the content, user identifiers, and groups.

The preceding figure illustrates the Amazon Q Business data source crawler with ACL crawling enabled. As the connector retrieves content from the data source, it examines the associated ACL and compiles a list of users and groups with read permissions for each document. The connector also collects user identifiers, which are stored in the Amazon Q Business user store for quick matching during query execution. Both the ACL and content are optimized and stored in the Amazon Q Business index storage, enabling secure and efficient retrieval when answering user queries. For more information on the user store, see Understanding Amazon Q Business User Store.

When to disable ACL crawling?

ACL crawling builds a security-aware index that respects access control policies in the primary data source. This process helps maintain data privacy and access control required for regulatory compliance, making sure that sensitive information isn’t inadvertently exposed through user query results. It provides a scalable mechanism to handle large amounts of content while maintaining consistency between the actual access controls on the data and what’s discoverable through search. Because of these advantages, ACL crawling is strongly recommended for all data sources. However, there are some circumstances when you might need to disable it. The following are some reasons why you might disable ACL crawling.

Internally public content

Organizations often designate certain data sources as internally public, including HR policies, IT knowledge bases, and wiki pages. For instance, a company might allocate an entire Microsoft SharePoint site for policies accessible to all employees, classifying it as internal-public. In such cases, crawling ACLs for permissions that include all employees can be costly and create unnecessary overhead. Turning off ACL crawling might be advantageous in these scenarios.

Data source contains irreconcilable identities

Amazon Q Business requires all users to authenticate with an enterprise-approved identity provider (IdP). After successful authentication, Amazon Q Business uses the IdP-provided user identifier to match against the user identifier fetched from the data source during ACL crawling. This process validates user access to content before retrieving it for query responses.

However, because of legacy issues such as mergers and acquisitions, data source configuration limitations, or other constraints, the primary user identifier from the IdP might differ from the one in the data source. This discrepancy can prevent Amazon Q Business from retrieving relevant content from the index and answering user queries effectively.

In such cases, it might be necessary to disable ACL crawling and use alternative options. These include implementing attribute filters or building dedicated restricted applications with access limited to specific audiences and content. For more information on attribute filters, see Filtering chat responses using document attributes.

Use case-driven targeted deployments

As a fully managed service, Amazon Q Business can be quickly deployed in multiple instances for scoped down targeted use cases. Examples include an HR bot in Slack or an AI assistant for customer support agents in a contact center. Because these AI assistants might be deployed for a limited audience, and the indexed content might be generally available to all users with application access, ACL crawling can be turned off.

Note of caution

Amazon Q Business cannot enforce access controls if ACL crawling is disabled. When ACL crawling is disabled for a data source, indexed content in that source will be considered accessible to users with access to the Amazon Q Business application. Therefore, disabling ACL crawling should be done with caution and due diligence. The following are some recommended best practices:

  • Notify data source content owners and administrators of your intent to disable ACL crawling and obtain their approval beforehand.
  • If applicable, consider implementing alternative options such as attribute filtering to restrict content retrieval or deploying a scoped-down, use-case-driven deployment to a limited audience.
  • Maintain a decision document that clearly articulates the reasons for disabling ACL crawling, the scope of affected content, and precautions taken to prevent indexing of sensitive information.

Note: As a precaution, you cannot disable ACL crawling for an existing Amazon Q Business data source that already has ACL crawling enabled. To disable ACL crawling, you must delete the data source and recreate it. You can only disable ACL crawling during the data source creation process, and this requires an account administrator to grant permission for disabling ACL crawling when configuring the data source.

Procedures for configuring ACL crawling

Amazon Q Business ACL crawling helps protect your data. Amazon Q Business provides safeguards to help administrators and developers mitigate accidentally disabling ACL crawling. In this section, we will cover how you can allow or deny the ACL crawling disable feature, explore procedures to enable or disable ACL crawling, explain how to monitor logs for ACL crawling configuration changes, and troubleshoot common issues.

Personas for configuring ACL crawling

ACL crawling configuration typically involves multiple roles, depending on your organizational structure. To maximize safeguards, it’s recommended that these roles are filled by different individuals. For faster deployments, identify the necessary personnel within your organization before starting the project and ensure they collaborate to complete the configuration. Here are the common roles needed for ACL crawling configuration:

  1. AWS account administrator – An AWS account administrator is a user with full access to AWS services and the ability to manage IAM resources and permissions in the account. They can create and manage organizations, enabling centralized management of multiple AWS accounts.
  2. Amazon Q Business administrator – An Amazon Q Business administrator is typically a user or role responsible for managing and configuring the Amazon Q Business service. Their duties include creating and optimizing Amazon Q Business indexes, setting up guardrails, and tuning relevance. They also set up and maintain connections to various data sources that Amazon Q Business will index, such as Amazon Simple Storage Service (Amazon S3) buckets, SharePoint, Salesforce, and Confluence.

Prerequisites for ACL crawling

Process to disallow the option to disable ACL crawling

By default, the option to disable ACL crawling is enabled for an account. AWS account administrators can disallow this feature by setting up an account-level policy. It’s recommended to configure an explicit deny for production accounts by default. The following below shows the associated actions in relation to the personas involved in the configuration process.

Administrators can attach the IAM action qbusiness:DisableAclOnDataSource to the Amazon Q Business administrator user or role policy to deny or allow the option to disable ACL crawling. The example IAM policy code snippet that follows demonstrates how to set up an explicit deny.

{
    "Version": "2012-10-17",
    "Statement": [
        {
          "Effect": "Deny",
          "Action": [
                "qbusiness:DisableAclOnDataSource"
            ],
          "Resource": ["*"]
       }
    ]
}

Note that even if the option to disable ACL crawling is denied, the user interface might not gray out this option. However, if you attempt to create a data source with this option disabled, it will fail the validation check, and Amazon Q Business will not create the data source.

Process to disable ACL crawling for a data source connector

Before setting up a data source connector with ACL crawling disabled in your Amazon Q Business application deployment, make sure that you have no sensitive content in the data source or have implemented controls to help prevent accidental content exposure. Verify that the data source connector supports the option to disable ACL crawling. Notify information custodians, content owners, and data source administrators of your intent to disable ACL crawling and obtain their documented approvals, if necessary. If your account administrator has explicitly denied the option to disable ACL crawling, request temporary permission. After you have secured all approvals and exceptions, create a new data source with ACL crawling disabled and sync the data. With ACL crawling disabled, Amazon Q Business users will be able to discover knowledge and obtain answers from the indexed documents through this connector. Notify the account administrator to revert the account policy back to explicitly denying the disable ACL crawling option. The process and interaction between different roles are shown in the following chart.

The following is an overview of the procedure to create a data source with ACL crawling disabled using AWS Console:

  1. Navigate to the Amazon Q Business console.
  2. Select the Amazon Q Business application that you want to add a data source connector to.
  3. Choose Add data source in the Data sources section and select the desired connector.
  4. Update the connector configuration information. See Connecting Amazon Q Business data sources for configuration details.
  5. In the Authorization section, choose Disable ACLs and check the acknowledgment to accept the risks of disabling ACL crawling.
  6. Complete the remaining connector configuration and choose Save.
  7. Sync the data source.

Note: You cannot disable ACL crawling for an existing data source connector that was created with ACL crawling enabled. You must create a new data source connector instance with ACL disabled and delete the older instance that has ACL crawling enabled.

Process to enable ACL crawling for a data source connector

Creating a data source connector with ACL crawling enabled is recommended and doesn’t require additional allow listing from AWS account administrators. To enable ACL crawling, you follow steps similar to disabling ACLs as described in the previous section. When configuring the data source connector using the console, choose Enable ACLs in the Authorization section to create a connector with ACL crawling enabled. You can also enable ACL crawling at any time for an existing data source connector that was created with this option disabled. Sync the data source connector for the ACL enforcement to take effect. Amazon Q Business users can only query and obtain answers from documents to which they have access in the original data source.

It’s important to review that the data source administrator has set up the required permissions properly, making sure that the crawler has permission to crawl for ACLs in the data source before enabling ACL crawling. You can find the required permissions in the prerequisite section of the connector in Connecting Amazon Q Business data sources. The following shows the process for setting up a data source connector with ACL crawling enabled.

Logging and monitoring the ACL crawling configuration

Amazon Q Business uses AWS CloudTrail for logging API calls related to ACL crawling configuration. You can monitor the CloudTrail log for CreateDataSource and UpdateDataSource API calls to identify ACL crawling-related changes made to data source configuration. For a complete list of Amazon Q Business APIs that are logged to CloudTrail, see Logging Amazon Q Business API calls using AWS CloudTrail.

Administrators can configure Amazon CloudWatch alarms to generate automated alert notifications if ACL crawling is disabled for a data source connector, allowing them to initiate corrective action. For step-by-step instructions on setting up CloudWatch alarms based on CloudTrail events, see How do I use CloudWatch alarms to monitor CloudTrail events.

The example CloudWatch alarm code snippet that follows shows the filter pattern for identifying events related to disabling ACL crawling in a data source connector.

{
    ($.eventSource = qbusiness.amazonaws.com)
    && (
        ($.eventName = CreateDataSource)
        || ($.eventName = UpdateDataSource)
    )
    && ($.requestParameters.disableAclCrawl = true) 
}

Tips for troubleshooting

When configuring Amazon Q Business data source connectors, you might occasionally encounter issues. The following are some common errors and their possible resolutions.

Not authorized to disable ACL crawling

When creating a new data source connector with ACL crawling disabled, you might see an error message stating not authorized to perform: qbusiness:DisableAclOnDataSource as shown in the following image.

This error indicates that your administrator has explicitly denied the option to disable ACL crawling for your AWS account. Contact your administrator to allow-list this action for your account. For more details, see the Process to disable ACL crawling for a data source connector section earlier in this post.

Data source connection errors

Data source connectors might also fail to connect to your data source or crawl data. In such cases, verify that Amazon Q Business can reach the data source through the public internet or through a VPC private network. See Connecting Amazon Q Business data sources to make sure that your data source authentication has the permissions needed to crawl content and ACLs, if enabled.

Identity and ACL mismatch errors

Finally, after successfully syncing data with ACL crawling enabled, some users might still be unable to get answers to queries, even though the relevant documents were indexed. This issue commonly occurs when the user lacks access to the indexed content in the original data source, or when the user identity obtained from the data source doesn’t match the sign-in identity. To troubleshoot such ACL mismatch issues, examine the data source sync report. For more information, see Introducing document-level sync reports: Enhanced data sync visibility in Amazon Q Business.

Key considerations and recommendations

Given the impact that disabling ACL crawling can have on content security, consider these restrictions and best practices when disabling ACL crawling in Amazon Q Business data source connectors:

  • ACL crawling enablement is a one-way control mechanism. After it’s enabled, you cannot disable it. This helps prevent accidentally disabling ACL crawling in production environments.
  • Keep ACL crawling enabled by default and disable it only for the subset of data source connectors that require it.
  • If necessary, consider splitting the indexing of a data source by setting up multiple data source connectors and limiting ACL crawling disablement to a smaller content segment. Use the document Inclusion and Exclusion feature of data source connectors to define the indexing scope.
  • When ACL crawling is disabled because of irreconcilable identities, consider alternative options. These include implementing attribute filters, restricting access to the Amazon Q Business application, and setting up guardrails.
  • As a security best practice, AWS Organizations and account administrators should add a service control policy to explicitly deny the qbusiness:DisableAclOnDataSource permission for all accounts. Grant this permission only when requested by an Amazon Q Business administrator. After configuring a data source connector with ACL crawling disabled, revert to an explicit deny. Use a ticketing system to maintain a record of exception approvals. For more information, see <link>.
  • Currently, disabling ACL crawling is available for limited connectors, including ServiceNow, Confluence, SharePoint, Jira, Google Drive, OneDrive, Salesforce, Zendesk, GitHub, MS Teams, and Slack. For the latest list of connectors that support disabling ACL crawling, see Connecting Amazon Q Business data sources.

Clean up

To avoid incurring additional charges, make sure you delete any resources created in this post.

  1. To delete any data source created in Amazon Q Business, follow the instructions in Deleting an Amazon Q Business data source connector to delete the same.
  2. To delete any Amazon Q Business application created, follow the instructions in Deleting an application.

Conclusion

Amazon Q Business data source connector ACL crawling is an essential feature that helps organizations build, manage, and scale secure AI assistants. It plays a crucial role in enforcing regulatory and compliance policies and protecting sensitive content. With the introduction of a self-service feature to disable ACL crawling, Amazon Q Business now provides you more autonomy to choose deployment options that suit your organization’s business needs. To start building secure AI assistants with Amazon Q Business, explore the Getting started guide.


About the Authors

Rajesh Kumar Ravi, a Senior Solutions Architect at Amazon Web Services, specializes in building generative AI solutions using Amazon Q Business, Amazon Bedrock, and Amazon Kendra. He helps businesses worldwide implement these technologies to enhance efficiency, innovation, and competitiveness. An accomplished technology leader, Rajesh has experience developing innovative AI products, nurturing the builder community, and contributing to new ideas. Outside of work, he enjoys walking and short hiking trips.

Meenakshisundaram Thandavarayan works for AWS as an AI/ML Specialist. He has a passion to design, create, and promote human-centered data and analytics experiences. Meena focuses on developing sustainable systems that deliver measurable, competitive advantages for strategic customers of AWS. Meena is a connector and design thinker and strives to drive business to new ways of working through innovation, incubation, and democratization.

Amit Choudhary is a Product Manager for Amazon Q Business connectors. He loves to build products that make it easy for customers to use privacy-preserving technologies (PETs) such as differential privacy

Keerthi Kumar Kallur is a Software Development Engineer at AWS. He is part of the Amazon Q Business team and worked on various features with customers. In his spare time, he likes to do outdoor activities such as hiking and sports such as volleyball.

Read More

SK Telecom improves telco-specific Q&A by fine-tuning Anthropic’s Claude models in Amazon Bedrock

SK Telecom improves telco-specific Q&A by fine-tuning Anthropic’s Claude models in Amazon Bedrock

SK Telecom (SKT), South Korea’s leading telecommunications company serving 30 million customers, is at the forefront of AI innovation. In line with its AI Pyramid Strategy, which aims to unlock AI’s potential for anyone, anywhere, anytime, SKT has collaborated with the AWS Generative AI Innovation Center (GenAIIC) Custom Model Program to explore domain-trained models using Amazon Bedrock for telco-specific use cases.

This collaboration aligns with SKT’s vision of using AI expertise and strategic partnerships to develop innovative AI-based products and services. One such initiative focused on developing a custom solution for grounded question answering (Q&A) based on reference documents.

Retrieval Augmented Generation (RAG) is a popular technique for Q&A tasks, offering improved factual accuracy and knowledge grounding. However, RAG faces challenges with generating a response not matching preferred tone, style, and manners for telco use cases, as well as retrieving irrelevant documents, potentially leading to inaccurate responses. To address this, SKT and AWS GenAIIC aimed to use model customization to improve Anthropic Claude models on Amazon Bedrock in three key areas:

  • Providing concise and informative answers
  • Correctly referencing links from retrieved documents
  • Answering in a tone and style consistent with SKT and similar to ground truth answers

Additionally, the team explored boosting smaller model performance using synthetic data generated by bigger large language models (LLMs) for knowledge distillation and scenarios with limited labeled training data.

Amazon Bedrock is a fully managed service that offers a variety of LLMs and foundation models (FMs) along with capabilities such as Amazon Bedrock Knowledge Bases, Amazon Bedrock Agents, and Amazon Bedrock Guardrails that can expedite many generative AI use cases. Amazon Bedrock is the only fully managed service that provides you with the ability to fine-tune Claude models. Amazon Bedrock offers an intuitive and secure way of fine-tuning Anthropic’s Claude models and more. The fine-tuned Claude model can be deployed using Amazon Bedrock and can use the capabilities of Amazon Bedrock seamlessly, for example, Amazon Bedrock Knowledge Bases for the telco domain-specific RAG or Amazon Bedrock Agents for the agentic usage.

In this post, we share how SKT customizes Anthropic Claude models for telco-specific Q&A regarding technical telecommunication documents of SKT using Amazon Bedrock.

Solution overview

The team explored combinations of prompt optimization, customization (fine-tuning), and data augmentation with synthetic data. This multifaceted approach aimed to maximize the benefits of each technique for the grounded Q&A generation task.

In the following sections, we explore these methods in more detail.

Anthropic’s Claude customization with prompt optimization

Fine-tuning, which is available through Amazon Bedrock for various FMs, including Anthropic’s Claude, allows adaptation of pre-trained language models for specific use cases. It’s particularly effective for tailoring response style and format adherence.

The team first optimized the system prompt, implementing standardized guidelines for answer formatting and document citation based on Anthropic model prompting best practices. Key focus areas included:

  • Clear presentation of system commands
  • Consistent use of code block formatting
  • Context-based tailored responses

This prompt engineering, combined with fine-tuning, yielded substantial improvements:

  • Over 50% increase in ROUGE-3 score
  • Over 25% improvement in ROUGE-L score
  • Over 4% increase in embedding similarity score
  • Significant progress in accurate reference citation

The iterative enhancement process demonstrated cumulative benefits, with prompt updates alone showing 35–40 percent improvements in key metrics, and the final customized model achieving 50–60 percent gains in some metrics.

This progression clearly illustrates the cumulative benefits of model customization through RAG, prompt engineering, and fine-tuning, resulting in a model that significantly outperformed both the baseline and the prompt-updated versions in terms of ROUGE scores and citation accuracy. ROUGE score measures the similarity between ground truths and generated results by computing N-gram word overlaps. The following table summarizes these improvements.

LLM Prompt update Fine-tuning Relative improvement over baseline
ROUGE-3 ROUGE-L Citation accuracy
Anthropic’s Claude 3 Sonnet baseline baseline baseline
Anthropic’s Claude 3 Sonnet ✅ +38.30% +13.4% +52.94%
Anthropic’s Claude 3 Sonnet ✅ ✅ +58.1% +26.8% +70.59%

Synthetic data for fine-tuning

To address the challenge of limited high-quality labeled training data, the team explored synthetic data generation techniques. This approach also facilitates knowledge distillation from larger LLMs to smaller, more targeted models, offering benefits such as lower latency and cost.

The team conducted controlled experiments using:

  • A baseline set of 500 ground truth samples
  • An augmented set with 500 original over 1,500 synthetic samples
  • A larger original set of 2,000 samples

Synthetic data was generated using Anthropic’s Claude Sonnet 3, creating new question-answer pairs over the same retrieved documents used in ground truth examples.

The results were evaluated using both LLM-based comparison and human preference evaluation. Human evaluators blindly ranked model outputs, with scores assigned based on preference (Best: 4, Second: 3, Third: 2, Worst: 1). The following table shows the results of the human preference evaluation scores.

Rank Model Cumulative score
(best possible: 160)
1 Fine-tuned with 2,000 original samples 114
2 Fine-tuned with 500 original and 1,500 synthetic samples 112
3 Fine-tuned with 500 original samples 85
4 No fine-tuning (baseline) 84

Some key findings include:

  • Small training sets (500 samples) showed minimal improvement over baseline
  • Larger training sets (2,000 samples) scored considerably higher
  • Synthetically augmented data performed similarly to equivalent-sized original data

Although having a large volume of domain-specific training data is always ideal, many businesses have limited available datasets. In such scenarios, synthetic data can play a crucial role in place of original data. This demonstrates the potential of synthetic data for model customization.

Conclusion

SK Telecom’s collaboration with AWS GenAIIC showcases the company’s commitment to developing innovative AI solutions for telco challenges. By using Amazon Bedrock to customize Anthropic’s Claude models, SKT has achieved significant performance improvements for telco-specific, Korean language use cases without the need to build models from scratch. The proof of concept demonstrated significant improvements:

  • ~58% increase in ROUGE-3 score
  • ~27% increase in ROUGE-L score
  • Substantial improvement in returning correct reference links

This approach, combined with synthetic data generation techniques, aligns with SKT’s AI Pyramid Strategy, enabling faster testing and development of new approaches. As SKT continues to focus on key areas such as personal AI assistants, AI healthcare, and AI data centers, this collaboration with AWS represents a significant step in their AI evolution and long-term competitiveness in the global AI landscape.

For those interested in working with AWS on similar projects, visit Generative AI Innovation Center.


About the Authors

Sungmin Hong is a Senior Applied Scientist at AWS Generative AI Innovation Center where he helps expedite the variety of use cases of AWS customers. Before joining Amazon, Sungmin was a postdoctoral research fellow at Harvard Medical School. He holds Ph.D. in Computer Science from New York University. Outside of work, Sungmin enjoys hiking, reading and cooking.

Sujeong Cha is a Deep Learning Architect at the AWS Generative AI Innovation Center, where she specializes in model customization and optimization. She has extensive hands-on experience in solving customers’ business use cases by utilizing generative AI as well as traditional AI/ML solutions. Sujeong holds a M.S. degree in Data Science from New York University.

Arijit Ghosh Chowdhury is a Scientist with the AWS Generative AI Innovation Center, where he works on model customization and optimization. In his role, he works on applied research in fine-tuning and model evaluations to enable GenAI for various industries. He has a Master’s degree in Computer Science from the University of Illinois at Urbana Champaign, where his research focused on question answering, search and domain adaptation.

Yiyue Qian is an Applied Scientist II at the AWS Generative AI Innovation Center, where she supports providing generative AI solutions to AWS customers. In this role, she collaborates with a team of experts to develop innovative AI-driven models for AWS customers across various industries. Yiyue holds a Ph.D. in Computer Science from the University of Notre Dame, where her research focused on advanced machine learning and deep learning techniques.

Wei-Chih Chen is a Machine Learning Engineer at the AWS Generative AI Innovation Center, where he works on model customization and optimization for LLMs. He also builds tools to help his team tackle various aspects of the LLM development life cycle—including fine-tuning, benchmarking, and load-testing—that accelerating the adoption of diverse use cases for AWS customers. He holds an M.S. degree in Computer Science from UC Davis.

Hannah Marlowe is a Senior Manager of Model Customization at the AWS Generative AI Innovation Center. Her team specializes in helping customers develop differentiating Generative AI solutions using their unique and proprietary data to achieve key business outcomes. She holds a Ph.D in Physics from the University of Iowa, with a focus on astronomical X-ray analysis and instrumentation development. Outside of work, she can be found hiking, mountain biking, and skiing around the mountains in Colorado.

Seunghyeon Jeong (Steve) is a team leader of the Platform Application team at SKT. He is responsible for commercializing the Global Intelligence Platform (GIP), which provides AI models and tools. For most of his career, he has been a PM developing various mobile services such as mobile wallet, fashion streaming, and unified login services for SK. His team is expanding the delivery of models and features to make it easier for internal teams to apply AI, contributing to SKT’s AI Transformation. Before entering the AI space, he was a Product Manager, developing and operating various mobile services such as mobile wallet, fashion streaming, and unified login services for the US and Korea.

Sunwoo Lee (Lois) is the team leader of the Data Construction and Evaluation Team within SK Telecom’s Global AI Tech division. She oversees the design and construction of training data for language models, the model performance evaluation process, and its application to services. Her career has focused on NLP within IT, which is a great fit with her background in Linguistics and Korean language education. Alongside her world-class team, she continues to explore and solve fascinating problems such as how to optimize the design of data for language model training, which tasks and methods to implement for validating language model performance, and the best design of AI-human conversations.

Eric Davis is the vice president of the AI Tech Collaboration Group at SKT. Eric oversees tech collaborations with worldwide tech partners to customize large language models (LLMs) for the telecommunications domain. His teams are responsible for designing and building the datasets to tune LLMs, as well as benchmarking LLMs in general and for the telecommunications domain. Eric holds a Master of Science degree in Computer Science from Carnegie Mellon from the Language Technologies Institute and a Bachelor of Arts in Linguistics and Psychology from the University of California, Los Angeles.

Read More

Scaling Rufus, the Amazon generative AI-powered conversational shopping assistant with over 80,000 AWS Inferentia and AWS Trainium chips, for Prime Day

Scaling Rufus, the Amazon generative AI-powered conversational shopping assistant with over 80,000 AWS Inferentia and AWS Trainium chips, for Prime Day

Amazon Rufus is a shopping assistant experience powered by generative AI. It generates answers using relevant information from across Amazon and the web to help Amazon customers make better, more informed shopping decisions. With Rufus, customers can shop alongside a generative AI-powered expert that knows Amazon’s selection inside and out, and can bring it all together with information from across the web to help shoppers make more informed purchase decisions.

To meet the needs of Amazon customers at scale, Rufus required a low-cost, performant, and highly available infrastructure for inference. The solution needed the capability to serve multi-billion parameter large language models (LLMs) with low latency across the world to service its expansive customer base. Low latency makes sure users have a positive experience chatting with Rufus and can start getting responses in less than a second. To achieve this, the Rufus team is using multiple AWS services and AWS AI chips, AWS Trainium and AWS Inferentia.

Inferentia and Trainium are purpose-built chips developed by AWS that accelerate deep learning workloads with high performance and lower overall costs. With these chips, Rufus reduced its costs by 4.5 times lower than other evaluated solutions while maintaining low latency for its customers. In this post, we dive into the Rufus inference deployment using AWS chips and how this enabled one of the most demanding events of the year—Amazon Prime Day.

Solution overview

At its core, Rufus is powered by an LLM trained on Amazon’s product catalog and information from across the web. LLM deployment can be challenging, requiring you to balance factors such as model size, model accuracy, and inference performance. Larger models generally have better knowledge and reasoning capabilities but come at a higher cost due to more demanding compute requirements and increasing latency. Rufus would need to be deployed and scale to meet the tremendous demand of peak events like Amazon Prime Day. Considerations for this scale include how well it needs to perform, its environmental impact, and the cost of hosting the solution. To meet these challenges, Rufus used a combination of AWS solutions: Inferentia2 and Trainium, Amazon Elastic Container Service (Amazon ECS), and Application Load Balancer (ALB). In addition, the Rufus team partnered with NVIDIA to power the solution using NVIDIA’s Triton Inference Server, providing capabilities to host the model using AWS chips.

Rufus inference is a Retrieval Augmented Generation (RAG) system with responses enhanced by retrieving additional information such as product information from Amazon search results. These results are based on the customer query, making sure the LLM generates reliable, high-quality, and precise responses.

To make sure Rufus was best positioned for Prime Day, the Rufus team built a heterogeneous inference system using multiple AWS Regions powered by Inferentia2 and Trainium. Building a system across multiple Regions allowed Rufus to benefit in two key areas. First, it provided additional capacity that could be used during times of high demand, and second, it improved the overall resiliency of the system.

The Rufus team was also able to use both Inf2 and Trn1 instance types. Because Inf2 and Trn1 instance types use the same AWS Neuron SDK, the Rufus team was able to use both instances to serve the same Rufus model. The only configuration setting to adjust was the tensor parallelism degree (24 for Inf2, 32 for Trn1). Using Trn1 instances also led to an additional 20% latency reduction and throughput improvement compared to Inf2.

The following diagram illustrates the solution architecture.

To support real-time traffic routing across multiple Regions, Rufus built a novel traffic orchestrator. Amazon CloudWatch supported the underlying monitoring, helping the team adjust the traffic ratio across the different Regions in less than 15 minutes based on the traffic pattern changes. By using this type of orchestration, the Rufus team had the ability to direct requests to other Regions when needed, with a small trade-off of latency to the first token. Due to Rufus’s streaming architecture and the performant AWS network between Regions, the perceived latency was minimal for end-users.

These choices allowed Rufus to scale up over 80,000 Trainium and Inferentia chips across three Regions serving an average of 3 million tokens a minute while maintaining P99 less than 1 second latency to the first response for Prime Day customers. In addition, by using these purpose-built chips, Rufus achieved 54% better performance per watt than other evaluated solutions, which helped the Rufus team meet energy efficiency goals.

Optimizing inference performance and host utilization

Within each Region, the Rufus inference system used Amazon ECS, which managed the underlying Inferentia and Trainium powered instances. By managing the underlying infrastructure, the Rufus team only needed to bring their container and configuration by defining an ECS task. Within each container, an NVIDIA Triton Inference Server with a Python backend is used running vLLM with the Neuron SDK. vLLM is a memory-efficient inference and serving engine that is optimized for high throughput. The Neuron SDK makes it straightforward for teams to adopt AWS chips and supports many different libraries and frameworks such as PyTorch Lightning.

The Neuron SDK provides a straightforward LLM inference solution on Trainium and Inferentia hardware with optimized performance supporting a wide range of transformer-based LLM architectures. To reduce latency, Rufus has collaborated with the AWS Annapurna team to develop various optimizations such as INT8 (weight only) quantization, continuous batching with vLLM, resource, compute, and memory bandwidth in the Neuron compiler and runtime. These optimizations are currently deployed in Rufus production and are available to use in the Neuron SDK 2.18 and onward.

To reduce overall waiting time for customers to start seeing a response from Rufus, the team also developed an inference streaming architecture. With the high compute and memory load needed for LLM inference, the total time it takes to finish generating the full response for a customer query can take multiple seconds. With a streaming architecture, Rufus is able to return the tokens right after they’re generated. This optimization allows the customer to start consuming the response in less than 1 second. In addition, multiple services work together using gRPC connections to intelligently aggregate and enhance the streaming response in real time for customers.

As shown in the following figure, images and links are embedded in the response, which allow customers to engage and continue exploring with Rufus.

Scaling up

Although we have to maintain low latency for the best customer experience, it’s also crucial to scale the service throughput by achieving high hardware resource utilization. High hardware utilization makes sure accelerators don’t sit idle and needlessly increase costs. To optimize the inference system throughput, the team improved both single-host throughput as well as load balancing efficiency.

Load balancing for LLM inference is tricky due to following challenges. First, a single host can only handle a limited number of concurrent requests. Second, the end-to-end latency to complete one request can vary, spanning many seconds depending on the LLM response length.

To address the challenges, the team optimized throughput by considering both single-host throughput and throughput across many hosts using load balancing.

The team used the least outstanding requests (LOR) routing algorithm from ALB, increasing throughput by five times faster in comparison to an earlier baseline measurement. This allows each host to have enough time to process in-flight requests and stream back responses using a gRPC connection, without getting overwhelmed by multiple requests received at the same time. Rufus also collaborated with AWS and vLLM teams to improve single-host concurrency using vLLM integration with the Neuron SDK and NVIDIA Triton Inference Server.

Figure 1. ECS tasks scale horizontally hosting the Triton Inference Server and dependencies

Figure 1. ECS tasks scale horizontally hosting the Triton Inference Server and dependencies

With this integration, Rufus was able to benefit from a critical optimization: continuous batching. Continuous batching allows a single host to greatly increase throughput. In addition, continuous batching provides unique capabilities in comparison to other batch techniques, such as static batching. For example, when using static batching, the time to first token (TTFT) increases linearly with the number of requests in one batch. Continuous batching prioritizes the prefill stage for LLM inference, keeping TTFT under control even with more requests running at the same time. This helped Rufus provide a pleasant experience with low latency when generating the first response, and improve the single-host throughput to keep serving costs under control.

Conclusion

In this post, we discussed how Rufus is able to reliably deploy and serve its multi-billion-parameter LLM using the Neuron SDK with Inferentia2 and Trainium chips and AWS services. Rufus continues to evolve with advancements in generative AI and customer feedback and we encourage you to use Inferentia and Trainium.

Learn more about how we are innovating with generative AI across Amazon.


About the author

James Park is a Solutions Architect at Amazon Web Services. He works with Amazon.com to design, build, and deploy technology solutions on AWS, and has a particular interest in AI and machine learning. In his spare time, he enjoys seeking out new cultures, new experiences, and staying up to date with the latest technology trends.

RJ is an Engineer within Amazon. He builds and optimizes systems for distributed systems for training and works on optimizing adopting systems to reduce latency for ML Inference. Outside work, he is exploring using Generative AI for building food recipes.

Yang Zhou is a software engineer working on building and optimizing machine learning systems. His recent focus is enhancing the performance and cost efficiency of generative AI inference. Beyond work, he enjoys traveling and has recently discovered a passion for running long distances.

Adam (Hongshen) Zhao is a Software Development Manager at Amazon Stores Foundational AI. In his current role, Adam is leading Rufus Inference team to build GenAI inference optimization solutions and inference system at scale for fast inference at low cost. Outside work, he enjoys traveling with his wife and art creations.

Faqin Zhong is a software engineer at Amazon Stores Foundational AI, working on Large Language Model (LLM) inference infrastructure and optimizations. Passionate about Generative AI technology, Faqin collaborates with leading teams to drive innovations, making LLMs more accessible and impactful, ultimately enhancing customer experiences across diverse applications. Outside of work she enjoys cardio exercise and baking with her son.

Nicolas Trown is an engineer in Amazon Stores Foundational AI. His recent focus is lending his systems expertise across Rufus to aid Rufus Inference team and efficient utilization across the Rufus experience. Outside of work he enjoys spending time with his wife and day trips to nearby coast, Napa, and Sonoma areas.

Bing Yin is a director of science at Amazon Stores Foundational AI. He leads the effort to build LLMs that are specialized for shopping use cases and optimized for inference at Amazon scale. Outside of work, he enjoys running marathon races.

Read More

Exploring alternatives and seamlessly migrating data from Amazon Lookout for Vision

Exploring alternatives and seamlessly migrating data from Amazon Lookout for Vision

Amazon Lookout for Vision, the AWS service designed to create customized artificial intelligence and machine learning (AI/ML) computer vision models for automated quality inspection, will be discontinuing on October 31, 2025. New customers will not be able to access the service effective October 10, 2024, but existing customers will be able to use the service as normal until October 31, 2025. AWS will continue to support the service with security updates, bug fixes, and availability enhancements, but we do not plan to introduce new features for this service.

This post discusses some alternatives to Lookout for Vision and how you can export your data from Lookout for Vision to migrate to an alternate solution.

Alternatives to Lookout for Vision

If you’re interested in an alternative to Lookout for Vision, AWS has options for both buyers and builders.

For an out-of-the-box solution, the AWS Partner Network offers solutions from multiple partners. You can browse solutions on the Computer Vision for Quality Insights page in the AWS Solutions Library. These partner solutions include options for software, software as a service (SaaS) applications, managed solutions or custom implementations based on your needs. This approach provides a solution that addresses your use case without requiring you to have expertise in imaging, computer vision, AI, or application development. This typically provides the fastest time to value by taking advantage of the specialized expertise of the AWS Partners. The Solutions Library also has additional guidance to help you build solutions faster.

If you prefer to build your own solution, AWS offers AI tools and services to help you develop an AI-based computer vision inspection solution. Amazon SageMaker provides a set of tools to build, train, and deploy ML models for your use case with fully managed infrastructure, tools, and workflows. In addition to SageMaker enabling you to build your own models, Amazon SageMaker JumpStart offers built-in computer vision algorithms and pre-trained defect detection models that can be fine-tuned to your specific use case. This approach provides you the tools to accelerate your AI development while providing complete flexibility to build a solution that meets your exact requirements and integrates with your existing hardware and software infrastructure. This typically provides the lowest operating costs for a solution.

AWS also offers Amazon Bedrock, a fully managed service that offers a choice of high-performing generative AI foundation models (FMs), including models that can help build a defect detection model running in the cloud. This approach enables you to build a custom solution while using the power of generative AI to handle the custom computer vision model creation and some of the code generation to speed development, eliminating the need for full AI computer vision expertise. Amazon Bedrock provides the ability to analyze images for defects, compare performance of different models, and generate code for custom applications. This alternative is useful for use cases that don’t require low latency processing, providing faster time to value and lower development costs.

Migrating data from Lookout for Vision

To move existing data from Lookout for Vision to use in an alternative implementation, the Lookout for Vision SDK provides the capability to export a dataset from the service to an Amazon Simple Storage Service (Amazon S3) bucket. This procedure exports the training dataset, including manifest and dataset images, for a project to a destination Amazon S3 location that you specify. With the exported dataset and manifest file, you can use the same data that you used to create a Lookout for Vision model to create a model using SageMaker or Amazon Bedrock, or provide it to a partner to incorporate into their customizations for your use case.

Summary

Although Lookout for Vision is planned to shut down on October 31, 2025, AWS offers a powerful set of AI/ML services and solutions in the form of SageMaker tools to build custom models and generative AI with Amazon Bedrock to do customized inspection and generate code, in addition to a range of offerings from partners in the AWS Partner Network. Export tools enable you to effortlessly move your data from Lookout for Vision to an alternate solution if you so choose. You should explore these options to determine what works best for your specific needs.

For more details, refer to the following resources:


About the Author

Tim Westman is the Product Manager and Go-to-Market Lead for Edge Machine Learning, AWS. Tim leads the Product Management and Business Development for the Edge Machine Learning business at Amazon Web Services. In this role, he works with customers to help build computer vision solutions at the edge to solve complex operational challenges. Tim has more than 30 years of experience in sales, business development and product management roles for leading hardware and software companies, with the last 8 years specializing in AI and computer vision for IoT applications.

Read More