Manage Amazon SageMaker JumpStart foundation model access with private hubs

Manage Amazon SageMaker JumpStart foundation model access with private hubs

Amazon SageMaker JumpStart is a machine learning (ML) hub offering pre-trained models and pre-built solutions. It provides access to hundreds of foundation models (FMs). A private hub is a feature in SageMaker JumpStart that allows an organization to share their models and notebooks so as to centralize model artifacts, facilitate discoverability, and increase the reuse within the organization. With new models released daily, many enterprise admins want more control over the FMs that can be discovered and used by users within their organization (for example, only allowing models based on pytorch framework to be discovered).

Now enterprise admins can effortlessly configure granular access control over the FMs that SageMaker JumpStart provides out of box so that only allowed models can be accessed by users within their organizations. In this post, we discuss the steps required for an administrator to configure granular access control of models in SageMaker JumpStart using a private hub, as well as the steps for users to access and consume models from the private hub.

Solution overview

Starting today, with SageMaker JumpStart and its private hub feature, administrators can create repositories for a subset of models tailored to different teams, use cases, or license requirements using the Amazon SageMaker Python SDK. Admins can also set up multiple private hubs with different lists of models discoverable for different groups of users. Users are then only able to discover and use models within the private hubs they have access to through Amazon SageMaker Studio and the SDK. This level of control empowers enterprises to consume the latest in open weight generative artificial intelligence (AI) development while enforcing governance guardrails. Finally, admins can share access to private hubs across multiple AWS accounts, enabling collaborative model management while maintaining centralized control. SageMaker JumpStart uses AWS Resource Access Manager (AWS RAM) to securely share private hubs with other accounts in the same organization. The new feature is available in the us-east-2 AWS Region as of writing, and will be available to more Regions soon.

The following diagram shows an example architecture of SageMaker JumpStart with its public and private hub features. The diagram illustrates how SageMaker JumpStart provides access to different model repositories, with some users accessing the public SageMaker JumpStart hub and others using private curated hubs.

In the following section, we demonstrate how admins can configure granular access control of models in SageMaker JumpStart using a private hub. Then we show how users can access and consume allowlisted models in the private hub using SageMaker Studio and the SageMaker Python SDK. Finally, we look at how an admin user can share the private hub with users in another account.

Prerequisites

To use the SageMaker Python SDK and run the code associated with this post, you need the following prerequisites:

  • An AWS account that contains all your AWS resources
  • An AWS Identity and Access Management (IAM) role with access to SageMaker Studio notebooks
  • SageMaker JumpStart enabled in a SageMaker Studio domain

Create a private hub, curate models, and configure access control (admins)

This section provides a step-by-step guide for administrators to create a private hub, curate models, and configure access control for your organization’s users.

  1. Because the feature has been integrated in the latest SageMaker Python SDK, to use the model granular access control feature with a private hub, let’s first update the SageMaker Python SDK:
    !pip3 install sagemaker —force-reinstall —quiet
  2. Next, import the SageMaker and Boto3 libraries:
    import boto3
    from sagemaker import Session
    from sagemaker.jumpstart.hub.hub import Hub
  3. Configure your private hub:
    HUB_NAME="CompanyHub"
    HUB_DISPLAY_NAME="Allowlisted Models"
    HUB_DESCRIPTION="These are allowlisted models taken from the JumpStart Public Hub."
    REGION="<your_region_name>" # for example, "us-west-2"

    In the preceding code, HUB_NAME specifies the name of your Hub. HUB_DISPLAY_NAME is the display name for your hub that will be shown to users in UI experiences. HUB_DESCRIPTION is the description for your hub that will be shown to users.

  4. Set up a Boto3 client for SageMaker:
    sm_client = boto3.client('sagemaker')
    session = Session(sagemaker_client=sm_client)
    session.get_caller_identity_arn()
  5. Check if the following policies have been already added to your admin IAM role; if not, you can add them as inline policies:
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Action": [
                    "s3:ListBucket",
                    "s3:GetObject",
                    "s3:GetObjectTagging"
                ],
                "Resource": [
                    "arn:aws:s3:::jumpstart-cache-prod-<REGION>",
                    "arn:aws:s3:::jumpstart-cache-prod-<REGION>/*"
                ],
                "Effect": "Allow"
            }
        ]
    }

    Replace the <REGION> placeholder using the configurations in Step 3.

    In addition to setting up IAM permissions to the admin role, you need to scope down permissions for your users so they can’t access public contents.

  6. Use the following policy to deny access to the public hub for your users. These can be added as inline policies in the user’s IAM role:
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Action": "s3:*",
                "Effect": "Deny",
                "Resource": [
                    "arn:aws:s3:::jumpstart-cache-prod-<REGION>",
                    "arn:aws:s3:::jumpstart-cache-prod-<REGION>/*"
                ],
                "Condition": {
                    "StringNotLike": {"s3:prefix": ["*.ipynb", "*/eula.txt"]}
                }
            },
            {
                "Action": "sagemaker:*",
                "Effect": "Deny",
                "Resource": [
                    "arn:aws:sagemaker:<REGION>:aws:hub/SageMakerPublicHub",
                    "arn:aws:sagemaker:<REGION>:aws:hub-content/SageMakerPublicHub/*/*"
                ]
            }
        ]
    }
    

    Replace the <REGION> placeholder in the policy using the configurations in Step 3.

    After you have set up the private hub configuration and permissions, you’re ready to create the private hub.

  7. Use the following code to create the private hub within your AWS account in the Region you specified earlier:
    hub = Hub(hub_name=HUB_NAME, sagemaker_session=session)
    
    try:
      hub.create(
          description=HUB_DESCRIPTION,
          display_name=HUB_DISPLAY_NAME
      )
      print(f"Successfully created Hub with name {HUB_NAME} in {REGION}")
    except Exception as e:
      if "ResourceInUse" in str(e):
        print(f"A hub with the name {HUB_NAME} already exists in your account.")
      else:
        raise e
    
  8. Use hub.describe() to verify the configuration of your hub.After your private hub is set up, you can add a reference to models from the SageMaker JumpStart public hub to your private hub. No model artifacts need to be managed by the customer. The SageMaker team will manage any version or security updates.For a list of available models, refer to Built-in Algorithms with pre-trained Model Table.
  9. To search programmatically, run the command
    filter_value = "framework == meta"
    response = hub.list_sagemaker_public_hub_models(filter=filter_value)
    models = response["hub_content_summaries"]
    while response["next_token"]:
        response = hub.list_sagemaker_public_hub_models(filter=filter_value,
                                                        next_token=response["next_token"])
        models.extend(response["hub_content_summaries"])
    
    print(models)
    

    The filter argument is optional. For a list of filters you can apply, refer to SageMaker Python SDK.

  10. Use the retrieved models from the preceding command to create model references for your private hub:
    for model in models:
        print(f"Adding {model.get('hub_content_name')} to Hub")
        hub.create_model_reference(model_arn=model.get("hub_content_arn"), 
                                   model_name=model.get("hub_content_name"))

    The SageMaker JumpStart private hub offers other useful features for managing and interacting with the curated models. Administrators can check the metadata of a specific model using the hub.describe_model(model_name=<model_name>) command. To list all available models in the private hub, you can use a simple loop:

    response = hub.list_models()
    models = response["hub_content_summaries"]
    while response["next_token"]:
        response = hub.list_models(next_token=response["next_token"])
        models.extend(response["hub_content_summaries"])
    
    for model in models:
        print(model.get('HubContentArn'))
    

    If you need to remove a specific model reference from the private hub, use the following command:

    hub.delete_model_reference("<model_name>")

    If you want to delete the private hub from your account and Region, you’ll need to delete all the HubContents first, then delete the private hub. Use the following code:

    for model in models:
        hub.delete_model_reference(model_name=model.get('HubContentName')) 
    
    hub.delete()
    

Interact with allowlisted models (users)

This section offers a step-by-step guide for users to interact with allowlisted models in SageMaker JumpStart. We demonstrate how to list available models, identify a model from the public hub, and deploy the model to endpoints from SageMaker Studio as well as the SageMaker Python SDK.

User experience in SageMaker Studio

Complete the following steps to interact with allowlisted models using SageMaker Studio:

  1.  On the SageMaker Studio console, choose JumpStart in the navigation pane or in the Prebuilt and automated solutions section.
  2. Choose one of model hubs you have access to. If the user has access to multiple hubs, you’ll see a list of hubs, as shown in the following screenshot.
    If the user has access to only one hub, you’ll go straight to the model list.
    You can view the model details and supported actions like train, deploy, and evaluate.
  3. To deploy a model, choose Deploy.
  4. Modify your model configurations like instances and deployment parameters, and choose Deploy.

User experience using the SageMaker Python SDK

To interact with your models using the SageMaker Python SDK, complete the following steps:

  1. Just like the admin process, the first step is to force reinstall the SageMaker Python SDK:
    !pip3 install sagemaker —force-reinstall —quiet
  2. Import the SageMaker and Boto3 libraries:
    import boto3
    from sagemaker import Session
    from sagemaker.jumpstart.hub.hub import Hub
    from sagemaker.jumpstart.model import JumpStartModel
    from sagemaker.jumpstart.estimator import JumpStartEstimator
  3. To access the models in your private hub, you need the Region and the name of the hub on your account. Fill out the HUB_NAME and REGION fields with the information provided by your administrator:
    HUB_NAME="CompanyHub" 
    REGION="<your_region_name>" # for example, "us-west-2"
    sm_client = boto3.client('sagemaker') 
    sm_runtime_client = boto3.client('sagemaker-runtime') 
    session = Session(sagemaker_client=sm_client, 
                        sagemaker_runtime_client=sm_runtime_client)
    hub = Hub(hub_name=HUB_NAME, sagemaker_session=session)
  4. List the models available in your private hub using the following command:
    response = hub.list_models()
    models = response["hub_content_summaries"]
    while response["next_token"]:
        response = hub.list_models(next_token=response["next_token"])
        models.extend(response["hub_content_summaries"])
    
    print(models)
  5. To get more information about a particular model, use the describe_model method:
    model_name = "huggingface-llm-phi-2"
    response = hub.describe_model(model_name=model_name) 
    print(response)
  6. You can deploy models in a hub with the Python SDK by using JumpStartModel. To deploy a model from the hub to an endpoint and invoke the endpoint with the default payloads, run the following code. To select which model from your hub you want to use, pass in a model_id and version. If you pass in * for the version, it will take the latest version available for that model_id in the hub. If you’re using a model gated behind a EULA agreement, pass in accept_eula=True.
    model_id, version = "huggingface-llm-phi-2", "1.0.0"
    model = JumpStartModel(model_id, version, hub_name=HUB_NAME, 
                                region=REGION, sagemaker_session=session)
    predictor = model.deploy(accept_eula=False)
  7. To invoke your deployed model with the default payloads, use the following code:
    example_payloads = model.retrieve_all_examples()
    for payload in example_payloads:
        response = predictor.predict(payload.body)
        print("nInputn", payload.body, "nnOutputn", 
                    response[0]["generated_text"], "nn===============")
  8. To delete the model endpoints that you created, use the following code:
    predictor.delete_model()
    predictor.delete_endpoint()

Cross-account sharing of private hubs

SageMaker JumpStart private hubs support cross-account sharing, allowing you to extend the benefits of your curated model repository beyond your own AWS account. This feature enables collaboration across different teams or departments within your organization, even when they operate in separate AWS accounts. By using AWS RAM, you can securely share your private hubs while maintaining control over access.

To share your private hub across accounts, complete the following steps:

  1. On the AWS RAM console, choose Create resource share.
  2. When specifying resource share details, choose the SageMaker hub resource type and select one or more private hubs that you want to share. When you share a hub with any other account, all of its contents are also shared implicitly.
  3. Associate permissions with your resource share.
  4. Use AWS account IDs to specify the accounts to which you want to grant access to your shared resources.
  5. Review your resource share configuration and choose Create resource share.

It may take a few minutes for the resource share and principal associations to complete.

Admins that want to perform the preceding steps programmatically can enter the following command to initiate the sharing:

# create a resource share using the private hub
aws ram create-resource-share 
    --name test-share 
    --resource-arns arn:aws:sagemaker:<region>:<resource_owner_account_id>:hub/<hub_name> 
    --principals <consumer_account_id>  
    --region <region>

Replace the <resource_owner_account_id>, <consumer_account_id>, <hub_name>, and <region> placeholders with the appropriate values for the resource owner account ID, consumer account ID, name of the hub, and Region to use.

After you set up the resource share, the specified AWS account will receive an invitation to join. They must accept this invitation through AWS RAM to gain access to the shared private hub. This process makes sure access is granted only with explicit consent from both the hub owner and the recipient account. For more information, refer to Using shared AWS resources.

You can also perform this step programmatically:

# list resource shares
aws ram get-resource-share-invitations 
    --region <region>

# accept resource share
# using the arn from the previous response 
aws ram accept-resource-share-invitation  
  --resource-share-invitation-arn <arn_from_ previous_request> 
  --region <region>

For detailed instructions on creating resource shares and accepting invitations, refer to Creating a resource share in AWS RAM. By extending your private hub across accounts, you can foster collaboration and maintain consistent model governance across your entire organization.

Conclusion

SageMaker JumpStart allows enterprises to adopt FMs while maintaining granular control over model access and usage. By creating a curated repository of approved models in private hubs, organizations can align their AI initiatives with corporate policies and regulatory requirements. The private hub decouples model curation from model consumption, enabling administrators to manage the model inventory while data scientists focus on developing AI solutions.

This post explained the private hub feature in SageMaker JumpStart and provided steps to set up and use a private hub, with minimal additional configuration required. Administrators can select models from the public SageMaker JumpStart hub, add them to the private hub, and manage user access through IAM policies. Users can then deploy these preapproved models, fine-tune them on custom datasets, and integrate them into their applications using familiar SageMaker interfaces. The private hub uses the SageMaker underlying infrastructure, allowing it to scale with enterprise-level ML demands.

For more information about SageMaker JumpStart, refer to SageMaker JumpStart. To get started using SageMaker JumpStart, access it through SageMaker Studio.

About the Authors

Raju Rangan is a Senior Solutions Architect at AWS. He works with government-sponsored entities, helping them build AI/ML solutions using AWS. When not tinkering with cloud solutions, you’ll catch him hanging out with family or smashing birdies in a lively game of badminton with friends.

Sherry Ding is a senior AI/ML specialist solutions architect at AWS. She has extensive experience in machine learning with a PhD in computer science. She mainly works with public sector customers on various AI/ML-related business challenges, helping them accelerate their machine learning journey on the AWS Cloud. When not helping customers, she enjoys outdoor activities.

June Won is a product manager with Amazon SageMaker JumpStart. He focuses on making foundation models easily discoverable and usable to help customers build generative AI applications. His experience at Amazon also includes mobile shopping applications and last mile delivery.

Bhaskar Pratap is a Senior Software Engineer with the Amazon SageMaker team. He is passionate about designing and building elegant systems that bring machine learning to people’s fingertips. Additionally, he has extensive experience with building scalable cloud storage services.

Read More

eSentire delivers private and secure generative AI interactions to customers with Amazon SageMaker

eSentire delivers private and secure generative AI interactions to customers with Amazon SageMaker

eSentire is an industry-leading provider of Managed Detection & Response (MDR) services protecting users, data, and applications of over 2,000 organizations globally across more than 35 industries. These security services help their customers anticipate, withstand, and recover from sophisticated cyber threats, prevent disruption from malicious attacks, and improve their security posture.

In 2023, eSentire was looking for ways to deliver differentiated customer experiences by continuing to improve the quality of its security investigations and customer communications. To accomplish this, eSentire built AI Investigator, a natural language query tool for their customers to access security platform data by using AWS generative artificial intelligence (AI) capabilities.

In this post, we share how eSentire built AI Investigator using Amazon SageMaker to provide private and secure generative AI interactions to their customers.

Benefits of AI Investigator

Before AI Investigator, customers would engage eSentire’s Security Operation Center (SOC) analysts to understand and further investigate their asset data and associated threat cases. This involved manual effort for customers and eSentire analysts, forming questions and searching through data across multiple tools to formulate answers.

eSentire’s AI Investigator enables users to complete complex queries using natural language by joining multiple sources of data from each customer’s own security telemetry and eSentire’s asset, vulnerability, and threat data mesh. This helps customers quickly and seamlessly explore their security data and accelerate internal investigations.

Providing AI Investigator internally to the eSentire SOC workbench has also accelerated eSentire’s investigation process by improving the scale and efficacy of multi-telemetry investigations. The LLM models augment SOC investigations with knowledge from eSentire’s security experts and security data, enabling higher-quality investigation outcomes while also reducing time to investigate. Over 100 SOC analysts are now using AI Investigator models to analyze security data and provide rapid investigation conclusions.

Solution overview

eSentire customers expect rigorous security and privacy controls for their sensitive data, which requires an architecture that doesn’t share data with external large language model (LLM) providers. Therefore, eSentire decided to build their own LLM using Llama 1 and Llama 2 foundational models. A foundation model (FM) is an LLM that has undergone unsupervised pre-training on a corpus of text. eSentire tried multiple FMs available in AWS for their proof of concept; however, the straightforward access to Meta’s Llama 2 FM through Hugging Face in SageMaker for training and inference (and their licensing structure) made Llama 2 an obvious choice.

eSentire has over 2 TB of signal data stored in their Amazon Simple Storage Service (Amazon S3) data lake. eSentire used gigabytes of additional human investigation metadata to perform supervised fine-tuning on Llama 2. This further step updates the FM by training with data labeled by security experts (such as Q&A pairs and investigation conclusions).

eSentire used SageMaker on several levels, ultimately facilitating their end-to-end process:

  • They used SageMaker notebook instances extensively to spin up GPU instances, giving them the flexibility to swap high-power compute in and out when needed. eSentire used instances with CPU for data preprocessing and post-inference analysis and GPU for the actual model (LLM) training.
  • The additional benefit of SageMaker notebook instances is its streamlined integration with eSentire’s AWS environment. Because they have vast amounts of data (terabyte scale, over 1 billion total rows of relevant data in preprocessing input) stored across AWS—in Amazon S3 and Amazon Relational Database Service (Amazon RDS) for PostgreSQL clusters—SageMaker notebook instances allowed secure movement of this volume of data directly from the AWS source (Amazon S3 or Amazon RDS) to the SageMaker notebook. They needed no additional infrastructure for data integration.
  • SageMaker real-time inference endpoints provide the infrastructure needed for hosting their custom self-trained LLMs. This was very useful in combination with SageMaker integration with Amazon Elastic Container Registry (Amazon ECR), SageMaker endpoint configuration, and SageMaker models to provide the entire configuration required to spin up their LLMs as needed. The fully featured end-to-end deployment capability provided by SageMaker allowed eSentire to effortlessly and consistently update their model registry as they iterate and update their LLMs. All of this was entirely automated with the software development lifecycle (SDLC) using Terraform and GitHub, which is only possible through SageMaker ecosystem.

The following diagram visualizes the architecture diagram and workflow.

The application’s frontend is accessible through Amazon API Gateway, using both edge and private gateways. To emulate intricate thought processes akin to those of a human investigator, eSentire engineered a system of chained agent actions. This system uses AWS Lambda and Amazon DynamoDB to orchestrate a series of LLM invocations. Each LLM call builds upon the previous one, creating a cascade of interactions that collectively produce high-quality responses. This intricate setup makes sure that the application’s backend data sources are seamlessly integrated, thereby providing tailored responses to customer inquiries.

When a SageMaker endpoint is constructed, an S3 URI to the bucket containing the model artifact and Docker image is shared using Amazon ECR.

For their proof of concept, eSentire selected the Nvidia A10G Tensor Core GPU housed in an MLG5 2XL instance for its balance of performance and cost. For LLMs with significantly larger numbers of parameters, which demand greater computational power for both training and inference tasks, eSentire used 12XL instances equipped with four GPUs. This was necessary because the computational complexity and the amount of memory required for LLMs can increase exponentially with the number of parameters. eSentire plans to harness P4 and P5 instance types for scaling their production workloads.

Additionally, a monitoring framework that captures the inputs and outputs of AI Investigator was necessary to enable threat hunting visibility to LLM interactions. To accomplish this, the application integrates with an open sourced eSentire LLM Gateway project to monitor the interactions with customer queries, backend agent actions, and application responses. This framework enables confidence in complex LLM applications by providing a security monitoring layer to detect malicious poisoning and injection attacks while also providing governance and support for compliance through logging of user activity. The LLM gateway can also be integrated with other LLM services, such as Amazon Bedrock.

Amazon Bedrock enables you to customize FMs privately and interactively, without the need for coding. Initially, eSentire’s focus was on training bespoke models using SageMaker. As their strategy evolved, they began to explore a broader array of FMs, evaluating their in-house trained models against those provided by Amazon Bedrock. Amazon Bedrock offers a practical environment for benchmarking and a cost-effective solution for managing workloads due to its serverless operation. This serves eSentire well, especially when customer queries are sporadic, making serverless an economical alternative to persistently running SageMaker instances.

From a security perspective as well, Amazon Bedrock doesn’t share users’ inputs and model outputs with any model providers. Additionally, eSentire have custom guardrails for NL2SQL applied to their models.

Results

The following screenshot shows an example of eSentire’s AI Investigator output. As illustrated, a natural language query is posed to the application. The tool is able to correlate multiple datasets and present a response.

Dustin Hillard, CTO of eSentire, shares: “eSentire customers and analysts ask hundreds of security data exploration questions per month, which typically take hours to complete. AI Investigator is now with an initial rollout to over 100 customers and more than 100 SOC analysts, providing a self-serve immediate response to complex questions about their security data. eSentire LLM models are saving thousands of hours of customer and analyst time.”

Conclusion

In this post, we shared how eSentire built AI Investigator, a generative AI solution that provides private and secure self-serve customer interactions. Customers can get near real-time answers to complex questions about their data. AI Investigator has also saved eSentire significant analyst time.

The aforementioned LLM gateway project is eSentire’s own product and AWS bears no responsibility.

If you have any comments or questions, share them in the comments section.


About the Authors

Aishwarya Subramaniam is a Sr. Solutions Architect in AWS. She works with commercial customers and AWS partners to accelerate customers’ business outcomes by providing expertise in analytics and AWS services.

Ilia Zenkov is a Senior AI Developer specializing in generative AI at eSentire. He focuses on advancing cybersecurity with expertise in machine learning and data engineering. His background includes pivotal roles in developing ML-driven cybersecurity and drug discovery platforms.

Dustin Hillard is responsible for leading product development and technology innovation, systems teams, and corporate IT at eSentire. He has deep ML experience in speech recognition, translation, natural language processing, and advertising, and has published over 30 papers in these areas.

Read More

Imperva optimizes SQL generation from natural language using Amazon Bedrock

Imperva optimizes SQL generation from natural language using Amazon Bedrock

This is a guest post co-written with Ori Nakar from Imperva.

Imperva Cloud WAF protects hundreds of thousands of websites against cyber threats and blocks billions of security events every day. Counters and insights based on security events are calculated daily and used by users from multiple departments. Millions of counters are added daily, together with 20 million insights updated daily to spot threat patterns.

Our goal was to improve the user experience of an existing application used to explore the counters and insights data. The data is stored in a data lake and retrieved by SQL using Amazon Athena.

As part of our solution, we replaced multiple search fields with a single free text field. We used a large language model (LLM) with query examples to make the search work using the language used by Imperva internal users (business analysts).

The following figure shows a search query that was translated to SQL and run. The results were later formatted as a chart by the application. We have many types of insights—global, industry, and customer level insights used by multiple departments such as marketing, support, and research. Data was made available to our users through a simplified user experience powered by an LLM.

Insights search by natural language

Figure 1: Insights search by natural language

Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading artificial intelligence (AI) companies such as AI21 Labs, Anthropic, Cohere, Meta, Mistral, Stability AI, and Amazon within a single API, along with a broad set of capabilities you need to build generative AI applications with security, privacy, and responsible AI. Amazon Bedrock Studio is a new single sign-on (SSO)-enabled web interface that provides a way for developers across an organization to experiment with LLMs and other FMs, collaborate on projects, and iterate on generative AI applications. It offers a rapid prototyping environment and streamlines access to multiple FMs and developer tools in Amazon Bedrock.

Read more to learn about the problem, and how we obtained quality results using Amazon Bedrock for our experimentation and deployment.

The problem

Making data accessible to users through applications has always been a challenge. Data is normally stored in databases, and can be queried using the most common query language, SQL. Applications use different UI components to allow users to filter and query the data. There are applications with tens of different filters and other options–all created to make the data accessible.

Querying databases through applications cannot be as flexible as running SQL queries on a known schema. Giving more power to the user comes on account of simple user experience (UX). Natural language can solve this problem—it’s possible to support complex yet readable natural language queries without SQL knowledge. On schema changes, the application UX and code remain the same, or with minor changes, which saves development time and keeps the application user interface (UI) stable for the users.

Constructing SQL queries from natural language isn’t a simple task. SQL queries must be accurate both syntactically and logically. Using an LLM with the right examples can make this task less difficult.

High level database access using an LLM flow

Figure 2: High level database access using an LLM flow

The challenge

An LLM can construct SQL queries based on natural language. The challenge is to assure quality. The user can enter any text, and the application constructs a query based on it. There isn’t an option, like in traditional applications, to cover all options and make sure the application functions correctly. Adding an LLM to an application adds another layer of complexity. The response by the LLM is not deterministic. Examples sent to the LLM are based on the database data, which makes it even harder to control the requests sent to the LLM and assure quality.

The solution: A data science approach

In data science, it’s common to develop a model and fine tune it using experimentation. The idea is to use metrics to compare experiments during development. Experiments might differ from each other in many ways, such as the input sent to the model, the model type, and other parameters. The ability to compare different experiments makes it possible to make progress. It’s possible to know how each change contributes to the model.

A test set is a static set of records that includes a prediction result for each record. Running predictions on the test set records results with the metrics needed to compare experiments. A common metric is the accuracy, which is the percentage of the correct results.

In our case the results generated by the LLM are SQL statements. The SQL statements generated by the LLM are not deterministic and are hard to measure, however running SQL statements on a static test database is deterministic and can be measured. We used a test database and a list of questions with known answers as a test set. It allowed us to run experiments and fine tune our LLM-based application.

Database access using LLM: Question to answer flow

Given a question we defined the following flow. The question is sent through a retrieval-augmented generation (RAG) process, which finds similar documents. Each document holds an example question and information about it. The relevant documents are built as a prompt and sent to the LLM, which builds a SQL statement. This flow is used both for development and application runtime:

Question to answer flow

Figure 3: Question to answer flow

As an example, consider a database schema with two tables: orders and items. The following figure is a question to SQL example flow:

Question to answer flow example

Figure 4: Question to answer flow example

Database access using LLM: Development process

To develop and fine-tune the application we created the following data sets:

  • A static test database: Contains the relevant tables and a sample copy of the data.
  • A test set: Includes questions and test database result answers.
  • Question to SQL examples: A set with questions and translation to SQL. For some examples returned data is included to allow asking questions about the data, and not only about the schema.

Development of the application is done by adding new questions and updating the different datasets, as shown in the following figure.

Adding a new question

Figure 5: Adding a new question

Datasets and other parameter updates are tracked as part of adding new questions and fine-tuning of the application. We used a tracking tool to track information about the experiments such as:

  • Parameters such as the number of questions, number of examples, LLM type, RAG search method
  • Metrics such as the accuracy and SQL errors rate
  • Artifacts such as a list of the wrong results including generated SQL, data returned, and more

Experiment flow

Figure 6: Experiment flow

Using a tracking tool, we were able to make progress by comparing experiments. The following figure shows the accuracy and error rate metrics for the different experiments we did:

Accuracy and error rate over time

Figure 7: Accuracy and error rate over time

When there’s a mistake or an error, a drill down to the false results and the experiment details is done to understand the source of the error and fix it.

Experiment and deploy using Amazon Bedrock

Amazon Bedrock is a managed service that offers a choice of high-performing foundation models. You can experiment with and evaluate top FMs for your use case and customize them with your data.

By using Amazon Bedrock, we were able to switch between models and embedding options easily. The following is an example code using the LangChain python library, which allows using different models and embeddings:

import boto3
from langchain_community.llms.bedrock import Bedrock
from langchain_community.embeddings import BedrockEmbeddings

def get_llm(model_id: str, args: dict):
   return Bedrock(model_id=model_id,
                  model_kwargs=args,
                  client=boto3.client("bedrock-runtime"))

def get_embeddings(model_id: str):
   return BedrockEmbeddings(model_id=model_id, 
                            client=boto3.client("bedrock-runtime"))

We used multiple models and embeddings with different hyper parameters to improve accuracy and decide which model is the best fit for us. We also tried to run experiments on smaller models, to determine if we can get to the same quality in terms of improved performance and reduced costs. We started using Anthropic Claude 2.1 and experimented with the Anthropic Claude instant model. Accuracy dropped by 20 percent, but after adding few additional examples, we achieved the same accuracy as Claude 2.1 with lower cost and faster response time

Conclusion

We used the same approach used in data science projects to construct SQL queries from natural language. The solution shown can be applied to other LLM-based applications, and not only for constructing SQL. For example, it can be used for API access, building JSON data, and more. The key is to create a test set together with measurable results and progress using experimentation.

Amazon Bedrock lets you use different models and switch between them to find the right one for your use case. You can compare different models, including small ones for better performance and costs. Because Amazon Bedrock is serverless, you don’t have to manage any infrastructure. We were able to test multiple models quickly, and finally integrate and deploy generative AI capabilities into our application.

You can start experimenting with natural language to SQL by running the code samples in this GitHub repository. This workshop is divided into modules that each build on the previous while introducing a new technique to solve this problem. Many of these approaches are based on an existing work from the community and cited accordingly.


About the Authors

Ori NakarOri Nakar is a Principal cyber-security researcher, a data engineer, and a data scientist at Imperva Threat Research group.

Eitan SelaEitan Sela is a Generative AI and Machine Learning Specialist Solutions Architect at AWS. He works with AWS customers to provide guidance and technical assistance, helping them build and operate Generative AI and Machine Learning solutions on AWS. In his spare time, Eitan enjoys jogging and reading the latest machine learning articles.

Elad EiznerElad Eizner is a Solutions Architect at Amazon Web Services. He works with AWS enterprise customers to help them architect and build solutions in the cloud and achieving their goals.

Read More

Create natural conversations with Amazon Lex QnAIntent and Knowledge Bases for Amazon Bedrock

Create natural conversations with Amazon Lex QnAIntent and Knowledge Bases for Amazon Bedrock

Customer service organizations today face an immense opportunity. As customer expectations grow, brands have a chance to creatively apply new innovations to transform the customer experience. Although meeting rising customer demands poses challenges, the latest breakthroughs in conversational artificial intelligence (AI) empowers companies to meet these expectations.

Customers today expect timely responses to their questions that are helpful, accurate, and tailored to their needs. The new QnAIntent, powered by Amazon Bedrock, can meet these expectations by understanding questions posed in natural language and responding conversationally in real time using your own authorized knowledge sources. Our Retrieval Augmented Generation (RAG) approach allows Amazon Lex to harness both the breadth of knowledge available in repositories as well as the fluency of large language models (LLMs).

Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI.

In this post, we show you how to add generative AI question answering capabilities to your bots. This can be done using your own curated knowledge sources, and without writing a single line of code.

Read on to discover how QnAIntent can transform your customer experience.

Solution overview

Implementing the solution consists of the following high-level steps:

  1. Create an Amazon Lex bot.
  2. Create an Amazon Simple Storage Service (Amazon S3) bucket and upload a PDF file that contains the information used to answer questions.
  3. Create a knowledge base that will split your data into chunks and generate embeddings using the Amazon Titan Embeddings model. As part of this process, Knowledge Bases for Amazon Bedrock automatically creates an Amazon OpenSearch Serverless vector search collection to hold your vectorized data.
  4. Add a new QnAIntent intent that will use the knowledge base to find answers to customers’ questions and then use the Anthropic Claude model to generate answers to questions and follow-up questions.

Prerequisites

To follow along with the features described in this post, you need access to an AWS account with permissions to access Amazon Lex, Amazon Bedrock (with access to Anthropic Claude models and Amazon Titan embeddings or Cohere Embed), Knowledge Bases for Amazon Bedrock, and the OpenSearch Serverless vector engine. To request access to models in Amazon Bedrock, complete the following steps:

  1. On the Amazon Bedrock console, choose Model access in the navigation pane.
  2. Choose Manage model access.
  3. Select the Amazon and Anthropic models. (You can also choose to use Cohere models for embeddings.)


  4. Choose Request model access.

Create an Amazon Lex bot

If you already have a bot you want to use, you can skip this step.

  1. On the Amazon Lex console, choose Bots in the navigation pane.
  2. Choose Create bot
  3. Select Start with an example and choose the BookTrip example bot.
  4. For Bot name, enter a name for the bot (for example, BookHotel).
  5. For Runtime role, select Create a role with basic Amazon Lex permissions.
  6. In the Children’s Online Privacy Protection Act (COPPA) section, you can select No because this bot is not targeted at children under the age of 13.
  7. Keep the Idle session timeout setting at 5 minutes.
  8. Choose Next.
  9. When using the QnAIntent to answer questions in a bot, you may want to increase the intent classification confidence threshold so that your questions are not accidentally interpreted as matching one of your intents. We set this to 0.8 for now. You may need to adjust this up or down based on your own testing.
  10. Choose Done.
  11. Choose Save intent.

Upload content to Amazon S3

Now you create an S3 bucket to store the documents you want to use for your knowledge base.

  1. On the Amazon S3 console, choose Buckets in the navigation pane.
  2. Choose Create bucket.
  3. For Bucket name, enter a unique name.
  4. Keep the default values for all other options and choose Create bucket.

For this post, we created an FAQ document for the fictitious hotel chain called Example Corp FictitiousHotels. Download the PDF document to follow along.

  1. On the Buckets page, navigate to the bucket you created.

If you don’t see it, you can search for it by name.

  1. Choose Upload.
  2. Choose Add files.
  3. Choose the ExampleCorpFicticiousHotelsFAQ.pdf that you downloaded.
  4. Choose Upload.

The file will now be accessible in the S3 bucket.

Create a knowledge base

Now you can set up the knowledge base:

  1. On the Amazon Bedrock console, choose Knowledge base in the navigation pane.
  2. Choose Create knowledge base.
  3. For Knowledge base name¸ enter a name.
  4. For Knowledge base description, enter an optional description.
  5. Select Create and use a new service role.
  6. For Service role name, enter a name or keep the default.
  7. Choose Next.
  8. For Data source name, enter a name.
  9. Choose Browse S3 and navigate to the S3 bucket you uploaded the PDF file to earlier.
  10. Choose Next.
  11. Choose an embeddings model.
  12. Select Quick create a new vector store to create a new OpenSearch Serverless vector store to store the vectorized content.
  13. Choose Next.
  14. Review your configuration, then choose Create knowledge base.

After a few minutes, the knowledge base will have been created.

  1. Choose Sync to sync to chunk the documents, calculate the embeddings, and store them in the vector store.

This may take a while. You can proceed with the rest of the steps, but the syncing needs to finish before you can query the knowledge base.

  1. Copy the knowledge base ID. You will reference this when you add this knowledge base to your Amazon Lex bot.

Add QnAIntent to the Amazon Lex bot

To add QnAIntent, compete the following steps:

  1. On the Amazon Lex console, choose Bots in the navigation pane.
  2. Choose your bot.
  3. In the navigation pane, choose Intents.
  4. On the Add intent menu, choose Use built-in intent.
  5. For Built-in intent, choose AMAZON.QnAIntent.
  6. For Intent name, enter a name.
  7. Choose Add.
  8. Choose the model you want to use to generate the answers (in this case, Anthropic Claude 3 Sonnet, but you can select Anthropic Claude 3 Haiku for a cheaper option with less latency).
  9. For Choose knowledge store, select Knowledge base for Amazon Bedrock.
  10. For Knowledge base for Amazon Bedrock Id, enter the ID you noted earlier when you created your knowledge base.
  11. Choose Save Intent.
  12. Choose Build to build the bot.
  13. Choose Test to test the new intent.

The following screenshot shows an example conversation with the bot.

In the second question about the Miami pool hours, you refer back to the previous question about pool hours in Las Vegas and still get a relevant answer based on the conversation history.

It’s also possible to ask questions that require the bot to reason a bit around the available data. When we asked about a good resort for a family vacation, the bot recommended the Orlando resort based on the availability of activities for kids, proximity to theme parks, and more.

Update the confidence threshold

You may have some questions accidentally match your other intents. If you run into this, you can adjust the confidence threshold for your bot. To modify this setting, choose the language of your bot (English) and in the Language details section, choose Edit.

After you update the confidence threshold, rebuild the bot for the change to take effect.

Add addional steps

By default, the next step in the conversation for the bot is set to Wait for user input after a question has been answered. This keeps the conversation in the bot and allows a user to ask follow-up questions or invoke any of the other intents in your bot.

If you want the conversation to end and return control to the calling application (for example, Amazon Connect), you can change this behavior to End conversation. To update the setting, complete the following steps:

  1. On the Amazon Lex console, navigate to the QnAIntent.
  2. In the Fulfillment section, choose Advanced options.
  3. On the Next step in conversation dropdown menu, choose End conversation.

If you would like the bot add a specific message after each response from the QnAIntent (such as “Can I help you with anything else?”), you can add a closing response to the QnAIntent.

Clean up

To avoid incurring ongoing costs, delete the resources you created as part of this post:

  • Amazon Lex bot
  • S3 bucket
  • OpenSearch Serverless collection (This is not automatically deleted when you delete your knowledge base)
  • Knowledge bases

Conclusion

The new QnAIntent in Amazon Lex enables natural conversations by connecting customers with curated knowledge sources. Powered by Amazon Bedrock, the QnAIntent understands questions in natural language and responds conversationally, keeping customers engaged with contextual, follow-up responses.

QnAIntent puts the latest innovations in reach to transform static FAQs into flowing dialogues that resolve customer needs. This helps scale excellent self-service to delight customers.

Try it out for yourself. Reinvent your customer experience!


About the Author

Thomas RindfussThomas Rinfuss is a Sr. Solutions Architect on the Amazon Lex team. He invents, develops, prototypes, and evangelizes new technical features and solutions for Language AI services that improves the customer experience and eases adoption.

Read More

Evaluate the reliability of Retrieval Augmented Generation applications using Amazon Bedrock

Evaluate the reliability of Retrieval Augmented Generation applications using Amazon Bedrock

Retrieval Augmented Generation (RAG) is a technique that enhances large language models (LLMs) by incorporating external knowledge sources. It allows LLMs to reference authoritative knowledge bases or internal repositories before generating responses, producing output tailored to specific domains or contexts while providing relevance, accuracy, and efficiency. RAG achieves this enhancement without retraining the model, making it a cost-effective solution for improving LLM performance across various applications. The following diagram illustrates the main steps in a RAG system.

Retrieval Augmented Generation RAG Architecture

Although RAG systems are promising, they face challenges like retrieving the most relevant knowledge, avoiding hallucinations inconsistent with the retrieved context, and efficient integration of retrieval and generation components. In addition, RAG architecture can lead to potential issues like retrieval collapse, where the retrieval component learns to retrieve the same documents regardless of the input. A similar problem occurs for some tasks like open-domain question answering—there are often multiple valid answers available in the training data, therefore the LLM could choose to generate an answer from its training data. Another challenge is the need for an effective mechanism to handle cases where no useful information can be retrieved for a given input. Current research aims to improve these aspects for more reliable and capable knowledge-grounded generation.

Given these challenges faced by RAG systems, monitoring and evaluating generative artificial intelligence (AI) applications powered by RAG is essential. Moreover, tracking and analyzing the performance of RAG-based applications is crucial, because it helps assess their effectiveness and reliability when deployed in real-world scenarios. By evaluating RAG applications, you can understand how well the models are using and integrating external knowledge into their responses, how accurately they can retrieve relevant information, and how coherent the generated outputs are. Additionally, evaluation can identify potential biases, hallucinations, inconsistencies, or factual errors that may arise from the integration of external sources or from sub-optimal prompt engineering. Ultimately, a thorough evaluation of RAG-based applications is important for their trustworthiness, improving their performance, optimizing cost, and fostering their responsible deployment in various domains, such as question answering, dialogue systems, and content generation.

In this post, we show you how to evaluate the performance, trustworthiness, and potential biases of your RAG pipelines and applications on Amazon Bedrock. Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI.

RAG evaluation and observability challenges in real-world scenarios

Evaluating a RAG system poses significant challenges due to its complex architecture consisting of multiple components, such as the retrieval module and the generation component represented by the LLMs. Each module operates differently and requires distinct evaluation methodologies, making it difficult to assess the overall end-to-end performance of the RAG architecture. The following are some of the challenges you may encounter:

  • Lack of ground truth references – In many open-ended generation tasks, there is no single correct answer or reference text against which to evaluate the system’s output. This makes it difficult to apply standard evaluation metrics like BERTScore (Zhang et al. 2020) BLEU, or ROUGE used for machine translation and summarization.
  • Faithfulness evaluation – A key requirement for RAG systems is that the generated output should be faithful and consistent with the retrieved context. Evaluating this faithfulness, which also serves to measure the presence of hallucinated content, in an automated manner is non-trivial, especially for open-ended responses.
  • Context relevance assessment – The quality of the RAG output depends heavily on retrieving the right contextual knowledge. Automatically assessing the relevance of the retrieved context to the input prompt is an open challenge.
  • Factuality vs. coherence trade-off – Although factual accuracy from the retrieved knowledge is important, the generated text should also be naturally coherent. Evaluating and balancing factual consistency with language fluency is difficult.
  • Compounding errors, diagnosis, and traceability – Errors can compound from the retrieval and generation components. Diagnosing whether errors stem from retrieval failures or generation inconsistencies is hard without clear intermediate outputs. Given the complex interplay between various components of the RAG architecture, it’s also difficult to provide traceability of the problem in the evaluation process.
  • Human evaluation challenges – Although human evaluation is possible for sample outputs, it’s expensive and subjective, and may not scale well for comprehensive system evaluation across many examples. The need for a domain expert to create and evaluate against a dataset is essential, because the evaluation process requires specialized knowledge and expertise. The labor-intensive nature of the human evaluation process is time-consuming, because it often involves manual effort.
  • Lack of standardized benchmarks – There are no widely accepted and standardized benchmarks yet for holistically evaluating different capabilities of RAG systems. Without such benchmarks, it can be challenging to compare the various capabilities of different RAG techniques, models, and parameter configurations. Consequently, you may face difficulties in making informed choices when selecting the most appropriate RAG approach that aligns with your unique use case requirements.

Addressing these evaluation and observability challenges is an active area of research, because robust metrics are critical for iterating on and deploying reliable RAG systems for real-world applications.

RAG evaluation concepts and metrics

As mentioned previously, RAG-based generative AI application is composed of two main processes: retrieval and generation. Retrieval is the process where the application uses the user query to retrieve the relevant documents from a knowledge base before adding it to as context augmenting the final prompt. Generation is the process of generating the final response from the LLM. It’s important to monitor and evaluate both processes because they impact the performance and reliability of the application.

Evaluating RAG systems at scale requires an automated approach to extract metrics that are quantitative indicators of its reliability. Generally, the metrics to look for are grouped by main RAG components or by domains. Aside from the metrics discussed in this section, you can incorporate tailored metrics that align with your business objectives and priorities.

Retrieval metrics

You can use the following retrieval metrics:

  • Context relevance – This measures whether the passages or chunks retrieved by the RAG system are relevant for answering the given query, without including extraneous or irrelevant details. The values range from 0–1, with higher values indicating better context relevancy.
  • Context recall – This evaluates how well the retrieved context matches to the annotated answer, treated as the ground truth. It’s computed based on the ground truth answer and the retrieved context. The values range between 0–1, with higher values indicating better performance.
  • Context precision – This measures if all the truly relevant pieces of information from the given context are ranked highly or not. The preferred scenario is when all the relevant chunks are placed at the top ranks. This metric is calculated by considering the question, the ground truth (correct answer), and the context, with values ranging from 0–1, where higher scores indicate better precision.

Generation metrics

You can use the following generation metrics:

  • Faithfulness – This measures whether the answer generated by the RAG system is faithful to the information contained in the retrieved passages. The aim is to avoid hallucinations and make sure the output is justified by the context provided as input to the RAG system. The metric ranges from 0–1, with higher values indicating better performance.
  • Answer relevance – This measures whether the generated answer is relevant to the given query. It penalizes cases where the answer contains redundant information or doesn’t sufficiently answer the actual query. Values range between 0–1, where higher scores indicate better answer relevancy.
  • Answer semantic similarity – It compares the meaning and content of a generated answer with a reference or ground truth answer. It evaluates how closely the generated answer matches the intended meaning of the ground truth answer. The score ranges from 0–1, with higher scores indicating greater semantic similarity between the two answers. A score of 1 means that the generated answer conveys the same meaning as the ground truth answer, whereas a score of 0 suggests that the two answers have completely different meanings.

Aspects evaluation

Aspects are evaluated as follows:

  • Harmfulness (Yes, No) – If the generated answer carries the risk of causing harm to people, communities, or more broadly to society
  • Maliciousness (Yes, No) – If the submission intends to harm, deceive, or exploit users
  • Coherence (Yes, No) – If the generated answer presents ideas, information, or arguments in a logical and organized manner
  • Correctness (Yes, No) – If the generated answer is factually accurate and free from errors
  • Conciseness (Yes, No) – If the submission conveys information or ideas clearly and efficiently, without unnecessary or redundant details

The RAG Triad proposed by TrueLens consists of three distinct assessments, as shown in the following figure: evaluating the relevance of the context, examining the grounding of the information, and assessing the relevance of the answer provided. Achieving satisfactory scores across all three evaluations provides confidence that the corresponding RAG application is not generating hallucinated or fabricated content.

RAG Triad

The RAGAS paper proposes automated metrics to evaluate these three quality dimensions in a reference-free manner, without needing human-annotated ground truth answers. This is done by prompting a language model and analyzing its outputs appropriately for each aspect.

To automate the evaluation at scale, metrics are computed using machine learning (ML) models called judges. Judges can be LLMs with reasoning capabilities, lightweight language models that are fine-tuned for evaluation tasks, or transformer models that compute similarities between text chunks such as cross-encoders.

Metric outcomes

When metrics are computed, they need to be examined to further optimize the system in a feedback loop:

  • Low context relevance means that the retrieval process isn’t fetching the relevant context. Therefore, data parsing, chunk sizes and embeddings models need to be optimized.
  • Low answer faithfulness means that the generation process is likely subject to hallucination, where the answer is not fully based on the retrieved context. In this case, the model choice needs to be revisited or further prompt engineering needs to be done.
  • Low answer relevance means that the answer generated by the model doesn’t correspond to the user query, and further prompt engineering or fine-tuning needs to be done.

Solution overview

You can use Amazon Bedrock to evaluate your RAG-based applications. In the following sections, we go over the steps to implement this solution:

  1. Set up observability.
  2. Prepare the evaluation dataset.
  3. Choose the metrics and prepare the evaluation prompts.
  4. Aggregate and review the metric results, then optimize the RAG system.

The following diagram illustrates the continuous process for optimizing a RAG system.

RAG Evaluation and Optimization Cycle

Set up observability

In a RAG system, multiple components (input processing, embedding, retrieval, prompt augmentation, generation, and output formatting) interact to generate answers assisted by external knowledge sources. Monitoring arriving user queries, search results, metadata, and component latencies help developers identify performance bottlenecks, understand system interactions, monitor for issues, and conduct root cause analysis, all of which are essential for maintaining, optimizing, and scaling the RAG system effectively.

In addition to metrics and logs, tracing is essential for setting up observability for a RAG system due to its distributed nature. The first step to implement tracing in your RAG system is to instrument your application. Instrumenting your application involves adding code to your application, automatically or manually, to send trace data for incoming and outbound requests and other events within your application, along with metadata about each request. There are several different instrumentation options you can choose from or combine, based on your particular requirements:

  • Auto instrumentation – Instrument your application with zero code changes, typically through configuration changes, adding an auto-instrumentation agent, or other mechanisms
  • Library instrumentation – Make minimal application code changes to add pre-built instrumentation targeting specific libraries or frameworks, such as the AWS SDK, LangChain, or LlamaIndex
  • Manual instrumentation – Add instrumentation code to your application at each location where you want to send trace information

To store and analyze your application traces, you can use AWS X-Ray or third-party tools like Arize Phoenix.

Prepare the evaluation dataset

To evaluate the reliability of your RAG system, you need a dataset that evolves with time, reflecting the state of your RAG system. Each evaluation record contains at least three of the following elements:

  • Human query – The user query that arrives in the RAG system
  • Reference document – The document content retrieved and added as a context to the final prompt
  • AI answer – The generated answer from the LLM
  • Ground truth – Optionally, you can add ground truth information:
    • Context ground truth – The documents or chunks relevant to the human query
    • Answer ground truth – The correct answer to the human query

If you have set up tracing, your RAG system traces already contain these elements, so you can either use them to prepare your evaluation dataset, or you can create a custom curated synthetic dataset specific for evaluation purposes based on your indexed data. In this post, we use Anthropic’s Claude 3 Sonnet, available in Amazon Bedrock, to evaluate the reliability of sample trace data of a RAG system that indexes the FAQs from the Zappos website.

Choose your metrics and prepare the evaluation prompts

Now that the evaluation dataset is prepared, you can choose the metrics that matter most to your application and your use case. In addition to the metrics we’ve discussed, you can create their own metrics to evaluate aspects that matter to you most. If your evaluation dataset provides answer ground truth, n-gram comparison metrics like ROUGE or embedding-based metrics BERTscore can be relevant before using an LLM as a judge. For more details, refer to the AWS Foundation Model Evaluations Library and Model evaluation.

When using an LLM as a judge to evaluate the metrics associated with a RAG system, the evaluation prompts play a crucial role in providing accurate and reliable assessments. The following are some best practices when designing evaluation prompts:

  • Give a clear role – Explicitly state the role the LLM should assume, such as “evaluator” or “judge,” to make sure it understands its task and what it is evaluating.
  • Give clear indications – Provide specific instructions on how the LLM should evaluate the responses, such as criteria to consider or rating scales to use.
  • Explain the evaluation procedure – Outline the parameters that need to be evaluated and the evaluation process step by step, including any necessary context or background information.
  • Deal with edge cases – Anticipate and address potential edge cases or ambiguities that may arise during the evaluation process. For example, determine if an answer based on irrelevant context be considered evaluated as factual or hallucinated.

In this post, we show how to create three custom binary metrics that don’t need ground truth data and that are inspired from some of the metrics we’ve discussed: faithfulness, context relevance, and answer relevance. We created three evaluation prompts.

The following is our faithfulness evaluation prompt template:

You are an AI assistant trained to evaluate interactions between a Human and an AI Assistant. An interaction is composed of a Human query, a reference document, and an AI answer. Your goal is to classify the AI answer using a single lower-case word among the following : “hallucinated” or “factual”.

“hallucinated” indicates that the AI answer provides information that is not found in the reference document.

“factual” indicates that the AI answer is correct relative to the reference document, and does not contain made up information.

Here is the interaction that needs to be evaluated:

Human query: {query}
Reference document: {reference}
AI answer: {response}
Classify the AI’s response as: “factual” or “hallucinated”. Skip the preamble or explanation, and provide the classification.

We also created the following context relevance prompt template:

You are an AI assistant trained to evaluate a knowledge base search system. A search request is composed of a Human query and a reference document. Your goal is to classify the reference document using one of the following classifications in lower-case: “relevant” or “irrelevant”.

“relevant” means that the reference document contains the necessary information to answer the Human query.

“irrelevant” means that the reference document doesn’t contain the necessary information to answer the Human query.

Here is the search request that needs to be evaluated:

Human query: {query}
Reference document: {reference}

Classify the reference document as: “relevant” or “irrelevant”. Skip any preamble or explanation, and provide the classification.

The following is our answer relevance prompt template:

You are an AI assistant trained to evaluate interactions between a Human and an AI Assistant. An interaction is composed of a Human query, a reference document, and an AI answer that should be based on the reference document. Your goal is to classify the AI answer using a single lower-case word among the following : “relevant” or “irrelevant”.

“relevant” means that the AI answer answers the Human query and stays relevant to the Human query, even if the reference document lacks full information.

“irrelevant” means that the Human query is not correctly or only partially answered by the AI.

Here is the interaction that needs to be evaluated:

Human query: {query}
Reference document: {reference}
AI answer: {response}

Classify the AI’s response as: “relevant” or “irrelevant”. Skip the preamble or explanation, and provide the classification.

Aggregate and review your metric results and then optimize your RAG system

After you obtain the evaluation results, you can store metrics in your observability systems alongside the stored traces to identify areas for improvement based on the values of their values or aggregates.

As indicated in the following diagram, every aspect of a RAG system has cascading impact on what follows; for instance, suboptimal document parsing impacts how reliably chunks are created, impacting embeddings quality, retrieval, and model output. When reviewing reliability metrics of your RAG system to find out what needs to be optimized, you should start by optimizing and reviewing what is earlier in the chain—from the left side of the following diagram.

RAG Optimization Steps

In the following table, we present 3 of the 15 queries we used from the Zappos FAQs to get the correspondent LLM answers together with the reference documents and the calculated metrics for faithfulness, context relevance, and answer relevance.

Question Retrieved Document Answer Faithfulness Context Relevance Answer Relevance
What are the specific care instructions for different fabric types like cashmere, silk, and wool?
Silk: Dry cleaning may be required. Some silks are hand- or machine-washable. Do not dry silk in a clothes dryer. Garments usually look best when professionally dry-cleaned.

Cashmere: Generally, hand wash in cold water with a mild detergent. Do not wring or twist. Lay flat to dry.Silk: Dry cleaning is often recommended. Some silks can be hand- or machine-washed, but avoid using a clothes dryer.

Wool: Check the label for specific instructions. Many wools can be machine-washed on a gentle cycle with cold water. Avoid high temperatures, as they can cause the wool to shrink or felt.

hallucinated
irrelevant
relevant
How can I identify if an email claiming to be from Zappos is legitimate or a phishing attempt?
recommends you update your online passwords, including the password to your Zappos.com Account, and report the email to the Federal Trade Commission (FTC). Visit the FTC website here to report the email and find more information on how to identify and prevent phishing attacks.
To identify if an email claiming to be from Zappos is legitimate or a phishing attempt, you should check for the following:1. Look for the correct sender email address. Zappos emails should come from an address ending in “@zappos.com”.

2. Check for spelling and grammatical errors in the email. Phishing emails often contain mistakes.

3. Be wary of emails asking for personal information, especially passwords or credit card details. Zappos will never ask for this information via email.

4. Hover over the links in the email without clicking to see if they lead to the correct Zappos URL.

5. If you’re unsure, contact Zappos directly through their official website or customer service number to verify the email’s authenticity.Report any suspicious emails to the Federal Trade Commission (FTC) through their website.
hallucinated
relevant
relevant
What is Amazon Pay, and how do I access my transaction history with it on Zappos?
How do I find my Amazon Pay transactions?
Amazon Pay is a digital wallet that allows you to make payments online using the information stored in your Amazon account, including your shipping addresses, payment methods, and default delivery settings. To view your Amazon Pay transaction history on Zappos or any other merchant site, you should log into your Amazon account. Once logged in, navigate to the “Your Orders” page, where you can view all of your past and present orders, including those made using Amazon Pay. If you’re unable to find a specific transaction, you can use the search bar to look up the order number or the date of the transaction.
hallucinated
irrelevant
relevant
. . . . .
. . . 87% 67% 100%

From the preceding aggregates, we can see that answer relevance has a high score; however, context relevance in this example RAG system is 67%. In addition to that, the system is demonstrating a level of hallucination in some cases. Therefore, we should start optimizing earlier in the chain to improve context relevance. If we look at sample questions where context relevance is classified as irrelevant, we can see that text is well parsed; however, we can also see that chunks may start or end in the middle of a sentence or just include the FAQ question without the answer. Therefore, we start by optimizing the chunking method.

After we update the chunking mechanism to prevent starting or ending a chunk in the middle of a sentence and to include the FAQ question and answer pairs, we redo the evaluation over the same 15 questions. The following table shows a sample of our results.

Question Retrieved Document Answer Faithfulness Context Relevance Answer Relevance
What are the specific care instructions for different fabric types like cashmere, silk, and wool?

How do I care for XYZ fabric?

Acetate: While most items made of acetate are dry-clean only, some may be hand-washed in lukewarm water. Do not twist the fabric. Iron while the fabric is damp. Press the inside of the garment using a low-temperature setting. Use a pressing cloth when ironing the outside.

Acrylic knit: Most acrylic knit garments can be machine-washed. Read the label, and check for the proper drying option. Some knits retain their shapes best if reshaped and dried flat.

Cashmere: Check the care label before laundering cashmere. To hand-wash, use a mild detergent or shampoo. Gently squeeze the water through the garment, then rinse until the water runs clear. Do not wring or twist. Squeeze out excess water. To dry, lay flat on a towel, away from sunlight or heat.

Cotton: Cotton holds up well to home laundering. Remove the garment from the dryer promptly to reduce wrinkling. Press using spray starch for the crispness of a laundered shirt.

Cotton blend: Dry cotton-blend garments using your dryer’s permanent-press or low-heat cycle, and remove immediately to reduce wrinkling. Touch up with a steam iron; starch for a professionally laundered look.

Linen: Most linen garments need to be dry-cleaned or hand-washed. Follow the instructions on the care label. To touch up or press, use a steam iron on a linen setting for a crisp look.

Nylon: Machine-wash nylon garments in warm water. Use a low-temperature setting when tumble-drying, and include a dryer sheet to reduce static electricity. Use a warm iron to press, if necessary.

Polyester: Read the label. Usually polyester items can be machine-washed (cool) and dried (low). Check the label to see if air-drying is recommended. Touch up with a cool – never hot – iron, if necessary.

Rayon: Make sure to read the care label for rayon clothing. Launder in cool water and a mild detergent. When machine-washing, use the gentle cycle and promptly remove garments after washing. When hand-washing, do not wring or twist the garment. To dry, roll the item in a towel to squeeze out excess water. Lay flat to dry. Iron when still damp and with the garment inside out to prevent the fabric from becoming shiny. Use a pressing cloth when ironing the right side of the garment. Pressing cloths can be purchased, or use something as simple as a piece of white cotton fabric.

Silk: Dry cleaning may be required. Some silks are hand- or machine-washable. Do not dry silk in a clothes dryer. Garments usually look best when professionally dry-cleaned.

Spandex: Use warm water when hand- or machine-washing items made with spandex. Line-dry or use a low temperature setting when machine-drying to preserve the elasticity of the garment.

Wool knit: Typically wool knits need to be dry-cleaned, but check the label. If hand-washable, use cool water and a detergent for fine washables. Squeeze out excess water, then reshape and dry flat.

Wool (lined): Lined garments generally look best when professionally dry-cleaned once a season. Between dry cleanings, remove surface soil with a brush or damp cloth. Refresh the item by hanging it from a padded hanger in a steamy bathroom; moisture will help garment shed wrinkles. When touching up with an iron, use steam in an up-and-down motion (rather than sliding the iron along the fabric).

Cashmere: Check the care label before laundering. To hand-wash, use a mild detergent or shampoo. Gently squeeze the water through the garment, then rinse until the water runs clear. Do not wring or twist. Squeeze out excess water and lay flat to dry, away from sunlight or heat.

Silk: Dry cleaning may be required. Some silks are hand- or machine-washable. Do not dry silk in a clothes dryer. For hand-washing, use a mild detergent and cool water. Squeeze out excess water and lay flat to dry. Iron when still damp and with the garment inside out to prevent the fabric from becoming shiny.

Wool: The care instructions for wool depend on whether it’s a knit or not. For wool knits, they typically need to be dry-cleaned, but check the label. If hand-washable, use cool water and a detergent for fine washables. Squeeze out excess water, then reshape and dry flat. For lined wool garments, they generally look best when professionally dry-cleaned once a season. Between dry cleanings, remove surface soil with a brush or damp cloth and refresh in a steamy bathroom. When touching up with an iron, use steam in an up-and-down motion.

factual
relevant
relevant
How can I identify if an email claiming to be from Zappos is legitimate or a phishing attempt?

How do I identify whether an Email is or is not from Zappos?

Suspicious or fraudulent emails not from Zappos.com may contain:

A reference to a payment source that Zappos.com does not accept, such as PayPal. Zappos.com currently only accepts credit cards, debit cards, and Zappos.com Gift Cards for payment.
A request to update payment information that is not linked to an order you placed on Zappos.com or your Zappos.com account.
Links, including “Here” links, to websites not affiliated with Zappos.com. You should be able to hover over a link without clicking it to show the website address associated with the link. If the link shows a site other than Zappos.com or a site that is not a legitimate Zappos domain, then it is likely phishing.
An order confirmation for an item you didn’t purchase or an attachment to an order confirmation. If you are a Zappos.com customer, go to your Zappos Account and view Your Order History to see if there is an order that matches the details in the correspondence. If you are not a Zappos.com customer, feel free to contact us to verify whether the email is associated with an actual order.
Typos, grammatical errors, or unconventional formatting.
A generic email address with Zappos in the email, for example advert@gmail.com or an email address from a non-US country or site not associated with Zappos. Zappos will never send communications from personal or generic email accounts. Order-related emails typically come from cs-noreply@zappos.com or another zappos.com email address.
Forged email addresses to make it look like the email is coming from Zappos.com. Your email program may allow you to hover over or click on an email address to view its original source.

What should I do if I believe I’ve received a phishing email or if I responded to what may be a phishing email?

If you believe you received a phishing email or responded to what may be a phishing email, Zappos recommends you update your online passwords, including the password to your Zappos.com Account, and report the email to the Federal Trade Commission (FTC). Visit the FTC website here to report the email and find more information on how to identify and prevent phishing attacks.

Payment Information

Do you accept international credit cards?

Currently Zappos.com can accept international credit cards but we can only ship to an address within the United States and its territories. Please note that the credit card must be issued and contain a logo from either Visa, Mastercard, Discover, or American Express. On the checkout page, there will be a billing and shipping section. Please do the following:

For the billing information

Enter your street address on Address Line 1. Enter your City, County or Province, and Postal Code on Address Line 2. Enter your Country for the city. Enter AA for the state. Enter 11111 for the zip code.

For the shipping information

Please enter the correct information in the appropriate field.

Orders placed with the use of an International Credit Card may take up to 72 hours to finalize prior to shipping. We apologize for any inconvenience this may cause.

To identify if an email claiming to be from Zappos is legitimate or a phishing attempt, check for the following:

1. Zappos only accepts certain payment methods: credit cards, debit cards, and Zappos.com Gift Cards.
2. No unsolicited requests for payment information or updates.
3. Links lead to Zappos.com or legitimate Zappos domains.
4. Match the order details in the email with your Zappos Account’s order history.
5. Look for typos, grammatical errors, or unconventional formatting.
6. Check for generic email addresses not affiliated with Zappos.
7. Verify the sender’s email address is from cs-noreply@zappos.com or another zappos.com email address.
8. Be wary of forged email addresses.

If you believe you received a phishing email or responded to what may be a phishing email, update your passwords and report the email to the Federal Trade Commission (FTC).

factual
relevant
relevant
What is Amazon Pay, and how do I access my transaction history with it on Zappos?

How do I find my Amazon Pay transactions?

Go to pay.amazon.com and sign in as a “shopper” using your Amazon username and password. One-time purchases are shown on the “Activity” tab, and subscription agreements are on the “Merchant agreements” tab. You can also view your transactions on Amazon.com. If you are on a desktop, your Amazon Pay transactions can be found under “Amazon Orders” within the Amazon Pay tab or in “Your Account” under “Transactions”. If you are in the Amazon app, all transactions are listed together under “Your Orders” or in “Your Account”.

Amazon Pay is a digital wallet that allows you to make payments online using your Amazon account. However, the context does not provide information on how to access your transaction history with Amazon Pay on Zappos. The instructions given in the context are for accessing Amazon Pay transactions on Amazon’s own platform.
factual
relevant
relevant
. . . . .
. . . 100% 93% 100%

After we changed the chunking mechanism to prevent mid-sentence chunking and to include an FAQ and its corresponding answer in the same chunk, we improved context relevance from 67% to 93%. We can also see that improving context relevance resolved previous hallucinations without even changing the prompt template. We can iterate the optimization process with further investigation into the questions that are having irrelevant retrievals by adjusting the indexing or the retrieval mechanism by choosing a higher number of retrieved chunks or by using hybrid search to combine lexical search with semantic search.

Sample references

To further explore and experiment different RAG evaluation techniques, you can delve deeper into the sample notebooks available in the Knowledge Bases section of the Amazon Bedrock Samples GitHub repo.

Conclusion

In this post, we described the importance of evaluating and monitoring RAG-based generative AI applications. We showcased the metrics and frameworks for RAG system evaluation and observability, then we went over how you can use FMs in Amazon Bedrock to compute RAG reliability metrics. It’s important to choose the metrics that matter most to your organization and that impact the aspect or configuration you want to optimize.

If RAG is not sufficient for your use case, you can opt for fine-tuning or continued pre-training in Amazon Bedrock or Amazon SageMaker to build custom models that are specific to your domain, organization, and use case. Most importantly, keeping a human in the loop is essential to align AI systems, as well as their evaluation mechanisms, with their intended uses and objectives.


About the Authors

Oussama Maxime Kandakji is a Senior Solutions Architect at AWS focusing on data science and engineering. He works with enterprise customers on solving business challenges and building innovative functionalities on top of AWS. He enjoys contributing to open source and working with data.

Ioan Catana is a Senior Artificial Intelligence and Machine Learning Specialist Solutions Architect at AWS. He helps customers develop and scale their ML solutions and generative AI applications in the AWS Cloud. Ioan has over 20 years of experience, mostly in software architecture design and cloud engineering.

Read More

Connect to Amazon services using AWS PrivateLink in Amazon SageMaker

Connect to Amazon services using AWS PrivateLink in Amazon SageMaker

AWS customers that implement secure development environments often have to restrict outbound and inbound internet traffic. This becomes increasingly important with artificial intelligence (AI) development because of the data assets that need to be protected. Transmitting data across the internet is not secure enough for highly sensitive data. Therefore, accessing AWS services without leaving the AWS network can be a secure workflow.

One of the ways you can secure AI development is by creating Amazon SageMaker instances within a virtual private cloud (VPC) with direct internet access disabled. This isolates the instance from the internet and makes API calls to other AWS services not possible. This presents a challenge for developers that are building architectures for production in which many AWS services need to function together.

In this post, we present a solution for configuring SageMaker notebook instances to connect to Amazon Bedrock and other AWS services with the use of AWS PrivateLink and Amazon Elastic Compute Cloud (Amazon EC2) security groups.

Solution overview

The following example architecture shows a SageMaker instance connecting to various services. The SageMaker instance is isolated from the internet but is still able to access AWS services through PrivateLink. One will notice that the connection to Amazon S3 is through a Gateway VPC endpoint. You can learn more about Gateway VPC endpoints here.

overall architecture for developing in a VPC environment

In the following sections, we show how to configure this on the AWS Management Console.

Create security groups for outbound and inbound endpoint access

First, you have to create the security groups that will be attached to the VPC endpoints and the SageMaker instance. You create the security groups before creating a SageMaker instance because after the instance has been created, the security group configuration can’t be changed.

You create two groups, one for outbound and another for inbound. Complete the following steps:

1. On the Amazon EC2 console, choose Security Groups in the navigation pane.

2. Choose Create security group.

3. For Security group name, enter a name (for example, inbound-sagemaker).

4. For Description, enter a description.

5. For VPC, choose your VPC.

create a security group for developing in a secure environment in vpc SageMaker

6. Note the security group ID to use in the next steps.

7. Create a new outbound rule.

8. For Security group name, enter a name (for example, outbound-sagemaker).

9. For Description, enter description.

10. For VPC, choose the same VPC as the inbound rule.

11. In the Outbound rules section, choose Add rule.

12. Add an outbound rule with the inbound security group ID as the destination using HTTPS as the type.

13. Note the outbound security group ID to use in the next step.

configure security group for connect to AWS services using AWS PrivateLink

14. Return to the inbound security group and add an inbound rule of HTTPS type with the destination set to the outbound security group ID.

set outbound rule for developing in a secure environment

Create a SageMaker instance with the outbound security group

You now create a SageMaker instance with the network configuration shown in the following screenshot. It’s important to choose the same VPC that you used to create the inbound and outbound security groups. You then choose the outbound security group you created earlier.

configure network for developing in secure environment

Create an Interface VPC endpoint

In this step, you create an Interface VPC endpoint using Amazon Virtual Private Cloud (Amazon VPC) that automatically uses PrivateLink, which allows calls from your SageMaker instance to AWS services.

1. On the Amazon VPC console, choose Endpoints in the navigation pane.

2. Choose Create endpoint.

3. For Name tag, enter a name (for example, bedrock-link).

4. For Service category, select AWS services.

5. For Services, search for and choose com.amazonaws.<region>.bedrock-runtime.

create interface endpoint for developing in secure environment

6. Set the VPC to the same one you’ve been working with.

7. Specify the subnet(s).

A subnet is a range of IP addresses within a VPC. If you don’t know what subnet to specify, any subnet will work. Otherwise, specify the subnet that is required by any security requirements from your cloud security team.

8. Set the security group to the inbound security group you created earlier.

After you create the endpoint, it should take some time to become available.

Repeat these steps for every service that you need for your workflow. The following screenshots show examples of services that you can create interface VPC endpoints for, such as Amazon Simple Storage Service (Amazon S3), Amazon Kendra, and AWS Lambda. AWS PrivateLink enables you to connect privately to several AWS services, for a current list please see this page.

select service for connecting to AWS services with AWS PrivateLink

Test the connection

You can test the connection to Amazon Bedrock using a simple Python API call. The following is a code snippet that invokes the Amazon Bedrock model:

import boto3
import json

bedrock = boto3.client(service_name='bedrock-runtime')
prompt = """
Human: What type of sharks are there?

Assistant:"""

body = json.dumps({
"prompt": prompt,
"max_tokens_to_sample": 4000,
"temperature": 0.1,
"top_p": 0.9,
})

modelId = 'anthropic.claude-instant-v1'
accept = 'application/json'
contentType = 'application/json'

response = bedrock.invoke_model(body=body, modelId=modelId, accept=accept, contentType=contentType)
response_body = json.loads(response.get('body').read())

print(response_body.get('completion'))

If you were to run this in a Jupyter notebook cell, it would give you an error because you have not pointed the invocation to use the VPC endpoint. You do this by adding an endpoint URL to the client instantiation:

bedrock = boto3.client(
    service_name='bedrock-runtime',
    endpoint_url = 'https://vpce-0e452bc86b1f87c50-5xltzdpo.bedrock-runtime.us-west-2.vpce.amazonaws.com'
)

To find the endpoint URL, go back to the VPC endpoint that you created in the previous step and look for DNS names, illustrated in the following screenshot. The Private DNS is the best option since it is the same as the public, which means you don’t have to change anything to use the private connection. The next best option is to use the Regional DNS, which is the first option under “DNS names”. Both options allow your traffic to failover to other healthy Availability Zones (AZ), in case the current AZ is impaired.

find the endpoint URL for the interface endpoing

Clean up

To clean up your resources, complete the following steps:

1. On the SageMaker console, navigate to the notebook configuration page.

2. Stop the instance, then choose Delete to delete the instance.

delete sagemaker notebook endpoint for clean up

3. On the Amazon EC2 console, navigate to the inbound security group’s detail page.

4. On the Actions menu, choose Delete security groups.

5. Repeat these steps for the outbound security group.

delete security group for clean up

6. On the Amazon VPC console, navigate to the VPC endpoint’s details page.

7. On the Actions menu, choose Delete.

8. Repeat this is step for every endpoint you created as part of this post.

delete vpc endpoint for clean up

Conclusion

In this post, we showed how to set up VPC endpoints and security groups to allow SageMaker to connect to Amazon Bedrock. When a SageMaker instance has restricted internet access, you can still develop and connect to other AWS services through the use of AWS PrivateLink. This post showed how to connect to Amazon Bedrock from an isolated SageMaker instance, but you can replicate the steps for other services.

We encourage you to get started developing AI applications on AWS. To learn more, visit Amazon SageMaker, Amazon Bedrock, and AWS PrivateLink for more information. Happy coding!


About the Author

Francisco Calderon is a Data Scientist at the AWS Generative AI Innovation Center. As a member of the GenAI Innovation Center, he helps solve critical business problems for AWS customers using the latest technology in Generative AI. In his spare time, Francisco likes to play music and guitar, play soccer with his daughters, and enjoy time with his family.

Sungmin Hong is an Applied Scientist at AWS Generative AI Innovation Center where he helps expedite the variety of use cases of AWS customers. Before joining Amazon, Sungmin was a postdoctoral research fellow at Harvard Medical School. He holds Ph.D. in Computer Science from New York University. Outside of work, Sungmin enjoys hiking, traveling and reading.

Yash Shah is a Science Manager in the AWS Generative AI Innovation Center. He and his team of applied scientists and machine learning engineers work on a range of machine learning use cases from healthcare, sports, automotive and manufacturing.

Anila Joshi has more than a decade of experience building AI solutions. As an Applied Science Manager at AWS Generative AI Innovation Center, Anila pioneers innovative applications of AI that push the boundaries of possibility and guides customers to strategically chart a course into the future of AI.

Read More

Maximize your Amazon Translate architecture using strategic caching layers

Maximize your Amazon Translate architecture using strategic caching layers

Amazon Translate is a neural machine translation service that delivers fast, high quality, affordable, and customizable language translation. Amazon Translate supports 75 languages and 5,550 language pairs. For the latest list, see the Amazon Translate Developer Guide. A key benefit of Amazon Translate is its speed and scalability. It can translate a large body of content or text passages in batch mode or translate content in real-time through API calls. This helps enterprises get fast and accurate translations across massive volumes of content including product listings, support articles, marketing collateral, and technical documentation. When content sets have phrases or sentences that are often repeated, you can optimize cost by implementing a write-through caching layer. For example, product descriptions for items contain many recurring terms and specifications. This is where implementing a translation cache can significantly reduce costs. The caching layer stores source content and its translated text. Then, when the same source content needs to be translated again, the cached translation is simply reused instead of paying for a brand-new translation.

In this post, we explain how setting up a cache for frequently accessed translations can benefit organizations that need scalable, multi-language translation across large volumes of content. You’ll learn how to build a simple caching mechanism for Amazon Translate to accelerate turnaround times.

Solution overview

The caching solution uses Amazon DynamoDB to store translations from Amazon Translate. DynamoDB functions as the cache layer. When a translation is required, the application code first checks the cache—the DynamoDB table—to see if the translation is already cached. If a cache hit occurs, the stored translation is read from DynamoDB with no need to call Amazon Translate again.

If the translation isn’t cached in DynamoDB (a cache miss), then the Amazon Translate API will be called to perform the translation. The source text is passed to Amazon Translate, and the translated result is returned and the translation is stored in DynamoDB, populating the cache for the next time that translation is requested.

For this blog post, we will be using Amazon API Gateway as a rest API for translation that integrates with AWS Lambda to perform backend logic. An Amazon Cognito user pool is used to control who can access your translate rest API. You can also use other mechanisms to control authentication and authorization to API Gateway based on your use-case.

Amazon Translate caching architecture

  1. When a new translation is needed, the user or application makes a request to the translation rest API.
  2. Amazon Cognito verifies the identity token in the request to grant access to the translation rest API.
  3. When new content comes in for translation, the Amazon API Gateway invokes the Lambda function that checks the Amazon DynamoDB table for an existing translation.
  4. If a match is found, the translation is retrieved from DynamoDB.
  5. If no match is found, the content is sent to Amazon Translate to perform a custom translation using parallel data. The translated content is then stored in DynamoDB along with a new entry for hit rate percentage.

These high-value translations are periodically post-edited by human translators and then added as parallel data for machine translation. This improves the quality of future translations performed by Amazon Translate.

We will use a simple schema in DynamoDB to store the cache entries. Each item will contain the following attributes:

  • src_text: The original source text
  • target_locale: The target language to translate to
  • translated_text: The translated text
  • src_locale: The original source language
  • hash: The primary key of the table

The primary key will be constructed from the src_locale, target_locale, and src_text to uniquely identify cache entries. When retrieving translations, items will be looked up by their primary key.

Prerequisites

To deploy the solution, you need

  1. An AWS account. If you don’t already have an AWS account, you can create one.
  2. Your access to the AWS account must have AWS Identity and Access Management (IAM) permissions to launch AWS CloudFormation templates that create IAM roles.
  3. Install AWS CLI.
  4. Install jq tool.
  5. AWS Cloud Development Kit (AWS CDK). See Getting started with the AWS CDK.
  6. Postman installed and configured on your computer.

Deploy the solution with AWS CDK

We will use AWS CDK to deploy the DynamoDB table for caching translations. CDK allows defining the infrastructure through a familiar programming language such as Python.

  1. Clone the repo from GitHub.
    git clone https://github.com/aws-samples/maximize-translate-architecture-strategic-caching

  2. Run the requirements.txt, to install python dependencies.
    python3 -m pip install -r requirements.txt

  3. Open app.py file and replace the AWS account number and AWS Region with yours.
  4. To verify that the AWS CDK is bootstrapped, run cdk bootstrap from the root of the repository:
cdk bootstrap
⏳ Bootstrapping environment aws://<acct#>/<region>... 
Trusted accounts for deployment: (none) 
Trusted accounts for lookup: (none) 
Using default execution policy of 
'arn:aws:iam::aws:policy/AdministratorAccess'. 
Pass '--cloudformation-execution-policies' to 
customize. ✅ Environment aws://<acct#>/<region> 
bootstrapped (no changes).
  1. Define your CDK stack to add DynamoDB and Lambda resources. The DynamoDB and Lambda Functions are defined as follows:
    • This creates a DynamoDB table with the primary key as hash, because the TRANSLATION_CACHE table is schemaless, you don’t have to define other attributes in advance. This also creates a Lambda function with Python as the runtime.
table = ddb.Table(
            self, 'TRANSLATION_CACHE',
            table_name='TRANSLATION_CACHE',
            partition_key={'name': 'hash', 'type': ddb.AttributeType.STRING},
            removal_policy=RemovalPolicy.DESTROY
        )

        self._handler = _lambda.Function(
            self, 'GetTranslationHandler',
            runtime=_lambda.Runtime.PYTHON_3_10,
            handler='get_translation.handler',
            code=_lambda.Code.from_asset('lambda'),
            environment={
                'TRANSLATION_CACHE_TABLE_NAME': table.table_name,
            }
        )
    • The Lambda function is defined such that it:
      • Parses the request body JSON into a Python dictionary.
      • Extracts the source locale, target locale, and input text from the request.
      • Gets the DynamoDB table name to use for a translation cache from environment variables.
      • Calls generate_translations_with_cache() to translate the text, passing the locales, text, and DynamoDB table name.
      • Returns a 200 response with the translations and processing time in the body.
def handler(event, context):

    print('request: {}'.format(json.dumps(event)))

    request = json.loads(event['body'])
    print("request", request)

    src_locale = request['src_locale']
    target_locale = request['target_locale']
    input_text = request['input_text']
    table_name = os.environ['TRANSLATION_CACHE_TABLE_NAME']

    if table_name == "":
        print("Defaulting table name")
        table_name = "TRANSLATION_CACHE"

    try:
        start = time.perf_counter()
        translations = generate_translations_with_cache(src_locale, target_locale, input_text, table_name)
        end = time.perf_counter()
        time_diff = (end - start)

        translations["processing_seconds"] = time_diff

        return {
            'statusCode': 200,
            'headers': {
                'Content-Type': 'application/json'
            },
            'body': json.dumps(translations)
        }

    except ClientError as error:

        error = {"error_text": error.response['Error']['Code']}
        return {
            'statusCode': 500,
            'headers': {
                'Content-Type': 'application/json'
            },
            'body': json.dumps(error)
        }

    • The generate_translations_with_cache function divides the input text into separate sentences by splitting on a period (“.”) symbol. It stores each sentence as a separate entry in the DynamoDB table along with its translation. This segmentation into sentences is done so that cached translations can be reused for repeating sentences.
    • In summary, it’s a Lambda function that accepts a translation request, translates the text using a cache, and returns the result with timing information. It uses DynamoDB to cache translations for better performance.
  1. You can deploy the stack by changing the working directory to the root of the repository and running the following command.
    cdk deploy

Considerations

Here are some additional considerations when implementing translation caching:

  • Eviction policy: An additional column can be defined indicating the cache expiration of the cache entry. The cache entry can then be evicted by defining a separate process.
  • Cache sizing: Determine expected cache size and provision DynamoDB throughput accordingly. Start with on-demand capacity if usage is unpredictable.
  • Cost optimization: Balance caching costs with savings from reducing Amazon Translate usage. Use a short DynamoDB Time-to-Live (TTL) and limit the cache size to minimize overhead.
  • Sensitive Information: DynamoDB encrypts all data at rest by default, if cached translations contain sensitive data, you can grant access to authorized users only. You can also choose to not cache data that contains sensitive information.

Customizing translations with parallel data

The translations generated in the translations table can be human-reviewed and used as parallel data to customize the translations. Parallel data consists of examples that show how you want segments of text to be translated. It includes a collection of textual examples in a source language; for each example, it contains the desired translation output in one or more target languages.

This is a great approach for most use cases, but some outliers might require light post-editing by human teams. The post-editing process can help you better understand the needs of your customers by capturing the nuances of local language that can be lost in translation. For businesses and organizations that want to augment the output of Amazon Translate (and other Amazon artificial intelligence (AI) services) with human intelligence, Amazon Augmented AI (Amazon A2I) provides a managed approach to do so, see Designing human review workflows with Amazon Translate and Amazon Augmented AI for more information.

When you add parallel data to a batch translation job, you create an Active Custom Translation job. When you run these jobs, Amazon Translate uses your parallel data at runtime to produce customized machine translation output. It adapts the translation to reflect the style, tone, and word choices that it finds in your parallel data. With parallel data, you can tailor your translations for terms or phrases that are unique to a specific domain, such as life sciences, law, or finance. For more information, see Customizing your translations with parallel data.

Testing the caching setup

Here is a video walkthrough of testing the solution.

There are multiple ways to test the caching setup. For this example, you will use Postman to test by sending requests. Because the Rest API is protected by an Amazon Cognito authorizer, you will need to configure Postman to send an authorization token with the API request.

As part of the AWS CDK deployment in the previous step, a Cognito user pool is created with an app client integration. On your AWS CloudFormation console, you can find BaseURL, translateCacheEndpoint, UserPoolID, and ClientID on the CDK stack output section. Copy these into a text editor for use later.

To generate an authorization token from Cognito, the next step is to create a user in the Cognito user pool.

  1. Go to the Amazon Cognito console. Select the user pool that was created by the AWS CDK stack.
  2. Select the Users tab and choose Create User.
  3. Enter the following values and choose Create User.
    1. On Invitation Message verify that Don’t send an invitation is selected.
    2. For Email address, enter test@test.com.
    3. On Temporary password, verify that Set a password is selected.
    4. In Password enter testUser123!.
  4. Now that the user is created, you will use AWS Command Line Interface (CLI) to simulate a sign in for the user. Go to the AWS CloudShell console.
  5. Enter the following commands on the CloudShell terminal by replacing UserPoolID and ClientID from the CloudFormation output of the AWS CDK stack.
export YOUR_POOL_ID=<UserPoolID>

export YOUR_CLIENT_ID=<ClientID>

export Session_ID=$(aws cognito-idp admin-initiate-auth --user-pool-id ${YOUR_POOL_ID} --client-id ${YOUR_CLIENT_ID} --auth-flow ADMIN_NO_SRP_AUTH --auth-parameters 'USERNAME=test@test.com,PASSWORD="testUser123!"' | jq .Session -r)

aws cognito-idp admin-respond-to-auth-challenge --user-pool-id ${YOUR_POOL_ID}  --client-id ${YOUR_CLIENT_ID} --challenge-name NEW_PASSWORD_REQUIRED --challenge-responses 'USERNAME= test@test.com,NEW_PASSWORD="testUser456!"' --session "${Session_ID}"
  1. The output from this call should be a valid session in the following format. The IdToken is the Open ID Connect-compatible identity token that we will pass to the APIs in the authorization header on Postman configuration. Copy it into a text editor to use later.
{
   "ChallengeParameters": {},
   "AuthenticationResult": {
"AccessToken":"YOU_WILL_SEE_VALID_ACCESS_TOKEN_VALUE_HERE",
      "ExpiresIn": 3600,
      "TokenType": "Bearer",
      "RefreshToken": "YOU_WILL_SEE_VALID_REFRESH_TOKEN_VALUE_HERE",
      "IdToken": "YOU_WILL_SEE_VALID_ID_TOKEN_VALUE_HERE"
   }
}

Now that you have an authorization token to pass with the API request to your rest API. Go to the Postman website. Sign in to the Postman website or download the Postman desktop client and create a Workspace with the name dev.

  1. Select the workspace dev and choose on New request.
  2. Change the method type to POST from GET.
  3. Paste the <TranslateCacheEndpoint> URL from the CloudFormation output of the AWS CDK stack into the request URL textbox. Append the API path /translate to the URL, as shown in the following figure.

Now set up authorization configuration on Postman so that requests to the translate API are authorized by the Amazon Cognito user pool.

  1. Select the Authorization tab below the request URL in Postman. Select OAuth2.0 as the Type.
  2. Under Current Token, copy and paste Your IdToken from earlier into the Token field.

  1. Select Configure New Token. Under Configuration Options add or select the values that follow. Copy the BaseURL and ClientID from the CloudFormation output of the AWS CDK stack. Leave the remaining fields at the default values.
    • Token Name: token
    • Grant Type: Select Authorization Code
    • Callback URL: Enter https://localhost
    • Auth URL: Enter <BaseURL>/oauth2/authorize
    • Access Token URL: Enter <BaseURL>/oauth2/token
    • ClientID: Enter <ClientID>
    • Scope: Enter openid profile translate-cache/translate
    • Client Authorization: Select Send client credentials in body.

  1. Click Get New Access Token. You will be directed to another page to sign in as a user. Use the below credentials of the test user that was created earlier in your Cognito user pool:-
    • Username: test@test.com
    • Password: testUser456!
  2. After authenticating, you will now get a new id_token. Copy the new id_token and go back to Postman authorization tab to replace that with the token value under Current Token.
  3. Now, on the Postman request URL and Select the Body tab for Request. Select the raw . Change Body type to JSON and insert the following JSON content. When done, choose Send.
{
"src_locale": "en",
"target_locale": "fr",
"input_text": "Use the Amazon Translate service to translate content from a source language (the language of the input content) to a target language (the language that you select for the translation output). In a batch job, you can translate files from one or more source languages to one or more target languages. For more information about supported languages, see Supported languages and language codes."
}

First translation request to the API

The first request to the API takes more time, because the Lambda function checks the given input text against the DynamoDB database on the initial request. Because this is the first request, it won’t find the input text in the table and will call Amazon Translate to translate the provided text.

Examining the processing_seconds value reveals that this initial request took approximately 2.97 seconds to complete.

Subsequent translations requests to the API

After the first request, the input text and translated output are now stored in the DynamoDB table. On subsequent requests with the same input text, the Lambda function will first check DynamoDB for a cache hit. Because the table now contains the input text from the first request, the Lambda function will find it there and retrieve the translation from DynamoDB instead of calling Amazon Translate again.

Storing requests in a cache allows subsequent requests for the same translation to skip the Amazon Translate call, which is usually the most time-consuming part of the process. Retrieving the translation from DynamoDB is much faster than calling Amazon Translate to translate the text each time.

The second request has a processing time of approximately 0.79 seconds, about 3 times faster than the first request which took 2.97 seconds to complete.

Cache purge

Amazon Translate continuously improves its translation models over time. To benefit from these improvements, you need to periodically purge translations from your DynamoDB cache and fetch fresh translations from Amazon Translate.

DynamoDB provides a Time-to-Live (TTL) feature that can automatically delete items after a specified expiry timestamp. You can use this capability to implement cache purging. When a translation is stored in DynamoDB, a purge_date attribute set to 30 days in the future is added. DynamoDB will automatically delete items shortly after the purge_date timestamp is reached. This ensures cached translations older than 30 days are removed from the table. When these expired entries are accessed again, a cache miss occurs and Amazon Translate is called to retrieve an updated translation.

The TTL-based cache expiration allows you to efficiently purge older translations on an ongoing basis. This ensures your applications can benefit from the continuous improvements to the machine learning models used by Amazon Translate while minimizing costs by still using caching for repeated translations within a 30-day period.

Clean up

When deleting a stack, most resources will be deleted upon stack deletion, however that’s not the case for all resources. The DynamoDB table will be retained by default. If you don’t want to retain this table, you can set this in the AWS CDK code by using RemovalPolicy.

Additionally, the Lambda function will generate Amazon CloudWatch logs that are permanently retained. These won’t be tracked by CloudFormation because they’re not part of the stack, so the logs will persist. Use the Cloudwatch console to manually delete any logs that you don’t want to retain.

You can either delete the stack through the CloudFormation console or use AWS CDK destroy from the root folder.

cdk destroy

Conclusion

The solution outlined in this post provides an effective way to implement a caching layer for Amazon Translate to improve translation performance and reduce costs. Using a cache-aside pattern with DynamoDB allows frequently accessed translations to be served from the cache instead of calling Amazon Translate each time.

The caching architecture is scalable, secure, and cost-optimized. Additional enhancements such as setting TTLs, adding eviction policies, and encrypting cache entries can further customize the architecture to your specific use case.

Translations stored in the cache can also be post-edited and used as parallel data to train Amazon Translate. This creates a feedback loop that continuously improves translation quality over time.

By implementing a caching layer, enterprises can deliver fast, high-quality translations tailored to their business needs at reduced costs. Caching provides a way to scale Amazon Translate efficiently while optimizing performance and cost.

Additional resources


About the authors

Praneeth Reddy Tekula is a Senior Solutions Architect focusing on EdTech at AWS. He provides architectural guidance and best practices to customers in building resilient, secure and scalable systems on AWS. He is passionate about observability and has a strong networking background.

Reagan Rosario is a Solutions Architect at AWS, specializing in building scalable, highly available, and secure cloud solutions for education technology companies. With over 10 years of experience in software engineering and architecture roles, Reagan loves using his technical knowledge to help AWS customers architect robust cloud solutions that leverage the breadth and depth of AWS.

Read More

Deploy a Slack gateway for Amazon Bedrock

Deploy a Slack gateway for Amazon Bedrock

In today’s fast-paced digital world, streamlining workflows and boosting productivity are paramount. That’s why we’re thrilled to share an exciting integration that will take your team’s collaboration to new heights. Get ready to unlock the power of generative artificial intelligence (AI) and bring it directly into your Slack workspace.

Imagine the possibilities: Quick and efficient brainstorming sessions, real-time ideation, and even drafting documents or code snippets—all powered by the latest advancements in AI. Say goodbye to context switching and hello to a streamlined, collaborative experience that will supercharge your team’s productivity. Whether you’re leading a dynamic team, working on complex projects, or simply looking to enhance your Slack experience, this integration is a game-changer.

In this post, we show you how to unlock new levels of efficiency and creativity by bringing the power of generative AI directly into your Slack workspace using Amazon Bedrock.

Solution overview

Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Stability AI, and Amazon through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI.

In the following sections, we guide you through the process of setting up a Slack integration for Amazon Bedrock. We show how to create a Slack application, configure the necessary permissions, and deploy the required resources using AWS CloudFormation.

The following diagram illustrates the solution architecture.

The workflow consists of the following steps:

  1. The user communicates with the Slack application.
  2. The Slack application sends the event to Amazon API Gateway, which is used in the event subscription.
  3. API Gateway forwards the event to an AWS Lambda function.
  4. The Lambda function invokes Amazon Bedrock with the request, then responds to the user in Slack.

Prerequisites

You need an AWS account and an AWS Identity and Access Management (IAM) role and user with permissions to create and manage the necessary resources and components for this application. If you don’t have an AWS account, see How do I create and activate a new Amazon Web Services account?

You also need an existing account with Amazon Bedrock model access provided. If you don’t have model permission, refer to Model access.

Lastly, you need a Slack account and access to create and publish apps to your Slack organization. If you don’t have one, request your company to create a Slack sandbox organization for you to experiment, or go to Slack to create a free Slack account and workspace.

Create a Slack application

The security configuration varies across organizations. To manage your Slack workspace’s settings, reach out to your Slack administrator or as administrator, complete the following steps:

  1. Navigate to the admin section within Slack and choose Build.
    Build new Slack Application
  2. Choose Create New App.
    Create new Slack application
  3. For App Name, enter a name for your app (for this post, we name it BedrockSlackIntegration).
  4. Choose your workspace.
  5. Choose Create App.

    After you create the app, you can configure its permissions.
  6. On the app details page, choose Basic Information in the navigation pane.
  7. Under Add features and functionality, choose Permissions
    Basic information of application
  8. In the Scopes section, add the scopes im:read, im:write, and chat:write.

On the Basic Information page, Bots and Permissions should now both have a green check mark.

  1. Under Install your app, choose Install to Workspace.
  2. When prompted to install, choose Allow.
  3. Open the Amazon Bedrock console and choose Model access in the navigation pane.
    Provision Amazon Bedrock model access
  4. You can select your model from the available list. For this post, we grant access to ai21.j2-ultra-v1 (Jurassic-2 Ultra).For more information about requesting model access, see Model access. Next, we deploy the code and connect with Amazon Bedrock when we get a message from Slack. For that, we need the Slack bot token to use as an input parameter for the CloudFormation template in the next section.
  5. On the Slack app details page, choose OAuth & Permissions in the navigation pane.
  6. Copy the value for Bot User OAuth Token.
    OAuth and permissions for Slack application

Deploy resources with AWS CloudFormation

Complete the following steps to launch the CloudFormation stack:

  1. For Stack name, use default or enter a name of your choice.
  2. For SlackTokenParam, enter the bot token you copied earlier.
  3. Choose Next.
    Specify CFN stack details
  4. Create your stack and wait a few minutes for deployment to complete.
    AWS CloudFormation stack status
  5. On the Outputs tab, copy the value for SlackBotEndpointOutput to use in the next steps.
    AWS CloudFormation output variables

In the next section, we start integrating Amazon Bedrock with Slack.

Integrate Amazon Bedrock with Slack

After you deploy your CloudFormation stack, complete the following steps:

  1. On the Slack app details page, choose Event Subscriptions in the navigation pane.
  2. Toggle Enable Events on.
    Enable event subscription on Slack application

The event subscription should get automatically verified.

  1. Under Subscribe to bot events, add the events app_mention and message.im.
  2. Choose Save Changes.
    Save slack application changes
    The integration is now complete.

Test the Slack bot

To test your bot, complete the following steps:

  1. Navigate to your Slack.
  2. Create a new group and add the app BedrockSlackIntegration.
  3. Start interacting with the Amazon Bedrock bot using @BedrockSlackIntegration.

Your interaction will look like the following screenshot.

Test your bot through Slack

The bot demonstrated here doesn’t have the state of your previous questions or your chat history with new subsequent messages. However, you can implement this using Amazon DynamoDB. We will cover this in a later blog post.

Summary

In this post, we delved into the seamless integration of Amazon Bedrock with the popular collaboration platform, Slack. The step-by-step guide demonstrated how to establish a direct connection between these two powerful tools, enabling you and your team to harness the full potential of generative AI directly within your Slack workspace. With this integration, you can streamline your workflow and enhance productivity, making it effortless to tap into the cutting-edge capabilities of generative AI. Whether you’re seeking to generate content, analyze data, or explore innovative ideas, this integration empowers you to do it all without leaving the familiar Slack environment.

You can further empower your team by deploying a Slack gateway for Amazon Q Business, the generative AI assistant that empowers employees based on knowledge and data in your enterprise systems. To learn more about how to use generative AI with AWS services, see Generative AI on AWS.


About the Authors

Rushabh Lokhande is a Senior Data & ML Engineer with AWS Professional Services Analytics Practice. He helps customers implement big data, machine learning, analytics solutions, and generative AI solutions. Outside of work, he enjoys spending time with family, reading, running, and playing golf.

Andrew Ang is a Senior ML Engineer with the AWS Generative AI Innovation Center, where he helps customers ideate and implement generative AI proof of concept projects. Outside of work, he enjoys playing squash and watching travel and food vlogs.

John Losito is an Associate Cloud Infrastructure Architect with AWS Professional Services, where he helps customers craft automation scripts using the AWS CDK or Terraform to efficiently deploy and managed cloud resources. Outside of work, he enjoys spending time with his family, exercising, and improving his archery skills.

Read More

Improving air quality with generative AI

Improving air quality with generative AI

As of this writing, Ghana ranks as the 27th most polluted country in the world, facing significant challenges due to air pollution. Recognizing the crucial role of air quality monitoring, many African countries, including Ghana, are adopting low-cost air quality sensors.

The Sensor Evaluation and Training Centre for West Africa (Afri-SET), aims to use technology to address these challenges. Afri-SET engages with air quality sensor manufacturers, providing crucial evaluations tailored to the African context. Through evaluations of sensors and informed decision-making support, Afri-SET empowers governments and civil society for effective air quality management.

On December 6th-8th 2023, the non-profit organization, Tech to the Rescue, in collaboration with AWS, organized the world’s largest Air Quality Hackathon – aimed at tackling one of the world’s most pressing health and environmental challenges, air pollution. More than 170 tech teams used the latest cloud, machine learning and artificial intelligence technologies to build 33 solutions. The solution addressed in this blog solves Afri-SET’s challenge and was ranked as the top 3 winning solutions.

This post presents a solution that uses a generative artificial intelligence (AI) to standardize air quality data from low-cost sensors in Africa, specifically addressing the air quality data integration problem of low-cost sensors. The solution harnesses the capabilities of generative AI, specifically Large Language Models (LLMs), to address the challenges posed by diverse sensor data and automatically generate Python functions based on various data formats. The fundamental objective is to build a manufacturer-agnostic database, leveraging generative AI’s ability to standardize sensor outputs, synchronize data, and facilitate precise corrections.

Current challenges

Afri-SET currently merges data from numerous sources, employing a bespoke approach for each of the sensor manufacturers. This manual synchronization process, hindered by disparate data formats, is resource-intensive, limiting the potential for widespread data orchestration. The platform, although functional, deals with CSV and JSON files containing hundreds of thousands of rows from various manufacturers, demanding substantial effort for data ingestion.

The objective is to automate data integration from various sensor manufacturers for Accra, Ghana, paving the way for scalability across West Africa. Despite the challenges, Afri-SET, with limited resources, envisions a comprehensive data management solution for stakeholders seeking sensor hosting on their platform, aiming to deliver accurate data from low-cost sensors. The attempt is disadvantaged by the current focus on data cleaning, diverting valuable skills away from building ML models for sensor calibration. Additionally, they aim to report corrected data from low-cost sensors, which requires information beyond specific pollutants.

The solution had the following requirements:

  • Cloud hosting – The solution must reside on the cloud, ensuring scalability and accessibility.
  • Automated data ingestion – An automated system is essential for recognizing and synchronizing new (unseen), diverse data formats with minimal human intervention.
  • Format flexibility – The solution should accommodate both CSV and JSON inputs and be flexible on the formatting (any reasonable column names, units of measure, any nested structure, or malformed CSV such as missing columns or extra columns)
  • Golden copy preservation – Retaining an untouched copy of the data is imperative for reference and validation purposes.
  • Cost-effective – The solution should only invoke LLM to generate reusable code on an as-needed basis instead of manipulating the data directly to be as cost-effective as possible.

The goal was to build a one-click solution that takes different data structure and formats (CSV and JSON) and automatically converts them to be integrated into a database with unified headers, as shown in the following figure. This allows for data to be aggregated for further manufacturer-agnostic analysis.

Figure 2: Covert data with different data formats into a desired data format with unified headers

Figure 1: Covert data with different data formats into a desired data format with unified headers

Overview of solution

The proposed solution uses Anthropic’s Claude 2.1 foundation model through Amazon Bedrock to generate Python codes, which converts input data into a unified data format. LLMs excel at writing code and reasoning over text, but tend to not perform as well when interacting directly with time-series data. In this solution, we leverage the reasoning and coding abilities of LLMs for creating reusable Extract, Transform, Load (ETL), which transforms sensor data files that do not conform to a universal standard to be stored together for downstream calibration and analysis. Additionally, we take advantage of the reasoning capabilities of LLMs to understand what the labels mean in the context of air quality sensor, such as particulate matter (PM), relative humidity, temperature, etc.

The following diagram shows the conceptual architecture:

Figure 3: The AWS reference architecture and the workflow for data transformation with Amazon Bedrock

Figure 2: The AWS reference architecture and the workflow for data transformation with Amazon Bedrock

Solution walkthrough

The solution reads raw data files (CSV and JSON files) from Amazon Simple Storage Service (Amazon S3) (Step 1) and checks if it has seen the device type (or data format) before. If yes, the solution retrieves and executes the previously-generated python codes (Step 2) and the transformed data is stored in S3 (Step 10). The solution only invokes the LLM for new device data file type (code has not yet been generated). This is done to optimize performance and minimize cost of LLM invocation. If the Python code is not available for a given device data, the solution notifies the operator to check the new data format (Step 3 and Step 4). At this time, the operator checks the new data format and validates if the new data format is from a new manufacturer (Step 5). Further, the solution checks if the file is CSV or JSON. If it is a CSV file, the data can be directly converted to a Pandas data frame by a Python function without LLM invocation. If it is a JSON file, the LLM is invoked to generate a Python function that creates a Pandas data frame from the JSON payload considering its schema and how nested it is (Step 6).

We invoke the LLM to generate Python functions that manipulate the data with three different prompts (input string):

  1. The first invocation (Step 6) generates a Python function that converts a JSON file to a Pandas data frame. JSON files from manufacturers have different schemas. Some input data uses a pair of value type and value for a measurement. The latter format results in data frames containing one column of value type and one column of value. Such columns need to be pivoted.
  2. The second invocation (Step 7) determines if the data needs to be pivoted and generates a Python function for pivoting if needed. Another issue of the input data is that the same air quality measurement can have different names from different manufacturers; for example, “P1” and “PM1” are for the same type of measurement.
  3. The third invocation (Step 8) focuses on data cleaning. It generates a Python function to convert data frames to a common data format. The Python function may include steps for unifying column names for the same type of measurement and dropping columns.

All LLM generated Python codes are stored in the repository (Step 9) so that this can be used to process daily raw device data files for transformation into a common format.

The data is then stored in Amazon S3 (Step 10) and can be published to OpenAQ so other organizations can use the calibrated air quality data.

The following screenshot shows the proposed frontend for illustrative purposes only as the solution is designed to integrate with Afri-SET’s existing backend system

Results

The proposed method minimizes LLM invocations, thus optimizing cost and resources. The solution only invokes the LLM when a new data format is detected. The code that is generated is stored, so that an input data with the same format (seen before) can reuse the code for data processing.

A human-in-the-loop mechanism safeguards data ingestion. This happens only when a new data format is detected to avoid overburdening scarce Afri-SET resources. Having a human-in-the-loop to validate each data transformation step is optional.

Automatic code generation reduces data engineering work from months to days. Afri-SET can use this solution to automatically generate Python code, based on the format of input data. The output data is transformed to a standardized format and stored in a single location in Amazon S3 in Parquet format, a columnar and efficient storage format. If useful, it can be further extended to a data lake platform that uses AWS Glue (a serverless data integration service for data preparation) and Amazon Athena (a serverless and interactive analytics service) to analyze and visualize data. With AWS Glue custom connectors, it’s effortless to transfer data between Amazon S3 and other applications. Additionally, this is a no-code experience for Afri-SET’s software engineer to effortlessly build their data pipelines.

Conclusion

This solution allows for easy data integration to help expand cost-effective air quality monitoring. It offers data-driven and informed legislation, fostering community empowerment and encouraging innovation.

This initiative, aimed at gathering precise data, is a significant step towards a cleaner and healthier environment. We believe that AWS technology can help address poor air quality through technical solutions similar to the one described here. If you want to prototype similar solutions, apply to the AWS Health Equity initiative.

As always, AWS welcomes your feedback. Please leave your thoughts and questions in the comments section.


About the authors

Sandra Topic is an Environmental Equity Leader at AWS. In this role, she leverages her engineering background to find new ways to use technology for solving the world’s “To Do list” and drive positive social impact. Sandra’s journey includes social entrepreneurship and leading sustainability and AI efforts in tech companies.

Qiong (Jo) Zhang, PhD, is a Senior Partner Solutions Architect at AWS, specializing in AI/ML. Her current areas of interest include federated learning, distributed training, and generative AI.  She holds 30+ patents and has co-authored 100+ journal/conference papers. She is also the recipient of the Best Paper Award at IEEE NetSoft 2016, IEEE ICC 2011, ONDM 2010, and IEEE GLOBECOM 2005.

Gabriel Verreault is a Senior Partner Solutions Architect at AWS for the Industrial Manufacturing segment. Gabriel works with AWS partners to define, build, and evangelize solutions around Smart Manufacturing, Sustainability and AI/ML. Gabriel also has expertise in industrial data platforms, predictive maintenance, and combining AI/ML with industrial workloads.

Venkatavaradhan (Venkat) Viswanathan is a Global Partner Solutions Architect at Amazon Web Services. Venkat is a Technology Strategy Leader in Data, AI, ML, generative AI, and Advanced Analytics. Venkat is a Global SME for Databricks and helps AWS customers design, build, secure, and optimize Databricks workloads on AWS.

Read More

Use zero-shot large language models on Amazon Bedrock for custom named entity recognition

Use zero-shot large language models on Amazon Bedrock for custom named entity recognition

Name entity recognition (NER) is the process of extracting information of interest, called entities, from structured or unstructured text. Manually identifying all mentions of specific types of information in documents is extremely time-consuming and labor-intensive. Some examples include extracting players and positions in an NFL game summary, products mentioned in an AWS keynote transcript, or key names from an article on a favorite tech company. This process must be repeated for every new document and entity type, making it impractical for processing large volumes of documents at scale. With more access to vast amounts of reports, books, articles, journals, and research papers than ever before, swiftly identifying desired information in large bodies of text is becoming invaluable.

Traditional neural network models like RNNs and LSTMs and more modern transformer-based models like BERT for NER require costly fine-tuning on labeled data for every custom entity type. This makes adopting and scaling these approaches burdensome for many applications. However, new capabilities of large language models (LLMs) enable high-accuracy NER across diverse entity types without the need for entity-specific fine-tuning. By using the model’s broad linguistic understanding, you can perform NER on the fly for any specified entity type. This capability is called zero-shot NER and enables the rapid deployment of NER across documents and many other use cases. This ability to extract specified entity mentions without costly tuning unlocks scalable entity extraction and downstream document understanding.

In this post, we cover the end-to-end process of using LLMs on Amazon Bedrock for the NER use case. Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading artificial intelligence (AI) companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI. In particular, we show how to use Amazon Textract to extract text from documents such PDFs or image files, and use the extracted text along with user-defined custom entities as input to Amazon Bedrock to conduct zero-shot NER. We also touch on the usefulness of text truncation for prompts using Amazon Comprehend, along with the challenges, opportunities, and future work with LLMs and NER.

Solution overview

In this solution, we implement zero-shot NER with LLMs using the following key services:

  • Amazon Textract – Extracts textual information from the input document.
  • Amazon Comprehend (optional) – Identifies predefined entities such as names of people, dates, and numeric values. You can use this feature to limit the context over which the entities of interest are detected.
  • Amazon Bedrock – Calls an LLM to identify entities of interest from the given context.

The following diagram illustrates the solution architecture.

The main inputs are the document image and target entities. The objective is to find values of the target entities within the document. If the truncation path is chosen, the pipeline uses Amazon Comprehend to reduce the context. The output of LLM is postprocessed to generate the output as entity-value pairs.

For example, if given the AWS Wikipedia page as the input document, and the target entities as AWS service names and geographic locations, then the desired output format would be as follows:

  • AWS service names: <all AWS service names mentioned in the Wikipedia page>
  • Geographic locations: <all geographic location names within the Wikipedia page>

In the following sections, we describe the three main modules to accomplish this task. For this post, we used Amazon SageMaker notebooks with ml.t3.medium instances along with Amazon Textract, Amazon Comprehend, and Amazon Bedrock.

Extract context

Context is the information that is taken from the document and where the values to the queried entities are found. When consuming a full document (full context), context significantly increases the input token count to the LLM. We provide an option of using the entire document or local context around relevant parts of the document, as defined by the user.

First, we extract context from the entire document using Amazon Textract. The code below uses the amazon-textract-caller library as a wrapper for the Textract API calls. You need to install the library first:

python -m pip install amazon-textract-caller

Then, for a single page document such as a PNG or JPEG file use the following code to extract the full context:

from textractcaller.t_call import call_textract, Textract_Features 
from textractprettyprinter.t_pretty_print import get_text_from_layout_json 

document_name = "sample_data/synthetic_sample_data.png"

# call Textract
layout_textract_json = call_textract(
input_document = document_name, 
features = [Textract_Features.LAYOUT]
) 

# extract the text from the JSON response
full_context = get_text_from_layout_json(textract_json = layout_textract_json)[1]

Note that PDF input documents have to be on a S3 bucket when using call_textract function. For multi-page TIFF files make sure to set force_async_api=True.

Truncate context (optional)

When the user-defined custom entities to be extracted are sparse compared to the full context, we provide an option to identify relevant local context and then look for the custom entities within the local context. To do so, we use generic entity extraction with Amazon Comprehend. This is assuming that the user-defined custom entity is a child of one of the default Amazon Comprehend entities, such as "name", "location", "date", or "organization". For example, "city" is a child of "location". We extract the default generic entities through the AWS SDK for Python (Boto3) as follows:

import pandas as pd
comprehend_client = boto3.client("comprehend")
generic_entities = comprehend_client.detect_entities(Text=full_context, 
                                                     LanguageCode="en")
df_entities = pd.DataFrame.from_dict(generic_entities["Entities"])

It outputs a list of dictionaries containing the entity as “Type”, the value as “Text”, along with other information such as “Score”, “BeginOffset”, and “EndOffset”. For more details, see DetectEntities. The following is an example output of Amazon Comprehend entity extraction, which provides the extracted generic entity-value pairs and location of the value within the text.

{
“Entities”: [
	{
	“Text”: “AWS”,
	“Score”: 0.98,
	“Type”: “ORGANIZATION”,
	“BeginOffset”: 21,
	“EndOffset”: 24
	},
	{
	“Text”: “US East”,
	“Score”: 0.97,
	“Type”: “LOCATION”,
	“BeginOffset”: 1100,
	“EndOffset”: 1107
	}
],
“LanguageCode”: “en”
}

The extracted list of generic entities may be more exhaustive than the queried entities, so a filtering step is necessary. For example, a queried entity is “AWS revenue” and generic entities contain “quantity”, “location”, “person”, and so on. To only retain the relevant generic entity, we define the mapping and apply the filter as follows:

query_entities = ['XX']
user_defined_map = {'XX': 'QUANTITY', 'YY': 'PERSON'}
entities_to_keep = [v for k,v in user_defined_map.items() if k in query_entities]
df_filtered = df_entities.loc[df_entities['Type'].isin(entities_to_keep)]

After we identify a subset of generic entity-value pairs, we want to preserve the local context around each pair and mask out everything else. We do this by applying a buffer to “BeginOffset” and “EndOffset” to add extra context around the offsets identified by Amazon Comprehend:

StrBuff, EndBuff =20,10
df_offsets = df_filtered.apply(lambda row : pd.Series({'BeginOffset':max(0, row['BeginOffset']-StrBuff),'EndOffset':min(row['EndOffset']+EndBuff, len(full_context))}), axis=1).reset_index(drop=True)

We also merge any overlapping offsets to avoid duplicating context:

for index, _ in df_offsets.iterrows():
    if (index>0) and (df_offsets.iloc[index]['BeginOffset']<=df_offsets.iloc[index-1]['EndOffset']):
        df_offsets.iloc[index]['BeginOffset'] = df_offsets.iloc[index-1]['BeginOffset']
df_offsets = df_offsets.groupby(['BeginOffset']).last().reset_index()

Finally, we truncate the full context using the buffered and merged offsets:

truncated_text = "/n".join([full_context[row['BeginOffset']:row['EndOffset']] for _, row in df_offsets.iterrows()])

An additional step for truncation is to use the Amazon Textract Layout feature to narrow the context to a relevant text block within the document. Layout is a new Amazon Textract feature that enables you to extract layout elements such as paragraphs, titles, lists, headers, footers, and more from documents. After a relevant text block has been identified, this can be followed by the buffer offset truncation we mentioned.

Extract entity-value pairs

Given either the full context or the local context as input, the next step is customized entity-value extraction using LLM. We propose a generic prompt template to extract customized entities through Amazon Bedrock. Examples of customized entities include product codes, SKU numbers, employee IDs, product IDs, revenue, and locations of operation. It provides generic instructions on the NER task and desired output formatting. The prompt input to LLM includes four components: an initial instruction, the customized entities as query entities, the context, and the format expected from the output of the LLM. The following is an example of the baseline prompt. The customized entities are incorporated as a list in query entities. This process is flexible to handle a variable number of entities.

prompt = “””
Given the text below, identify these name entities:
	“{query_entities}”
text: “{context}”
Respond in the following format:
	“{output formay}”
“””

With the preceding prompt, we can invoke a specified Amazon Bedrock model using InvokeModel as follows. For a full list of models available on Amazon Bedrock and prompting strategies, see Amazon Bedrock base model IDs (on-demand throughput).

import json
bedrock_client = boto3.client(service_name='bedrock-runtime')
body = json.dumps({
        "prompt": f"nnHuman: {prompt}nnAssistant:",
        "max_tokens_to_sample": 300,
        "temperature": 0.1,
        "top_p": 0.9,
    })
modelId = 'anthropic.claude-v2'
accept = 'application/json'
contentType = 'application/json'

response = bedrock_client.invoke_model(body=body, modelId=modelId, accept=accept, contentType=contentType)
response_body = json.loads(response.get('body').read())
print(response_body.get('completion'))

Although the overall solution described here is intended for both unstructured data (such as documents and emails) and structured data (such as tables), another method to conduct entity extraction on structured data is by using the Amazon Textract Queries feature. When provided a query, Amazon Textract can extract entities using queries or custom queries by specifying natural language questions. For more information, see Specify and extract information from documents using the new Queries feature in Amazon Textract.

Use case

To demonstrate an example use case, we use Anthropic Claude-V2 on Amazon Bedrock to generate some text about AWS (as shown in the following figure), saved it as an image to simulate a scanned document, and then used the proposed solution to identify some entities within the text. Because this example was generated by an LLM, the content may not be completely accurate. We used the following prompt to generate the text: “Generate 10 paragraphs about Amazon AWS which contains examples of AWS service names, some numeric values as well as dollar amount values, list like items, and entity-value pairs.”

Let’s extract values for the following target entities:

  • Countries where AWS operates
  • AWS annual revenue

As shown in the solution architecture, the image is first sent to Amazon Textract to extract the contents as text. Then there are two options:

  • No truncation – You can use the whole text along with the target entities to create a prompt for the LLM
  • With truncation – You can use Amazon Comprehend to detect generic entities, identify candidate positions of the target entities, and truncate the text to the proximities of the entities

In this example, we ask Amazon Comprehend to identify "location" and "quantity" entities, and we postprocess the output to restrict the text to the neighborhood of identified entities. In the following figure, the "location" entities and context around them are highlighted in purple, and the "quantity" entities and context around them are highlighted in yellow. Because the highlighted text is the only text that persists after truncation, this approach can reduce the number of input tokens to the LLM and ultimately save cost. In this example, with truncation and total buffer size of 30, the input token count reduces by almost 50%. Because the LLM cost is a function of number of input tokens and output tokens, the cost due to input tokens is reduced by almost 50%. See Amazon Bedrock Pricing for more details.

Given the entities and (optionally truncated) context, the following prompt is sent to the LLM:

prompt = “””
Given the text below, identify these name entities:
	Countries where AWS operates in, AWS annual revenue

text: “{(optionally truncated) context}”

Respond in the following format:

Countries where AWS operates in: <all countries where AWS operates in entities from the text>

AWS annual revenue: <all AWS annual revenue entities from the text>
“”"

The following table shows the response of Anthropic Claude-V2 on Amazon Bedrock for different text inputs (again, the document used as input was generated by an LLM and may not be completely accurate). The LLM can still generate the correct response even after removing almost 50% of the context.

Input text LLM response
Full context

Countries where AWS operates in: us-east-1 in Northern Virginia, eu-west-1 in Ireland, ap-southeast-1 in Singapore

AWS annual revenue: $62 billion

Truncated context

Countries where AWS operates in: us-east-1 in Northern Virginia, eu-west-1 in Ireland, ap-southeast-1 in Singapore

AWS annual revenue: $62 billion in annual revenue

Conclusion

In this post, we discussed the potential for LLMs to conduct NER without being specifically fine-tuned to do so. You can use this pipeline to extract information from structured and unstructured text documents at scale. In addition, the optional truncation modality has the potential to reduce the size of your documents, decreasing an LLM’s token input while maintaining comparable performance to using the full document. Although zero-shot LLMs have proved to be capable of conducting NER, we believe experimenting with few-shot LLMs is also worth exploring. For more information on how you can start your LLM journey on AWS, refer to the Amazon Bedrock User Guide.


About the Authors

Sujitha Martin is an Applied Scientist in the Generative AI Innovation Center (GAIIC). Her expertise is in building machine learning solutions involving computer vision and natural language processing for various industry verticals. In particular, she has extensive experience working on human-centered situational awareness and knowledge infused learning for highly autonomous systems.

 Matthew Rhodes is a Data Scientist working in the Generative AI Innovation Center (GAIIC). He specializes in building machine learning pipelines that involve concepts such as natural language processing and computer vision.

Amin Tajgardoon is an Applied Scientist in the Generative AI Innovation Center (GAIIC). He has an extensive background in computer science and machine learning. In particular, Amin’s focus has been on deep learning and forecasting, prediction explanation methods, model drift detection, probabilistic generative models, and applications of AI in the healthcare domain.

Read More