Manage Amazon SageMaker JumpStart foundation model access with private hubs

Manage Amazon SageMaker JumpStart foundation model access with private hubs

Amazon SageMaker JumpStart is a machine learning (ML) hub offering pre-trained models and pre-built solutions. It provides access to hundreds of foundation models (FMs). A private hub is a feature in SageMaker JumpStart that allows an organization to share their models and notebooks so as to centralize model artifacts, facilitate discoverability, and increase the reuse within the organization. With new models released daily, many enterprise admins want more control over the FMs that can be discovered and used by users within their organization (for example, only allowing models based on pytorch framework to be discovered).

Now enterprise admins can effortlessly configure granular access control over the FMs that SageMaker JumpStart provides out of box so that only allowed models can be accessed by users within their organizations. In this post, we discuss the steps required for an administrator to configure granular access control of models in SageMaker JumpStart using a private hub, as well as the steps for users to access and consume models from the private hub.

Solution overview

Starting today, with SageMaker JumpStart and its private hub feature, administrators can create repositories for a subset of models tailored to different teams, use cases, or license requirements using the Amazon SageMaker Python SDK. Admins can also set up multiple private hubs with different lists of models discoverable for different groups of users. Users are then only able to discover and use models within the private hubs they have access to through Amazon SageMaker Studio and the SDK. This level of control empowers enterprises to consume the latest in open weight generative artificial intelligence (AI) development while enforcing governance guardrails. Finally, admins can share access to private hubs across multiple AWS accounts, enabling collaborative model management while maintaining centralized control. SageMaker JumpStart uses AWS Resource Access Manager (AWS RAM) to securely share private hubs with other accounts in the same organization. The new feature is available in the us-east-2 AWS Region as of writing, and will be available to more Regions soon.

The following diagram shows an example architecture of SageMaker JumpStart with its public and private hub features. The diagram illustrates how SageMaker JumpStart provides access to different model repositories, with some users accessing the public SageMaker JumpStart hub and others using private curated hubs.

In the following section, we demonstrate how admins can configure granular access control of models in SageMaker JumpStart using a private hub. Then we show how users can access and consume allowlisted models in the private hub using SageMaker Studio and the SageMaker Python SDK. Finally, we look at how an admin user can share the private hub with users in another account.

Prerequisites

To use the SageMaker Python SDK and run the code associated with this post, you need the following prerequisites:

  • An AWS account that contains all your AWS resources
  • An AWS Identity and Access Management (IAM) role with access to SageMaker Studio notebooks
  • SageMaker JumpStart enabled in a SageMaker Studio domain

Create a private hub, curate models, and configure access control (admins)

This section provides a step-by-step guide for administrators to create a private hub, curate models, and configure access control for your organization’s users.

  1. Because the feature has been integrated in the latest SageMaker Python SDK, to use the model granular access control feature with a private hub, let’s first update the SageMaker Python SDK:
    !pip3 install sagemaker —force-reinstall —quiet
  2. Next, import the SageMaker and Boto3 libraries:
    import boto3
    from sagemaker import Session
    from sagemaker.jumpstart.hub.hub import Hub
  3. Configure your private hub:
    HUB_NAME="CompanyHub"
    HUB_DISPLAY_NAME="Allowlisted Models"
    HUB_DESCRIPTION="These are allowlisted models taken from the JumpStart Public Hub."
    REGION="<your_region_name>" # for example, "us-west-2"

    In the preceding code, HUB_NAME specifies the name of your Hub. HUB_DISPLAY_NAME is the display name for your hub that will be shown to users in UI experiences. HUB_DESCRIPTION is the description for your hub that will be shown to users.

  4. Set up a Boto3 client for SageMaker:
    sm_client = boto3.client('sagemaker')
    session = Session(sagemaker_client=sm_client)
    session.get_caller_identity_arn()
  5. Check if the following policies have been already added to your admin IAM role; if not, you can add them as inline policies:
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Action": [
                    "s3:ListBucket",
                    "s3:GetObject",
                    "s3:GetObjectTagging"
                ],
                "Resource": [
                    "arn:aws:s3:::jumpstart-cache-prod-<REGION>",
                    "arn:aws:s3:::jumpstart-cache-prod-<REGION>/*"
                ],
                "Effect": "Allow"
            }
        ]
    }

    Replace the <REGION> placeholder using the configurations in Step 3.

    In addition to setting up IAM permissions to the admin role, you need to scope down permissions for your users so they can’t access public contents.

  6. Use the following policy to deny access to the public hub for your users. These can be added as inline policies in the user’s IAM role:
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Action": "s3:*",
                "Effect": "Deny",
                "Resource": [
                    "arn:aws:s3:::jumpstart-cache-prod-<REGION>",
                    "arn:aws:s3:::jumpstart-cache-prod-<REGION>/*"
                ],
                "Condition": {
                    "StringNotLike": {"s3:prefix": ["*.ipynb", "*/eula.txt"]}
                }
            },
            {
                "Action": "sagemaker:*",
                "Effect": "Deny",
                "Resource": [
                    "arn:aws:sagemaker:<REGION>:aws:hub/SageMakerPublicHub",
                    "arn:aws:sagemaker:<REGION>:aws:hub-content/SageMakerPublicHub/*/*"
                ]
            }
        ]
    }
    

    Replace the <REGION> placeholder in the policy using the configurations in Step 3.

    After you have set up the private hub configuration and permissions, you’re ready to create the private hub.

  7. Use the following code to create the private hub within your AWS account in the Region you specified earlier:
    hub = Hub(hub_name=HUB_NAME, sagemaker_session=session)
    
    try:
      hub.create(
          description=HUB_DESCRIPTION,
          display_name=HUB_DISPLAY_NAME
      )
      print(f"Successfully created Hub with name {HUB_NAME} in {REGION}")
    except Exception as e:
      if "ResourceInUse" in str(e):
        print(f"A hub with the name {HUB_NAME} already exists in your account.")
      else:
        raise e
    
  8. Use hub.describe() to verify the configuration of your hub.After your private hub is set up, you can add a reference to models from the SageMaker JumpStart public hub to your private hub. No model artifacts need to be managed by the customer. The SageMaker team will manage any version or security updates.For a list of available models, refer to Built-in Algorithms with pre-trained Model Table.
  9. To search programmatically, run the command
    filter_value = "framework == meta"
    response = hub.list_sagemaker_public_hub_models(filter=filter_value)
    models = response["hub_content_summaries"]
    while response["next_token"]:
        response = hub.list_sagemaker_public_hub_models(filter=filter_value,
                                                        next_token=response["next_token"])
        models.extend(response["hub_content_summaries"])
    
    print(models)
    

    The filter argument is optional. For a list of filters you can apply, refer to SageMaker Python SDK.

  10. Use the retrieved models from the preceding command to create model references for your private hub:
    for model in models:
        print(f"Adding {model.get('hub_content_name')} to Hub")
        hub.create_model_reference(model_arn=model.get("hub_content_arn"), 
                                   model_name=model.get("hub_content_name"))

    The SageMaker JumpStart private hub offers other useful features for managing and interacting with the curated models. Administrators can check the metadata of a specific model using the hub.describe_model(model_name=<model_name>) command. To list all available models in the private hub, you can use a simple loop:

    response = hub.list_models()
    models = response["hub_content_summaries"]
    while response["next_token"]:
        response = hub.list_models(next_token=response["next_token"])
        models.extend(response["hub_content_summaries"])
    
    for model in models:
        print(model.get('HubContentArn'))
    

    If you need to remove a specific model reference from the private hub, use the following command:

    hub.delete_model_reference("<model_name>")

    If you want to delete the private hub from your account and Region, you’ll need to delete all the HubContents first, then delete the private hub. Use the following code:

    for model in models:
        hub.delete_model_reference(model_name=model.get('HubContentName')) 
    
    hub.delete()
    

Interact with allowlisted models (users)

This section offers a step-by-step guide for users to interact with allowlisted models in SageMaker JumpStart. We demonstrate how to list available models, identify a model from the public hub, and deploy the model to endpoints from SageMaker Studio as well as the SageMaker Python SDK.

User experience in SageMaker Studio

Complete the following steps to interact with allowlisted models using SageMaker Studio:

  1.  On the SageMaker Studio console, choose JumpStart in the navigation pane or in the Prebuilt and automated solutions section.
  2. Choose one of model hubs you have access to. If the user has access to multiple hubs, you’ll see a list of hubs, as shown in the following screenshot.
    If the user has access to only one hub, you’ll go straight to the model list.
    You can view the model details and supported actions like train, deploy, and evaluate.
  3. To deploy a model, choose Deploy.
  4. Modify your model configurations like instances and deployment parameters, and choose Deploy.

User experience using the SageMaker Python SDK

To interact with your models using the SageMaker Python SDK, complete the following steps:

  1. Just like the admin process, the first step is to force reinstall the SageMaker Python SDK:
    !pip3 install sagemaker —force-reinstall —quiet
  2. Import the SageMaker and Boto3 libraries:
    import boto3
    from sagemaker import Session
    from sagemaker.jumpstart.hub.hub import Hub
    from sagemaker.jumpstart.model import JumpStartModel
    from sagemaker.jumpstart.estimator import JumpStartEstimator
  3. To access the models in your private hub, you need the Region and the name of the hub on your account. Fill out the HUB_NAME and REGION fields with the information provided by your administrator:
    HUB_NAME="CompanyHub" 
    REGION="<your_region_name>" # for example, "us-west-2"
    sm_client = boto3.client('sagemaker') 
    sm_runtime_client = boto3.client('sagemaker-runtime') 
    session = Session(sagemaker_client=sm_client, 
                        sagemaker_runtime_client=sm_runtime_client)
    hub = Hub(hub_name=HUB_NAME, sagemaker_session=session)
  4. List the models available in your private hub using the following command:
    response = hub.list_models()
    models = response["hub_content_summaries"]
    while response["next_token"]:
        response = hub.list_models(next_token=response["next_token"])
        models.extend(response["hub_content_summaries"])
    
    print(models)
  5. To get more information about a particular model, use the describe_model method:
    model_name = "huggingface-llm-phi-2"
    response = hub.describe_model(model_name=model_name) 
    print(response)
  6. You can deploy models in a hub with the Python SDK by using JumpStartModel. To deploy a model from the hub to an endpoint and invoke the endpoint with the default payloads, run the following code. To select which model from your hub you want to use, pass in a model_id and version. If you pass in * for the version, it will take the latest version available for that model_id in the hub. If you’re using a model gated behind a EULA agreement, pass in accept_eula=True.
    model_id, version = "huggingface-llm-phi-2", "1.0.0"
    model = JumpStartModel(model_id, version, hub_name=HUB_NAME, 
                                region=REGION, sagemaker_session=session)
    predictor = model.deploy(accept_eula=False)
  7. To invoke your deployed model with the default payloads, use the following code:
    example_payloads = model.retrieve_all_examples()
    for payload in example_payloads:
        response = predictor.predict(payload.body)
        print("nInputn", payload.body, "nnOutputn", 
                    response[0]["generated_text"], "nn===============")
  8. To delete the model endpoints that you created, use the following code:
    predictor.delete_model()
    predictor.delete_endpoint()

Cross-account sharing of private hubs

SageMaker JumpStart private hubs support cross-account sharing, allowing you to extend the benefits of your curated model repository beyond your own AWS account. This feature enables collaboration across different teams or departments within your organization, even when they operate in separate AWS accounts. By using AWS RAM, you can securely share your private hubs while maintaining control over access.

To share your private hub across accounts, complete the following steps:

  1. On the AWS RAM console, choose Create resource share.
  2. When specifying resource share details, choose the SageMaker hub resource type and select one or more private hubs that you want to share. When you share a hub with any other account, all of its contents are also shared implicitly.
  3. Associate permissions with your resource share.
  4. Use AWS account IDs to specify the accounts to which you want to grant access to your shared resources.
  5. Review your resource share configuration and choose Create resource share.

It may take a few minutes for the resource share and principal associations to complete.

Admins that want to perform the preceding steps programmatically can enter the following command to initiate the sharing:

# create a resource share using the private hub
aws ram create-resource-share 
    --name test-share 
    --resource-arns arn:aws:sagemaker:<region>:<resource_owner_account_id>:hub/<hub_name> 
    --principals <consumer_account_id>  
    --region <region>

Replace the <resource_owner_account_id>, <consumer_account_id>, <hub_name>, and <region> placeholders with the appropriate values for the resource owner account ID, consumer account ID, name of the hub, and Region to use.

After you set up the resource share, the specified AWS account will receive an invitation to join. They must accept this invitation through AWS RAM to gain access to the shared private hub. This process makes sure access is granted only with explicit consent from both the hub owner and the recipient account. For more information, refer to Using shared AWS resources.

You can also perform this step programmatically:

# list resource shares
aws ram get-resource-share-invitations 
    --region <region>

# accept resource share
# using the arn from the previous response 
aws ram accept-resource-share-invitation  
  --resource-share-invitation-arn <arn_from_ previous_request> 
  --region <region>

For detailed instructions on creating resource shares and accepting invitations, refer to Creating a resource share in AWS RAM. By extending your private hub across accounts, you can foster collaboration and maintain consistent model governance across your entire organization.

Conclusion

SageMaker JumpStart allows enterprises to adopt FMs while maintaining granular control over model access and usage. By creating a curated repository of approved models in private hubs, organizations can align their AI initiatives with corporate policies and regulatory requirements. The private hub decouples model curation from model consumption, enabling administrators to manage the model inventory while data scientists focus on developing AI solutions.

This post explained the private hub feature in SageMaker JumpStart and provided steps to set up and use a private hub, with minimal additional configuration required. Administrators can select models from the public SageMaker JumpStart hub, add them to the private hub, and manage user access through IAM policies. Users can then deploy these preapproved models, fine-tune them on custom datasets, and integrate them into their applications using familiar SageMaker interfaces. The private hub uses the SageMaker underlying infrastructure, allowing it to scale with enterprise-level ML demands.

For more information about SageMaker JumpStart, refer to SageMaker JumpStart. To get started using SageMaker JumpStart, access it through SageMaker Studio.

About the Authors

Raju Rangan is a Senior Solutions Architect at AWS. He works with government-sponsored entities, helping them build AI/ML solutions using AWS. When not tinkering with cloud solutions, you’ll catch him hanging out with family or smashing birdies in a lively game of badminton with friends.

Sherry Ding is a senior AI/ML specialist solutions architect at AWS. She has extensive experience in machine learning with a PhD in computer science. She mainly works with public sector customers on various AI/ML-related business challenges, helping them accelerate their machine learning journey on the AWS Cloud. When not helping customers, she enjoys outdoor activities.

June Won is a product manager with Amazon SageMaker JumpStart. He focuses on making foundation models easily discoverable and usable to help customers build generative AI applications. His experience at Amazon also includes mobile shopping applications and last mile delivery.

Bhaskar Pratap is a Senior Software Engineer with the Amazon SageMaker team. He is passionate about designing and building elegant systems that bring machine learning to people’s fingertips. Additionally, he has extensive experience with building scalable cloud storage services.

Read More

eSentire delivers private and secure generative AI interactions to customers with Amazon SageMaker

eSentire delivers private and secure generative AI interactions to customers with Amazon SageMaker

eSentire is an industry-leading provider of Managed Detection & Response (MDR) services protecting users, data, and applications of over 2,000 organizations globally across more than 35 industries. These security services help their customers anticipate, withstand, and recover from sophisticated cyber threats, prevent disruption from malicious attacks, and improve their security posture.

In 2023, eSentire was looking for ways to deliver differentiated customer experiences by continuing to improve the quality of its security investigations and customer communications. To accomplish this, eSentire built AI Investigator, a natural language query tool for their customers to access security platform data by using AWS generative artificial intelligence (AI) capabilities.

In this post, we share how eSentire built AI Investigator using Amazon SageMaker to provide private and secure generative AI interactions to their customers.

Benefits of AI Investigator

Before AI Investigator, customers would engage eSentire’s Security Operation Center (SOC) analysts to understand and further investigate their asset data and associated threat cases. This involved manual effort for customers and eSentire analysts, forming questions and searching through data across multiple tools to formulate answers.

eSentire’s AI Investigator enables users to complete complex queries using natural language by joining multiple sources of data from each customer’s own security telemetry and eSentire’s asset, vulnerability, and threat data mesh. This helps customers quickly and seamlessly explore their security data and accelerate internal investigations.

Providing AI Investigator internally to the eSentire SOC workbench has also accelerated eSentire’s investigation process by improving the scale and efficacy of multi-telemetry investigations. The LLM models augment SOC investigations with knowledge from eSentire’s security experts and security data, enabling higher-quality investigation outcomes while also reducing time to investigate. Over 100 SOC analysts are now using AI Investigator models to analyze security data and provide rapid investigation conclusions.

Solution overview

eSentire customers expect rigorous security and privacy controls for their sensitive data, which requires an architecture that doesn’t share data with external large language model (LLM) providers. Therefore, eSentire decided to build their own LLM using Llama 1 and Llama 2 foundational models. A foundation model (FM) is an LLM that has undergone unsupervised pre-training on a corpus of text. eSentire tried multiple FMs available in AWS for their proof of concept; however, the straightforward access to Meta’s Llama 2 FM through Hugging Face in SageMaker for training and inference (and their licensing structure) made Llama 2 an obvious choice.

eSentire has over 2 TB of signal data stored in their Amazon Simple Storage Service (Amazon S3) data lake. eSentire used gigabytes of additional human investigation metadata to perform supervised fine-tuning on Llama 2. This further step updates the FM by training with data labeled by security experts (such as Q&A pairs and investigation conclusions).

eSentire used SageMaker on several levels, ultimately facilitating their end-to-end process:

  • They used SageMaker notebook instances extensively to spin up GPU instances, giving them the flexibility to swap high-power compute in and out when needed. eSentire used instances with CPU for data preprocessing and post-inference analysis and GPU for the actual model (LLM) training.
  • The additional benefit of SageMaker notebook instances is its streamlined integration with eSentire’s AWS environment. Because they have vast amounts of data (terabyte scale, over 1 billion total rows of relevant data in preprocessing input) stored across AWS—in Amazon S3 and Amazon Relational Database Service (Amazon RDS) for PostgreSQL clusters—SageMaker notebook instances allowed secure movement of this volume of data directly from the AWS source (Amazon S3 or Amazon RDS) to the SageMaker notebook. They needed no additional infrastructure for data integration.
  • SageMaker real-time inference endpoints provide the infrastructure needed for hosting their custom self-trained LLMs. This was very useful in combination with SageMaker integration with Amazon Elastic Container Registry (Amazon ECR), SageMaker endpoint configuration, and SageMaker models to provide the entire configuration required to spin up their LLMs as needed. The fully featured end-to-end deployment capability provided by SageMaker allowed eSentire to effortlessly and consistently update their model registry as they iterate and update their LLMs. All of this was entirely automated with the software development lifecycle (SDLC) using Terraform and GitHub, which is only possible through SageMaker ecosystem.

The following diagram visualizes the architecture diagram and workflow.

The application’s frontend is accessible through Amazon API Gateway, using both edge and private gateways. To emulate intricate thought processes akin to those of a human investigator, eSentire engineered a system of chained agent actions. This system uses AWS Lambda and Amazon DynamoDB to orchestrate a series of LLM invocations. Each LLM call builds upon the previous one, creating a cascade of interactions that collectively produce high-quality responses. This intricate setup makes sure that the application’s backend data sources are seamlessly integrated, thereby providing tailored responses to customer inquiries.

When a SageMaker endpoint is constructed, an S3 URI to the bucket containing the model artifact and Docker image is shared using Amazon ECR.

For their proof of concept, eSentire selected the Nvidia A10G Tensor Core GPU housed in an MLG5 2XL instance for its balance of performance and cost. For LLMs with significantly larger numbers of parameters, which demand greater computational power for both training and inference tasks, eSentire used 12XL instances equipped with four GPUs. This was necessary because the computational complexity and the amount of memory required for LLMs can increase exponentially with the number of parameters. eSentire plans to harness P4 and P5 instance types for scaling their production workloads.

Additionally, a monitoring framework that captures the inputs and outputs of AI Investigator was necessary to enable threat hunting visibility to LLM interactions. To accomplish this, the application integrates with an open sourced eSentire LLM Gateway project to monitor the interactions with customer queries, backend agent actions, and application responses. This framework enables confidence in complex LLM applications by providing a security monitoring layer to detect malicious poisoning and injection attacks while also providing governance and support for compliance through logging of user activity. The LLM gateway can also be integrated with other LLM services, such as Amazon Bedrock.

Amazon Bedrock enables you to customize FMs privately and interactively, without the need for coding. Initially, eSentire’s focus was on training bespoke models using SageMaker. As their strategy evolved, they began to explore a broader array of FMs, evaluating their in-house trained models against those provided by Amazon Bedrock. Amazon Bedrock offers a practical environment for benchmarking and a cost-effective solution for managing workloads due to its serverless operation. This serves eSentire well, especially when customer queries are sporadic, making serverless an economical alternative to persistently running SageMaker instances.

From a security perspective as well, Amazon Bedrock doesn’t share users’ inputs and model outputs with any model providers. Additionally, eSentire have custom guardrails for NL2SQL applied to their models.

Results

The following screenshot shows an example of eSentire’s AI Investigator output. As illustrated, a natural language query is posed to the application. The tool is able to correlate multiple datasets and present a response.

Dustin Hillard, CTO of eSentire, shares: “eSentire customers and analysts ask hundreds of security data exploration questions per month, which typically take hours to complete. AI Investigator is now with an initial rollout to over 100 customers and more than 100 SOC analysts, providing a self-serve immediate response to complex questions about their security data. eSentire LLM models are saving thousands of hours of customer and analyst time.”

Conclusion

In this post, we shared how eSentire built AI Investigator, a generative AI solution that provides private and secure self-serve customer interactions. Customers can get near real-time answers to complex questions about their data. AI Investigator has also saved eSentire significant analyst time.

The aforementioned LLM gateway project is eSentire’s own product and AWS bears no responsibility.

If you have any comments or questions, share them in the comments section.


About the Authors

Aishwarya Subramaniam is a Sr. Solutions Architect in AWS. She works with commercial customers and AWS partners to accelerate customers’ business outcomes by providing expertise in analytics and AWS services.

Ilia Zenkov is a Senior AI Developer specializing in generative AI at eSentire. He focuses on advancing cybersecurity with expertise in machine learning and data engineering. His background includes pivotal roles in developing ML-driven cybersecurity and drug discovery platforms.

Dustin Hillard is responsible for leading product development and technology innovation, systems teams, and corporate IT at eSentire. He has deep ML experience in speech recognition, translation, natural language processing, and advertising, and has published over 30 papers in these areas.

Read More

Imperva optimizes SQL generation from natural language using Amazon Bedrock

Imperva optimizes SQL generation from natural language using Amazon Bedrock

This is a guest post co-written with Ori Nakar from Imperva.

Imperva Cloud WAF protects hundreds of thousands of websites against cyber threats and blocks billions of security events every day. Counters and insights based on security events are calculated daily and used by users from multiple departments. Millions of counters are added daily, together with 20 million insights updated daily to spot threat patterns.

Our goal was to improve the user experience of an existing application used to explore the counters and insights data. The data is stored in a data lake and retrieved by SQL using Amazon Athena.

As part of our solution, we replaced multiple search fields with a single free text field. We used a large language model (LLM) with query examples to make the search work using the language used by Imperva internal users (business analysts).

The following figure shows a search query that was translated to SQL and run. The results were later formatted as a chart by the application. We have many types of insights—global, industry, and customer level insights used by multiple departments such as marketing, support, and research. Data was made available to our users through a simplified user experience powered by an LLM.

Insights search by natural language

Figure 1: Insights search by natural language

Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading artificial intelligence (AI) companies such as AI21 Labs, Anthropic, Cohere, Meta, Mistral, Stability AI, and Amazon within a single API, along with a broad set of capabilities you need to build generative AI applications with security, privacy, and responsible AI. Amazon Bedrock Studio is a new single sign-on (SSO)-enabled web interface that provides a way for developers across an organization to experiment with LLMs and other FMs, collaborate on projects, and iterate on generative AI applications. It offers a rapid prototyping environment and streamlines access to multiple FMs and developer tools in Amazon Bedrock.

Read more to learn about the problem, and how we obtained quality results using Amazon Bedrock for our experimentation and deployment.

The problem

Making data accessible to users through applications has always been a challenge. Data is normally stored in databases, and can be queried using the most common query language, SQL. Applications use different UI components to allow users to filter and query the data. There are applications with tens of different filters and other options–all created to make the data accessible.

Querying databases through applications cannot be as flexible as running SQL queries on a known schema. Giving more power to the user comes on account of simple user experience (UX). Natural language can solve this problem—it’s possible to support complex yet readable natural language queries without SQL knowledge. On schema changes, the application UX and code remain the same, or with minor changes, which saves development time and keeps the application user interface (UI) stable for the users.

Constructing SQL queries from natural language isn’t a simple task. SQL queries must be accurate both syntactically and logically. Using an LLM with the right examples can make this task less difficult.

High level database access using an LLM flow

Figure 2: High level database access using an LLM flow

The challenge

An LLM can construct SQL queries based on natural language. The challenge is to assure quality. The user can enter any text, and the application constructs a query based on it. There isn’t an option, like in traditional applications, to cover all options and make sure the application functions correctly. Adding an LLM to an application adds another layer of complexity. The response by the LLM is not deterministic. Examples sent to the LLM are based on the database data, which makes it even harder to control the requests sent to the LLM and assure quality.

The solution: A data science approach

In data science, it’s common to develop a model and fine tune it using experimentation. The idea is to use metrics to compare experiments during development. Experiments might differ from each other in many ways, such as the input sent to the model, the model type, and other parameters. The ability to compare different experiments makes it possible to make progress. It’s possible to know how each change contributes to the model.

A test set is a static set of records that includes a prediction result for each record. Running predictions on the test set records results with the metrics needed to compare experiments. A common metric is the accuracy, which is the percentage of the correct results.

In our case the results generated by the LLM are SQL statements. The SQL statements generated by the LLM are not deterministic and are hard to measure, however running SQL statements on a static test database is deterministic and can be measured. We used a test database and a list of questions with known answers as a test set. It allowed us to run experiments and fine tune our LLM-based application.

Database access using LLM: Question to answer flow

Given a question we defined the following flow. The question is sent through a retrieval-augmented generation (RAG) process, which finds similar documents. Each document holds an example question and information about it. The relevant documents are built as a prompt and sent to the LLM, which builds a SQL statement. This flow is used both for development and application runtime:

Question to answer flow

Figure 3: Question to answer flow

As an example, consider a database schema with two tables: orders and items. The following figure is a question to SQL example flow:

Question to answer flow example

Figure 4: Question to answer flow example

Database access using LLM: Development process

To develop and fine-tune the application we created the following data sets:

  • A static test database: Contains the relevant tables and a sample copy of the data.
  • A test set: Includes questions and test database result answers.
  • Question to SQL examples: A set with questions and translation to SQL. For some examples returned data is included to allow asking questions about the data, and not only about the schema.

Development of the application is done by adding new questions and updating the different datasets, as shown in the following figure.

Adding a new question

Figure 5: Adding a new question

Datasets and other parameter updates are tracked as part of adding new questions and fine-tuning of the application. We used a tracking tool to track information about the experiments such as:

  • Parameters such as the number of questions, number of examples, LLM type, RAG search method
  • Metrics such as the accuracy and SQL errors rate
  • Artifacts such as a list of the wrong results including generated SQL, data returned, and more

Experiment flow

Figure 6: Experiment flow

Using a tracking tool, we were able to make progress by comparing experiments. The following figure shows the accuracy and error rate metrics for the different experiments we did:

Accuracy and error rate over time

Figure 7: Accuracy and error rate over time

When there’s a mistake or an error, a drill down to the false results and the experiment details is done to understand the source of the error and fix it.

Experiment and deploy using Amazon Bedrock

Amazon Bedrock is a managed service that offers a choice of high-performing foundation models. You can experiment with and evaluate top FMs for your use case and customize them with your data.

By using Amazon Bedrock, we were able to switch between models and embedding options easily. The following is an example code using the LangChain python library, which allows using different models and embeddings:

import boto3
from langchain_community.llms.bedrock import Bedrock
from langchain_community.embeddings import BedrockEmbeddings

def get_llm(model_id: str, args: dict):
   return Bedrock(model_id=model_id,
                  model_kwargs=args,
                  client=boto3.client("bedrock-runtime"))

def get_embeddings(model_id: str):
   return BedrockEmbeddings(model_id=model_id, 
                            client=boto3.client("bedrock-runtime"))

We used multiple models and embeddings with different hyper parameters to improve accuracy and decide which model is the best fit for us. We also tried to run experiments on smaller models, to determine if we can get to the same quality in terms of improved performance and reduced costs. We started using Anthropic Claude 2.1 and experimented with the Anthropic Claude instant model. Accuracy dropped by 20 percent, but after adding few additional examples, we achieved the same accuracy as Claude 2.1 with lower cost and faster response time

Conclusion

We used the same approach used in data science projects to construct SQL queries from natural language. The solution shown can be applied to other LLM-based applications, and not only for constructing SQL. For example, it can be used for API access, building JSON data, and more. The key is to create a test set together with measurable results and progress using experimentation.

Amazon Bedrock lets you use different models and switch between them to find the right one for your use case. You can compare different models, including small ones for better performance and costs. Because Amazon Bedrock is serverless, you don’t have to manage any infrastructure. We were able to test multiple models quickly, and finally integrate and deploy generative AI capabilities into our application.

You can start experimenting with natural language to SQL by running the code samples in this GitHub repository. This workshop is divided into modules that each build on the previous while introducing a new technique to solve this problem. Many of these approaches are based on an existing work from the community and cited accordingly.


About the Authors

Ori NakarOri Nakar is a Principal cyber-security researcher, a data engineer, and a data scientist at Imperva Threat Research group.

Eitan SelaEitan Sela is a Generative AI and Machine Learning Specialist Solutions Architect at AWS. He works with AWS customers to provide guidance and technical assistance, helping them build and operate Generative AI and Machine Learning solutions on AWS. In his spare time, Eitan enjoys jogging and reading the latest machine learning articles.

Elad EiznerElad Eizner is a Solutions Architect at Amazon Web Services. He works with AWS enterprise customers to help them architect and build solutions in the cloud and achieving their goals.

Read More

Create natural conversations with Amazon Lex QnAIntent and Knowledge Bases for Amazon Bedrock

Create natural conversations with Amazon Lex QnAIntent and Knowledge Bases for Amazon Bedrock

Customer service organizations today face an immense opportunity. As customer expectations grow, brands have a chance to creatively apply new innovations to transform the customer experience. Although meeting rising customer demands poses challenges, the latest breakthroughs in conversational artificial intelligence (AI) empowers companies to meet these expectations.

Customers today expect timely responses to their questions that are helpful, accurate, and tailored to their needs. The new QnAIntent, powered by Amazon Bedrock, can meet these expectations by understanding questions posed in natural language and responding conversationally in real time using your own authorized knowledge sources. Our Retrieval Augmented Generation (RAG) approach allows Amazon Lex to harness both the breadth of knowledge available in repositories as well as the fluency of large language models (LLMs).

Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI.

In this post, we show you how to add generative AI question answering capabilities to your bots. This can be done using your own curated knowledge sources, and without writing a single line of code.

Read on to discover how QnAIntent can transform your customer experience.

Solution overview

Implementing the solution consists of the following high-level steps:

  1. Create an Amazon Lex bot.
  2. Create an Amazon Simple Storage Service (Amazon S3) bucket and upload a PDF file that contains the information used to answer questions.
  3. Create a knowledge base that will split your data into chunks and generate embeddings using the Amazon Titan Embeddings model. As part of this process, Knowledge Bases for Amazon Bedrock automatically creates an Amazon OpenSearch Serverless vector search collection to hold your vectorized data.
  4. Add a new QnAIntent intent that will use the knowledge base to find answers to customers’ questions and then use the Anthropic Claude model to generate answers to questions and follow-up questions.

Prerequisites

To follow along with the features described in this post, you need access to an AWS account with permissions to access Amazon Lex, Amazon Bedrock (with access to Anthropic Claude models and Amazon Titan embeddings or Cohere Embed), Knowledge Bases for Amazon Bedrock, and the OpenSearch Serverless vector engine. To request access to models in Amazon Bedrock, complete the following steps:

  1. On the Amazon Bedrock console, choose Model access in the navigation pane.
  2. Choose Manage model access.
  3. Select the Amazon and Anthropic models. (You can also choose to use Cohere models for embeddings.)


  4. Choose Request model access.

Create an Amazon Lex bot

If you already have a bot you want to use, you can skip this step.

  1. On the Amazon Lex console, choose Bots in the navigation pane.
  2. Choose Create bot
  3. Select Start with an example and choose the BookTrip example bot.
  4. For Bot name, enter a name for the bot (for example, BookHotel).
  5. For Runtime role, select Create a role with basic Amazon Lex permissions.
  6. In the Children’s Online Privacy Protection Act (COPPA) section, you can select No because this bot is not targeted at children under the age of 13.
  7. Keep the Idle session timeout setting at 5 minutes.
  8. Choose Next.
  9. When using the QnAIntent to answer questions in a bot, you may want to increase the intent classification confidence threshold so that your questions are not accidentally interpreted as matching one of your intents. We set this to 0.8 for now. You may need to adjust this up or down based on your own testing.
  10. Choose Done.
  11. Choose Save intent.

Upload content to Amazon S3

Now you create an S3 bucket to store the documents you want to use for your knowledge base.

  1. On the Amazon S3 console, choose Buckets in the navigation pane.
  2. Choose Create bucket.
  3. For Bucket name, enter a unique name.
  4. Keep the default values for all other options and choose Create bucket.

For this post, we created an FAQ document for the fictitious hotel chain called Example Corp FictitiousHotels. Download the PDF document to follow along.

  1. On the Buckets page, navigate to the bucket you created.

If you don’t see it, you can search for it by name.

  1. Choose Upload.
  2. Choose Add files.
  3. Choose the ExampleCorpFicticiousHotelsFAQ.pdf that you downloaded.
  4. Choose Upload.

The file will now be accessible in the S3 bucket.

Create a knowledge base

Now you can set up the knowledge base:

  1. On the Amazon Bedrock console, choose Knowledge base in the navigation pane.
  2. Choose Create knowledge base.
  3. For Knowledge base name¸ enter a name.
  4. For Knowledge base description, enter an optional description.
  5. Select Create and use a new service role.
  6. For Service role name, enter a name or keep the default.
  7. Choose Next.
  8. For Data source name, enter a name.
  9. Choose Browse S3 and navigate to the S3 bucket you uploaded the PDF file to earlier.
  10. Choose Next.
  11. Choose an embeddings model.
  12. Select Quick create a new vector store to create a new OpenSearch Serverless vector store to store the vectorized content.
  13. Choose Next.
  14. Review your configuration, then choose Create knowledge base.

After a few minutes, the knowledge base will have been created.

  1. Choose Sync to sync to chunk the documents, calculate the embeddings, and store them in the vector store.

This may take a while. You can proceed with the rest of the steps, but the syncing needs to finish before you can query the knowledge base.

  1. Copy the knowledge base ID. You will reference this when you add this knowledge base to your Amazon Lex bot.

Add QnAIntent to the Amazon Lex bot

To add QnAIntent, compete the following steps:

  1. On the Amazon Lex console, choose Bots in the navigation pane.
  2. Choose your bot.
  3. In the navigation pane, choose Intents.
  4. On the Add intent menu, choose Use built-in intent.
  5. For Built-in intent, choose AMAZON.QnAIntent.
  6. For Intent name, enter a name.
  7. Choose Add.
  8. Choose the model you want to use to generate the answers (in this case, Anthropic Claude 3 Sonnet, but you can select Anthropic Claude 3 Haiku for a cheaper option with less latency).
  9. For Choose knowledge store, select Knowledge base for Amazon Bedrock.
  10. For Knowledge base for Amazon Bedrock Id, enter the ID you noted earlier when you created your knowledge base.
  11. Choose Save Intent.
  12. Choose Build to build the bot.
  13. Choose Test to test the new intent.

The following screenshot shows an example conversation with the bot.

In the second question about the Miami pool hours, you refer back to the previous question about pool hours in Las Vegas and still get a relevant answer based on the conversation history.

It’s also possible to ask questions that require the bot to reason a bit around the available data. When we asked about a good resort for a family vacation, the bot recommended the Orlando resort based on the availability of activities for kids, proximity to theme parks, and more.

Update the confidence threshold

You may have some questions accidentally match your other intents. If you run into this, you can adjust the confidence threshold for your bot. To modify this setting, choose the language of your bot (English) and in the Language details section, choose Edit.

After you update the confidence threshold, rebuild the bot for the change to take effect.

Add addional steps

By default, the next step in the conversation for the bot is set to Wait for user input after a question has been answered. This keeps the conversation in the bot and allows a user to ask follow-up questions or invoke any of the other intents in your bot.

If you want the conversation to end and return control to the calling application (for example, Amazon Connect), you can change this behavior to End conversation. To update the setting, complete the following steps:

  1. On the Amazon Lex console, navigate to the QnAIntent.
  2. In the Fulfillment section, choose Advanced options.
  3. On the Next step in conversation dropdown menu, choose End conversation.

If you would like the bot add a specific message after each response from the QnAIntent (such as “Can I help you with anything else?”), you can add a closing response to the QnAIntent.

Clean up

To avoid incurring ongoing costs, delete the resources you created as part of this post:

  • Amazon Lex bot
  • S3 bucket
  • OpenSearch Serverless collection (This is not automatically deleted when you delete your knowledge base)
  • Knowledge bases

Conclusion

The new QnAIntent in Amazon Lex enables natural conversations by connecting customers with curated knowledge sources. Powered by Amazon Bedrock, the QnAIntent understands questions in natural language and responds conversationally, keeping customers engaged with contextual, follow-up responses.

QnAIntent puts the latest innovations in reach to transform static FAQs into flowing dialogues that resolve customer needs. This helps scale excellent self-service to delight customers.

Try it out for yourself. Reinvent your customer experience!


About the Author

Thomas RindfussThomas Rinfuss is a Sr. Solutions Architect on the Amazon Lex team. He invents, develops, prototypes, and evangelizes new technical features and solutions for Language AI services that improves the customer experience and eases adoption.

Read More

Evaluate the reliability of Retrieval Augmented Generation applications using Amazon Bedrock

Evaluate the reliability of Retrieval Augmented Generation applications using Amazon Bedrock

Retrieval Augmented Generation (RAG) is a technique that enhances large language models (LLMs) by incorporating external knowledge sources. It allows LLMs to reference authoritative knowledge bases or internal repositories before generating responses, producing output tailored to specific domains or contexts while providing relevance, accuracy, and efficiency. RAG achieves this enhancement without retraining the model, making it a cost-effective solution for improving LLM performance across various applications. The following diagram illustrates the main steps in a RAG system.

Retrieval Augmented Generation RAG Architecture

Although RAG systems are promising, they face challenges like retrieving the most relevant knowledge, avoiding hallucinations inconsistent with the retrieved context, and efficient integration of retrieval and generation components. In addition, RAG architecture can lead to potential issues like retrieval collapse, where the retrieval component learns to retrieve the same documents regardless of the input. A similar problem occurs for some tasks like open-domain question answering—there are often multiple valid answers available in the training data, therefore the LLM could choose to generate an answer from its training data. Another challenge is the need for an effective mechanism to handle cases where no useful information can be retrieved for a given input. Current research aims to improve these aspects for more reliable and capable knowledge-grounded generation.

Given these challenges faced by RAG systems, monitoring and evaluating generative artificial intelligence (AI) applications powered by RAG is essential. Moreover, tracking and analyzing the performance of RAG-based applications is crucial, because it helps assess their effectiveness and reliability when deployed in real-world scenarios. By evaluating RAG applications, you can understand how well the models are using and integrating external knowledge into their responses, how accurately they can retrieve relevant information, and how coherent the generated outputs are. Additionally, evaluation can identify potential biases, hallucinations, inconsistencies, or factual errors that may arise from the integration of external sources or from sub-optimal prompt engineering. Ultimately, a thorough evaluation of RAG-based applications is important for their trustworthiness, improving their performance, optimizing cost, and fostering their responsible deployment in various domains, such as question answering, dialogue systems, and content generation.

In this post, we show you how to evaluate the performance, trustworthiness, and potential biases of your RAG pipelines and applications on Amazon Bedrock. Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI.

RAG evaluation and observability challenges in real-world scenarios

Evaluating a RAG system poses significant challenges due to its complex architecture consisting of multiple components, such as the retrieval module and the generation component represented by the LLMs. Each module operates differently and requires distinct evaluation methodologies, making it difficult to assess the overall end-to-end performance of the RAG architecture. The following are some of the challenges you may encounter:

  • Lack of ground truth references – In many open-ended generation tasks, there is no single correct answer or reference text against which to evaluate the system’s output. This makes it difficult to apply standard evaluation metrics like BERTScore (Zhang et al. 2020) BLEU, or ROUGE used for machine translation and summarization.
  • Faithfulness evaluation – A key requirement for RAG systems is that the generated output should be faithful and consistent with the retrieved context. Evaluating this faithfulness, which also serves to measure the presence of hallucinated content, in an automated manner is non-trivial, especially for open-ended responses.
  • Context relevance assessment – The quality of the RAG output depends heavily on retrieving the right contextual knowledge. Automatically assessing the relevance of the retrieved context to the input prompt is an open challenge.
  • Factuality vs. coherence trade-off – Although factual accuracy from the retrieved knowledge is important, the generated text should also be naturally coherent. Evaluating and balancing factual consistency with language fluency is difficult.
  • Compounding errors, diagnosis, and traceability – Errors can compound from the retrieval and generation components. Diagnosing whether errors stem from retrieval failures or generation inconsistencies is hard without clear intermediate outputs. Given the complex interplay between various components of the RAG architecture, it’s also difficult to provide traceability of the problem in the evaluation process.
  • Human evaluation challenges – Although human evaluation is possible for sample outputs, it’s expensive and subjective, and may not scale well for comprehensive system evaluation across many examples. The need for a domain expert to create and evaluate against a dataset is essential, because the evaluation process requires specialized knowledge and expertise. The labor-intensive nature of the human evaluation process is time-consuming, because it often involves manual effort.
  • Lack of standardized benchmarks – There are no widely accepted and standardized benchmarks yet for holistically evaluating different capabilities of RAG systems. Without such benchmarks, it can be challenging to compare the various capabilities of different RAG techniques, models, and parameter configurations. Consequently, you may face difficulties in making informed choices when selecting the most appropriate RAG approach that aligns with your unique use case requirements.

Addressing these evaluation and observability challenges is an active area of research, because robust metrics are critical for iterating on and deploying reliable RAG systems for real-world applications.

RAG evaluation concepts and metrics

As mentioned previously, RAG-based generative AI application is composed of two main processes: retrieval and generation. Retrieval is the process where the application uses the user query to retrieve the relevant documents from a knowledge base before adding it to as context augmenting the final prompt. Generation is the process of generating the final response from the LLM. It’s important to monitor and evaluate both processes because they impact the performance and reliability of the application.

Evaluating RAG systems at scale requires an automated approach to extract metrics that are quantitative indicators of its reliability. Generally, the metrics to look for are grouped by main RAG components or by domains. Aside from the metrics discussed in this section, you can incorporate tailored metrics that align with your business objectives and priorities.

Retrieval metrics

You can use the following retrieval metrics:

  • Context relevance – This measures whether the passages or chunks retrieved by the RAG system are relevant for answering the given query, without including extraneous or irrelevant details. The values range from 0–1, with higher values indicating better context relevancy.
  • Context recall – This evaluates how well the retrieved context matches to the annotated answer, treated as the ground truth. It’s computed based on the ground truth answer and the retrieved context. The values range between 0–1, with higher values indicating better performance.
  • Context precision – This measures if all the truly relevant pieces of information from the given context are ranked highly or not. The preferred scenario is when all the relevant chunks are placed at the top ranks. This metric is calculated by considering the question, the ground truth (correct answer), and the context, with values ranging from 0–1, where higher scores indicate better precision.

Generation metrics

You can use the following generation metrics:

  • Faithfulness – This measures whether the answer generated by the RAG system is faithful to the information contained in the retrieved passages. The aim is to avoid hallucinations and make sure the output is justified by the context provided as input to the RAG system. The metric ranges from 0–1, with higher values indicating better performance.
  • Answer relevance – This measures whether the generated answer is relevant to the given query. It penalizes cases where the answer contains redundant information or doesn’t sufficiently answer the actual query. Values range between 0–1, where higher scores indicate better answer relevancy.
  • Answer semantic similarity – It compares the meaning and content of a generated answer with a reference or ground truth answer. It evaluates how closely the generated answer matches the intended meaning of the ground truth answer. The score ranges from 0–1, with higher scores indicating greater semantic similarity between the two answers. A score of 1 means that the generated answer conveys the same meaning as the ground truth answer, whereas a score of 0 suggests that the two answers have completely different meanings.

Aspects evaluation

Aspects are evaluated as follows:

  • Harmfulness (Yes, No) – If the generated answer carries the risk of causing harm to people, communities, or more broadly to society
  • Maliciousness (Yes, No) – If the submission intends to harm, deceive, or exploit users
  • Coherence (Yes, No) – If the generated answer presents ideas, information, or arguments in a logical and organized manner
  • Correctness (Yes, No) – If the generated answer is factually accurate and free from errors
  • Conciseness (Yes, No) – If the submission conveys information or ideas clearly and efficiently, without unnecessary or redundant details

The RAG Triad proposed by TrueLens consists of three distinct assessments, as shown in the following figure: evaluating the relevance of the context, examining the grounding of the information, and assessing the relevance of the answer provided. Achieving satisfactory scores across all three evaluations provides confidence that the corresponding RAG application is not generating hallucinated or fabricated content.

RAG Triad

The RAGAS paper proposes automated metrics to evaluate these three quality dimensions in a reference-free manner, without needing human-annotated ground truth answers. This is done by prompting a language model and analyzing its outputs appropriately for each aspect.

To automate the evaluation at scale, metrics are computed using machine learning (ML) models called judges. Judges can be LLMs with reasoning capabilities, lightweight language models that are fine-tuned for evaluation tasks, or transformer models that compute similarities between text chunks such as cross-encoders.

Metric outcomes

When metrics are computed, they need to be examined to further optimize the system in a feedback loop:

  • Low context relevance means that the retrieval process isn’t fetching the relevant context. Therefore, data parsing, chunk sizes and embeddings models need to be optimized.
  • Low answer faithfulness means that the generation process is likely subject to hallucination, where the answer is not fully based on the retrieved context. In this case, the model choice needs to be revisited or further prompt engineering needs to be done.
  • Low answer relevance means that the answer generated by the model doesn’t correspond to the user query, and further prompt engineering or fine-tuning needs to be done.

Solution overview

You can use Amazon Bedrock to evaluate your RAG-based applications. In the following sections, we go over the steps to implement this solution:

  1. Set up observability.
  2. Prepare the evaluation dataset.
  3. Choose the metrics and prepare the evaluation prompts.
  4. Aggregate and review the metric results, then optimize the RAG system.

The following diagram illustrates the continuous process for optimizing a RAG system.

RAG Evaluation and Optimization Cycle

Set up observability

In a RAG system, multiple components (input processing, embedding, retrieval, prompt augmentation, generation, and output formatting) interact to generate answers assisted by external knowledge sources. Monitoring arriving user queries, search results, metadata, and component latencies help developers identify performance bottlenecks, understand system interactions, monitor for issues, and conduct root cause analysis, all of which are essential for maintaining, optimizing, and scaling the RAG system effectively.

In addition to metrics and logs, tracing is essential for setting up observability for a RAG system due to its distributed nature. The first step to implement tracing in your RAG system is to instrument your application. Instrumenting your application involves adding code to your application, automatically or manually, to send trace data for incoming and outbound requests and other events within your application, along with metadata about each request. There are several different instrumentation options you can choose from or combine, based on your particular requirements:

  • Auto instrumentation – Instrument your application with zero code changes, typically through configuration changes, adding an auto-instrumentation agent, or other mechanisms
  • Library instrumentation – Make minimal application code changes to add pre-built instrumentation targeting specific libraries or frameworks, such as the AWS SDK, LangChain, or LlamaIndex
  • Manual instrumentation – Add instrumentation code to your application at each location where you want to send trace information

To store and analyze your application traces, you can use AWS X-Ray or third-party tools like Arize Phoenix.

Prepare the evaluation dataset

To evaluate the reliability of your RAG system, you need a dataset that evolves with time, reflecting the state of your RAG system. Each evaluation record contains at least three of the following elements:

  • Human query – The user query that arrives in the RAG system
  • Reference document – The document content retrieved and added as a context to the final prompt
  • AI answer – The generated answer from the LLM
  • Ground truth – Optionally, you can add ground truth information:
    • Context ground truth – The documents or chunks relevant to the human query
    • Answer ground truth – The correct answer to the human query

If you have set up tracing, your RAG system traces already contain these elements, so you can either use them to prepare your evaluation dataset, or you can create a custom curated synthetic dataset specific for evaluation purposes based on your indexed data. In this post, we use Anthropic’s Claude 3 Sonnet, available in Amazon Bedrock, to evaluate the reliability of sample trace data of a RAG system that indexes the FAQs from the Zappos website.

Choose your metrics and prepare the evaluation prompts

Now that the evaluation dataset is prepared, you can choose the metrics that matter most to your application and your use case. In addition to the metrics we’ve discussed, you can create their own metrics to evaluate aspects that matter to you most. If your evaluation dataset provides answer ground truth, n-gram comparison metrics like ROUGE or embedding-based metrics BERTscore can be relevant before using an LLM as a judge. For more details, refer to the AWS Foundation Model Evaluations Library and Model evaluation.

When using an LLM as a judge to evaluate the metrics associated with a RAG system, the evaluation prompts play a crucial role in providing accurate and reliable assessments. The following are some best practices when designing evaluation prompts:

  • Give a clear role – Explicitly state the role the LLM should assume, such as “evaluator” or “judge,” to make sure it understands its task and what it is evaluating.
  • Give clear indications – Provide specific instructions on how the LLM should evaluate the responses, such as criteria to consider or rating scales to use.
  • Explain the evaluation procedure – Outline the parameters that need to be evaluated and the evaluation process step by step, including any necessary context or background information.
  • Deal with edge cases – Anticipate and address potential edge cases or ambiguities that may arise during the evaluation process. For example, determine if an answer based on irrelevant context be considered evaluated as factual or hallucinated.

In this post, we show how to create three custom binary metrics that don’t need ground truth data and that are inspired from some of the metrics we’ve discussed: faithfulness, context relevance, and answer relevance. We created three evaluation prompts.

The following is our faithfulness evaluation prompt template:

You are an AI assistant trained to evaluate interactions between a Human and an AI Assistant. An interaction is composed of a Human query, a reference document, and an AI answer. Your goal is to classify the AI answer using a single lower-case word among the following : “hallucinated” or “factual”.

“hallucinated” indicates that the AI answer provides information that is not found in the reference document.

“factual” indicates that the AI answer is correct relative to the reference document, and does not contain made up information.

Here is the interaction that needs to be evaluated:

Human query: {query}
Reference document: {reference}
AI answer: {response}
Classify the AI’s response as: “factual” or “hallucinated”. Skip the preamble or explanation, and provide the classification.

We also created the following context relevance prompt template:

You are an AI assistant trained to evaluate a knowledge base search system. A search request is composed of a Human query and a reference document. Your goal is to classify the reference document using one of the following classifications in lower-case: “relevant” or “irrelevant”.

“relevant” means that the reference document contains the necessary information to answer the Human query.

“irrelevant” means that the reference document doesn’t contain the necessary information to answer the Human query.

Here is the search request that needs to be evaluated:

Human query: {query}
Reference document: {reference}

Classify the reference document as: “relevant” or “irrelevant”. Skip any preamble or explanation, and provide the classification.

The following is our answer relevance prompt template:

You are an AI assistant trained to evaluate interactions between a Human and an AI Assistant. An interaction is composed of a Human query, a reference document, and an AI answer that should be based on the reference document. Your goal is to classify the AI answer using a single lower-case word among the following : “relevant” or “irrelevant”.

“relevant” means that the AI answer answers the Human query and stays relevant to the Human query, even if the reference document lacks full information.

“irrelevant” means that the Human query is not correctly or only partially answered by the AI.

Here is the interaction that needs to be evaluated:

Human query: {query}
Reference document: {reference}
AI answer: {response}

Classify the AI’s response as: “relevant” or “irrelevant”. Skip the preamble or explanation, and provide the classification.

Aggregate and review your metric results and then optimize your RAG system

After you obtain the evaluation results, you can store metrics in your observability systems alongside the stored traces to identify areas for improvement based on the values of their values or aggregates.

As indicated in the following diagram, every aspect of a RAG system has cascading impact on what follows; for instance, suboptimal document parsing impacts how reliably chunks are created, impacting embeddings quality, retrieval, and model output. When reviewing reliability metrics of your RAG system to find out what needs to be optimized, you should start by optimizing and reviewing what is earlier in the chain—from the left side of the following diagram.

RAG Optimization Steps

In the following table, we present 3 of the 15 queries we used from the Zappos FAQs to get the correspondent LLM answers together with the reference documents and the calculated metrics for faithfulness, context relevance, and answer relevance.

Question Retrieved Document Answer Faithfulness Context Relevance Answer Relevance
What are the specific care instructions for different fabric types like cashmere, silk, and wool?
Silk: Dry cleaning may be required. Some silks are hand- or machine-washable. Do not dry silk in a clothes dryer. Garments usually look best when professionally dry-cleaned.

Cashmere: Generally, hand wash in cold water with a mild detergent. Do not wring or twist. Lay flat to dry.Silk: Dry cleaning is often recommended. Some silks can be hand- or machine-washed, but avoid using a clothes dryer.

Wool: Check the label for specific instructions. Many wools can be machine-washed on a gentle cycle with cold water. Avoid high temperatures, as they can cause the wool to shrink or felt.

hallucinated
irrelevant
relevant
How can I identify if an email claiming to be from Zappos is legitimate or a phishing attempt?
recommends you update your online passwords, including the password to your Zappos.com Account, and report the email to the Federal Trade Commission (FTC). Visit the FTC website here to report the email and find more information on how to identify and prevent phishing attacks.
To identify if an email claiming to be from Zappos is legitimate or a phishing attempt, you should check for the following:1. Look for the correct sender email address. Zappos emails should come from an address ending in “@zappos.com”.

2. Check for spelling and grammatical errors in the email. Phishing emails often contain mistakes.

3. Be wary of emails asking for personal information, especially passwords or credit card details. Zappos will never ask for this information via email.

4. Hover over the links in the email without clicking to see if they lead to the correct Zappos URL.

5. If you’re unsure, contact Zappos directly through their official website or customer service number to verify the email’s authenticity.Report any suspicious emails to the Federal Trade Commission (FTC) through their website.
hallucinated
relevant
relevant
What is Amazon Pay, and how do I access my transaction history with it on Zappos?
How do I find my Amazon Pay transactions?
Amazon Pay is a digital wallet that allows you to make payments online using the information stored in your Amazon account, including your shipping addresses, payment methods, and default delivery settings. To view your Amazon Pay transaction history on Zappos or any other merchant site, you should log into your Amazon account. Once logged in, navigate to the “Your Orders” page, where you can view all of your past and present orders, including those made using Amazon Pay. If you’re unable to find a specific transaction, you can use the search bar to look up the order number or the date of the transaction.
hallucinated
irrelevant
relevant
. . . . .
. . . 87% 67% 100%

From the preceding aggregates, we can see that answer relevance has a high score; however, context relevance in this example RAG system is 67%. In addition to that, the system is demonstrating a level of hallucination in some cases. Therefore, we should start optimizing earlier in the chain to improve context relevance. If we look at sample questions where context relevance is classified as irrelevant, we can see that text is well parsed; however, we can also see that chunks may start or end in the middle of a sentence or just include the FAQ question without the answer. Therefore, we start by optimizing the chunking method.

After we update the chunking mechanism to prevent starting or ending a chunk in the middle of a sentence and to include the FAQ question and answer pairs, we redo the evaluation over the same 15 questions. The following table shows a sample of our results.

Question Retrieved Document Answer Faithfulness Context Relevance Answer Relevance
What are the specific care instructions for different fabric types like cashmere, silk, and wool?

How do I care for XYZ fabric?

Acetate: While most items made of acetate are dry-clean only, some may be hand-washed in lukewarm water. Do not twist the fabric. Iron while the fabric is damp. Press the inside of the garment using a low-temperature setting. Use a pressing cloth when ironing the outside.

Acrylic knit: Most acrylic knit garments can be machine-washed. Read the label, and check for the proper drying option. Some knits retain their shapes best if reshaped and dried flat.

Cashmere: Check the care label before laundering cashmere. To hand-wash, use a mild detergent or shampoo. Gently squeeze the water through the garment, then rinse until the water runs clear. Do not wring or twist. Squeeze out excess water. To dry, lay flat on a towel, away from sunlight or heat.

Cotton: Cotton holds up well to home laundering. Remove the garment from the dryer promptly to reduce wrinkling. Press using spray starch for the crispness of a laundered shirt.

Cotton blend: Dry cotton-blend garments using your dryer’s permanent-press or low-heat cycle, and remove immediately to reduce wrinkling. Touch up with a steam iron; starch for a professionally laundered look.

Linen: Most linen garments need to be dry-cleaned or hand-washed. Follow the instructions on the care label. To touch up or press, use a steam iron on a linen setting for a crisp look.

Nylon: Machine-wash nylon garments in warm water. Use a low-temperature setting when tumble-drying, and include a dryer sheet to reduce static electricity. Use a warm iron to press, if necessary.

Polyester: Read the label. Usually polyester items can be machine-washed (cool) and dried (low). Check the label to see if air-drying is recommended. Touch up with a cool – never hot – iron, if necessary.

Rayon: Make sure to read the care label for rayon clothing. Launder in cool water and a mild detergent. When machine-washing, use the gentle cycle and promptly remove garments after washing. When hand-washing, do not wring or twist the garment. To dry, roll the item in a towel to squeeze out excess water. Lay flat to dry. Iron when still damp and with the garment inside out to prevent the fabric from becoming shiny. Use a pressing cloth when ironing the right side of the garment. Pressing cloths can be purchased, or use something as simple as a piece of white cotton fabric.

Silk: Dry cleaning may be required. Some silks are hand- or machine-washable. Do not dry silk in a clothes dryer. Garments usually look best when professionally dry-cleaned.

Spandex: Use warm water when hand- or machine-washing items made with spandex. Line-dry or use a low temperature setting when machine-drying to preserve the elasticity of the garment.

Wool knit: Typically wool knits need to be dry-cleaned, but check the label. If hand-washable, use cool water and a detergent for fine washables. Squeeze out excess water, then reshape and dry flat.

Wool (lined): Lined garments generally look best when professionally dry-cleaned once a season. Between dry cleanings, remove surface soil with a brush or damp cloth. Refresh the item by hanging it from a padded hanger in a steamy bathroom; moisture will help garment shed wrinkles. When touching up with an iron, use steam in an up-and-down motion (rather than sliding the iron along the fabric).

Cashmere: Check the care label before laundering. To hand-wash, use a mild detergent or shampoo. Gently squeeze the water through the garment, then rinse until the water runs clear. Do not wring or twist. Squeeze out excess water and lay flat to dry, away from sunlight or heat.

Silk: Dry cleaning may be required. Some silks are hand- or machine-washable. Do not dry silk in a clothes dryer. For hand-washing, use a mild detergent and cool water. Squeeze out excess water and lay flat to dry. Iron when still damp and with the garment inside out to prevent the fabric from becoming shiny.

Wool: The care instructions for wool depend on whether it’s a knit or not. For wool knits, they typically need to be dry-cleaned, but check the label. If hand-washable, use cool water and a detergent for fine washables. Squeeze out excess water, then reshape and dry flat. For lined wool garments, they generally look best when professionally dry-cleaned once a season. Between dry cleanings, remove surface soil with a brush or damp cloth and refresh in a steamy bathroom. When touching up with an iron, use steam in an up-and-down motion.

factual
relevant
relevant
How can I identify if an email claiming to be from Zappos is legitimate or a phishing attempt?

How do I identify whether an Email is or is not from Zappos?

Suspicious or fraudulent emails not from Zappos.com may contain:

A reference to a payment source that Zappos.com does not accept, such as PayPal. Zappos.com currently only accepts credit cards, debit cards, and Zappos.com Gift Cards for payment.
A request to update payment information that is not linked to an order you placed on Zappos.com or your Zappos.com account.
Links, including “Here” links, to websites not affiliated with Zappos.com. You should be able to hover over a link without clicking it to show the website address associated with the link. If the link shows a site other than Zappos.com or a site that is not a legitimate Zappos domain, then it is likely phishing.
An order confirmation for an item you didn’t purchase or an attachment to an order confirmation. If you are a Zappos.com customer, go to your Zappos Account and view Your Order History to see if there is an order that matches the details in the correspondence. If you are not a Zappos.com customer, feel free to contact us to verify whether the email is associated with an actual order.
Typos, grammatical errors, or unconventional formatting.
A generic email address with Zappos in the email, for example advert@gmail.com or an email address from a non-US country or site not associated with Zappos. Zappos will never send communications from personal or generic email accounts. Order-related emails typically come from cs-noreply@zappos.com or another zappos.com email address.
Forged email addresses to make it look like the email is coming from Zappos.com. Your email program may allow you to hover over or click on an email address to view its original source.

What should I do if I believe I’ve received a phishing email or if I responded to what may be a phishing email?

If you believe you received a phishing email or responded to what may be a phishing email, Zappos recommends you update your online passwords, including the password to your Zappos.com Account, and report the email to the Federal Trade Commission (FTC). Visit the FTC website here to report the email and find more information on how to identify and prevent phishing attacks.

Payment Information

Do you accept international credit cards?

Currently Zappos.com can accept international credit cards but we can only ship to an address within the United States and its territories. Please note that the credit card must be issued and contain a logo from either Visa, Mastercard, Discover, or American Express. On the checkout page, there will be a billing and shipping section. Please do the following:

For the billing information

Enter your street address on Address Line 1. Enter your City, County or Province, and Postal Code on Address Line 2. Enter your Country for the city. Enter AA for the state. Enter 11111 for the zip code.

For the shipping information

Please enter the correct information in the appropriate field.

Orders placed with the use of an International Credit Card may take up to 72 hours to finalize prior to shipping. We apologize for any inconvenience this may cause.

To identify if an email claiming to be from Zappos is legitimate or a phishing attempt, check for the following:

1. Zappos only accepts certain payment methods: credit cards, debit cards, and Zappos.com Gift Cards.
2. No unsolicited requests for payment information or updates.
3. Links lead to Zappos.com or legitimate Zappos domains.
4. Match the order details in the email with your Zappos Account’s order history.
5. Look for typos, grammatical errors, or unconventional formatting.
6. Check for generic email addresses not affiliated with Zappos.
7. Verify the sender’s email address is from cs-noreply@zappos.com or another zappos.com email address.
8. Be wary of forged email addresses.

If you believe you received a phishing email or responded to what may be a phishing email, update your passwords and report the email to the Federal Trade Commission (FTC).

factual
relevant
relevant
What is Amazon Pay, and how do I access my transaction history with it on Zappos?

How do I find my Amazon Pay transactions?

Go to pay.amazon.com and sign in as a “shopper” using your Amazon username and password. One-time purchases are shown on the “Activity” tab, and subscription agreements are on the “Merchant agreements” tab. You can also view your transactions on Amazon.com. If you are on a desktop, your Amazon Pay transactions can be found under “Amazon Orders” within the Amazon Pay tab or in “Your Account” under “Transactions”. If you are in the Amazon app, all transactions are listed together under “Your Orders” or in “Your Account”.

Amazon Pay is a digital wallet that allows you to make payments online using your Amazon account. However, the context does not provide information on how to access your transaction history with Amazon Pay on Zappos. The instructions given in the context are for accessing Amazon Pay transactions on Amazon’s own platform.
factual
relevant
relevant
. . . . .
. . . 100% 93% 100%

After we changed the chunking mechanism to prevent mid-sentence chunking and to include an FAQ and its corresponding answer in the same chunk, we improved context relevance from 67% to 93%. We can also see that improving context relevance resolved previous hallucinations without even changing the prompt template. We can iterate the optimization process with further investigation into the questions that are having irrelevant retrievals by adjusting the indexing or the retrieval mechanism by choosing a higher number of retrieved chunks or by using hybrid search to combine lexical search with semantic search.

Sample references

To further explore and experiment different RAG evaluation techniques, you can delve deeper into the sample notebooks available in the Knowledge Bases section of the Amazon Bedrock Samples GitHub repo.

Conclusion

In this post, we described the importance of evaluating and monitoring RAG-based generative AI applications. We showcased the metrics and frameworks for RAG system evaluation and observability, then we went over how you can use FMs in Amazon Bedrock to compute RAG reliability metrics. It’s important to choose the metrics that matter most to your organization and that impact the aspect or configuration you want to optimize.

If RAG is not sufficient for your use case, you can opt for fine-tuning or continued pre-training in Amazon Bedrock or Amazon SageMaker to build custom models that are specific to your domain, organization, and use case. Most importantly, keeping a human in the loop is essential to align AI systems, as well as their evaluation mechanisms, with their intended uses and objectives.


About the Authors

Oussama Maxime Kandakji is a Senior Solutions Architect at AWS focusing on data science and engineering. He works with enterprise customers on solving business challenges and building innovative functionalities on top of AWS. He enjoys contributing to open source and working with data.

Ioan Catana is a Senior Artificial Intelligence and Machine Learning Specialist Solutions Architect at AWS. He helps customers develop and scale their ML solutions and generative AI applications in the AWS Cloud. Ioan has over 20 years of experience, mostly in software architecture design and cloud engineering.

Read More

Connect to Amazon services using AWS PrivateLink in Amazon SageMaker

Connect to Amazon services using AWS PrivateLink in Amazon SageMaker

AWS customers that implement secure development environments often have to restrict outbound and inbound internet traffic. This becomes increasingly important with artificial intelligence (AI) development because of the data assets that need to be protected. Transmitting data across the internet is not secure enough for highly sensitive data. Therefore, accessing AWS services without leaving the AWS network can be a secure workflow.

One of the ways you can secure AI development is by creating Amazon SageMaker instances within a virtual private cloud (VPC) with direct internet access disabled. This isolates the instance from the internet and makes API calls to other AWS services not possible. This presents a challenge for developers that are building architectures for production in which many AWS services need to function together.

In this post, we present a solution for configuring SageMaker notebook instances to connect to Amazon Bedrock and other AWS services with the use of AWS PrivateLink and Amazon Elastic Compute Cloud (Amazon EC2) security groups.

Solution overview

The following example architecture shows a SageMaker instance connecting to various services. The SageMaker instance is isolated from the internet but is still able to access AWS services through PrivateLink. One will notice that the connection to Amazon S3 is through a Gateway VPC endpoint. You can learn more about Gateway VPC endpoints here.

overall architecture for developing in a VPC environment

In the following sections, we show how to configure this on the AWS Management Console.

Create security groups for outbound and inbound endpoint access

First, you have to create the security groups that will be attached to the VPC endpoints and the SageMaker instance. You create the security groups before creating a SageMaker instance because after the instance has been created, the security group configuration can’t be changed.

You create two groups, one for outbound and another for inbound. Complete the following steps:

1. On the Amazon EC2 console, choose Security Groups in the navigation pane.

2. Choose Create security group.

3. For Security group name, enter a name (for example, inbound-sagemaker).

4. For Description, enter a description.

5. For VPC, choose your VPC.

create a security group for developing in a secure environment in vpc SageMaker

6. Note the security group ID to use in the next steps.

7. Create a new outbound rule.

8. For Security group name, enter a name (for example, outbound-sagemaker).

9. For Description, enter description.

10. For VPC, choose the same VPC as the inbound rule.

11. In the Outbound rules section, choose Add rule.

12. Add an outbound rule with the inbound security group ID as the destination using HTTPS as the type.

13. Note the outbound security group ID to use in the next step.

configure security group for connect to AWS services using AWS PrivateLink

14. Return to the inbound security group and add an inbound rule of HTTPS type with the destination set to the outbound security group ID.

set outbound rule for developing in a secure environment

Create a SageMaker instance with the outbound security group

You now create a SageMaker instance with the network configuration shown in the following screenshot. It’s important to choose the same VPC that you used to create the inbound and outbound security groups. You then choose the outbound security group you created earlier.

configure network for developing in secure environment

Create an Interface VPC endpoint

In this step, you create an Interface VPC endpoint using Amazon Virtual Private Cloud (Amazon VPC) that automatically uses PrivateLink, which allows calls from your SageMaker instance to AWS services.

1. On the Amazon VPC console, choose Endpoints in the navigation pane.

2. Choose Create endpoint.

3. For Name tag, enter a name (for example, bedrock-link).

4. For Service category, select AWS services.

5. For Services, search for and choose com.amazonaws.<region>.bedrock-runtime.

create interface endpoint for developing in secure environment

6. Set the VPC to the same one you’ve been working with.

7. Specify the subnet(s).

A subnet is a range of IP addresses within a VPC. If you don’t know what subnet to specify, any subnet will work. Otherwise, specify the subnet that is required by any security requirements from your cloud security team.

8. Set the security group to the inbound security group you created earlier.

After you create the endpoint, it should take some time to become available.

Repeat these steps for every service that you need for your workflow. The following screenshots show examples of services that you can create interface VPC endpoints for, such as Amazon Simple Storage Service (Amazon S3), Amazon Kendra, and AWS Lambda. AWS PrivateLink enables you to connect privately to several AWS services, for a current list please see this page.

select service for connecting to AWS services with AWS PrivateLink

Test the connection

You can test the connection to Amazon Bedrock using a simple Python API call. The following is a code snippet that invokes the Amazon Bedrock model:

import boto3
import json

bedrock = boto3.client(service_name='bedrock-runtime')
prompt = """
Human: What type of sharks are there?

Assistant:"""

body = json.dumps({
"prompt": prompt,
"max_tokens_to_sample": 4000,
"temperature": 0.1,
"top_p": 0.9,
})

modelId = 'anthropic.claude-instant-v1'
accept = 'application/json'
contentType = 'application/json'

response = bedrock.invoke_model(body=body, modelId=modelId, accept=accept, contentType=contentType)
response_body = json.loads(response.get('body').read())

print(response_body.get('completion'))

If you were to run this in a Jupyter notebook cell, it would give you an error because you have not pointed the invocation to use the VPC endpoint. You do this by adding an endpoint URL to the client instantiation:

bedrock = boto3.client(
    service_name='bedrock-runtime',
    endpoint_url = 'https://vpce-0e452bc86b1f87c50-5xltzdpo.bedrock-runtime.us-west-2.vpce.amazonaws.com'
)

To find the endpoint URL, go back to the VPC endpoint that you created in the previous step and look for DNS names, illustrated in the following screenshot. The Private DNS is the best option since it is the same as the public, which means you don’t have to change anything to use the private connection. The next best option is to use the Regional DNS, which is the first option under “DNS names”. Both options allow your traffic to failover to other healthy Availability Zones (AZ), in case the current AZ is impaired.

find the endpoint URL for the interface endpoing

Clean up

To clean up your resources, complete the following steps:

1. On the SageMaker console, navigate to the notebook configuration page.

2. Stop the instance, then choose Delete to delete the instance.

delete sagemaker notebook endpoint for clean up

3. On the Amazon EC2 console, navigate to the inbound security group’s detail page.

4. On the Actions menu, choose Delete security groups.

5. Repeat these steps for the outbound security group.

delete security group for clean up

6. On the Amazon VPC console, navigate to the VPC endpoint’s details page.

7. On the Actions menu, choose Delete.

8. Repeat this is step for every endpoint you created as part of this post.

delete vpc endpoint for clean up

Conclusion

In this post, we showed how to set up VPC endpoints and security groups to allow SageMaker to connect to Amazon Bedrock. When a SageMaker instance has restricted internet access, you can still develop and connect to other AWS services through the use of AWS PrivateLink. This post showed how to connect to Amazon Bedrock from an isolated SageMaker instance, but you can replicate the steps for other services.

We encourage you to get started developing AI applications on AWS. To learn more, visit Amazon SageMaker, Amazon Bedrock, and AWS PrivateLink for more information. Happy coding!


About the Author

Francisco Calderon is a Data Scientist at the AWS Generative AI Innovation Center. As a member of the GenAI Innovation Center, he helps solve critical business problems for AWS customers using the latest technology in Generative AI. In his spare time, Francisco likes to play music and guitar, play soccer with his daughters, and enjoy time with his family.

Sungmin Hong is an Applied Scientist at AWS Generative AI Innovation Center where he helps expedite the variety of use cases of AWS customers. Before joining Amazon, Sungmin was a postdoctoral research fellow at Harvard Medical School. He holds Ph.D. in Computer Science from New York University. Outside of work, Sungmin enjoys hiking, traveling and reading.

Yash Shah is a Science Manager in the AWS Generative AI Innovation Center. He and his team of applied scientists and machine learning engineers work on a range of machine learning use cases from healthcare, sports, automotive and manufacturing.

Anila Joshi has more than a decade of experience building AI solutions. As an Applied Science Manager at AWS Generative AI Innovation Center, Anila pioneers innovative applications of AI that push the boundaries of possibility and guides customers to strategically chart a course into the future of AI.

Read More

Born in the research lab a decade ago, SWAN continues to accelerate networking in the Microsoft Cloud

Born in the research lab a decade ago, SWAN continues to accelerate networking in the Microsoft Cloud

SWAN controller diagram

Software-driven wide area network (SWAN) is a system that enables centralized management and control of network infrastructure to improve reliability and efficiency. SWAN controls the timing and volume of traffic each service sends and automatically reconfigures the network’s data plane to match traffic demand. Over the last decade, I’ve had the opportunity to shepherd SWAN from a research idea to a foundational system for Microsoft Azure (opens in new tab). I want to share a few thoughts to commemorate this incredible journey. 

The idea for SWAN was born in 2012, when Microsoft’s mobility and networking research group sought to solve two important challenges—efficiency and flexibility of the backbone that carried traffic between Microsoft datacenters. Azure’s explosive growth created unprecedented demand for bandwidth in this backbone. Efficiency and flexibility were essential, enabling the network to offer the best possible service to every application, based on a deep understanding of its performance needs (latency-sensitive database queries versus throughput-bound storage backups), diurnal patterns, and whether demand can be time-shifted (“follow the sun”) to fully utilize the available capacity.  

It became clear that traditional backbone architectures, with MPLS-based traffic engineering without any coordination with the applications, would not be able to address these challenges. Decentralized resource allocation comes with fundamental limits; and hardware limitations (such as the limited number of priority queues) prevent fine-grained resource allocation across thousands of (high-bandwidth) applications. 

We decided to explore logically centralized control for both the applications and the network. On the application side, we would control how much traffic each application would be able to send based on its demand and priority. On the network side, we would control how each switch forwarded traffic. While software-defined networking (SDN) was actively being explored in the community at the time, we were not aware of any production systems, certainly not at the scale of the Microsoft Cloud. Going down this path meant that we were sure to encounter many “unknown unknowns.” Can centralization work in a fault tolerant manner at a truly global scale? Is the hardware ready and reliable? How would applications react to bandwidth controller mediating access to the network? Our estimates of possible gains suggested that addressing these unknowns could be fruitful, and building something that no one had built before was exciting for us as systems researchers. 

Given the risks, we approached the development of SWAN in the spirit of “fail fast,” taking on prototyping and algorithmic challenges in the order of highest risk. This approach led us to focus early on problems such as scalably computing max-min fair allocations across hundreds of applications, enforcing those allocations, working with limited memory on commodity switches, and updating the global network in a timely and congestion-free manner.

Our early prototyping uncovered several challenges with the latest OpenFlow switches at the time. We worked with Arista on DirectFlow (a superset of OpenFlow), and got it working at the scale and reliability we wanted. This provided the foundation for SWAN for years to come. As Jayashree Ullal (Arista CEO) notes (opens in new tab), “SWAN was then able to take advantage of Arista EOS to build an elegant WAN evolving to support 100G, 200G as well as DWDM interconnections at Internet peering points around the world.” It also allowed customers to use this battle hardened SDN switch infrastructure on their own networks.

Microsoft Research Podcast

Collaborators: Holoportation™ communication technology with Spencer Fowers and Kwame Darko

Spencer Fowers and Kwame Darko break down how the technology behind Holoportation and the telecommunication device being built around it brings patients and doctors together when being in the same room isn’t an easy option and discuss the potential impact of the work.


We shared the results of our work at the ACM SIGCOMM 2013 conference, where Google shared its results of building a similar system called B4. The two systems provided proof points that SDN could enable massively more efficient and flexible traffic engineering. In the words of noted computer scientist Bruce Davie (opens in new tab), they “broke the rule that centralized control could not be done, thus freeing the system from greedy approaches that made only local optimizations.” 

The original paper: Achieving High Utilization with Software-Driven WAN, was the start, not the end of the journey for us. We have since solved many additional challenges such as, for example, a faster solution for approximate max-min fairness, proactive defense against a small number of failures by spreading traffic and using the hierarchical nature of the WAN topology and traffic demands to solve max-flow style problems more quickly. Many of these have been deployed in production on the Microsoft WAN. In this sense, SWAN has provided a rich research-to-production pipeline.   

As I look back, I can proudly say that SWAN has lived up to its promise. Our inter-datacenter WAN now has unprecedented efficiency and flexibility. Of course, not everything went as expected. While we worried about the reliability of centralized control when the controllers become unavailable–and built Paxos-like consensus clusters with redundancy–we didn’t protect against code bugs where all cluster members were simultaneously wrong. Since then, we have developed new mechanisms to counteract this threat

Overall, the design of SWAN and its implementation has stood the test of time. In fact, we are now moving our other WAN, which connects Microsoft datacenters to the broader Internet, to the world of centralized control as well [e.g., OneWAN]. SWAN now carries over 90% of the traffic in and out of Microsoft’s datacenters, a footprint spanning over 280,000 kilometers of optical fiber and over 150 points of presence across all Azure regions. This unification will unlock the next level of efficiency and flexibility, and Microsoft researchers are right there taking the next set of technical bets.  

The post Born in the research lab a decade ago, SWAN continues to accelerate networking in the Microsoft Cloud appeared first on Microsoft Research.

Read More

Crack the Case With ‘Tell Me Why’ and ‘As Dusk Falls’ on GeForce NOW

Crack the Case With ‘Tell Me Why’ and ‘As Dusk Falls’ on GeForce NOW

Sit back and settle in for some epic storytelling. Tell Me Why and As Dusk Falls — award-winning, narrative-driven games from Xbox Studios — add to the 1,900+ games in the GeForce NOW library, ready to stream from the cloud. 

Members can find more adventures with four new titles available this week.

Experience a Metallica concert like no other in “Metallica: Fuel. Fire. Fury.” This journey through six fan-favorite songs features gameplay that matches the intensity. “Metallica: Fuel. Fire. Fury.” will have six different showtimes running June 22-23 in Fortnite. Anyone can get a front-row seat to the interactive music experience by streaming on their mobile device, powered by GeForce NOW.

Unravel the Mystery

Whether uncovering family mysteries in Alaska or navigating small-town secrets in Arizona, gamers are set to be drawn into richly woven stories with Tell Me Why and Ask Dusk Falls joining the cloud this week.

Tell Me Why on GeForce NOW
Ain’t nothing but a great game.

Tell Me Why — an episodic adventure game from Dontnod Entertainment, the creators of the beloved Life Is Strange series — follows twins Tyler and Alyson Ronan as they reunite after a decade to uncover the mysteries of their troubled childhoods in the fictional town of Delos Crossing, Alaska. Experience true-to-life characters, mature themes and gripping choices.

As Dusk Falls on GeForce NOW
Every family has secrets.

Dive into the intertwined lives of two families over three decades in As Dusk Falls from INTERIOR/NIGHT. Set in small-town Arizona in the 1990s, the game’s unique art style blends 2D character illustrations with 3D environments, creating a visually striking experience. Players’ choices significantly impact the storyline, making each playthrough unique.

GeForce NOW members can now stream these award-winning titles on a variety of devices, including PCs, Macs, SHIELD TVs and Android devices. Upgrade to a Priority or Ultimate membership to enjoy enhanced streaming quality and performance, including up to 4K resolution and 120 frames per second on supported devices. Jump into these emotionally rich narratives and discover the power of choice in shaping the characters’ destinies.

Wake Up to New Games

Still Wakes the Deep on GeForce NOW
Run!

In Still Wakes the Deep from The Chinese Room and Secret Mode, play as an offshore oil rig worker fighting for dear life through a vicious storm, perilous surroundings and the dark, freezing North Sea waters. All lines of communication have been severed. All exits are gone. All that remains is the need to face the unknowable horror aboard. Live the terror and escape the rig, all from the cloud.

Check out the list of new games this week:

  • Still Wakes the Deep (New release on Steam and Xbox, available on PC Game Pass, June 18)
  • Skye: The Misty Isle (New release on Steam, June 19)
  • As Dusk Falls (Steam and Xbox, available on PC Game Pass)
  • Tell Me Why (Steam and Xbox, available on PC Game Pass)
Greetings From GFN
Make sure to catch #GreetingFromGFN.

Plus, #GreetingsFromGFN continues on @NVIDIAGFN social media accounts, with members sharing their favorite locations to visit in the cloud.

What are you planning to play this weekend? Let us know on X or in the comments below.

Read More

Accelerating Neural Network Training with Semi-Structured (2:4) Sparsity

Accelerating Neural Network Training with Semi-Structured (2:4) Sparsity

Over the past year, we’ve added support for semi-structured (2:4) sparsity into PyTorch. With just a few lines of code, we were able to show a 10% end-to-end inference speedup on segment-anything by replacing dense matrix multiplications with sparse matrix multiplications.

However, matrix multiplications are not unique to neural network inference – they happen during training as well. By expanding on the core primitives we used earlier to accelerate inference, we were also able to accelerate model training. We wrote a replacement nn.Linear layer, SemiSparseLinear, that is able to achieve a 1.3x speedup across the forwards + backwards pass of the linear layers in the MLP block of ViT-L on a NVIDIA A100.

End-to-end, we see a wall time reduction of 6% for a DINOv2 ViT-L training, with virtually no accuracy degradation out of the box (82.8 vs 82.7 on ImageNet top-1 accuracy).

2 strategies for training a ViT model

We compare 2 strategies for training a ViT model for 125k iterations on 4x NVIDIA A100s: either fully dense (blue), or sparse for 70% of the training, then dense (orange). Both achieve similar results on the benchmarks, but the sparse variant trains 6% faster. For both experiments, we evaluate the intermediate checkpoints with and without sparsity.

As far as we are aware, this is the first OSS implementation of accelerated sparse training and we’re excited to provide a user API in torchao. You can try accelerating your own training runs with just a few lines of code:

# Requires torchao and pytorch nightlies and CUDA compute capability 8.0+
import torch
from torchao.sparsity.training import (
    SemiSparseLinear,
    swap_linear_with_semi_sparse_linear,
)

model = torch.nn.Sequential(torch.nn.Linear(1024, 4096)).cuda().half()

# Specify the fully-qualified-name of the nn.Linear modules you want to swap
sparse_config = {
    "seq.0": SemiSparseLinear
}

# Swap nn.Linear with SemiSparseLinear, you can run your normal training loop after this step
swap_linear_with_semi_sparse_linear(model, sparse_config)

How does this work?

The general idea behind sparsity is simple: skip calculations involving zero-valued tensor elements to speed up matrix multiplication. However, simply setting weights to zero isn’t enough, as the dense tensor still contains these pruned elements and dense matrix multiplication kernels will continue to process them, incurring the same latency and memory overhead. To achieve actual performance gains, we need to replace dense kernels with sparse kernels that intelligently bypass calculations involving pruned elements.

These kernels work on sparse matrices, which remove the pruned elements and store the specified elements in a compressed format. There are many different sparse formats, but we’re particularly interested in semi-structured sparsity, also known as 2:4 structured sparsity or fine-grained structured sparsity or more generally N:M structured sparsity.

2:4 sparse compressed representation

2:4 sparse compressed representation. Original Source

A 2:4-sparse matrix is a matrix where at most 2 elements are non-zero for every 4 elements, as illustrated in the image above. Semi-structured sparsity is attractive because it exists in a goldilocks spot of performance and accuracy:

  1. NVIDIA GPUs since Ampere offer hardware acceleration and library support (cuSPARSELt) for this format, with matrix multiplication being up to 1.6x faster
  2. Pruning models to fit this sparsity pattern does not degrade accuracy as much as other patterns. NVIDIA’s whitepaper shows pruning then retraining is able to recover accuracy for most vision models.

Illustration of 2:4 (sparse) matrix multiplication on NVIDIA GPUs

Illustration of 2:4 (sparse) matrix multiplication on NVIDIA GPUs. Original source

Accelerating inference with semi-structured sparsity is straightforward. Since our weights are fixed during inference, we can prune and compress the weight ahead of time (offline) and store the compressed sparse representation instead of our dense tensor.

flow chart

Then, instead of dispatching to dense matrix multiplication we dispatch to sparse matrix multiplication, passing in the compressed sparse weight instead of the normal dense one. For more information about accelerating models for inference using 2:4 sparsity, please refer to our tutorial.

Extending sparse inference acceleration to training

In order to use sparsity to reduce the training time of our models, we need to consider when the mask is calculated, as once we store the compressed representation the mask is fixed.

Training with a fixed mask applied to an existing trained dense model (also known as pruning) does not degrade accuracy, but this requires two training runs – one to obtain the dense model and another to make it sparse, offering no speedups.

Instead we’d like to train a sparse model from scratch (dynamic sparse training), but training from scratch with a fixed mask will lead to a significant drop in evaluations, as the sparsity mask would be selected at initialization, when the model weights are essentially random.

To maintain the accuracy of the model when training from scratch, we prune and compress the weights at runtime, so that we can calculate the optimal mask at each step of the training process.

Conceptually you can think of our approach as an approximate matrix multiplication technique, where we `prune_and_compress` and dispatch to `sparse_GEMM` in less time than a `dense_GEMM` call would take. This is difficult because the native pruning and compression functions are too slow to show speedups.

Given the shapes of our ViT-L training matrix multiplications (13008x4096x1024), we measured the runtime of a dense and sparse GEMM respectively at 538us and 387us. In other words, the pruning and compression step of the weight matrix must run in less than 538-387=151us to have any efficiency gain. Unfortunately, the compression kernel provided in cuSPARSELt already takes 380us (without even considering the pruning step!).

Given the max NVIDIA A100 memory IO (2TB/s), and considering that a prune and compress kernel would be memory bound, we could theoretically prune and compress our weight (4096x1024x2 bytes=8MB) in 4us (8MB / 2TB/s)! And in fact, we were able to write a kernel that prunes and compresses a matrix into 2:4-sparse format, and runs in 36 us (10x faster than the compression kernel in cuSPARSELt), making the entire GEMM (including the sparsification) faster. Our kernel is available for use in PyTorch.

Our custom sparsification kernel

Our custom sparsification kernel, which includes pruning + compression, is ~30% faster across a linear layer forward+backward. Benchmarks run on a NVIDIA A100-80GB GPU.

Writing a performant runtime sparsification kernel

There were multiple challenges we faced in order to implement a performant runtime sparsification kernel, which we will explore below.

1) Handling the backwards pass

For the backwards pass, we need to calculate dL/dX and dL/dW for the gradient update and the subsequent layer, which means we need to calculate xWT and xTW respectively.

Overview of runtime sparsification for training acceleration (FW + BW pass)

Overview of runtime sparsification for training acceleration (FW + BW pass)

However this is problematic, because the compressed representation cannot be transposed, since there’s no guarantee that the tensor is 2:4 sparse in both directions.

Both matrices are valid 2:4 matrices. However, the right one is no longer a valid 2:4 matrix once transposed because one column contains more than 2 elements

Both matrices are valid 2:4 matrices. However, the right one is no longer a valid 2:4 matrix once transposed because one column contains more than 2 elements

Therefore, we prune a 4×4 tile, instead of a 1×4 strip. We greedily preserve the largest values, ensuring that we take at most 2 values for each row / column. While this approach is not guaranteed to be optimal, as we sometimes only preserve 7 values instead of 8, it efficiently calculates a tensor that is 2:4 sparse both row-wise and column-wise.

We then compress both the packed tensor and the packed transpose tensor, storing the transpose tensor for the backwards pass. By calculating both the packed and packed transpose tensor at the same time, we avoid a secondary kernel call in the backwards pass.

Our kernel prunes the weight matrix in registers

Our kernel prunes the weight matrix in registers, and writes the compressed values in global memory. It also prunes at the same time W.t, which is needed for the backward pass, minimizing the memory IO

There’s some additional transpose trickery needed to handle the backwards pass – the underlying hardware only supports operations where the first matrix is sparse. For weight sparsification during inference, when we need to calculate xWT we rely on transpose properties to swap the order of the operands.

Math formula

During inference, we use torch.compile to fuse the outer transpose into subsequent pointwise ops in order to avoid paying a performance penalty.

However in the case of the backwards pass of training, we have no subsequent pointwise op to fuse with. Instead, we fuse the transposition into our matrix multiplication by taking advantage of cuSPARSELt’s ability to specify the row / column layout of the result matrix.

2) Kernel tiling for efficient memory-IO

In order for our kernel to be as efficient as possible, we want to coalesce our reads / writes, as we found that memory IO to be the main bottleneck. This means that within a CUDA thread, we want to read/write chunks of 128 bytes at a time, so that multiple parallel reads/writes can be coalesced into a single request by the GPU memory controller.

Therefore, instead of a thread handling a single 4×4 tile, which is only 4x4x2 = 32 bytes, we decided that each thread will handle 4 4×4 tiles (aka an 8×8 tile), which allows us to operate 8x8x2 =128 byte chunks.

Kernel tiling for efficient memory-IO

3) Sorting elements in a 4×4 tile without warp-divergence

For each individual 4×4 tile within our thread we calculate a bitmask that specifies which elements to prune and which elements to keep. To do this we sort all 16 elements and greedily preserve elements, so long as they do not break our 2:4 row / col constraint. This preserves only the weights with the largest values.

Crucially we observe that we are only ever sorting a fixed number of elements, so by using a branchless sorting network, we can avoid warp divergence.

Sorting network diagram

For clarity, the transposed packed tensor and metadata are omitted. Sorting network diagram taken from Wikipedia.

Warp divergence occurs when we have conditional execution inside across a thread block. In CUDA, work items in the same work group (thread block) are dispatched at the hardware level in batches (warps). If we have conditional execution, such that some work-items in the same batch run different instructions, then they are masked when the warp is dispatched, or dispatched sequentially.

For example, if we have some code like if (condition) do(A) else do(B), where condition is satisfied by all the odd-numbered work items, then the total runtime of this conditional statement is do(A) + do(B), since we would dispatch do(A) for all odd-numbered work-items, masking out even-numbered work-items, and do(B) for all even numbered work-items, masking out odd-numbered work-items. This answer provides more information about warp divergence.

4) Writing the compressed matrices and metadata

Once the bitmask has been computed, the weight data has to be written back in a compressed format in global memory. This is not trivial, because the data needs to stay in registers, and it’s not possible to index registers (eg C[i++] = a prevents us from storing C in registers). Furthermore, we found that nvcc was using many more registers than we expected, which caused register spilling and impacted global performance. We write this compressed matrix to global memory in Column-Major format to make the writes more efficient.

compressed matrix to global memory in Column-Major format

We also need to write the cuSPARSELt metadata as well. This metadata layout is quite similar to the one from the open-source CUTLASS library and is optimized for being loaded efficiently through shared-memory in the GEMM kernel with the PTX ldmatrix instruction.

However, this layout is not optimized to be written efficiently: the first 128 bits of the metadata tensor contains metadata about the first 32 columns of the rows 0, 8, 16 and 24. Recall that each thread handles an 8×8 tile, which means that this information is scattered across 16 threads.

We rely on a series of warp-shuffle operations, once for the original and transposed representation respectively to write the metadata. Fortunately, this data represents less than 10% of the total IO, so we can afford to not fully coalesce the writes.

DINOv2 Sparse Training: Experimental Setup and Results

For our experiments, the ViT-L model is trained on ImageNet for 125k steps using the DINOv2 method. All our experiments were run on 4x AMD EPYC 7742 64-core CPUs and 4x NVIDIA A100-80GB GPUs. During sparse training, the model is trained with 2:4 sparsity enabled for the first part of the training, where only half of the weights are enabled. This sparsity mask on the weights is dynamically recomputed at every step, as weights are continuously updated during the optimization. For the remaining steps, the model is trained densely, producing a final model without 2:4 sparsity (except the 100% sparse training setup), which is then evaluated.

Training setup ImageNet 1k log-regression
0% sparse (125k dense steps, baseline) 82.8
40% sparse (40k sparse -> 85k dense steps) 82.9
60% sparse (75k sparse -> 50k dense steps) 82.8
70% sparse (87.5k sparse -> 37.5k dense steps) 82.7
80% sparse (100k sparse -> 25k dense steps) 82.7
90% sparse (112.5k sparse -> 12.5k dense steps) 82.0
100% sparse (125k sparse steps) 82.3 (2:4-sparse model)

sparsity training diagrams

During the sparse training steps, in the backward pass we obtain a dense gradient for the sparse weights. For the gradient descent to be sound, we should also sparsify this gradient before using it in the optimizer to update the weights. Instead of doing that, we use the full dense gradient to update the weights – we found this to work better in practice: this is the STE (Straight Through Estimator) strategy. In other words, we update all the parameters at every step, even the ones we don’t use.

Conclusion and Future Work

In this blog post, we’ve shown how to accelerate neural network training with semi-structured sparsity and explained some of the challenges we faced. We were able to achieve a 6% end to end speedup on DINOv2 training with a small 0.1 pp accuracy drop.

There are several areas of expansion for this work:

  • Expansion to new sparsity patterns: Researchers have created new sparsity patterns like V:N:M sparsity that use the underlying semi-structured sparse kernels to allow for more flexibility. This is especially interesting for applying sparsity to LLMs, as 2:4 sparsity degrades accuracy too much, but we have seen some positive results for more general N:M pattern.
  • Performance optimizations for sparse fine-tuning: This post covers sparse training from scratch, but oftentimes we want to fine-tune a foundational model. In this case, a static mask may be sufficient to preserve accuracy which would enable us to make additional performance optimizations.
  • More experiments on pruning strategy: We calculate the mask at each step of the network, but calculating the mask every n steps may yield better training accuracy. Overall, figuring out the best strategy to use semi-structured sparsity during training is an open area of research.
  • Compatibility with fp8: The hardware also supports fp8 semi-structured sparsity (in the 4:8 format instead of 2:4), and this approach should work similarly with fp8 in principle. In practice, we would need to write similar sparsification kernels, and could possibly fuse them with the scaling of the tensors.
  • Activation Sparsity: Efficient sparsification kernels also enable to sparsify the activations during training. Because the sparsification overhead grows linearly with the sparsified matrix size, setups with large activation tensors compared to the weight tensors could benefit more from activation sparsity than weight sparsity. Furthermore, activations are naturally sparse because of the usage of ReLU or GELU activation functions, reducing accuracy degradation.

If you are interested in these problems, please feel free to open an issue / PR in torchao, a community we’re building for architecture optimization techniques like quantization and sparsity. Additionally, if you have general interest in sparsity please reach out in CUDA-MODE (#sparsity)

Read More

🎉 PyTorch Docathon H2 2024 Wrap-up 🎉

We are thrilled to announce the successful completion of the H1 2024 PyTorch Docathon! The event was a resounding success, and we want to extend our heartfelt gratitude to all the participants who made it possible. Dedication, expertise, and tireless efforts of our open-source contributors have once again helped us to improve PyTorch documentation.

This Docathon ran from June 4 through June 20 with more than 176 registrants. The energy and enthusiasm were palpable, and entrants were judged on the difficulty of submissions that resulted in over 50 merged pull requests.

We want to give a special shout-out to our top contributors, who went above and beyond during this event. Your dedication and expertise have been invaluable in enhancing the PyTorch documentation and empowering developers worldwide.

Meet the top contributors

For the full list of participants, see here.

As we bring this Docathon to a close, we encourage each and every one of you to stay inspired and keep contributing to PyTorch documentation and code, and pushing the boundaries of what’s possible with PyTorch. Your collective efforts are shaping the landscape of deep learning and fostering innovation in the PyTorch community.

Thank you again for your participation and support. We look forward to seeing what you will achieve next!

Team PyTorch

Read More