Intelligent document processing using Amazon Bedrock and Anthropic Claude

Intelligent document processing using Amazon Bedrock and Anthropic Claude

Generative artificial intelligence (AI) not only empowers innovation through ideation, content creation, and enhanced customer service, but also streamlines operations and boosts productivity across various domains. To effectively harness this transformative technology, Amazon Bedrock offers a fully managed service that integrates high-performing foundation models (FMs) from leading AI companies, such as AI21 Labs, Anthropic, Cohere, Meta, Stability AI, Mistral AI, and Amazon. By providing access to these advanced models through a single API and supporting the development of generative AI applications with an emphasis on security, privacy, and responsible AI, Amazon Bedrock enables you to use AI to explore new avenues for innovation and improve overall offerings.

Enterprise customers can unlock significant value by harnessing the power of intelligent document processing (IDP) augmented with generative AI. By infusing IDP solutions with generative AI capabilities, organizations can revolutionize their document processing workflows, achieving exceptional levels of automation and reliability. This combination enables advanced document understanding, highly effective structured data extraction, automated document classification, and seamless information retrieval from unstructured text. With these capabilities, organizations can achieve scalable, efficient, and high-value document processing that drives business transformation and competitiveness, ultimately leading to improved productivity, reduced costs, and enhanced decision-making.

In this post, we show how to develop an IDP solution using Anthropic Claude 3 Sonnet on Amazon Bedrock. We demonstrate how to extract data from a scanned document and insert it into a database.

The Anthropic Claude 3 Sonnet model is optimized for speed and efficiency, making it an excellent choice for intelligent tasks—particularly for enterprise workloads. It also possesses sophisticated vision capabilities, demonstrating a strong aptitude for understanding a wide range of visual formats, including photos, charts, graphs, and technical diagrams. Although we demonstrate this solution using the Anthropic Claude 3 Sonnet model, you can alternatively use the Haiku and Opus models if your use case requires them.

Solution overview

The proposed solution uses Amazon Bedrock and the powerful Anthropic Claude 3 Sonnet model to enable IDP capabilities. The architecture consists of several AWS services seamlessly integrated with the Amazon Bedrock, enabling efficient and accurate extraction of data from scanned documents.

The following diagram illustrates our solution architecture.

The solution consists of the following steps:

  1. The process begins with scanned documents being uploaded and stored in an Amazon Simple Storage Service (Amazon S3) bucket, which invokes an S3 Event Notification on object upload.
  2. This event invokes an AWS Lambda function, responsible for invoking the Anthropic Claude 3 Sonnet model on Amazon Bedrock.
  3. The Anthropic Claude 3 Sonnet model, with its advanced multimodal capabilities, processes the scanned documents and extracts relevant data in a structured JSON format.
  4. The extracted data from the Anthropic Claude 3 model is sent to an Amazon Simple Queue Service (Amazon SQS) queue. Amazon SQS acts as a buffer, allowing components to send and receive messages reliably without being directly coupled, providing scalability and fault tolerance in the system.
  5. Another Lambda function consumes the messages from the SQS queue, parses the JSON data, and stores the extracted key-value pairs in an Amazon DynamoDB table for retrieval and further processing.

This serverless architecture takes advantage of the scalability and cost-effectiveness of AWS services while harnessing the cutting-edge intelligence of Anthropic Claude 3 Sonnet. By combining the robust infrastructure of AWS with Anthropic’s FMs, this solution enables organizations to streamline their document processing workflows, extract valuable insights, and enhance overall operational efficiency.

The solution uses the following services and features:

  • Amazon Bedrock is a fully managed service that provides access to large language models (LLMs), allowing developers to build and deploy their own customized AI applications.
  • The Anthropic Claude 3 family offers a versatile range of models tailored to meet diverse needs. With three options—Opus, Sonnet, and Haiku—you can choose the perfect balance of intelligence, speed, and cost. These models excel at understanding complex enterprise content, including charts, graphs, technical diagrams, and reports.
  • Amazon DynamoDB is a fully managed, serverless, NoSQL database service.
  • AWS Lambda is a serverless computing service that allows you to run code without provisioning or managing servers.
  • Amazon SQS is a fully managed message queuing service.
  • Amazon S3 is a highly scalable, durable, and secure object storage service.

In this solution, we use the generative AI capabilities in Amazon Bedrock to efficiently extract data. As of writing of this post, Anthropic Claude 3 Sonnet only accepts images as input. The supported file types are GIF, JPEG, PNG, and WebP. You can choose to save images during the scanning process or convert the PDF to images.

You can also enhance this solution by implementing human-in-the-loop and model evaluation features. The goal of this post is to demonstrate how you can build an IDP solution using Amazon Bedrock, but to use this as a production-scale solution, additional considerations should be taken into account, such as testing for edge case scenarios, better exception handling, trying additional prompting techniques, model fine-tuning, model evaluation, throughput requirements, number of concurrent requests to be supported, and carefully considering cost and latency implications.

Prerequisites

You need the following prerequisites before you can proceed with this solution. For this post, we use the us-east-1 AWS Region. For details on available Regions, see Amazon Bedrock endpoints and quotas.

Use case and dataset

For our example use case, let’s look at a state agency responsible for issuing birth certificates. The agency may receive birth certificate applications through various methods, such as online applications, forms completed at a physical location, and mailed-in completed paper applications. Today, most agencies spend a considerable amount of time and resources to manually extract the application details. The process begins with scanning the application forms, manually extracting the details, and then entering them into an application that eventually stores the data into a database. This process is time-consuming, inefficient, not scalable, and error-prone. Additionally, it adds complexity if the application form is in a different language (such as Spanish).

For this demonstration, we use sample scanned images of birth certificate application forms. These forms don’t contain any real personal data. Two examples are provided: one in English (handwritten) and another in Spanish (printed). Save these images as .jpeg files to your computer. You need them later for testing the solution.

Create an S3 bucket

On the Amazon S3 console, create a new bucket with a unique name (for example, bedrock-claude3-idp-{random characters to make it globally unique}) and leave the other settings as default. Within the bucket, create a folder named images and a sub-folder named birth_certificates.

Create an SQS queue

On the Amazon SQS console, create a queue with the Standard queue type, provide a name (for example, bedrock-idp-extracted-data), and leave the other settings as default.

Create a Lambda function to invoke the Amazon Bedrock model

On the Lambda console, create a function (for example, invoke_bedrock_claude3), choose Python 3.12 for the runtime, and leave the remaining settings as default. Later, you configure this function to be invoked every time a new image is uploaded into the S3 bucket. You can download the entire Lambda function code from invoke_bedrock_claude3.py. Replace the contents of the lambda_function.py file with the code from the downloaded file. Make sure to substitute {SQS URL} with the URL of the SQS queue you created earlier, then choose Deploy.

The Lambda function should perform the following actions:

s3 = boto3.client('s3')
sqs = boto3.client('sqs')
bedrock = boto3.client('bedrock-runtime', region_name='us-east-1')
QUEUE_URL = {SQS URL}
MODEL_ID = "anthropic.claude-3-sonnet-20240229-v1:0"

The following code gets the image from the S3 bucket using the get_object method and converts it to base64 data:

image_data = s3.get_object(Bucket=bucket_name, Key=object_key)['Body'].read()
base64_image = base64.b64encode(image_data).decode('utf-8')

Prompt engineering is a critical factor in unlocking the full potential of generative AI applications like IDP. Crafting well-structured prompts makes sure that the AI system’s outputs are accurate, relevant, and aligned with your objectives, while mitigating potential risks.

With the Anthropic Claude 3 model integrated into the Amazon Bedrock IDP solution, you can use the model’s impressive visual understanding capabilities to effortlessly extract data from documents. Simply provide the image or document as input, and Anthropic Claude 3 will comprehend its contents, seamlessly extracting the desired information and presenting it in a human-readable format. All Anthropic Claude 3 models are capable of understanding non-English languages such as Spanish, Japanese, and French. In this particular use case, we demonstrate how to translate Spanish application forms into English by providing the appropriate prompt instructions.

However, LLMs like Anthropic Claude 3 can exhibit variability in their response formats. To achieve consistent and structured output, you can tailor your prompts to instruct the model to return the extracted data in a specific format, such as JSON with predefined keys. This approach enhances the interoperability of the model’s output with downstream applications and streamlines data processing workflows.

The following is the prompt with the specific JSON output format:

prompt = """
This image shows a birth certificate application form. 
Please precisely copy all the relevant information from the form.
Leave the field blank if there is no information in corresponding field.
If the image is not a birth certificate application form, simply return an empty JSON object. 
If the application form is not filled, leave the fees attributes blank. 
Translate any non-English text to English. 
Organize and return the extracted data in a JSON format with the following keys:
{
    "applicantDetails":{
        "applicantName": "",
        "dayPhoneNumber": "",
        "address": "",
        "city": "",
        "state": "",
        "zipCode": "",
        "email":""
    },
    "mailingAddress":{
        "mailingAddressApplicantName": "",
        "mailingAddress": "",
        "mailingAddressCity": "",
        "mailingAddressState": "",
        "mailingAddressZipCode": ""
    },
    "relationToApplicant":[""],
    "purposeOfRequest": "",
    
    "BirthCertificateDetails":
    {
        "nameOnBirthCertificate": "",
        "dateOfBirth": "",
        "sex": "",
        "cityOfBirth": "",
        "countyOfBirth": "",
        "mothersMaidenName": "",
        "fathersName": "",
        "mothersPlaceOfBirth": "",
        "fathersPlaceOfBirth": "",
        "parentsMarriedAtBirth": "",
        "numberOfChildrenBornInSCToMother": "",
        "diffNameAtBirth":""
    },
    "fees":{
        "searchFee": "",
        "eachAdditionalCopy": "",
        "expediteFee": "",
        "totalFees": ""
    } 
  }
""" 

Invoke the Anthropic Claude 3 Sonnet model using the Amazon Bedrock API. Pass the prompt and the base64 image data as parameters:

def invoke_claude_3_multimodal(prompt, base64_image_data):
    request_body = {
        "anthropic_version": "bedrock-2023-05-31",
        "max_tokens": 2048,
        "messages": [
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": prompt,
                    },
                    {
                        "type": "image",
                        "source": {
                            "type": "base64",
                            "media_type": "image/png",
                            "data": base64_image_data,
                        },
                    },
                ],
            }
        ],
    }

    try:
        response = bedrock.invoke_model(modelId=MODEL_ID, body=json.dumps(request_body))
        return json.loads(response['body'].read())
    except bedrock.exceptions.ClientError as err:
        print(f"Couldn't invoke Claude 3 Sonnet. Here's why: {err.response['Error']['Code']}: {err.response['Error']['Message']}")
        raise

Send the Amazon Bedrock API response to the SQS queue using the send_message method:

def send_message_to_sqs(message_body):
    try:
        sqs.send_message(QueueUrl=QUEUE_URL, MessageBody=json.dumps(message_body))
    except sqs.exceptions.ClientError as e:
        print(f"Error sending message to SQS: {e.response['Error']['Code']}: {e.response['Error']['Message']}")

Next, modify the IAM role of the Lambda function to grant the required permissions:

  1. On the Lambda console, navigate to the function.
  2. On the Configuration tab, choose Permissions in the left pane.
  3. Choose the IAM role (for example, invoke_bedrock_claude3-role-{random chars}).

This will open the role on a new tab.

  1. In the Permissions policies section, choose Add permissions and Create inline policy.
  2. On the Create policy page, switch to the JSON tab in the policy editor.
  3. Enter the policy from the following code block, replacing {AWS Account ID} with your AWS account ID and {S3 Bucket Name} with your S3 bucket name.
  4. Choose Next.
  5. Enter a name for the policy (for example, invoke_bedrock_claude3-role-policy), and choose Create policy.
{
    "Version": "2012-10-17",
    "Statement": [{
        "Effect": "Allow",
        "Action": "bedrock:InvokeModel",
        "Resource": "arn:aws:bedrock:us-east-1::foundation-model/*"
    }, {
        "Effect": "Allow",
        "Action": "s3:GetObject",
        "Resource": "arn:aws:s3:::{S3 Bucket Name}/*"
    }, {
        "Effect": "Allow",
        "Action": "sqs:SendMessage",
        "Resource": "arn:aws:sqs:us-east-1:{AWS Account ID}:bedrock-idp-extracted-data"
    }]
}

The policy will grant the following permissions:

  • Invoke model access to Amazon Bedrock FMs
  • Retrieve objects from the bedrock-claude3-idp... S3 bucket
  • Send messages to the bedrock-idp-extracted-data SQS queue for processing the extracted data

Additionally, modify the Lambda function’s timeout to 2 minutes. By default, it’s set to 3 seconds.

Create an S3 Event Notification

To create an S3 Event Notification, complete the following steps:

  1. On the Amazon S3 console, open the bedrock-claude3-idp... S3 bucket.
  2. Navigate to Properties, and in the Event notifications section, create an event notification.
  3. Enter a name for Event name (for example, bedrock-claude3-idp-event-notification).
  4. Enter images/birth_certificates/ for the prefix.
  5. For Event Type, select Put in the Object creation section.
  6. For Destination, select Lambda function and choose invoke_bedrock_claude3.
  7. Choose Save changes.

Create a DynamoDB table

To store the extracted data in DynamoDB, you need to create a table. On the DynamoDB console, create a table called birth_certificates with Id as the partition key, and keep the remaining settings as default.

Create a Lambda function to insert records into the DynamoDB table

On the Lambda console, create a Lambda function (for example, insert_into_dynamodb), choose Python 3.12 for the runtime, and leave the remaining settings as default. You can download the entire Lambda function code from insert_into_dynamodb.py. Replace the contents of the lambda_function.py file with the code from the downloaded file and choose Deploy.

The Lambda function should perform the following actions:

Get the message from the SQS queue that contains the response from the Anthropic Claude 3 Sonnet model:

data = json.loads(event['Records'][0]['body'])['content'][0]['text']
event_id = event['Records'][0]['messageId']
data = json.loads(data)

Create objects representing DynamoDB and its table:

dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('birth_certificates')

Get the key objects from the JSON data:

applicant_details = data.get('applicantDetails', {})
    mailing_address = data.get('mailingAddress', {})
    relation_to_applicant = data.get('relationToApplicant', [])
    birth_certificate_details = data.get('BirthCertificateDetails', {})
    fees = data.get('fees', {})

Insert the extracted data into DynamoDB table using put_item() method:

table.put_item(Item={
'Id': event_id,
'applicantName': applicant_details.get('applicantName', ''),
'dayPhoneNumber': applicant_details.get('dayPhoneNumber', ''),
'address': applicant_details.get('address', ''),
'city': applicant_details.get('city', ''),
'state': applicant_details.get('state', ''),
'zipCode': applicant_details.get('zipCode', ''),
'email': applicant_details.get('email', ''),
'mailingAddressApplicantName': mailing_address.get('mailingAddressApplicantName', ''),
'mailingAddress': mailing_address.get('mailingAddress', ''),
'mailingAddressCity': mailing_address.get('mailingAddressCity', ''),
'mailingAddressState': mailing_address.get('mailingAddressState', ''),
'mailingAddressZipCode': mailing_address.get('mailingAddressZipCode', ''),
'relationToApplicant': ', '.join(relation_to_applicant),
'purposeOfRequest': data.get('purposeOfRequest', ''),
'nameOnBirthCertificate': birth_certificate_details.get('nameOnBirthCertificate', ''),
'dateOfBirth': birth_certificate_details.get('dateOfBirth', ''),
'sex': birth_certificate_details.get('sex', ''),
'cityOfBirth': birth_certificate_details.get('cityOfBirth', ''),
'countyOfBirth': birth_certificate_details.get('countyOfBirth', ''),
'mothersMaidenName': birth_certificate_details.get('mothersMaidenName', ''),
'fathersName': birth_certificate_details.get('fathersName', ''),
'mothersPlaceOfBirth': birth_certificate_details.get('mothersPlaceOfBirth', ''),
'fathersPlaceOfBirth': birth_certificate_details.get('fathersPlaceOfBirth', ''),
'parentsMarriedAtBirth': birth_certificate_details.get('parentsMarriedAtBirth', ''),
'numberOfChildrenBornInSCToMother': birth_certificate_details.get('numberOfChildrenBornInSCToMother', ''),
'diffNameAtBirth': birth_certificate_details.get('diffNameAtBirth', ''),
'searchFee': fees.get('searchFee', ''),
'eachAdditionalCopy': fees.get('eachAdditionalCopy', ''),
'expediteFee': fees.get('expediteFee', ''),
'totalFees': fees.get('totalFees', '')
})

Next, modify the IAM role of the Lambda function to grant the required permissions. Follow the same steps you used to modify the permissions for the invoke_bedrock_claude3 Lambda function, but enter the following JSON as the inline policy:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": "dynamodb:PutItem",
            "Resource": "arn:aws:dynamodb:us-east-1::{AWS Account ID}:table/birth_certificates"
        },
        {
            "Sid": "VisualEditor1",
            "Effect": "Allow",
            "Action": [
                "sqs:DeleteMessage",
                "sqs:ReceiveMessage",
                "sqs:GetQueueAttributes"
            ],
            "Resource": "arn:aws:sqs:us-east-1::{AWS Account ID}:bedrock-idp-extracted-data"
        }
    ]
}

Enter a policy name (for example, insert_into_dynamodb-role-policy) and choose Create policy.

The policy will grant the following permissions:

  • Put records into the DynamoDB table
  • Read and delete messages from the SQS queue

Configure the Lambda function trigger for SQS

Complete the following steps to create a trigger for the Lambda function:

  1. On the Amazon SQS console, open the bedrock-idp-extracted-data queue.
  2. On the Lambda triggers tab, choose Configure Lambda function trigger.
  3. Select the insert_into_dynamodb Lambda function and choose Save.

Test the solution

Now that you have created all the necessary resources, permissions, and code, it’s time to test the solution.

In the S3 folder birth_certificates, upload the two scanned images that you downloaded earlier. Then open the DynamoDB console and explore the items in the birth_certificates table.

If everything is configured properly, you should see two items in DynamoDB in just a few seconds, as shown in the following screenshots. For the Spanish form, Anthropic Claude 3 automatically translated the keys and labels from Spanish to English based on the prompt.

Troubleshooting

If you don’t see the extracted data in the DynamoDB table, you can investigate the issue:

  • Check CloudWatch logs – Review the Amazon CloudWatch log streams of the Lambda functions involved in the data extraction and ingestion process. Look for any error messages or exceptions that may indicate the root cause of the issue.
  • Identify missing permissions – In many cases, errors can occur due to missing permissions. Confirm that the Lambda functions have the necessary permissions to access the required AWS resources, such as DynamoDB tables, S3 buckets, or other services involved in the solution.
  • Implement a dead-letter queue – In a production-scale solution, it is recommended to implement a dead letter queue (DLQ) to catch and handle any events or messages that fail to process or encounter errors.

Clean up

Clean up the resources created as part of this post to avoid incurring ongoing charges:

  1. Delete all the objects from the bedrock-claude3-idp... S3 bucket, then delete the bucket.
  2. Delete the two Lambda functions named invoke_bedrock_claude3 and insert_into_dynamodb.
  3. Delete the SQS queue named bedrock-idp-extracted-data.
  4. Delete the DynamoDB table named birth_certificates.

Example use cases and business value

The generative AI-powered IDP solution demonstrated in this post can benefit organizations across various industries, such as:

  • Government and public sector – Process and extract data from citizen applications, immigration documents, legal contracts, and other government-related forms, enabling faster turnaround times and improved service delivery
  • Healthcare – Extract and organize patient information, medical records, insurance claims, and other health-related documents, improving data accuracy and accessibility for better patient care
  • Finance and banking – Automate the extraction and processing of financial documents, loan applications, tax forms, and regulatory filings, reducing manual effort and increasing operational efficiency
  • Logistics and supply chain – Extract and organize data from shipping documents, invoices, purchase orders, and inventory records, streamlining operations and enhancing supply chain visibility
  • Retail and ecommerce – Automate the extraction and processing of customer orders, product catalogs, and marketing materials, enabling personalized experiences and efficient order fulfillment

By using the power of generative AI and Amazon Bedrock, organizations can unlock the true potential of their data, driving operational excellence, enhancing customer experiences, and fostering continuous innovation.

Conclusion

In this post, we demonstrated how to use Amazon Bedrock and the powerful Anthropic Claude 3 Sonnet model to develop an IDP solution. By harnessing the advanced multimodal capabilities of Anthropic Claude 3, we were able to accurately extract data from scanned documents and store it in a structured format in a DynamoDB table.

Although this solution showcases the potential of generative AI in IDP, it may not be suitable for all IDP use cases. The effectiveness of the solution may vary depending on the complexity and quality of the documents, the amount of training data available, and the specific requirements of the organization.

To further enhance the solution, consider implementing a human-in-the-loop workflow to review and validate the extracted data, especially for mission-critical or sensitive applications. This will provide data accuracy and compliance with regulatory requirements. You can also explore the model evaluation feature in Amazon Bedrock to compare model outputs, and then choose the model best suited for your downstream generative AI applications.

For further exploration and learning, we recommend checking out the following resources:


About the Authors

Govind Palanisamy is a Solutions Architect at AWS, where he helps government agencies migrate and modernize their workloads to increase citizen experience. He is passionate about technology and transformation, and he helps customers transform their businesses using AI/ML and generative AI-based solutions.

Bharath Gunapati is a Sr. Solutions architect at AWS, where he helps clinicians, researchers, and staff at academic medical centers to adopt and use cloud technologies. He is passionate about technology and the impact it can make on healthcare and research.

Read More

Metadata filtering for tabular data with Knowledge Bases for Amazon Bedrock

Metadata filtering for tabular data with Knowledge Bases for Amazon Bedrock

Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading artificial intelligence (AI) companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API. To equip FMs with up-to-date and proprietary information, organizations use Retrieval Augmented Generation (RAG), a technique that fetches data from company data sources and enriches the prompt to provide more relevant and accurate responses. Knowledge Bases for Amazon Bedrock is a fully managed capability that helps you implement the entire RAG workflow, from ingestion to retrieval and prompt augmentation. However, information about one dataset can be in another dataset, called metadata. Without using metadata, your retrieval process can cause the retrieval of unrelated results, thereby decreasing FM accuracy and increasing cost in the FM prompt token.

On March 27, 2024, Amazon Bedrock announced a key new feature called metadata filtering and also changed the default engine. This change allows you to use metadata fields during the retrieval process. However, the metadata fields need to be configured during the knowledge base ingestion process. Often, you might have tabular data where details about one field are available in another field. Also, you could have a requirement to cite the exact text document or text field to prevent hallucination. In this post, we show you how to use the new metadata filtering feature with Knowledge Bases for Amazon Bedrock for such tabular data.

Solution overview

The solution consists of the following high-level steps:

  1. Prepare data for metadata filtering.
  2. Create and ingest data and metadata into the knowledge base.
  3. Retrieve data from the knowledge base using metadata filtering.

Prepare data for metadata filtering

As of this writing, Knowledge Bases for Amazon Bedrock supports Amazon OpenSearch Serverless, Amazon Aurora, Pinecone, Redis Enterprise, and MongoDB Atlas as underlying vector store providers. In this post, we create and access an OpenSearch Serverless vector store using the Amazon Bedrock Boto3 SDK. For more details, see Set up a vector index for your knowledge base in a supported vector store.

For this post, we create a knowledge base using the public dataset Food.com – Recipes and Reviews. The following screenshot shows an example of the dataset.

The TotalTime is in ISO 8601 format. You can convert that to minutes using the following logic:

# Function to convert ISO 8601 duration to minutes
def convert_to_minutes(duration):
    hours = 0
    minutes = 0
    
    # Find hours and minutes using regex
    match = re.match(r'PT(?:(d+)H)?(?:(d+)M)?', duration)
    
    if match:
        if match.group(1):
            hours = int(match.group(1))
        if match.group(2):
            minutes = int(match.group(2))
    
    # Convert total time to minutes
    total_minutes = hours * 60 + minutes
    return total_minutes

df['TotalTimeInMinutes'] = df['TotalTime'].apply(convert_to_minutes)

After converting some of the features like CholesterolContent, SugarContent, and RecipeInstructions, the data frame looks like the following screenshot.

To enable the FM to point to a specific menu with a link (cite the document), we split each row of the tabular data in a single text file, with each file containing RecipeInstructions as the data field and TotalTimeInMinutes, CholesterolContent, and SugarContent as metadata. The metadata should be kept in a separate JSON file with the same name as the data file and .metadata.json added to its name. For example, if the data file name is 100.txt, the metadata file name should be 100.txt.metadata.json. For more details, see Add metadata to your files to allow for filtering. Also, the content in the metadata file should be in the following format:

{
"metadataAttributes": {
"${attribute1}": "${value1}",
"${attribute2}": "${value2}",
...
}
}

For the sake of simplicity, we only process the top 2,000 rows to create the knowledge base.

  1. After you import the necessary libraries, create a local directory using the following Python code:
    import pandas as pd
    import os, json, tqdm, boto3
    
    metafolder = 'multi_file_recipe_data'os.mkdir(metafolder)

  2. Iterate over the top 2,000 rows to create data and metadata files to store in the local folder:
    for i in tqdm.trange(2000):
        desc = str(df['RecipeInstructions'][i])
        meta = {
        "metadataAttributes": {
            "Name": str(df['Name'][i]),
            "TotalTimeInMinutes": str(df['TotalTimeInMinutes'][i]),
            "CholesterolContent": str(df['CholesterolContent'][i]),
            "SugarContent": str(df['SugarContent'][i]),
        }
        }
        filename = metafolder+'/' + str(i+1)+ '.txt'
        f = open(filename, 'w')
        f.write(desc)
        f.close()
        metafilename = filename+'.metadata.json'
        with open( metafilename, 'w') as f:
            json.dump(meta, f)
    

  3. Create an Amazon Simple Storage Service (Amazon S3) bucket named food-kb and upload the files:
    # Upload data to s3
    s3_client = boto3.client("s3")
    bucket_name = "recipe-kb"
    data_root = metafolder+'/'
    def uploadDirectory(path,bucket_name):
        for root,dirs,files in os.walk(path):
            for file in tqdm.tqdm(files):
                s3_client.upload_file(os.path.join(root,file),bucket_name,file)
    
    uploadDirectory(data_root, bucket_name)

Create and ingest data and metadata into the knowledge base

When the S3 folder is ready, you can create the knowledge base on the Amazon Bedrock console using the SDK according to this example notebook.

Retrieve data from the knowledge base using metadata filtering

Now let’s retrieve some data from the knowledge base. For this post, we use Anthropic Claude Sonnet on Amazon Bedrock for our FM, but you can choose from a variety of Amazon Bedrock models. First, you need to set the following variables, where kb_id is the ID of your knowledge base. The knowledge base ID can be found programmatically, as shown in the example notebook, or from the Amazon Bedrock console by navigating to the individual knowledge base, as shown in the following screenshot.

Set the required Amazon Bedrock parameters using the following code:

import boto3
import pprint
from botocore.client import Config
import json

pp = pprint.PrettyPrinter(indent=2)
session = boto3.session.Session()
region = session.region_name
bedrock_config = Config(connect_timeout=120, read_timeout=120, retries={'max_attempts': 0})
bedrock_client = boto3.client('bedrock-runtime', region_name = region)
bedrock_agent_client = boto3.client("bedrock-agent-runtime",
                              config=bedrock_config, region_name = region)
kb_id = "EIBBXVFDQP"
model_id = 'anthropic.claude-3-sonnet-20240229-v1:0'

# retrieve api for fetching only the relevant context.

query = " Tell me a recipe that I can make under 30 minutes and has cholesterol less than 10 "

relevant_documents = bedrock_agent_runtime_client.retrieve(
    retrievalQuery= {
        'text': query
    },
    knowledgeBaseId=kb_id,
    retrievalConfiguration= {
        'vectorSearchConfiguration': {
            'numberOfResults': 2 
        }
    }
)
pp.pprint(relevant_documents["retrievalResults"])

The following code is the output of the retrieval from the knowledge base without metadata filtering for the query “Tell me a recipe that I can make under 30 minutes and has cholesterol less than 10.” As we can see, out of the two recipes, the preparation durations are 30 and 480 minutes, respectively, and the cholesterol contents are 86 and 112.4, respectively. Therefore, the retrieval isn’t following the query accurately.

The following code demonstrates how to use the Retrieve API with the metadata filters set to a cholesterol content less than 10 and minutes of preparation less than 30 for the same query:

def retrieve(query, kbId, numberOfResults=5):
    return bedrock_agent_client.retrieve(
        retrievalQuery= {
            'text': query
        },
        knowledgeBaseId=kbId,
        retrievalConfiguration= {
            'vectorSearchConfiguration': {
                'numberOfResults': numberOfResults,
                 "filter": {
                            'andAll':[
                                {
                                "lessThan": {
                                "key": "CholesterolContent",
                                "value": 10
                                }
                            },
                                {
                            "lessThan": {
                                "key": "TotalTimeInMinutes",
                                "value": 30
                            }
                                }
                            ]
                        }
            }
        }
    ) 
query = "Tell me a recipe that I can make under 30 minutes and has cholesterol less than 10" 
response = retrieve(query, kb_id, 2)
retrievalResults = response['retrievalResults']
pp.pprint(retrievalResults)

As we can see in the following results, out of the two  recipes, the preparation times are 27 and 20, respectively, and the cholesterol contents are 0 and 0, respectively. With the use of metadata filtering, we get more accurate results.

The following code shows how to get accurate output using the same metadata filtering with the retrieve_and_generate API. First, we set the prompt, then we set up the API with metadata filtering:

prompt = f"""
Human: You have great knowledge about food, so provide answers to questions by using fact. 
If you don't know the answer, just say that you don't know, don't try to make up an answer.

Assistant:"""

def retrieve_and_generate(query, kb_id,modelId, numberOfResults=10):
    return bedrock_agent_client.retrieve_and_generate(
        input= {
            'text': query,
        },
        retrieveAndGenerateConfiguration={
        'knowledgeBaseConfiguration': {
            'generationConfiguration': {
                'promptTemplate': {
                    'textPromptTemplate': f"{prompt} $search_results$"
                }
            },
            'knowledgeBaseId': kb_id,
            'modelArn': model_id,
            'retrievalConfiguration': {
                'vectorSearchConfiguration': {
                    'numberOfResults': numberOfResults,
                    'overrideSearchType': 'HYBRID',
                     "filter": {
                            'andAll':[
                                {
                                "lessThan": {
                                "key": "CholesterolContent",
                                "value": 10
                                }
                            },
                                {
                            "lessThan": {
                                "key": "TotalTimeInMinutes",
                                "value": 30
                            }
                                }
                            ]
                        },
                }
        }
                    },
        'type': 'KNOWLEDGE_BASE'
    }
    )
    
query = "Tell me a recipe that I can make under 30 minutes and has cholesterol less than 10"
response = retrieve_and_generate(query, kb_id,modelId, numberOfResults=10)
pp.pprint(response['output']['text'])

As we can see in the following output, the model returns a detailed recipe that follows the instructed metadata filtering of less than 30 minutes of preparation time and a cholesterol content less than 10.

Clean up

Make sure to comment the following section if you’re planning to use the knowledge base that you created for building your RAG application. If you only wanted to try out creating the knowledge base using the SDK, make sure to delete all the resources that were created because you will incur costs for storing documents in the OpenSearch Serverless index. See the following code:

bedrock_agent_client.delete_data_source(dataSourceId = ds["dataSourceId"], knowledgeBaseId=kb['knowledgeBaseId'])
bedrock_agent_client.delete_knowledge_base(knowledgeBaseId=kb['knowledgeBaseId'])
oss_client.indices.delete(index=index_name)
aoss_client.delete_collection(id=collection_id)
aoss_client.delete_access_policy(type="data", name=access_policy['accessPolicyDetail']['name'])
aoss_client.delete_security_policy(type="network", name=network_policy['securityPolicyDetail']['name'])
aoss_client.delete_security_policy(type="encryption", name=encryption_policy['securityPolicyDetail']['name'])
# Delete roles and polices 
iam_client.delete_role(RoleName=bedrock_kb_execution_role)
iam_client.delete_policy(PolicyArn=policy_arn)

Conclusion

In this post, we explained how to split a large tabular dataset into rows to set up a knowledge base with metadata for each of those records, and how to then retrieve outputs with metadata filtering. We also showed how retrieving results with metadata is more accurate than retrieving results without metadata filtering. Lastly, we showed how to use the result with an FM to get accurate results.

To further explore the capabilities of Knowledge Bases for Amazon Bedrock, refer to the following resources:


About the Author

Tanay Chowdhury is a Data Scientist at Generative AI Innovation Center at Amazon Web Services. He helps customers to solve their business problem using Generative AI and Machine Learning.

Read More