A secure approach to generative AI with AWS

A secure approach to generative AI with AWS

Generative artificial intelligence (AI) is transforming the customer experience in industries across the globe. Customers are building generative AI applications using large language models (LLMs) and other foundation models (FMs), which enhance customer experiences, transform operations, improve employee productivity, and create new revenue channels.

FMs and the applications built around them represent extremely valuable investments for our customers. They’re often used with highly sensitive business data, like personal data, compliance data, operational data, and financial information, to optimize the model’s output. The biggest concern we hear from customers as they explore the advantages of generative AI is how to protect their highly sensitive data and investments. Because their data and model weights are incredibly valuable, customers require them to stay protected, secure, and private, whether that’s from their own administrator’s accounts, their customers, vulnerabilities in software running in their own environments, or even their cloud service provider from having access.

At AWS, our top priority is safeguarding the security and confidentiality of our customers’ workloads. We think about security across the three layers of our generative AI stack:

  • Bottom layer – Provides the tools for building and training LLMs and other FMs
  • Middle layer – Provides access to all the models along with tools you need to build and scale generative AI applications
  • Top layer – Includes applications that use LLMs and other FMs to make work stress-free by writing and debugging code, generating content, deriving insights, and taking action

Each layer is important to making generative AI pervasive and transformative.

With the AWS Nitro System, we delivered a first-of-its-kind innovation on behalf of our customers. The Nitro System is an unparalleled computing backbone for AWS, with security and performance at its core. Its specialized hardware and associated firmware are designed to enforce restrictions so that nobody, including anyone in AWS, can access your workloads or data running on your Amazon Elastic Compute Cloud (Amazon EC2) instances. Customers have benefited from this confidentiality and isolation from AWS operators on all Nitro-based EC2 instances since 2017.

By design, there is no mechanism for any Amazon employee to access a Nitro EC2 instance that customers use to run their workloads, or to access data that customers send to a machine learning (ML) accelerator or GPU. This protection applies to all Nitro-based instances, including instances with ML accelerators like AWS Inferentia and AWS Trainium, and instances with GPUs like P4, P5, G5, and G6.

The Nitro System enables Elastic Fabric Adapter (EFA), which uses the AWS-built AWS Scalable Reliable Datagram (SRD) communication protocol for cloud-scale elastic and large-scale distributed training, enabling the only always-encrypted Remote Direct Memory Access (RDMA) capable network. All communication through EFA is encrypted with VPC encryption without incurring any performance penalty.

The design of the Nitro System has been validated by the NCC Group, an independent cybersecurity firm. AWS delivers a high level of protection for customer workloads, and we believe this is the level of security and confidentiality that customers should expect from their cloud provider. This level of protection is so critical that we’ve added it in our AWS Service Terms to provide an additional assurance to all of our customers.

Innovating secure generative AI workloads using AWS industry-leading security capabilities

From day one, AWS AI infrastructure and services have had built-in security and privacy features to give you control over your data. As customers move quickly to implement generative AI in their organizations, you need to know that your data is being handled securely across the AI lifecycle, including data preparation, training, and inferencing. The security of model weights—the parameters that a model learns during training that are critical for its ability to make predictions—is paramount to protecting your data and maintaining model integrity.

This is why it is critical for AWS to continue to innovate on behalf of our customers to raise the bar on security across each layer of the generative AI stack. To do this, we believe that you must have security and confidentiality built in across each layer of the generative AI stack. You need to be able to secure the infrastructure to train LLMs and other FMs, build securely with tools to run LLMs and other FMs, and run applications that use FMs with built-in security and privacy that you can trust.

At AWS, securing AI infrastructure refers to zero access to sensitive AI data, such as AI model weights and data processed with those models, by any unauthorized person, either at the infrastructure operator or at the customer. It’s comprised of three key principles:

  1. Complete isolation of the AI data from the infrastructure operator – The infrastructure operator must have no ability to access customer content and AI data, such as AI model weights and data processed with models.
  2. Ability for customers to isolate AI data from themselves – The infrastructure must provide a mechanism to allow model weights and data to be loaded into hardware, while remaining isolated and inaccessible from customers’ own users and software.
  3. Protected infrastructure communications – The communication between devices in the ML accelerator infrastructure must be protected. All externally accessible links between the devices must be encrypted.

The Nitro System fulfills the first principle of Secure AI Infrastructure by isolating your AI data from AWS operators. The second principle provides you with a way to remove administrative access of your own users and software to your AI data. AWS not only offers you a way to achieve that, but we also made it straightforward and practical by investing in building an integrated solution between AWS Nitro Enclaves and AWS Key Management Service (AWS KMS). With Nitro Enclaves and AWS KMS, you can encrypt your sensitive AI data using keys that you own and control, store that data in a location of your choice, and securely transfer the encrypted data to an isolated compute environment for inferencing. Throughout this entire process, the sensitive AI data is encrypted and isolated from your own users and software on your EC2 instance, and AWS operators cannot access this data. Use cases that have benefited from this flow include running LLM inferencing in an enclave. Until today, Nitro Enclaves operate only in the CPU, limiting the potential for larger generative AI models and more complex processing.

We announced our plans to extend this Nitro end-to-end encrypted flow to include first-class integration with ML accelerators and GPUs, fulfilling the third principle. You will be able to decrypt and load sensitive AI data into an ML accelerator for processing while providing isolation from your own operators and verified authenticity of the application used for processing the AI data. Through the Nitro System, you can cryptographically validate your applications to AWS KMS and decrypt data only when the necessary checks pass. This enhancement allows AWS to offer end-to-end encryption for your data as it flows through generative AI workloads.

We plan to offer this end-to-end encrypted flow in the upcoming AWS-designed Trainium2 as well as GPU instances based on NVIDIA’s upcoming Blackwell architecture, which both offer secure communications between devices, the third principle of Secure AI Infrastructure. AWS and NVIDIA are collaborating closely to bring a joint solution to market, including NVIDIA’s new NVIDIA Blackwell GPU 21 platform, which couples NVIDIA’s GB200 NVL72 solution with the Nitro System and EFA technologies to provide an industry-leading solution for securely building and deploying next-generation generative AI applications.

Advancing the future of generative AI security

Today, tens of thousands of customers are using AWS to experiment and move transformative generative AI applications into production. Generative AI workloads contain highly valuable and sensitive data that needs the level of protection from your own operators and the cloud service provider. Customers using AWS Nitro-based EC2 instances have received this level of protection and isolation from AWS operators since 2017, when we launched our innovative Nitro System.

At AWS, we’re continuing that innovation as we invest in building performant and accessible capabilities to make it practical for our customers to secure their generative AI workloads across the three layers of the generative AI stack, so that you can focus on what you do best: building and extending the uses of the generative AI to more areas. Learn more here.


About the authors

Anthony Liguori is an AWS VP and Distinguished Engineer for EC2

Colm MacCárthaigh is an AWS VP and Distinguished Engineer for EC2

Read More

Cost-effective document classification using the Amazon Titan Multimodal Embeddings Model

Cost-effective document classification using the Amazon Titan Multimodal Embeddings Model

Organizations across industries want to categorize and extract insights from high volumes of documents of different formats. Manually processing these documents to classify and extract information remains expensive, error prone, and difficult to scale. Advances in generative artificial intelligence (AI) have given rise to intelligent document processing (IDP) solutions that can automate the document classification, and create a cost-effective classification layer capable of handling diverse, unstructured enterprise documents.

Categorizing documents is an important first step in IDP systems. It helps you determine the next set of actions to take depending on the type of document. For example, during the claims adjudication process, the accounts payable team receives the invoice, whereas the claims department manages the contract or policy documents. Traditional rule engines or ML-based classification can classify the documents, but often reach a limit on types of document formats and support for the dynamic addition of a new classes of document. For more information, see Amazon Comprehend document classifier adds layout support for higher accuracy.

In this post, we discuss document classification using the Amazon Titan Multimodal Embeddings model to classify any document types without the need for training.

Amazon Titan Multimodal Embeddings

Amazon recently introduced Titan Multimodal Embeddings in Amazon Bedrock. This model can create embeddings for images and text, enabling the creation of document embeddings to be used in new document classification workflows.

It generates optimized vector representations of documents scanned as images. By encoding both visual and textual components into unified numerical vectors that encapsulate semantic meaning, it enables rapid indexing, powerful contextual search, and accurate classification of documents.

As new document templates and types emerge in business workflows, you can simply invoke the Amazon Bedrock API to dynamically vectorize them and append to their IDP systems to rapidly enhance document classification capabilities.

Solution overview

Let’s examine the following document classification solution with the Amazon Titan Multimodal Embeddings model. For optimal performance, you should customize the solution to your specific use case and existing IDP pipeline setup.

This solution classifies documents using vector embedding semantic search by matching an input document to an already indexed gallery of documents. We use the following key components:

  • EmbeddingsEmbeddings are numerical representations of real-world objects that machine learning (ML) and AI systems use to understand complex knowledge domains like humans do.
  • Vector databasesVector databases are used to store embeddings. Vector databases efficiently index and organize the embeddings, enabling fast retrieval of similar vectors based on distance metrics like Euclidean distance or cosine similarity.
  • Semantic search – Semantic search works by considering the context and meaning of the input query and its relevance to the content being searched. Vector embeddings are an effective way to capture and retain the contextual meaning of text and images. In our solution, when an application wants to perform a semantic search, the search document is first converted into an embedding. The vector database with relevant content is then queried to find the most similar embeddings.

In the labeling process, a sample set of business documents like invoices, bank statements, or prescriptions are converted into embeddings using the Amazon Titan Multimodal Embeddings model and stored in a vector database against predefined labels. The Amazon Titan Multimodal Embedding model was trained using the Euclidean L2 algorithm and therefore for best results the vector database used should support this algorithm.

The following architecture diagram illustrates how you can use the Amazon Titan Multimodal Embeddings model with documents in an Amazon Simple Storage Service (Amazon S3) bucket for image gallery creation.

The workflow consists of the following steps:

  1. A user or application uploads a sample document image with classification metadata to a document image gallery. An S3 prefix or S3 object metadata can be used to classify gallery images.
  2. An Amazon S3 object notification event invokes the embedding AWS Lambda function.
  3. The Lambda function reads the document image and translates the image into embeddings by calling Amazon Bedrock and using the Amazon Titan Multimodal Embeddings model.
  4. Image embeddings, along with document classification, are stored in the vector database.

This is the architecture diagram which illustrates how Titan Multimodal Embeddings can be used with documents in an Amazon Simple Storage Service (Amazon S3) bucket for image gallery creation and classification.

When a new document needs classification, the same embedding model is used to convert the query document into an embedding. Then, a semantic similarity search is performed on the vector database using the query embedding. The label retrieved against the top embedding match will be the classification label for the query document.

The following architecture diagram illustrates how to use the Amazon Titan Multimodal Embeddings model with documents in an S3 bucket for image classification.

The workflow consists of the following steps:

  1. Documents that require classification are uploaded to an input S3 bucket.
  2. The classification Lambda function receives the Amazon S3 object notification.
  3. The Lambda function translates the image to an embedding by calling the Amazon Bedrock API.
  4. The vector database is searched for a matching document using semantic search. Classification of the matching document is used to classify the input document.
  5. The input document is moved to the target S3 directory or prefix using the classification retrieved from the vector database search.

This is the architecture diagram which illustrates how Titan Multimodal Embeddings can be used with documents in an Amazon Simple Storage Service (Amazon S3) bucket for image classification.

To help you test the solution with your own documents, we have created an example Python Jupyter notebook, which is available on GitHub.

Prerequisites

To run the notebook, you need an AWS account with appropriate AWS Identity and Access Management (IAM) permissions to call Amazon Bedrock. Additionally, on the Model access page of the Amazon Bedrock console, make sure that access is granted for the Amazon Titan Multimodal Embeddings model.

Implementation

In the following steps, replace each user input placeholder with your own information:

  1. Create the vector database. In this solution, we use an in-memory FAISS database, but you could use an alternative vector database. Amazon Titan’s default dimension size is 1024.
index = faiss.IndexFlatL2(1024)
indexIDMap = faiss.IndexIDMap(index)
  1. After the vector database is created, enumerate over the sample documents, creating embeddings of each and store those into the vector database
  1. Test with your documents. Replace the folders in the following code with your own folders that contain known document types:
DOC_CLASSES: list[str] = ["Closing Disclosure", "Invoices", "Social Security Card", "W4", "Bank Statement"]

getDocumentsandIndex("sampleGallery/ClosingDisclosure", DOC_CLASSES.index("Closing Disclosure"))
getDocumentsandIndex("sampleGallery/Invoices", DOC_CLASSES.index("Invoices"))
getDocumentsandIndex("sampleGallery/SSCards", DOC_CLASSES.index("Social Security Card"))
getDocumentsandIndex("sampleGallery/W4", DOC_CLASSES.index("W4"))
getDocumentsandIndex("sampleGallery/BankStatements", DOC_CLASSES.index("Bank Statement"))
  1. Using the Boto3 library, call Amazon Bedrock. The variable inputImageB64 is a base64 encoded byte array representing your document. The response from Amazon Bedrock contains the embeddings.
bedrock = boto3.client(
service_name='bedrock-runtime',
region_name='Region’
)

request_body = {}
request_body["inputText"] = None # not using any text
request_body["inputImage"] = inputImageB64
body = json.dumps(request_body)
response = bedrock.invoke_model(
body=body, 
modelId="amazon.titan-embed-image-v1", 
accept="application/json", 
contentType="application/json")
response_body = json.loads(response.get("body").read()) 
  1. Add the embeddings to the vector database, with a class ID that represents a known document type:
indexIDMap.add_with_ids(embeddings, classID)
  1. With the vector database populated with images (representing our gallery), you can uncover similarities with new documents. For example, the following is the syntax used for search. The k=1 tells FAISS to return the top 1 match.
indexIDMap.search(embeddings, k=1)

In addition, the Euclidean L2 distance between the image on hand and the found image is also returned. If the image is an exact match, this value would be 0. The larger this value is, the further apart the images are in similarity.

Additional considerations

In this section, we discuss additional considerations for using the solution effectively. This includes data privacy, security, integration with existing systems, and cost estimates.

Data privacy and security

The AWS shared responsibility model applies to data protection in Amazon Bedrock. As described in this model, AWS is responsible for protecting the global infrastructure that runs all of the AWS Cloud. Customers are responsible for maintaining control over their content that is hosted on this infrastructure. As a customer, you are responsible for the security configuration and management tasks for the AWS services that you use.

Data protection in Amazon Bedrock

Amazon Bedrock avoids using customer prompts and continuations to train AWS models or share them with third parties. Amazon Bedrock doesn’t store or log customer data in its service logs. Model providers don’t have access to Amazon Bedrock logs or access to customer prompts and continuations. As a result, the images used for generating embeddings through the Amazon Titan Multimodal Embeddings model are not stored or employed in training AWS models or external distribution. Additionally, other usage data, such as timestamps and logged account IDs, is excluded from model training.

Integration with existing systems

The Amazon Titan Multimodal Embeddings model underwent training with the Euclidean L2 algorithm, so the vector database being used should be compatible with this algorithm.

Cost estimate

At the time of writing this post, as per Amazon Bedrock Pricing for the Amazon Titan Multimodal Embeddings model, the following are the estimated costs using on-demand pricing for this solution:

  • One-time indexing cost – $0.06 for a single run of indexing, assuming a 1,000 images gallery
  • Classification cost – $6 for 100,000 input images per month

Clean up

To avoid incurring future charges, delete the resources you created, such as the Amazon SageMaker notebook instance, when not in use.

Conclusion

In this post, we explored how you can use the Amazon Titan Multimodal Embeddings model to build an inexpensive solution for document classification in the IDP workflow. We demonstrated how to create an image gallery of known documents and perform similarity searches with new documents to classify them. We also discussed the benefits of using multimodal image embeddings for document classification, including their ability to handle diverse document types, scalability, and low latency.

As new document templates and types emerge in business workflows, developers can invoke the Amazon Bedrock API to vectorize them dynamically and append to their IDP systems to rapidly enhance document classification capabilities. This creates an inexpensive, infinitely scalable classification layer that can handle even the most diverse, unstructured enterprise documents.

Overall, this post provides a roadmap for building an inexpensive solution for document classification in the IDP workflow using Amazon Titan Multimodal Embeddings.

As next steps, check out What is Amazon Bedrock to start using the service. And follow Amazon Bedrock on the AWS Machine Learning Blog to keep up to date with new capabilities and use cases for Amazon Bedrock.


About the Authors

Sumit Bhati is a Senior Customer Solutions Manager at AWS, specializes in expediting the cloud journey for enterprise customers. Sumit is dedicated to assisting customers through every phase of their cloud adoption, from accelerating migrations to modernizing workloads and facilitating the integration of innovative practices.

David Girling is a Senior AI/ML Solutions Architect with over 20 years of experience in designing, leading, and developing enterprise systems. David is part of a specialist team that focuses on helping customers learn, innovate, and utilize these highly capable services with their data for their use cases.

Ravi Avula is a Senior Solutions Architect in AWS focusing on Enterprise Architecture. Ravi has 20 years of experience in software engineering and has held several leadership roles in software engineering and software architecture working in the payments industry.

George Belsian is a Senior Cloud Application Architect at AWS. He is passionate about helping customers accelerate their modernization and cloud adoption journey. In his current role, George works alongside customer teams to strategize, architect, and develop innovative, scalable solutions.

Read More

AWS at NVIDIA GTC 2024: Accelerate innovation with generative AI on AWS

AWS at NVIDIA GTC 2024: Accelerate innovation with generative AI on AWS

AWS was delighted to present to and connect with over 18,000 in-person and 267,000 virtual attendees at NVIDIA GTC, a global artificial intelligence (AI) conference that took place March 2024 in San Jose, California, returning to a hybrid, in-person experience for the first time since 2019.

AWS has had a long-standing collaboration with NVIDIA for over 13 years. AWS was the first Cloud Service Provider (CSP) to offer NVIDIA GPUs in the public cloud, and remains among the first to deploy NVIDIA’s latest technologies.

Looking back at AWS re:Invent 2023, Jensen Huang, founder and CEO of NVIDIA, chatted with AWS CEO Adam Selipsky on stage, discussing how NVIDIA and AWS are working together to enable millions of developers to access powerful technologies needed to rapidly innovate with generative AI. NVIDIA is known for its cutting-edge accelerators and full-stack solutions that contribute to advancements in AI. The company is combining this expertise with the highly scalable, reliable, and secure AWS Cloud infrastructure to help customers run advanced graphics, machine learning, and generative AI workloads at an accelerated pace.

The collaboration between AWS and NVIDIA further expanded at GTC 2024, with the CEOs from both companies sharing their perspectives on the collaboration and state of AI in a press release:

“The deep collaboration between our two organizations goes back more than 13 years, when together we launched the world’s first GPU cloud instance on AWS, and today we offer the widest range of NVIDIA GPU solutions for customers,” says Adam Selipsky, CEO of AWS. “NVIDIA’s next-generation Grace Blackwell processor marks a significant step forward in generative AI and GPU computing. When combined with AWS’s powerful Elastic Fabric Adapter networking, Amazon EC2 UltraClusters’ hyper-scale clustering, and our unique AWS Nitro System’s advanced virtualization and security capabilities, we make it possible for customers to build and run multi-trillion parameter large language models faster, at massive scale, and more securely than anywhere else. Together, we continue to innovate to make AWS the best place to run NVIDIA GPUs in the cloud.”

“AI is driving breakthroughs at an unprecedented pace, leading to new applications, business models, and innovation across industries,” says Jensen Huang, founder and CEO of NVIDIA. “Our collaboration with AWS is accelerating new generative AI capabilities and providing customers with unprecedented computing power to push the boundaries of what’s possible.”

Joint announcements and keynote

On the first day of the NVIDIA GTC, AWS and NVIDIA made a joint announcement focused on their strategic collaboration to advance generative AI. Huang included the AWS and NVIDIA collaboration on a slide during his keynote, highlighting the following announcements. The GTC keynote had over 21 million views within the first 72 hours.

Media coverage

By March 22, AWS’s announcement with NVIDIA had generated 104 articles mentioning AWS and Amazon. The vast majority of coverage mentioned AWS’s plans to offer Blackwell-based instances. Adam Selipsky appeared on CNBC’s Mad Money to discuss the long-standing collaboration between AWS and NVIDIA, among the many other ways AWS is innovating in generative AI, stating that AWS has been the first to bring many of its GPUs to the cloud to drive efficiency and scalability for customers.

Project Ceiba has also been a focus in media coverage. Forbes referred to Project Ceiba as the “most exciting” project by AWS and NVIDIA, stating that it “should accelerate the pace of innovation in AI, making it possible to tackle more complex problems, develop more sophisticated models, and achieve previously unattainable breakthroughs.” The Next Platform ran an in-depth piece on Ceiba, stating that “the size and the aggregate compute of Ceiba cluster are both being radically expanded, which will give AWS a very large supercomputer in one of its data centers” and NVIDIA will use it to do AI research, among other things.

Live from GTC

“Live from GTC” was an on-site studio at GTC for invited speakers to have a fireside chat with tech influencers like VentureBeat. Chetan Kapoor, Director of Product Management for Amazon EC2 at AWS, was interviewed by VentureBeat at the Live from GTC studio, where he discussed AWS’s presence and highlighted key announcements at GTC.

The AWS booth and sessions

The AWS booth showcased generative AI services, like the LLMs with Anthropic and Cohere on Amazon Bedrock, PartyRock, Amazon Q, Amazon SageMaker JumpStart, and more. Highlights included:

AWS presence with partners and customers

During GTC, AWS invited 23 partner and customer solution demos to join its booth with either a dedicated demo kiosk or a 30-minute in-booth session. Such partners and customers included Ansys, Anthropic, Articul8, Bria.ai, Cohere, Deci, Deepbrain.AI, Denali Advanced Integration, Ganit, Hugging Face, Lilt, Linker Vision, Mavenir, MCE, Media.Monks, Modular, NVIDIA, Perplexity, Quantiphi, Run.ai, Salesforce, Second Spectrum, and Slalom.

Among them, high-potential early-stage startups in generative AI across the globe were showcased with a dedicated kiosk at the AWS booth. The AWS Startups team works closely with these companies by investing and supporting their growth, offering resources through programs like AWS Activate.

AWS Generative AI Competency

NVIDIA was one of the 45 launch partners for the new AWS Generative AI Competency program. The Generative AI Center of Excellence for AWS Partners team members were on site at the AWS booth, presenting this program for both existing and potential AWS partners. The program offers valuable resources along with best practices for all AWS partners to build, market, and sell generative AI solutions jointly with AWS.

Additional resources

Watch a video recap of the AWS presence at NVIDIA GTC 2024. For additional resources about the AWS and NVIDIA collaboration, refer to the AWS at NVIDIA GTC 2024 resource hub.


About the Author

Julie Tang is the Senior Global Partner Marketing Manager for Generative AI at Amazon Web Services (AWS), where she collaborates closely with NVIDIA to plan and execute partner marketing initiatives focused on generative AI. Throughout her tenure at AWS, she has held various partner marketing roles, including Global IoT Solutions, AWS Partner Solution Factory, and Sr. Campaign Manager in Americas Field Marketing. Prior to AWS, Julie served as the Marketing Director at Segway. She holds a Master’s degree in Communications Management with a focus on marketing and entertainment management from the University of Southern California, and dual Bachelor’s degrees in Law and Broadcast Journalism from Fudan University.

Read More

Build an active learning pipeline for automatic annotation of images with AWS services

Build an active learning pipeline for automatic annotation of images with AWS services

This blog post is co-written with Caroline Chung from Veoneer.

Veoneer is a global automotive electronics company and a world leader in automotive electronic safety systems. They offer best-in-class restraint control systems and have delivered over 1 billion electronic control units and crash sensors to car manufacturers globally. The company continues to build on a 70-year history of automotive safety development, specializing in cutting-edge hardware and systems that prevent traffic incidents and mitigate accidents.

Automotive in-cabin sensing (ICS) is an emerging space that uses a combination of several types of sensors such as cameras and radar, and artificial intelligence (AI) and machine learning (ML) based algorithms for enhancing safety and improving riding experience. Building such a system can be a complex task. Developers have to manually annotate large volumes of images for training and testing purposes. This is very time consuming and resource intensive. The turnaround time for such a task is several weeks. Furthermore, companies have to deal with issues such as inconsistent labels due to human errors.

AWS is focused on helping you increase your development speed and lower your costs for building such systems through advanced analytics like ML. Our vision is to use ML for automated annotation, enabling retraining of safety models, and ensuring consistent and reliable performance metrics. In this post, we share how, by collaborating with Amazon’s Worldwide Specialist Organization and the Generative AI Innovation Center, we developed an active learning pipeline for in-cabin image head bounding boxes and key points annotation. The solution reduces cost by over 90%, accelerates the annotation process from weeks to hours in terms of the turnaround time, and enables reusability for similar ML data labeling tasks.

Solution overview

Active learning is an ML approach that involves an iterative process of selecting and annotating the most informative data to train a model. Given a small set of labeled data and a large set of unlabeled data, active learning improves model performance, reduces labeling effort, and integrates human expertise for robust results. In this post, we build an active learning pipeline for image annotations with AWS services.

The following diagram demonstrates the overall framework for our active learning pipeline. The labeling pipeline takes images from an Amazon Simple Storage Service (Amazon S3) bucket and outputs annotated images with the cooperation of ML models and human expertise. The training pipeline preprocesses data and uses them to train ML models. The initial model is set up and trained on a small set of manually labeled data, and will be used in the labeling pipeline. The labeling pipeline and training pipeline can be iterated gradually with more labeled data to enhance the model’s performance.

Auto labeling workflow

In the labeling pipeline, an Amazon S3 Event Notification is invoked when a new batch of images comes into the Unlabeled Datastore S3 bucket, activating the labeling pipeline. The model produces the inference results on the new images. A customized judgement function selects parts of the data based on the inference confidence score or other user-defined functions. This data, with its inference results, is sent for a human labeling job on Amazon SageMaker Ground Truth created by the pipeline. The human labeling process helps annotate the data, and the modified results are combined with the remaining auto annotated data, which can be used later by the training pipeline.

Model retraining happens in the training pipeline, where we use the dataset containing the human-labeled data to retrain the model. A manifest file is produced to describe where the files are stored, and the same initial model is retrained on the new data. After retraining, the new model replaces the initial model, and the next iteration of the active learning pipeline starts.

Model deployment

Both the labeling pipeline and training pipeline are deployed on AWS CodePipeline. AWS CodeBuild instances are used for implementation, which is flexible and fast for a small amount of data. When speed is needed, we use Amazon SageMaker endpoints based on the GPU instance to allocate more resources to support and accelerate the process.

The model retraining pipeline can be invoked when there is new dataset or when the model’s performance needs improvement. One critical task in the retraining pipeline is to have the version control system for both the training data and the model. Although AWS services such as Amazon Rekognition have the integrated version control feature, which makes the pipeline straightforward to implement, customized models require metadata logging or additional version control tools.

The entire workflow is implemented using the AWS Cloud Development Kit (AWS CDK) to create necessary AWS components, including the following:

  • Two roles for CodePipeline and SageMaker jobs
  • Two CodePipeline jobs, which orchestrate the workflow
  • Two S3 buckets for the code artifacts of the pipelines
  • One S3 bucket for labeling the job manifest, datasets, and models
  • Preprocessing and postprocessing AWS Lambda functions for the SageMaker Ground Truth labeling jobs

The AWS CDK stacks are highly modularized and reusable across different tasks. The training, inference code, and SageMaker Ground Truth template can be replaced for any similar active learning scenarios.

Model training

Model training includes two tasks: head bounding box annotation and human key points annotation. We introduce them both in this section.

Head bounding box annotation

Head bounding box annotation is a task to predict the location of a bounding box of the human head in an image. We use an Amazon Rekognition Custom Labels model for head bounding box annotations. The following sample notebook provides a step-by-step tutorial on how to train a Rekognition Custom Labels model via SageMaker.

We first need to prepare the data to start the training. We generate a manifest file for the training and a manifest file for the test dataset. A manifest file contains multiple items, each of which is for an image. The following is an example of the manifest file, which includes the image path, size, and annotation information:

{
    "source-ref": "s3://mlsl-sandox/rekognition_images/train/IMS_00000_00_000_000_R2_1900_01_01_00000_compressed_front_tof_amp_000.jpeg",
    "bounding-box-attribute-name": {
        "image_size": [{
                "width": 640,
                "height": 480,
                "depth": 3
            }
        ],
        "annotations": [{
                "class_id": 1,
                "top": 189,
                "left": 209,
                "width": 97,
                "height": 121
            }
        ]
    },
    "bounding-box-attribute-name-metadata": {
        "objects": [{
                "confidence": 1
            }
        ],
        "class-map": {
            "1": "Head"
        },
        "type": "groundtruth/object-detection",
        "human-annotated": "yes",
        "creation-date": "2023-04-07T20:04:42",
        "job-name": "testjob"
    }
}

Using the manifest files, we can load datasets to a Rekognition Custom Labels model for training and testing. We iterated the model with different amounts of training data and tested it on the same 239 unseen images. In this test, the mAP_50 score increased from 0.33 with 114 training images to 0.95 with 957 training images. The following screenshot shows the performance metrics of the final Rekognition Custom Labels model, which yields great performance in terms of F1 score, precision, and recall.

We further tested the model on a withheld dataset that has 1,128 images. The model consistently predicts accurate bounding box predictions on the unseen data, yielding a high mAP_50 of 94.9%. The following example shows an auto-annotated image with a head bounding box.

Key points annotation

Key points annotation produces locations of key points, including eyes, ears, nose, mouth, neck, shoulders, elbows, wrists, hips, and ankles. In addition to the location prediction, visibility of each point is needed to predict in this specific task, for which we design a novel method.

For key points annotation, we use a Yolo 8 Pose model on SageMaker as the initial model. We first prepare the data for training, including generating label files and a configuration .yaml file following Yolo’s requirements. After preparing the data, we train the model and save artifacts, including the model weights file. With the trained model weights file, we can annotate the new images.

In the training stage, all the labeled points with locations, including visible points and occluded points, are used for training. Therefore, this model by default provides the location and confidence of the prediction. In the following figure, a large confidence threshold (main threshold) near 0.6 is capable of dividing the points that are visible or occluded versus outside of camera’s viewpoints. However, occluded points and visible points are not separated by the confidence, which means the predicted confidence is not useful for predicting the visibility.

To get the prediction of visibility, we introduce an additional model trained on the dataset containing only visible points, excluding both occluded points and outside of camera’s viewpoints. The following figure shows the distribution of points with different visibility. Visible points and other points can be separated in the additional model. We can use a threshold (additional threshold) near 0.6 to get the visible points. By combining these two models, we design a method to predict the location and visibility.

A key point is first predicted by the main model with location and main confidence, then we get the additional confidence prediction from the additional model. Its visibility is then classified as follows:

  • Visible, if its main confidence is greater than its main threshold, and its additional confidence is greater than the additional threshold
  • Occluded, if its main confidence is greater than its main threshold, and its additional confidence is less than or equal to the additional threshold
  • Outside of camera’s review, if otherwise

An example of key points annotation is demonstrated in the following image, where solid marks are visible points and hollow marks are occluded points. Outside of the camera’s review points are not shown.

Based on the standard OKS definition on the MS-COCO dataset, our method is able to achieve mAP_50​ of 98.4% on the unseen test dataset. In terms of visibility, the method yields a 79.2% classification accuracy on the same dataset.

Human labeling and retraining

Although the models achieve great performance on test data, there are still possibilities for making mistakes on new real-world data. Human labeling is the process to correct these mistakes for enhancing model performance using retraining. We designed a judgement function that combined the confidence value that output from the ML models for the output of all head bounding box or key points. We use the final score to identify these mistakes and the resultant bad labeled images, which need to be sent to the human labeling process.

In addition to bad labeled images, a small portion of images are randomly chosen for human labeling. These human-labeled images are added into the current version of the training set for retraining, enhancing model performance and overall annotation accuracy.

In the implementation, we use SageMaker Ground Truth for the human labeling process. SageMaker Ground Truth provides a user-friendly and intuitive UI for data labeling. The following screenshot demonstrates a SageMaker Ground Truth labeling job for head bounding box annotation.

The following screenshot demonstrates a SageMaker Ground Truth labeling job for key points annotation.

Cost, speed, and reusability

Cost and speed are the key advantages of using our solution compared to human labeling, as shown in the following tables. We use these tables to represent the cost savings and speed accelerations. Using the accelerated GPU SageMaker instance ml.g4dn.xlarge, the whole life training and inference cost on 100,000 images is 99% less than the cost of human labeling, while the speed is 10–10,000 times faster than the human labeling, depending on the task.

The first table summarizes the cost performance metrics.

Model mAP_50 based on 1,128 test images Training cost based on 100,000 images Inference cost based on 100,000 images Cost reduction compared to human annotation Inference time based on 100,000 images Time acceleration compared to human annotation
Rekognition head bounding box 0.949 $4 $22 99% less 5.5 h Days
Yolo Key points 0.984 $27.20 * $10 99.9% less minutes Weeks

The following table summarizes performance metrics.

Annotation Task mAP_50 (%) Training Cost ($) Inference Cost ($) Inference Time
Head Bounding Box 94.9 4 22 5.5 hours
Key Points 98.4 27 10 5 minutes

Moreover, our solution provides reusability for similar tasks. Camera perception developments for other systems like advanced driver assist system (ADAS) and in-cabin systems can also adopt our solution.

Summary

In this post, we showed how to build an active learning pipeline for automatic annotation of in-cabin images utilizing AWS services. We demonstrate the power of ML, which enables you to automate and expedite the annotation process, and the flexibility of the framework that uses models either supported by AWS services or customized on SageMaker. With Amazon S3, SageMaker, Lambda, and SageMaker Ground Truth, you can streamline data storage, annotation, training, and deployment, and achieve reusability while reducing costs significantly. By implementing this solution, automotive companies can become more agile and cost-efficient by using ML-based advanced analytics such as automated image annotation.

Get started today and unlock the power of AWS services and machine learning for your automotive in-cabin sensing use cases!


About the Authors

Yanxiang Yu is an Applied Scientist at at the Amazon Generative AI Innovation Center. With over 9 years of experience building AI and machine learning solutions for industrial applications, he specializes in generative AI, computer vision, and time series modeling.

Tianyi Mao is an Applied Scientist at AWS based out of Chicago area. He has 5+ years of experience in building machine learning and deep learning solutions and focuses on computer vision and reinforcement learning with human feedbacks. He enjoys working with customers to understand their challenges and solve them by creating innovative solutions using AWS services.

Yanru Xiao is an Applied Scientist at the Amazon Generative AI Innovation Center, where he builds AI/ML solutions for customers’ real-world business problems. He has worked in several fields, including manufacturing, energy, and agriculture. Yanru obtained his Ph.D. in Computer Science from Old Dominion University.

Paul George is an accomplished product leader with over 15 years of experience in automotive technologies. He is adept at leading product management, strategy, Go-to-Market and systems engineering teams. He has incubated and launched several new sensing and perception products globally. At AWS, he is leading strategy and go-to-market for autonomous vehicle workloads.

Caroline Chung is an engineering manager at Veoneer (acquired by Magna International), she has over 14 years of experience developing sensing and perception systems. She currently leads interior sensing pre-development programs at Magna International managing a team of compute vision engineers and data scientists.

Read More

Knowledge Bases for Amazon Bedrock now supports custom prompts for the RetrieveAndGenerate API and configuration of the maximum number of retrieved results

Knowledge Bases for Amazon Bedrock now supports custom prompts for the RetrieveAndGenerate API and configuration of the maximum number of retrieved results

With Knowledge Bases for Amazon Bedrock, you can securely connect foundation models (FMs) in Amazon Bedrock to your company data for Retrieval Augmented Generation (RAG). Access to additional data helps the model generate more relevant, context-specific, and accurate responses without retraining the FMs.

In this post, we discuss two new features of Knowledge Bases for Amazon Bedrock specific to the RetrieveAndGenerate API: configuring the maximum number of results and creating custom prompts with a knowledge base prompt template. You can now choose these as query options alongside the search type.

Overview and benefits of new features

The maximum number of results option gives you control over the number of search results to be retrieved from the vector store and passed to the FM for generating the answer. This allows you to customize the amount of background information provided for generation, thereby giving more context for complex questions or less for simpler questions. It allows you to fetch up to 100 results. This option helps improve the likelihood of relevant context, thereby improving the accuracy and reducing the hallucination of the generated response.

The custom knowledge base prompt template allows you to replace the default prompt template with your own to customize the prompt that’s sent to the model for response generation. This allows you to customize the tone, output format, and behavior of the FM when it responds to a user’s question. With this option, you can fine-tune terminology to better match your industry or domain (such as healthcare or legal). Additionally, you can add custom instructions and examples tailored to your specific workflows.

In the following sections, we explain how you can use these features with either the AWS Management Console or SDK.

Prerequisites

To follow along with these examples, you need to have an existing knowledge base. For instructions to create one, see Create a knowledge base.

Configure the maximum number of results using the console

To use the maximum number of results option using the console, complete the following steps:

  1. On the Amazon Bedrock console, choose Knowledge bases in the left navigation pane.
  2. Select the knowledge base you created.
  3. Choose Test knowledge base.
  4. Choose the configuration icon.
  5. Choose Sync data source before you start testing your knowledge base.
  6. Under Configurations, for Search Type, select a search type based on your use case.

For this post, we use hybrid search because it combines semantic and text search to provider greater accuracy. To learn more about hybrid search, see Knowledge Bases for Amazon Bedrock now supports hybrid search.

  1. Expand Maximum number of source chunks and set your maximum number of results.

To demonstrate the value of the new feature, we show examples of how you can increase the accuracy of the generated response. We used Amazon 10K document for 2023 as the source data for creating the knowledge base. We use the following query for experimentation: “In what year did Amazon’s annual revenue increase from $245B to $434B?”

The correct response for this query is “Amazon’s annual revenue increased from $245B in 2019 to $434B in 2022,” based on the documents in the knowledge base. We used Claude v2 as the FM to generate the final response based on the contextual information retrieved from the knowledge base. Claude 3 Sonnet and Claude 3 Haiku are also supported as the generation FMs.

We ran another query to demonstrate the comparison of retrieval with different configurations. We used the same input query (“In what year did Amazon’s annual revenue increase from $245B to $434B?”) and set the maximum number of results to 5.

As shown in the following screenshot, the generated response was “Sorry, I am unable to assist you with this request.”

Next, we set the maximum results to 12 and ask the same question. The generated response is “Amazon’s annual revenue increase from $245B in 2019 to $434B in 2022.”

As shown in this example, we are able to retrieve the correct answer based on the number of retrieved results. If you want to learn more about the source attribution that constitutes the final output, choose Show source details to validate the generated answer based on the knowledge base.

Customize a knowledge base prompt template using the console

You can also customize the default prompt with your own prompt based on the use case. To do so on the console, complete the following steps:

  1. Repeat the steps in the previous section to start testing your knowledge base.
  2. Enable Generate responses.
  3. Select the model of your choice for response generation.

We use the Claude v2 model as an example in this post. The Claude 3 Sonnet and Haiku model is also available for generation.

  1. Choose Apply to proceed.

After you choose the model, a new section called Knowledge base prompt template appears under Configurations.

  1. Choose Edit to start customizing the prompt.
  2. Adjust the prompt template to customize how you want to use the retrieved results and generate content.

For this post, we gave a few examples for creating a “Financial Advisor AI system” using Amazon financial reports with custom prompts. For best practices on prompt engineering, refer to Prompt engineering guidelines.

We now customize the default prompt template in several different ways, and observe the responses.

Let’s first try a query with the default prompt. We ask “What was the Amazon’s revenue in 2019 and 2021?” The following shows our results.

From the output, we find that it’s generating the free-form response based on the retrieved knowledge. The citations are also listed for reference.

Let’s say we want to give extra instructions on how to format the generated response, like standardizing it as JSON. We can add these instructions as a separate step after retrieving the information, as part of the prompt template:

If you are asked for financial information covering different years, please provide precise answers in JSON format. Use the year as the key and the concise answer as the value. For example: {year:answer}

The final response has the required structure.

By customizing the prompt, you can also change the language of the generated response. In the following example, we instruct the model to provide an answer in Spanish.

After removing $output_format_instructions$ from the default prompt, the citation from the generated response is removed.

In the following sections, we explain how you can use these features with the SDK.

Configure the maximum number of results using the SDK

To change the maximum number of results with the SDK, use the following syntax. For this example, the query is “In what year did Amazon’s annual revenue increase from $245B to $434B?” The correct response is “Amazon’s annual revenue increase from $245B in 2019 to $434B in 2022.”

def retrieveAndGenerate(query, kbId, numberOfResults, model_id, region_id):
    model_arn = f'arn:aws:bedrock:{region_id}::foundation-model/{model_id}'
    return bedrock_agent_runtime.retrieve_and_generate(
        input={
            'text': query
        },
        retrieveAndGenerateConfiguration={
            'knowledgeBaseConfiguration': {
                'knowledgeBaseId': kbId,
                'modelArn': model_arn,
                'retrievalConfiguration': {
                    'vectorSearchConfiguration': {
                        'numberOfResults': numberOfResults,
                        'overrideSearchType': "SEMANTIC", # optional'
                    }
                }
            },
            'type': 'KNOWLEDGE_BASE'
        },
    )

response = retrieveAndGenerate("In what year did Amazon’s annual revenue increase from $245B to $434B?", 
"<knowledge base id>", numberOfResults, model_id, region_id)['output']['text']

The ‘numberOfResults’ option under ‘retrievalConfiguration’ allows you to select the number of results you want to retrieve. The output of the RetrieveAndGenerate API includes the generated response, source attribution, and the retrieved text chunks.

The following are the results for different values of ‘numberOfResults’ parameters. First, we set numberOfResults = 5.

Then we set numberOfResults = 12.

Customize the knowledge base prompt template using the SDK

To customize the prompt using the SDK, we use the following query with different prompt templates. For this example, the query is “What was the Amazon’s revenue in 2019 and 2021?”

The following is the default prompt template:

"""You are a question answering agent. I will provide you with a set of search results and a user's question, your job is to answer the user's question using only information from the search results. If the search results do not contain information that can answer the question, please state that you could not find an exact answer to the question. Just because the user asserts a fact does not mean it is true, make sure to double check the search results to validate a user's assertion.
Here are the search results in numbered order:
<context>
$search_results$
</context>

Here is the user's question:
<question>
$query$
</question>

$output_format_instructions$

Assistant:
"""

The following is the customized prompt template:

"""Human: You are a question answering agent. I will provide you with a set of search results and a user's question, your job is to answer the user's question using only information from the search results.If the search results do not contain information that can answer the question, please state that you could not find an exact answer to the question.Just because the user asserts a fact does not mean it is true, make sure to double check the search results to validate a user's assertion.

Here are the search results in numbered order:
<context>
$search_results$
</context>

Here is the user's question:
<question>
$query$
</question>

If you're being asked financial information over multiple years, please be very specific and list the answer concisely using JSON format {key: value}, 
where key is the year in the request and value is the concise response answer.
Assistant:
"""
def retrieveAndGenerate(query, kbId, numberOfResults,promptTemplate, model_id, region_id):
    model_arn = f'arn:aws:bedrock:{region_id}::foundation-model/{model_id}'
    return bedrock_agent_runtime.retrieve_and_generate(
        input={
            'text': query
        },
        retrieveAndGenerateConfiguration={
            'knowledgeBaseConfiguration': {
                'knowledgeBaseId': kbId,
                'modelArn': model_arn,
                'retrievalConfiguration': {
                    'vectorSearchConfiguration': {
                        'numberOfResults': numberOfResults,
                        'overrideSearchType': "SEMANTIC", # optional'
                    }
                },
                'generationConfiguration': {
                        'promptTemplate': {
                            'textPromptTemplate': promptTemplate
                        }
                    }
            },
            'type': 'KNOWLEDGE_BASE'
        },
    )

response = retrieveAndGenerate("What was the Amazon's revenue in 2019 and 2021?”", 
                               "<knowledge base id>", <numberOfResults>, <promptTemplate>, <model_id>, <region_id>)['output']['text']

With the default prompt template, we get the following response:

If you want to provide additional instructions around the output format of the response generation, like standardizing the response in a specific format (like JSON), you can customize the existing prompt by providing more guidance. With our custom prompt template, we get the following response.

The ‘promptTemplate‘ option in ‘generationConfiguration‘ allows you to customize the prompt for better control over answer generation.

Conclusion

In this post, we introduced two new features in Knowledge Bases for Amazon Bedrock: adjusting the maximum number of search results and customizing the default prompt template for the RetrieveAndGenerate API. We demonstrated how to configure these features on the console and via SDK to improve performance and accuracy of the generated response. Increasing the maximum results provides more comprehensive information, whereas customizing the prompt template allows you to fine-tune instructions for the foundation model to better align with specific use cases. These enhancements offer greater flexibility and control, enabling you to deliver tailored experiences for RAG-based applications.

For additional resources to start implementing in your AWS environment, refer to the following:


About the authors

Sandeep Singh is a Senior Generative AI Data Scientist at Amazon Web Services, helping businesses innovate with generative AI. He specializes in Generative AI, Artificial Intelligence, Machine Learning, and System Design. He is passionate about developing state-of-the-art AI/ML-powered solutions to solve complex business problems for diverse industries, optimizing efficiency and scalability.

Suyin Wang is an AI/ML Specialist Solutions Architect at AWS. She has an interdisciplinary education background in Machine Learning, Financial Information Service and Economics, along with years of experience in building Data Science and Machine Learning applications that solved real-world business problems. She enjoys helping customers identify the right business questions and building the right AI/ML solutions. In her spare time, she loves singing and cooking.

Sherry Ding is a senior artificial intelligence (AI) and machine learning (ML) specialist solutions architect at Amazon Web Services (AWS). She has extensive experience in machine learning with a PhD degree in computer science. She mainly works with public sector customers on various AI/ML related business challenges, helping them accelerate their machine learning journey on the AWS Cloud. When not helping customers, she enjoys outdoor activities.

Read More

Knowledge Bases for Amazon Bedrock now supports metadata filtering to improve retrieval accuracy

Knowledge Bases for Amazon Bedrock now supports metadata filtering to improve retrieval accuracy

At AWS re:Invent 2023, we announced the general availability of Knowledge Bases for Amazon Bedrock. With Knowledge Bases for Amazon Bedrock, you can securely connect foundation models (FMs) in Amazon Bedrock to your company data using a fully managed Retrieval Augmented Generation (RAG) model.

For RAG-based applications, the accuracy of the generated responses from FMs depend on the context provided to the model. Contexts are retrieved from vector stores based on user queries. In the recently released feature for Knowledge Bases for Amazon Bedrock, hybrid search, you can combine semantic search with keyword search. However, in many situations, you may need to retrieve documents created in a defined period or tagged with certain categories. To refine the search results, you can filter based on document metadata to improve retrieval accuracy, which in turn leads to more relevant FM generations aligned with your interests.

In this post, we discuss the new custom metadata filtering feature in Knowledge Bases for Amazon Bedrock, which you can use to improve search results by pre-filtering your retrievals from vector stores.

Metadata filtering overview

Prior to the release of metadata filtering, all semantically relevant chunks up to the pre-set maximum would be returned as context for the FM to use to generate a response. Now, with metadata filters, you can retrieve not only semantically relevant chunks but a well-defined subset of those relevant chucks based on applied metadata filters and associated values.

With this feature, you can now supply a custom metadata file (each up to 10 KB) for each document in the knowledge base. You can apply filters to your retrievals, instructing the vector store to pre-filter based on document metadata and then search for relevant documents. This way, you have control over the retrieved documents, especially if your queries are ambiguous. For example, you can use legal documents with similar terms for different contexts, or movies that have a similar plot released in different years. In addition, by reducing the number of chunks that are being searched over, you achieve performance advantages like a reduction in CPU cycles and cost of querying the vector store, in addition to improvement in accuracy.

To use the metadata filtering feature, you need to provide metadata files alongside the source data files with the same name as the source data file and .metadata.json suffix. Metadata can be string, number, or Boolean. The following is an example of the metadata file content:

{
    "metadataAttributes" : { 
        "tag" : "project EVE",
        "year" :  2016,
        "team": "ninjas"
    }
}

The metadata filtering feature of Knowledge Bases for Amazon Bedrock is available in AWS Regions US East (N. Virginia) and US West (Oregon).

The following are common use cases for metadata filtering:

  • Document chatbot for a software company – This allows users to find product information and troubleshooting guides. Filters on the operating system or application version, for example, can help avoid retrieving obsolete or irrelevant documents.
  • Conversational search of an organization’s application – This allows users to search through documents, kanbans, meeting recording transcripts, and other assets. Using metadata filters on work groups, business units, or project IDs, you can personalize the chat experience and improve collaboration. An example would be, “What is the status of project Sphinx and risks raised,” where users can filter documents for a specific project or source type (such as email or meeting documents).
  • Intelligent search for software developers – This allows developers to look for information of a specific release. Filters on the release version, document type (such as code, API reference, or issue) can help pinpoint relevant documents.

Solution overview

In the following sections, we demonstrate how to prepare a dataset to use as a knowledge base, and then query with metadata filtering. You can query using either the AWS Management Console or SDK.

Prepare a dataset for Knowledge Bases for Amazon Bedrock

For this post, we use a sample dataset about fictional video games to illustrate how to ingest and retrieve metadata using Knowledge Bases for Amazon Bedrock. If you want to follow along in your own AWS account, download the file.

If you want to add metadata to your documents in an existing knowledge base, create the metadata files with the expected filename and schema, then skip to the step to sync your data with the knowledge base to start the incremental ingestion.

In our sample dataset, each game’s document is a separate CSV file (for example, s3://$bucket_name/video_game/$game_id.csv) with the following columns:

title, description, genres, year, publisher, score

Each game’s metadata has the suffix .metadata.json (for example, s3://$bucket_name/video_game/$game_id.csv.metadata.json) with the following schema:

{
  "metadataAttributes": {
    "id": number, 
    "genres": string,
    "year": number,
    "publisher": string,
    "score": number
  }
}

Create a knowledge base for Amazon Bedrock

For instructions to create a new knowledge base, see Create a knowledge base. For this example, we use the following settings:

  • On the Set up data source page, under Chunking strategy, select No chunking, because you’ve already preprocessed the documents in the previous step.
  • In the Embeddings model section, choose Titan G1 Embeddings – Text.
  • In the Vector database section, choose Quick create a new vector store. The metadata filtering feature is available for all supported vector stores.

Synchronize the dataset with the knowledge base

After you create the knowledge base, and your data files and metadata files are in an Amazon Simple Storage Service (Amazon S3) bucket, you can start the incremental ingestion. For instructions, see Sync to ingest your data sources into the knowledge base.

Query with metadata filtering on the Amazon Bedrock console

To use the metadata filtering options on the Amazon Bedrock console, complete the following steps:

  1. On the Amazon Bedrock console, choose Knowledge bases in the navigation pane.
  2. Choose the knowledge base you created.
  3. Choose Test knowledge base.
  4. Choose the Configurations icon, then expand Filters.
  5. Enter a condition using the format: key = value (for example, genres = Strategy) and press Enter.
  6. To change the key, value, or operator, choose the condition.
  7. Continue with the remaining conditions (for example, (genres = Strategy AND year >= 2023) OR (rating >= 9))
  8. When finished, enter your query in the message box, then choose Run.

For this post, we enter the query “A strategy game with cool graphic released after 2023.”

Query with metadata filtering using the SDK

To use the SDK, first create the client for the Agents for Amazon Bedrock runtime:

import boto3

bedrock_agent_runtime = boto3.client(
    service_name = "bedrock-agent-runtime"
)

Then construct the filter (the following are some examples):

# genres = Strategy
single_filter= {
    "equals": {
        "key": "genres",
        "value": "Strategy"
    }
}

# genres = Strategy AND year >= 2023
one_group_filter= {
    "andAll": [
        {
            "equals": {
                "key": "genres",
                "value": "Strategy"
            }
        },
        {
            "GreaterThanOrEquals": {
                "key": "year",
                "value": 2023
            }
        }
    ]
}

# (genres = Strategy AND year >=2023) OR score >= 9
two_group_filter = {
    "orAll": [
        {
            "andAll": [
                {
                    "equals": {
                        "key": "genres",
                        "value": "Strategy"
                    }
                },
                {
                    "GreaterThanOrEquals": {
                        "key": "year",
                        "value": 2023
                    }
                }
            ]
        },
        {
            "GreaterThanOrEquals": {
                "key": "score",
                "value": "9"
            }
        }
    ]
}

Pass the filter to retrievalConfiguration of the Retrieval API or RetrieveAndGenerate API:

retrievalConfiguration={
        "vectorSearchConfiguration": {
            "filter": metadata_filter
        }
    }

The following table lists a few responses with different metadata filtering conditions.

Query Metadata Filtering Retrieved Documents Observations
“A strategy game with cool graphic released after 2023” Off

* Viking Saga: The Sea Raider, year:2023, genres: Strategy

* Medieval Castle: Siege and Conquest, year:2022, genres: Strategy
* Fantasy Kingdoms: Chronicles of Eldoria, year:2023, genres: Strategy

* Cybernetic Revolution: Rise of the Machines, year:2022, genres: Strategy
* Steampunk Chronicles: Clockwork Empires, year:2021, genres: City-Building

2/5 games meet the condition (genres = Strategy and year >= 2023)
On * Viking Saga: The Sea Raider, year:2023, genres: Strategy
* Fantasy Kingdoms: Chronicles of Eldoria, year:2023, genres: Strategy
2/2 games meet the condition (genres = Strategy and year >= 2023)

In addition to custom metadata, you can also filter using S3 prefixes (which is a built-in metadata, so you don’t need to provide any metadata files). For example, if you organize the game documents into prefixes by publisher (for example, s3://$bucket_name/video_game/$publisher/$game_id.csv), you can filter with the specific publisher (for example, neo_tokyo_games) using the following syntax:

publisher_filter = {
    "startsWith": {
                    "key": "x-amz-bedrock-kb-source-uri",
                    "value": "s3://$bucket_name/video_game/neo_tokyo_games/"
                }
}

Clean up

To clean up your resources, complete the following steps:

  1. Delete the knowledge base:
    1. On the Amazon Bedrock console, choose Knowledge bases under Orchestration in the navigation pane.
    2. Choose the knowledge base you created.
    3. Take note of the AWS Identity and Access Management (IAM) service role name in the Knowledge base overview section.
    4. In the Vector database section, take note of the collection ARN.
    5. Choose Delete, then enter delete to confirm.
  2. Delete the vector database:
    1. On the Amazon OpenSearch Service console, choose Collections under Serverless in the navigation pane.
    2. Enter the collection ARN you saved in the search bar.
    3. Select the collection and chose Delete.
    4. Enter confirm in the confirmation prompt, then choose Delete.
  3. Delete the IAM service role:
    1. On the IAM console, choose Roles in the navigation pane.
    2. Search for the role name you noted earlier.
    3. Select the role and choose Delete.
    4. Enter the role name in the confirmation prompt and delete the role.
  4. Delete the sample dataset:
    1. On the Amazon S3 console, navigate to the S3 bucket you used.
    2. Select the prefix and files, then choose Delete.
    3. Enter permanently delete in the confirmation prompt to delete.

Conclusion

In this post, we covered the metadata filtering feature in Knowledge Bases for Amazon Bedrock. You learned how to add custom metadata to documents and use them as filters while retrieving and querying the documents using the Amazon Bedrock console and the SDK. This helps improve context accuracy, making query responses even more relevant while achieving a reduction in cost of querying the vector database.

For additional resources, refer to the following:


About the Authors

Corvus Lee is a Senior GenAI Labs Solutions Architect based in London. He is passionate about designing and developing prototypes that use generative AI to solve customer problems. He also keeps up with the latest developments in generative AI and retrieval techniques by applying them to real-world scenarios.

Ahmed Ewis is a Senior Solutions Architect at AWS GenAI Labs, helping customers build generative AI prototypes to solve business problems. When not collaborating with customers, he enjoys playing with his kids and cooking.

Chris Pecora is a Generative AI Data Scientist at Amazon Web Services. He is passionate about building innovative products and solutions while also focusing on customer-obsessed science. When not running experiments and keeping up with the latest developments in GenAI, he loves spending time with his kids.

Read More

Build knowledge-powered conversational applications using LlamaIndex and Llama 2-Chat

Build knowledge-powered conversational applications using LlamaIndex and Llama 2-Chat

Unlocking accurate and insightful answers from vast amounts of text is an exciting capability enabled by large language models (LLMs). When building LLM applications, it is often necessary to connect and query external data sources to provide relevant context to the model. One popular approach is using Retrieval Augmented Generation (RAG) to create Q&A systems that comprehend complex information and provide natural responses to queries. RAG allows models to tap into vast knowledge bases and deliver human-like dialogue for applications like chatbots and enterprise search assistants.

In this post, we explore how to harness the power of LlamaIndex, Llama 2-70B-Chat, and LangChain to build powerful Q&A applications. With these state-of-the-art technologies, you can ingest text corpora, index critical knowledge, and generate text that answers users’ questions precisely and clearly.

Llama 2-70B-Chat

Llama 2-70B-Chat is a powerful LLM that competes with leading models. It is pre-trained on two trillion text tokens, and intended by Meta to be used for chat assistance to users. Pre-training data is sourced from publicly available data and concludes as of September 2022, and fine-tuning data concludes July 2023. For more details on the model’s training process, safety considerations, learnings, and intended uses, refer to the paper Llama 2: Open Foundation and Fine-Tuned Chat Models. Llama 2 models are available on Amazon SageMaker JumpStart for a quick and straightforward deployment.

LlamaIndex

LlamaIndex is a data framework that enables building LLM applications. It provides tools that offer data connectors to ingest your existing data with various sources and formats (PDFs, docs, APIs, SQL, and more). Whether you have data stored in databases or in PDFs, LlamaIndex makes it straightforward to bring that data into use for LLMs. As we demonstrate in this post, LlamaIndex APIs make data access effortless and enables you to create powerful custom LLM applications and workflows.

If you are experimenting and building with LLMs, you are likely familiar with LangChain, which offers a robust framework, simplifying the development and deployment of LLM-powered applications. Similar to LangChain, LlamaIndex offers a number of tools, including data connectors, data indexes, engines, and data agents, as well as application integrations such as tools and observability, tracing, and evaluation. LlamaIndex focuses on bridging the gap between the data and powerful LLMs, streamlining data tasks with user-friendly features. LlamaIndex is specifically designed and optimized for building search and retrieval applications, such as RAG, because it provides a simple interface for querying LLMs and retrieving relevant documents.

Solution overview

In this post, we demonstrate how to create a RAG-based application using LlamaIndex and an LLM. The following diagram shows the step-by-step architecture of this solution outlined in the following sections.

RAG combines information retrieval with natural language generation to produce more insightful responses. When prompted, RAG first searches text corpora to retrieve the most relevant examples to the input. During response generation, the model considers these examples to augment its capabilities. By incorporating relevant retrieved passages, RAG responses tend to be more factual, coherent, and consistent with context compared to basic generative models. This retrieve-generate framework takes advantage of the strengths of both retrieval and generation, helping address issues like repetition and lack of context that can arise from pure autoregressive conversational models. RAG introduces an effective approach for building conversational agents and AI assistants with contextualized, high-quality responses.

Building the solution consists of the following steps:

  1. Set up Amazon SageMaker Studio as the development environment and install the required dependencies.
  2. Deploy an embedding model from the Amazon SageMaker JumpStart hub.
  3. Download press releases to use as our external knowledge base.
  4. Build an index out of the press releases to be able to query and add as additional context to the prompt.
  5. Query the knowledge base.
  6. Build a Q&A application using LlamaIndex and LangChain agents.

All the code in this post is available in the GitHub repo.

Prerequisites

For this example, you need an AWS account with a SageMaker domain and appropriate AWS Identity and Access Management (IAM) permissions. For account setup instructions, see Create an AWS Account. If you don’t already have a SageMaker domain, refer to Amazon SageMaker domain overview to create one. In this post, we use the AmazonSageMakerFullAccess role. It is not recommended that you use this credential in a production environment. Instead, you should create and use a role with least-privilege permissions. You can also explore how you can use Amazon SageMaker Role Manager to build and manage persona-based IAM roles for common machine learning needs directly through the SageMaker console.

Additionally, you need access to a minimum of the following instance sizes:

  • ml.g5.2xlarge for endpoint usage when deploying the Hugging Face GPT-J text embeddings model
  • ml.g5.48xlarge for endpoint usage when deploying the Llama 2-Chat model endpoint

To increase your quota, refer to Requesting a quota increase.

Deploy a GPT-J embedding model using SageMaker JumpStart

This section gives you two options when deploying SageMaker JumpStart models. You can use a code-based deployment using the code provided, or use the SageMaker JumpStart user interface (UI).

Deploy with the SageMaker Python SDK

You can use the SageMaker Python SDK to deploy the LLMs, as shown in the code available in the repository. Complete the following steps:

  1. Set the instance size that is to be used for deployment of the embeddings model using instance_type = "ml.g5.2xlarge"
  2. Locate the ID the model to use for embeddings. In SageMaker JumpStart, it is identified as model_id = "huggingface-textembedding-gpt-j-6b-fp16"
  3. Retrieve the pre-trained model container and deploy it for inference.

SageMaker will return the name of the model endpoint and the following message when the embeddings model has been deployed successfully:

Deploy with SageMaker JumpStart in SageMaker Studio

To deploy the model using SageMaker JumpStart in Studio, complete the following steps:

  1. On the SageMaker Studio console, choose JumpStart in the navigation pane.
  2. Search for and choose the GPT-J 6B Embedding FP16 model.
  3. Choose Deploy and customize the deployment configuration.
  4. For this example, we need an ml.g5.2xlarge instance, which is the default instance suggested by SageMaker JumpStart.
  5. Choose Deploy again to create the endpoint.

The endpoint will take approximately 5–10 minutes to be in service.

After you have deployed the embeddings model, in order to use the LangChain integration with SageMaker APIs, you need to create a function to handle inputs (raw text) and transform them to embeddings using the model. You do this by creating a class called ContentHandler, which takes a JSON of input data, and returns a JSON of text embeddings: class ContentHandler(EmbeddingsContentHandler).

Pass the model endpoint name to the ContentHandler function to convert the text and return embeddings:

embeddings = SagemakerEndpointEmbeddings(endpoint_name='huggingface-textembedding-gpt-j-6b-fp16', region_name= aws_region, content_handler=emb_content_handler).

You can locate the endpoint name in either the output of the SDK or in the deployment details in the SageMaker JumpStart UI.

You can test that the ContentHandler function and endpoint are working as expected by inputting some raw text and running the embeddings.embed_query(text) function. You can use the example provided text = "Hi! It's time for the beach" or try your own text.

Deploy and test Llama 2-Chat using SageMaker JumpStart

Now you can deploy the model that is able to have interactive conversations with your users. In this instance, we choose one of the Llama 2-chat models, that is identified via

my_model = JumpStartModel(model_id = "meta-textgeneration-llama-2-70b-f")

The model needs to be deployed to a real-time endpoint using predictor = my_model.deploy(). SageMaker will return the model’s endpoint name, which you can use for the endpoint_name variable to reference later.

You define a print_dialogue function to send input to the chat model and receive its output response. The payload includes hyperparameters for the model, including the following:

  • max_new_tokens – Refers to the maximum number of tokens that the model can generate in its outputs.
  • top_p – Refers to the cumulative probability of the tokens that can be retained by the model when generating its outputs
  • temperature – Refers to the randomness of the outputs generated by the model. A temperature greater than 0 or equal to 1 increases the level of randomness, whereas a temperature of 0 will generate the most likely tokens.

You should select your hyperparameters based on your use case and test them appropriately. Models such as the Llama family require you to include an additional parameter indicating that you have read and accepted the End User License Agreement (EULA):

response = predictor.predict(payload, custom_attributes='accept_eula=true')

To test the model, replace the content section of the input payload: "content": "what is the recipe of mayonnaise?". You can use your own text values and update the hyperparameters to understand them better.

Similar to the deployment of the embeddings model, you can deploy Llama-70B-Chat using the SageMaker JumpStart UI:

  1. On the SageMaker Studio console, choose JumpStart in the navigation pane
  2. Search for and choose the Llama-2-70b-Chat model
  3. Accept the EULA and choose Deploy, using the default instance again

Similar to the embedding model, you can use LangChain integration by creating a content handler template for the inputs and outputs of your chat model. In this case, you define the inputs as those coming from a user, and indicate that they are governed by the system prompt. The system prompt informs the model of its role in assisting the user for a particular use case.

This content handler is then passed when invoking the model, in addition to the aforementioned hyperparameters and custom attributes (EULA acceptance). You parse all these attributes using the following code:

llm = SagemakerEndpoint(
        endpoint_name=endpoint_name,
        region_name="us-east-1",
        model_kwargs={"max_new_tokens":500, "top_p": 0.1, "temperature": 0.4, "return_full_text": False},
        content_handler=content_handler,
        endpoint_kwargs = {"CustomAttributes": "accept_eula=true"}
    )

When the endpoint is available, you can test that it is working as expected. You can update llm("what is amazon sagemaker?") with your own text. You also need to define the specific ContentHandler to invoke the LLM using LangChain, as shown in the code and the following code snippet:

class ContentHandler(LLMContentHandler):
    content_type = "application/json"
    accepts = "application/json"
    def transform_input(self, prompt: str, model_kwargs: dict) -> bytes:
            payload = {
                "inputs": [
                    [
                        {
                            "role": "system",
                            "content": system_prompt,
                        },
                        {"role": "user", "content": prompt},
                    ],
                ],
                "parameters": model_kwargs,
            }
            input_str = json.dumps(
                payload,
            )
            return input_str.encode("utf-8")
   
    def transform_output(self, output: bytes) -> str:
            response_json = json.loads(output.read().decode("utf-8"))
            content = response_json[0]["generation"]["content"]
            return content
        
content_handler = ContentHandler()

Use LlamaIndex to build the RAG

To continue, install LlamaIndex to create the RAG application. You can install LlamaIndex using the pip: pip install llama_index

You first need to load your data (knowledge base) onto LlamaIndex for indexing. This involves a few steps:

  1. Choose a data loader:

LlamaIndex provides a number of data connectors available on LlamaHub for common data types like JSON, CSV, and text files, as well as other data sources, allowing you to ingest a variety of datasets. In this post, we use SimpleDirectoryReader to ingest a few PDF files as shown in the code. Our data sample is two Amazon press releases in PDF version in the press releases folder in our code repository. After you load the PDFs, you can see that they been converted to a list of 11 elements.

Instead of loading the documents directly, you can also covert the Document object into Node objects before sending them to the index. The choice between sending the entire Document object to the index or converting the Document into Node objects before indexing depends on your specific use case and the structure of your data. The nodes approach is generally a good choice for long documents, where you want to break and retrieve specific parts of a document rather than the entire document. For more information, refer to Documents / Nodes.

  1. Instantiate the loader and load the documents:

This step initializes the loader class and any needed configuration, such as whether to ignore hidden files. For more details, refer to SimpleDirectoryReader.

  1. Call the loader’s load_data method to parse your source files and data and convert them into LlamaIndex Document objects, ready for indexing and querying. You can use the following code to complete the data ingestion and preparation for full-text search using LlamaIndex’s indexing and retrieval capabilities:
docs = SimpleDirectoryReader(input_dir="pressrelease").load_data()
  1. Build the index:

The key feature of LlamaIndex is its ability to construct organized indexes over data, which is represented as documents or nodes. The indexing facilitates efficient querying over the data. We create our index with the default in-memory vector store and with our defined setting configuration. The LlamaIndex Settings is a configuration object that provides commonly used resources and settings for indexing and querying operations in a LlamaIndex application. It acts as a singleton object, so that it allows you to set global configurations, while also allowing you to override specific components locally by passing them directly into the interfaces (such as LLMs, embedding models) that use them. When a particular component is not explicitly provided, the LlamaIndex framework falls back to the settings defined in the Settings object as a global default. To use our embedding and LLM models with LangChain and configuring the Settings we need to install llama_index.embeddings.langchain and llama_index.llms.langchain. We can configure the Settings object as in the following code:

Settings.embed_model = LangchainEmbedding(embeddings)
Settings.llm = LangChainLLM(llm)

By default, VectorStoreIndex uses an in-memory SimpleVectorStore that’s initialized as part of the default storage context. In real-life use cases, you often need to connect to external vector stores such as Amazon OpenSearch Service. For more details, refer to Vector Engine for Amazon OpenSearch Serverless.

index = VectorStoreIndex.from_documents(docs, service_context=service_context)

Now you can run Q&A over your documents by using the query_engine from LlamaIndex. To do so, pass the index you created earlier for queries and ask your question. The query engine is a generic interface for querying data. It takes a natural language query as input and returns a rich response. The query engine is typically built on top of one or more indexes using retrievers.

query_engine = index.as_query_engine() print(query_engine.query("Since migrating to AWS in May, how much in operational cost Yellow.ai has reduced?"))

You can see that the RAG solution is able to retrieve the correct answer from the provided documents:

According to the provided information, Yellow.ai has reduced its operational costs by 20% since migrating to AWS in May

Use LangChain tools and agents

Loader class. The loader is designed to load data into LlamaIndex or subsequently as a tool in a LangChain agent. This gives you more power and flexibility to use this as part of your application. You start by defining your tool from the LangChain agent class. The function that you pass on to your tool queries the index you built over your documents using LlamaIndex.

tools = [
    Tool(
        name="Pressrelease",
        func=lambda q: str(index.as_query_engine().query(q)),
        description="useful pressreleases for answering relevnat questions",
        return_direct=True,
    ),
]

Then you select the right type of the agent that you would like to use for your RAG implementation. In this case, you use the chat-zero-shot-react-description agent. With this agent, the LLM will take use the available tool (in this scenario, the RAG over the knowledge base) to provide the response. You then initialize the agent by passing your tool, LLM, and agent type:

agent= initialize_agent(tools, llm, agent="chat-zero-shot-react-description", verbose=True)

You can see the agent going through thoughts, actions, and observation , use the tool (in this scenario, querying your indexed documents); and return a result:

'According to the provided press release, Yellow.ai has reduced its operational costs by 20%, driven performance improvements by 15%, and cut infrastructure costs by 10% since migrating to AWS. However, the specific cost savings from the migration are not mentioned in the provided information. It only states that the company has been able to reinvest the savings into innovation and AI research and development.'

You can find the end-to-end implementation code in the accompanying GitHub repo.

Clean up

To avoid unnecessary costs, you can clean up your resources, either via the following code snippets or the Amazon JumpStart UI.

To use the Boto3 SDK, use the following code to delete the text embedding model endpoint and the text generation model endpoint, as well as the endpoint configurations:

client = boto3.client('sagemaker', region_name=aws_region)
client.delete_endpoint(EndpointName=endpoint_name)
client.delete_endpoint_config(EndpointConfigName=endpoint_configuration)

To use the SageMaker console, complete the following steps:

  1. On the SageMaker console, under Inference in the navigation pane, choose Endpoints
  2. Search for the embedding and text generation endpoints.
  3. On the endpoint details page, choose Delete.
  4. Choose Delete again to confirm.

Conclusion

For use cases focused on search and retrieval, LlamaIndex provides flexible capabilities. It excels at indexing and retrieval for LLMs, making it a powerful tool for deep exploration of data. LlamaIndex enables you to create organized data indexes, use diverse LLMs, augment data for better LLM performance, and query data with natural language.

This post demonstrated some key LlamaIndex concepts and capabilities. We used GPT-J for embedding and Llama 2-Chat as the LLM to build a RAG application, but you could use any suitable model instead. You can explore the comprehensive range of models available on SageMaker JumpStart.

We also showed how LlamaIndex can provide powerful, flexible tools to connect, index, retrieve, and integrate data with other frameworks like LangChain. With LlamaIndex integrations and LangChain, you can build more powerful, versatile, and insightful LLM applications.


About the Authors

Dr. Romina Sharifpour is a Senior Machine Learning and Artificial Intelligence Solutions Architect at Amazon Web Services (AWS). She has spent over 10 years leading the design and implementation of innovative end-to-end solutions enabled by advancements in ML and AI. Romina’s areas of interest are natural language processing, large language models, and MLOps.

Nicole Pinto is an AI/ML Specialist Solutions Architect based in Sydney, Australia. Her background in healthcare and financial services gives her a unique perspective in solving customer problems. She is passionate about enabling customers through machine learning and empowering the next generation of women in STEM.

Read More