Mitigate hallucinations through Retrieval Augmented Generation using Pinecone vector database & Llama-2 from Amazon SageMaker JumpStart

Mitigate hallucinations through Retrieval Augmented Generation using Pinecone vector database & Llama-2 from Amazon SageMaker JumpStart

Despite the seemingly unstoppable adoption of LLMs across industries, they are one component of a broader technology ecosystem that is powering the new AI wave. Many conversational AI use cases require LLMs like Llama 2, Flan T5, and Bloom to respond to user queries. These models rely on parametric knowledge to answer questions. The model learns this knowledge during training and encodes it into the model parameters. In order to update this knowledge, we must retrain the LLM, which takes a lot of time and money.

Fortunately, we can also use source knowledge to inform our LLMs. Source knowledge is information fed into the LLM through an input prompt. One popular approach to providing source knowledge is Retrieval Augmented Generation (RAG). Using RAG, we retrieve relevant information from an external data source and feed that information into the LLM.

In this blog post, we’ll explore how to deploy LLMs such as Llama-2 using Amazon Sagemaker JumpStart and keep our LLMs up to date with relevant information through Retrieval Augmented Generation (RAG) using the Pinecone vector database in order to prevent AI Hallucination.

Retrieval Augmented Generation (RAG) in Amazon SageMaker

Pinecone will handle the retrieval component of RAG, but you need two more critical components: somewhere to run the LLM inference and somewhere to run the embedding model.

Amazon SageMaker Studio an integrated development environment (IDE) that provides a single web-based visual interface where you can access purpose-built tools to perform all machine learning (ML) development. It provides SageMaker JumpStart which is a model hub where users can locate, preview, and launch a particular model in their own SageMaker account. It provides pretrained, publicly available and proprietary models for a wide range of problem types, including Foundation Models.

Amazon SageMaker Studio provides the ideal environment for developing RAG-enabled LLM pipelines. First, using the AWS console, go to Amazon SageMaker & create a SageMaker Studio domain and open a Jupyter Studio notebook.

Prerequisites

Complete the following prerequisite steps:

  1. Set up Amazon SageMaker Studio.
  2. Onboard to an Amazon SageMaker Domain.
  3. Sign up for a free-tier Pinecone Vector Database.
  4. Prerequisite libraries: SageMaker Python SDK, Pinecone Client

Solution Walkthrough

Using SageMaker Studio notebook, we first need install prerequisite libraries:

!pip install -qU sagemaker pinecone-client==2.2.1 ipywidgets==7.0.0 

Deploying an LLM

In this post, we discuss two approaches to deploying an LLM. The first is through the HuggingFaceModel object. You can use this when deploying LLMs (and embedding models) directly from the Hugging Face model hub.

For example, you can create a deployable config for the google/flan-t5-xl model as shown in the following screen capture:

import sagemaker
from sagemaker.huggingface import (
HuggingFaceModel, 
get_huggingface_llm_image_uri
)
role = sagemaker.get_execution_role()
hub_config = {'HF_MODEL_ID':'google/flan-t5-xl', # model_id from hf.co/models
'HF_TASK':'text-generation' # NLP task you want to use for predictions

# retrieve the llm image uri
llm_image = get_huggingface_llm_image_uri("huggingface", version="0.8.2"&)
huggingface_model = HuggingFaceModel(env=hub_config, role=role, # iam role with permissions to create an Endpoint 
image_uri=llm_image
)

When deploying models directly from Hugging Face, initialize the my_model_configuration with the following:

  • An env config tells us which model we want to use and for what task.
  • Our SageMaker execution role gives us permissions to deploy our model.
  • An image_uri is an image config specifically for deploying LLMs from Hugging Face.

Alternatively, SageMaker has a set of models directly compatible with a simpler JumpStartModel object. Many popular LLMs like Llama 2 are supported by this model, which can be initialized as shown in the following screen capture:

import sagemaker 
from sagemaker.jumpstart.model import JumpStartModel 

role = sagemaker.get_execution_role() 

my_model = JumpStartModel(model_id = "meta-textgeneration-llama-2-7b-f")

For both versions of my_model, deploy them as shown in the following screen capture:

predictor = my_model.deploy(
    initial_instance_count=1, instance_type="ml.g5.4xlarge", endpoint_name="llama-2-generator")

Querying the pre-trained LLM

With our initialized LLM endpoint, you can begin querying. The format of our queries may vary (particularly between conversational and non-conversational LLMs), but the process is generally the same. For the Hugging Face model, do the following:

# https://aws.amazon.com/blogs/machine-learning/llama-2-foundation-models-from-meta-are-now-available-in-amazon-sagemaker-jumpstart/

prompt = """Answer the following QUESTION based on the CONTEXT
given. If you do not know the answer and the CONTEXT doesn't
contain the answer truthfully say "I don't know

ANSWER:

"""

payload = {
    "inputs":  
      [
        [
         {"role": "system", "content": prompt},
         {"role": "user", "content": question},
        ]   
      ],
   "parameters":{"max_new_tokens": 64, "top_p": 0.9, "temperature": 0.6, "return_full_text": False}
}

out = predictor.predict(payload, custom_attributes='accept_eula=true')
out[0]['generation']['content']

You can find the solution in the GitHub repository.

The generated answer we’re receiving here doesn’t make much sense — it is a hallucination.

Providing Additional Context to LLM

Llama 2 attempts to answer our question based solely on internal parametric knowledge. Clearly, the model parameters do not store knowledge of which instances we can with managed spot training in SageMaker.

To answer this question correctly, we must use source knowledge. That is, we give additional information to the LLM via the prompt. Let’s add that information directly as additional context for the model.

context = """Managed Spot Training can be used with all instances
supported in Amazon SageMaker. Managed Spot Training is supported
in all AWS Regions where Amazon SageMaker is currently available."""

prompt_template = """Answer the following QUESTION based on the CONTEXT
given. If you do not know the answer and the CONTEXT doesn't
contain the answer truthfully say "I don't know".

CONTEXT:
{context}

ANSWER:
"""

text_input = prompt_template.replace("{context}", context).replace("{question}", question)

payload = {
    "inputs":  
      [
        [
         {"role": "system", "content": text_input},
         {"role": "user", "content": question},
        ]   
      ],
   "parameters":{"max_new_tokens": 64, "top_p": 0.9, "temperature": 0.6, "return_full_text": False}
}

out = predictor.predict(payload, custom_attributes='accept_eula=true')
generated_text = out[0]['generation']['content']
print(f"[Input]: {question}n[Output]: {generated_text}")

[Input]: Which instances can I use with Managed Spot Training in SageMaker?

[Output]:  Based on the given context, you can use Managed Spot Training with all instances supported in Amazon SageMaker. Therefore, the answer is:

All instances supported in Amazon SageMaker.

We now see the correct answer to the question; that was easy! However, a user is unlikely to insert contexts into their prompts, they would already know the answer to their question.

Rather than manually inserting a single context, automatically identify relevant information from a more extensive database of information. For that, you will need Retrieval Augmented Generation.

Retrieval Augmented Generation

With Retrieval Augmented Generation, you can encode a database of information into a vector space where the proximity between vectors represents their relevance/semantic similarity. With this vector space as a knowledge base, you can convert a new user query, encode it into the same vector space, and retrieve the most relevant records previously indexed.

After retrieving these relevant records, select a few of them and include them in the LLM prompt as additional context, providing the LLM with highly relevant source knowledge. This is a two-step process where:

  • Indexing populates the vector index with information from a dataset.
  • Retrieval happens during a query and is where we retrieve relevant information from the vector index.

Both steps require an embedding model to translate our human-readable plain text into semantic vector space. Use the highly efficient MiniLM sentence transformer from Hugging Face as shown in the following screen capture. This model is not an LLM and therefore is not initialized in the same way as our Llama 2 model.

hub_config = {
    "HF_MODEL_ID": "sentence-transformers/all-MiniLM-L6-v2",  # model_id from hf.co/models
    "HF_TASK": "feature-extraction",
}

huggingface_model = HuggingFaceModel(
    env=hub_config,
    role=role,
    transformers_version="4.6",  # transformers version used
    pytorch_version="1.7",  # pytorch version used
    py_version="py36",  # python version of the DLC
)

In the hub_config, specify the model ID as shown in the screen capture above but for the task, use feature-extraction because we are generating vector embeddings not text like our LLM. Following this, initialize the model config with HuggingFaceModel as before, but this time without the LLM image and with some version parameters.

encoder = huggingface_model.deploy(
    initial_instance_count=1, instance_type="ml.t2.large", endpoint_name="minilm-embedding"
)

You can deploy the model again with deploy, using the smaller (CPU only) instance of ml.t2.large. The MiniLM model is tiny, so it does not require a lot of memory and doesn’t need a GPU because it can quickly create embeddings even on a CPU. If preferred, you can run the model faster on GPU.

To create embeddings, use the predict method and pass a list of contexts to encode via the inputs key as shown:

out = encoder.predict({"inputs": ["some text here", "some more text goes here too"]})

Two input contexts are passed, returning two context vector embeddings as shown:

len(out)

2

The embedding dimensionality of the MiniLM model is 384 which means each vector embedding MiniLM outputs should have a dimensionality of 384. However, looking at the length of our embeddings, you will see the following:

len(out[0]), len(out[1])

(8, 8)

Two lists contain eight items each. MiniLM first processes text in a tokenization step. This tokenization transforms our human-readable plain text into a list of model-readable token IDs. In the output features of the model, you can see the token-level embeddings. one of these embeddings shows the expected dimensionality of 384 as shown:

len(out[0][0])

384

Transform these token-level embeddings into document-level embeddings by using the mean values across each vector dimension, as shown in the following illustration.

Mean pooling operation to get a single 384-dimensional vector.

import numpy as np embeddings = np.mean(np.array(out), axis=1)embeddings.shape(2, 384)

With two 384-dimensional vector embeddings, one for each input text. To make our lives easier, wrap the encoding process into a single function as shown in the following screen capture:

from typing import List

def embed_docs(docs: List[str]) -> List[List[float]]:
    out = encoder.predict({"inputs": docs})
    embeddings = np.mean(np.array(out), axis=1)
    return embeddings.tolist()

Downloading the Dataset

Download the Amazon SageMaker FAQs as the knowledge base to get the data which contains both question and answer columns.

Download the Amazon SageMaker FAQs

When performing the search, look for Answers only, so you can drop the Question column. See notebook for details.

Our dataset and the embedding pipeline are ready. Now all we need is somewhere to store those embeddings.

Indexing

The Pinecone vector database stores vector embeddings and searches them efficiently at scale. To create a database, you will need a free API key from Pinecone.

import pinecone
import os

# add Pinecone API key from app.pinecone.io
api_key = os.environ.get("PINECONE_API_KEY") or "YOUR_API_KEY"
# set Pinecone environment - find next to API key in console
env = os.environ.get("PINECONE_ENVIRONMENT") or "YOUR_ENV"

pinecone.init(api_key=api_key, environment=env)

After you have connected to the Pinecone vector database, create a single vector index (similar to a table in traditional DBs). Name the index retrieval-augmentation-aws and align the index dimension and metric parameters with those required by the embedding model (MiniLM in this case).

import time

index_name = "retrieval-augmentation-aws"

if index_name in pinecone.list_indexes():
    pinecone.delete_index(index_name)

pinecone.create_index(name=index_name, dimension=embeddings.shape[1], metric="cosine")
# wait for index to finish initialization
while not pinecone.describe_index(index_name).status["ready"]:
    time.sleep(1)

To begin inserting data, run the following:

from tqdm.auto import tqdm

batch_size = 2  # can increase but needs larger instance size otherwise instance runs out of memory
vector_limit = 1000

answers = df_knowledge[:vector_limit]
index = pinecone.Index(index_name)

for i in tqdm(range(0, len(answers), batch_size)):
    # find end of batch
    i_end = min(i + batch_size, len(answers))
    # create IDs batch
    ids = [str(x) for x in range(i, i_end)]
    # create metadata batch
    metadatas = [{"text": text} for text in answers["Answer"][i:i_end]]
    # create embeddings
    texts = answers["Answer"][i:i_end].tolist()
    embeddings = embed_docs(texts)
    # create records list for upsert
    records = zip(ids, embeddings, metadatas)
    # upsert to Pinecone
    index.upsert(vectors=records)

You can begin querying the index with the question from earlier in this post.

# extract embeddings for the questions
query_vec = embed_docs(question)[0]

# query pinecone
res = index.query(query_vec, top_k=1, include_metadata=True)

# show the results
res
{'matches': [{'id': '90',
'metadata': {'text': 'Managed Spot Training can be used with all '
'instances supported in Amazon '
'SageMaker.rn'},
'score': 0.881181657,
'values': []}],
'namespace': ''}

Above output shows that we’re returning relevant contexts to help us answer our question. Since we top_k = 1, index.query returned the top result along side the metadata which reads Managed Spot Training can be used with all instances supported in Amazon.

Augmenting the Prompt

Use the retrieved contexts to augment the prompt and decide on a maximum amount of context to feed into the LLM. Use the 1000 characters limit to iteratively add each returned context to the prompt until you exceed the content length.

Augmenting the Prompt

Augmenting the Prompt

Feed the context_str into the LLM prompt as shown in the following screen capture:

payload = create_payload(question, context_str)
out = predictor.predict(payload, custom_attributes='accept_eula=true')
generated_text = out[0]['generation']['content']
print(f"[Input]: {question}n[Output]: {generated_text}")
[Input]: Which instances can I use with Managed Spot Training in SageMaker?

[Output]:  Based on the context provided, you can use Managed Spot Training with all instances supported in Amazon SageMaker. Therefore, the answer is:


All instances supported in Amazon SageMaker.

The logic works, so wrap it up into a single function to keep things clean.

def rag_query(question: str) -> str:
    # create query vec
    query_vec = embed_docs(question)[0]
    # query pinecone
    res = index.query(query_vec, top_k=5, include_metadata=True)
    # get contexts
    contexts = [match.metadata["text"] for match in res.matches]
    # build the multiple contexts string
    context_str = construct_context(contexts=contexts)
    # create our retrieval augmented prompt
    payload = create_payload(question, context_str)
    # make prediction
    out = predictor.predict(payload, custom_attributes='accept_eula=true')
    return out[0]["generation"]["content"]

You can now ask questions like those shown in the following:

rag_query("Does SageMaker support spot instances?")

' Yes, Amazon SageMaker supports spot instances for managed spot training. According to the provided context, Managed Spot Training can be used with all instances supported in Amazon SageMaker, and Managed Spot Training is supported in all AWS Regions where Amazon SageMaker is currently available.nnTherefore, the answer to your question is:nnYes, SageMaker supports spot instances in all regions where Amazon SageMaker is available.'

Clean up

To stop incurring any unwanted charges, delete the model and endpoint.

encoder.delete_model()

encoder.delete_endpoint()

Conclusion

In this post, we introduced you to RAG with open-access LLMs on SageMaker. We also showed how to deploy Amazon SageMaker Jumpstart models with Llama 2, Hugging Face LLMs with Flan T5, and embedding models with MiniLM.

We implemented a complete end-to-end RAG pipeline using our open-access models and a Pinecone vector index. Using this, we showed how to minimize hallucinations, and keep LLM knowledge up to date, and ultimately enhance the user experience and trust in our systems.

To run this example on your own, clone this GitHub repository and walkthrough the previous steps using the Question Answering notebook on GitHub.


About the authors

Vedant Jain profile pictureVedant Jain is a Sr. AI/ML Specialist, working on strategic Generative AI initiatives. Prior to joining AWS, Vedant has held ML/Data Science Specialty positions at various companies such as Databricks, Hortonworks (now Cloudera) & JP Morgan Chase. Outside of his work, Vedant is passionate about making music, rock climbing, using science to lead a meaningful life & exploring cuisines from around the world.

James Briggs is a Staff Developer Advocate at Pinecone, specializing in vector search and AI/ML. He guides developers and businesses in developing their own GenAI solutions through online education. Prior to Pinecone James worked on AI for small tech startups to established finance corporations. Outside of work, James has a passion for traveling and embracing new adventures, ranging from surfing and scuba to Muay Thai and BJJ.

Xin HuangXin Huang is a Senior Applied Scientist for Amazon SageMaker JumpStart and Amazon SageMaker built-in algorithms. He focuses on developing scalable machine learning algorithms. His research interests are in the area of natural language processing, explainable deep learning on tabular data, and robust analysis of non-parametric space-time clustering. He has published many papers in ACL, ICDM, KDD conferences, and Royal Statistical Society: Series A.

Read More

Techniques for automatic summarization of documents using language models

Techniques for automatic summarization of documents using language models

Summarization is the technique of condensing sizable information into a compact and meaningful form, and stands as a cornerstone of efficient communication in our information-rich age. In a world full of data, summarizing long texts into brief summaries saves time and helps make informed decisions. Summarization condenses content, saving time and improving clarity by presenting information concisely and coherently. Summarization is invaluable for decision-making and in managing large volumes of content.

Summarization methods have a broad range of applications serving various purposes, such as:

  • News aggregation News aggregation involves summarizing news articles into a newsletter for the media industry
  • Legal document summarization Legal document summarization helps legal professionals extract key legal information from lengthy documents like terms, conditions, and contracts
  • Academic research – Summarization annotates, indexes, condenses, and simplifies important information from academic papers
  • Content curation for blogs and websites – You can create engaging and original content summaries for readers, especially in marketing
  • Financial reports and market analysis – You can extract financial insights from reports and create executive summaries for investor presentations in the finance industry

With the advancements in natural language processing (NLP), language models, and generative AI, summarizing texts of varying lengths has become more accessible. Tools like LangChain, combined with a large language model (LLM) powered by Amazon Bedrock or Amazon SageMaker JumpStart, simplify the implementation process.

This post delves into the following summarization techniques:

  • Extractive summarization using the BERT extractive summarizer
  • Abstractive summarization using specialized summarization models and LLMs
  • Two multi-level summarization techniques:
    • Extractive-abstractive summarization using the extractive-abstractive content summarization strategy (EACSS)
    • Abstractive-abstractive summarization using Map Reduce and Map ReRank

Text Summarization Techniques

The complete code sample is found in the GitHub repo. You can launch this solution in Amazon SageMaker Studio.

Click here to open the AWS console and follow along.

Types of summarizations

There are several techniques to summarize text, which are broadly categorized into two main approaches: extractive and abstractive summarization. Furthermore, multi-level summarization methodologies incorporate a series of steps, combining both extractive and abstractive techniques. These multi-level approaches are advantageous when dealing with text with tokens longer than the limit of an LLM, enabling an understanding of complex narratives.

Extractive summarization

Extractive summarization is a technique used in NLP and text analysis to create a summary by extracting key sentences. Instead of generating new sentences or content as in abstractive summarization, extractive summarization relies on identifying and pulling out the most relevant and informative portions of the original text to create a condensed version.

Extractive summarization, although advantageous in preserving the original content and ensuring high readability by directly pulling important sentences from the source text, has limitations. It lacks creativity, is unable to generate novel sentences, and may overlook nuanced details, potentially missing important information. Moreover, it may produce lengthy summaries, sometimes overwhelming readers with excessive and unwanted information. There are many extractive summarization techniques, such as TextRank and LexRank. In this post, we focus on the BERT extractive summarizer.

BERT extractive summarizer

The BERT extractive summarizer is a type of extractive summarization model that uses the BERT language model to extract the most important sentences from a text. BERT is a pre-trained language model that can be fine-tuned for a variety of tasks, including text summarization. It works by first embedding the sentences in the text using BERT. This produces a vector representation for each sentence that captures its meaning and context. The model then uses a clustering algorithm to group the sentences into clusters. The sentences that are closest to the center of each cluster are selected to form the summary.

Compared with LLMs, the advantage of the BERT extractive summarizer is it’s relatively straightforward to train and deploy the model and it’s more explainable. The disadvantage is the summarization isn’t creative and doesn’t generate sentences. It only selects sentences from the original text. This limits its ability to summarize complex or nuanced texts.

Abstractive summarization

Abstractive summarization is a technique used in NLP and text analysis to create a summary that goes beyond mere extraction of sentences or phrases from the source text. Instead of selecting and reorganizing existing content, abstractive summarization generates new sentences or phrases that capture the core meaning and main ideas of the original text in a more condensed and coherent form. This approach requires the model to understand the content of the text and express it in a way that is not necessarily present in the source material.

Specialized summarization models

These pre-trained natural language models, such as BART and PEGASUS, are specifically tailored for text summarization tasks. They employ encoder-decoder architectures and are smaller in parameters compared to their counterparts. This reduced size allows for ease of fine-tuning and deployment on smaller instances. However, it’s important to note that these summarization models also come with smaller input and output token sizes. Unlike their more general-purpose counterparts, these models are exclusively designed for summarization tasks. As a result, the input required for these models is solely the text that needs to be summarized.

Large language models

A large language model refers to any model that undergoes training on extensive and diverse datasets, typically through self-supervised learning at a large scale, and is capable of being fine-tuned to suit a wide array of specific downstream tasks. These models are larger in parameter size and perform better in tasks. Notably, they feature substantially larger input token sizes, some going up to 100,000, such as Anthropic’s Claude. To use one of these models, AWS offers the fully managed service Amazon Bedrock. If you need more control of the model development lifecycle, you can deploy LLMs through SageMaker.

Given their versatile nature, these models require specific task instructions provided through input text, a practice referred to as prompt engineering. This creative process yields varying outcomes based on the model type and input text. The effectiveness of both the model’s performance and the prompt’s quality significantly influence the final quality of the model’s outputs. The following are some tips when engineering prompts for summarization:

  • Include the text to summarize – Input the text that needs to be summarized. This serves as the source material for the summary.
  • Define the task – Clearly state that the objective is text summarization. For example, “Summarize the following text: [input text].”
  • Provide context – Offer a brief introduction or context for the given text that needs to be summarized. This helps the model understand the content and context. For example, “You are given the following article about Artificial Intelligence and its role in Healthcare: [input text].”
  • Prompt for the summary – Prompt the model to generate a summary of the provided text. Be clear about the desired length or format of the summary. For example, “Please generate a concise summary of the given article on Artificial Intelligence and its role in Healthcare: [input text].”
  • Set constraints or length guidelines – Optionally, guide the length of the summary by specifying a desired word count, sentence count, or character limit. For example, “Please generate a summary that is no longer than 50 words: [input text].”

Effective prompt engineering is critical for ensuring that the generated summaries are accurate, relevant, and aligned with the intended summarization task. Refine the prompt for optimal summarization result with experiments and iterations. After you have established the effectiveness of the prompts, you can reuse them with the use of prompt templates.

Multi-level summarization

Extractive and abstractive summarizations are useful for shorter texts. However, when the input text exceeds the model’s maximum token limit, multi-level summarization becomes necessary. Multi-level summarization involves a combination of various summarization techniques, such as extractive and abstractive methods, to effectively condense longer texts by applying multiple layers of summarization processes. In this section, we discuss two multi-level summarization techniques: extractive-abstractive summarization and abstractive-abstractive summarization.

Extractive-abstractive summarization

Extractive-abstractive summarization works by first generating an extractive summary of the text. Then it uses an abstractive summarization system to refine the extractive summary, making it more concise and informative. This enhances accuracy by providing more informative summaries compared to extractive methods alone.

Extractive-abstractive content summarization strategy

The EACSS technique combines the strengths of two powerful techniques: the BERT extractive summarizer for the extractive phase and LLMs for the abstractive phase, as illustrated in the following diagram.

Extractive Abstractive Text Summarization

EACSS offers several advantages, including the preservation of crucial information, enhanced readability, and adaptability. However, implementing EACSS is computationally expensive and complex. There’s a risk of potential information loss, and the quality of the summarization heavily depends on the performance of the underlying models, making careful model selection and tuning essential for achieving optimal results. Implementation includes the following steps:

  1. The first step is to break down the large document, such as a book, into smaller sections, or chunks. These chunks are defined as sentences, paragraphs, or even chapters, depending on the granularity desired for the summary.
  2. For the extractive phase, we employ the BERT extractive summarizer. This component works by embedding the sentences within each chunk and then employing a clustering algorithm to identify sentences that are closest to the cluster’s centroids. This extractive step helps in preserving the most important and relevant content from each chunk.
  3. Having generated extractive summaries for each chunk, we move on to the abstractive summarization phase. Here, we utilize LLMs known for their ability to generate coherent and contextually relevant summaries. These models take the extracted summaries as input and produce abstractive summaries that capture the essence of the original document while ensuring readability and coherence.

By combining extractive and abstractive summarization techniques, this approach offers an efficient and comprehensive way to summarize lengthy documents such as books. It ensures that important information is extracted while allowing for the generation of concise and human-readable summaries, making it a valuable tool for various applications in the domain of document summarization.

Abstractive-abstractive summarization

Abstractive-abstractive summarization is an approach where abstractive methods are used for both extracting and generating summaries. It offers notable advantages, including enhanced readability, coherence, and the flexibility to adjust summary length and detail. It excels in language generation, allowing for paraphrasing and avoiding redundancy. However, there are drawbacks. For example, it’s computationally expensive and resource intensive, and its quality heavily depends on the effectiveness of the underlying models, which, if not well-trained or versatile, may impact the quality of the generated summaries. Selection of models is crucial to mitigate these challenges and ensure high-quality abstractive summaries. For abstractive-abstractive summarization, we discuss two strategies: Map Reduce and Map ReRank.

Map Reduce using LangChain

This two-step process comprises a Map step and a Reduce step, as illustrated in the following diagram. This technique enables you to summarize an input that is longer than the model’s input token limit.

Abstractive text summarization mapreduce

The process consists of three main steps:

  1. The corpora is split into smaller chunks that fit into the LLM’s token limit.
  2. We use a Map step to individually apply an LLM chain that extracts all the important information from each passage, and its output is used as a new passage. Depending on the size and structure of the corpora, this could be in the form of overarching themes or short summaries.
  3. The Reduce step combines the output passages from the Map step or a Reduce Step such that it fits the token limit and feeds it into the LLM. This process is repeated until the final output is a singular passage.

The advantage of using this technique is that it’s highly scalable and parallelizable. All the processing in each step is independent from each other, which takes advantage of distributed systems or serverless services and lower compute time.

Map ReRank using LangChain

This chain runs an initial prompt on each document that not only tries to complete a task but also gives a score for how certain it is in its answer. The highest scoring response is returned.

This technique is very similar to Map Reduce but with the advantage of requiring fewer overall calls, streamlining the summarization process. However, its limitation lies in its inability to merge information across multiple documents. This restriction makes it most effective in scenarios where a single, straightforward answer is expected from a singular document, making it less suitable for more complex or multifaceted information retrieval tasks that involve multiple sources. Careful consideration of the context and the nature of the data is essential to determine the appropriateness of this method for specific summarization needs.

Cohere ReRank uses a semantic-based reranking system that contextualizes the meaning of a user’s query beyond keyword relevance. It’s used with vector store systems as well as keyword-based search engines, giving it flexibility.

Comparing summarization techniques

Each summarization technique has its own unique advantages and disadvantages:

  • Extractive summarization preserves the original content and ensures high readability but lacks creativity and may produce lengthy summaries.
  • Abstractive summarization, while offering creativity and generating concise, fluent summaries, comes with the risk of unintentional content modification, challenges in language accuracy, and resource-intensive development.
  • Extractive-abstractive multi-level summarization effectively summarizes large documents and provides better flexibility in fine-tuning the extractive part of the models. However, it’s expensive, time consuming, and lacks parallelization, making parameter tuning challenging.
  • Abstractive-abstractive multi-level summarization also effectively summarizes large documents and excels in enhanced readability and coherence. However, it’s computationally expensive and resource intensive, relying heavily on the effectiveness of underlying models.

Careful model selection is crucial to mitigate challenges and ensure high-quality abstractive summaries in this approach. The following table summarizes the capabilities for each type of summarization.

Aspect Extractive Summarization Abstractive Summarization Multi-level Summarization
Generate creative and engaging summaries No Yes Yes
Preserve original content Yes No No
Balance information preservation and creativity No Yes Yes
Suitable for short, objective text (input text length smaller than maximum tokens of the model) Yes Yes No
Effective for longer, complex documents such as books (input text length greater than maximum tokens of the model) No No Yes
Combines extraction and content generation No No Yes

Multi-level summarization techniques are suitable for long and complex documents where the input text length exceeds the token limit of the model. The following table compares these techniques.

Technique Advantages Disadvantages
EACSS (extractive-abstractive) Preserves crucial information, provides the ability to fine-tune the extractive part of the models. Computationally expensive, potential information loss, and lacks parallelization.
Map Reduce (abstractive-abstractive) Scalable and parallelizable, with less compute time. The best technique to generate creative and concise summaries. Memory-intensive process.
Map ReRank (abstractive-abstractive) Streamlined summarization with semantic-based ranking. Limited information merging.

Tips when summarizing text

Consider the following best practices when summarizing text:

  • Be aware of the total token size – Be prepared to split the text if it exceeds the model’s token limits or employ multiple levels of summarization when using LLMs.
  • Be aware of the types and number of data sources – Combining information from multiple sources may require transformations, clear organization, and integration strategies. LangChain Stuff has integration on a wide variety of data sources and document types. It simplifies the process of combining text from different documents and data sources with the use of this technique.
  • Be aware of model specialization – Some models may excel at certain types of content but struggle with others. There may be fine-tuned models that are better suited for your domain of text.
  • Use multi-level summarization for large bodies of text – For texts that exceed the token limits, consider a multi-level summarization approach. Start with a high-level summary to capture the main ideas and then progressively summarize subsections or chapters for more detailed insights.
  • Summarize text by topics – This approach helps maintain a logical flow and reduce information loss, and it prioritizes the retention of crucial information. If you’re using LLMs, craft clear and specific prompts that guide the model to summarize a particular topic instead of the whole body of text.

Conclusion

Summarization stands as a vital tool in our information-rich era, enabling the efficient distillation of extensive information into concise and meaningful forms. It plays a pivotal role in various domains, offering numerous advantages. Summarization saves time by swiftly conveying essential content from lengthy documents, aids decision-making by extracting critical information, and enhances comprehension in education and content curation.

This post provided a comprehensive overview of various summarization techniques, including extractive, abstractive, and multi-level approaches. With tools like LangChain and language models, you can harness the power of summarization to streamline communication, improve decision-making, and unlock the full potential of vast information repositories. The comparison table in this post can help you identify the most suitable summarization techniques for your projects. Additionally, the tips shared in the post serve as valuable guidelines to avoid repetitive errors when experimenting with LLMs for text summarization. This practical advice empowers you to apply the knowledge gained, ensuring successful and efficient summarization in the projects.

References


About the authors

Nick Biso is a Machine Learning Engineer at AWS Professional Services. He solves complex organizational and technical challenges using data science and engineering. In addition, he builds and deploys AI/ML models on the AWS Cloud. His passion extends to his proclivity for travel and diverse cultural experiences.

Suhas chowdary Jonnalagadda is a Data Scientist at AWS Global Services. He is passionate about helping enterprise customers solve their most complex problems with the power of AI/ML. He has helped customers in transforming their business solutions across diverse industries, including finance, healthcare, banking, ecommerce, media, advertising, and marketing.

Tabby Ward is a Principal Cloud Architect/Strategic Technical Advisor with extensive experience migrating customers and modernizing their application workload and services to AWS. With over 25 years of experience developing and architecting software, she is recognized for her deep-dive ability as well as skillfully earning the trust of customers and partners to design architectures and solutions across multiple tech stacks and cloud providers.

Shyam Desai is a Cloud Engineer for big data and machine learning services at AWS. He supports enterprise-level big data applications and customers using a combination of software engineering expertise with data science. He has extensive knowledge in computer vision and imaging applications for artificial intelligence, as well as biomedical and bioinformatic applications.

Read More

Boosting RAG-based intelligent document assistants using entity extraction, SQL querying, and agents with Amazon Bedrock

Boosting RAG-based intelligent document assistants using entity extraction, SQL querying, and agents with Amazon Bedrock

Conversational AI has come a long way in recent years thanks to the rapid developments in generative AI, especially the performance improvements of large language models (LLMs) introduced by training techniques such as instruction fine-tuning and reinforcement learning from human feedback. When prompted correctly, these models can carry coherent conversations without any task-specific training data. However, they can’t generalize well to enterprise-specific questions because, to generate an answer, they rely on the public data they were exposed to during pre-training. Such data often lacks the specialized knowledge contained in internal documents available in modern businesses, which is typically needed to get accurate answers in domains such as pharmaceutical research, financial investigation, and customer support.

To create AI assistants that are capable of having discussions grounded in specialized enterprise knowledge, we need to connect these powerful but generic LLMs to internal knowledge bases of documents. This method of enriching the LLM generation context with information retrieved from your internal data sources is called Retrieval Augmented Generation (RAG), and produces assistants that are domain specific and more trustworthy, as shown by Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. Another driver behind RAG’s popularity is its ease of implementation and the existence of mature vector search solutions, such as those offered by Amazon Kendra (see Amazon Kendra launches Retrieval API) and Amazon OpenSearch Service (see k-Nearest Neighbor (k-NN) search in Amazon OpenSearch Service), among others.

However, the popular RAG design pattern with semantic search can’t answer all types of questions that are possible on documents. This is especially true for questions that require analytical reasoning across multiple documents. For example, imagine that you are planning next year’s strategy of an investment company. One essential step would be to analyze and compare the financial results and potential risks of candidate companies. This task involves answering analytical reasoning questions. For instance, the query “Give me the top 5 companies with the highest revenue in the last 2 years and identify their main risks” requires multiple steps of reasoning, some of which can use semantic search retrieval, whereas others require analytical capabilities.

In this post, we show how to design an intelligent document assistant capable of answering analytical and multi-step reasoning questions in three parts. In Part 1, we review the RAG design pattern and its limitations on analytical questions. Then we introduce you to a more versatile architecture that overcomes these limitations. Part 2 helps you dive deeper into the entity extraction pipeline used to prepare structured data, which is a key ingredient for analytical question answering. Part 3 walks you through how to use Amazon Bedrock LLMs to query that data and build an LLM agent that enhances RAG with analytical capabilities, thereby enabling you to build intelligent document assistants that can answer complex domain-specific questions across multiple documents.

Part 1: RAG limitations and solution overview

In this section, we review the RAG design pattern and discuss its limitations on analytical questions. We also present a more versatile architecture that overcomes these limitations.

Overview of RAG

RAG solutions are inspired by representation learning and semantic search ideas that have been gradually adopted in ranking problems (for example, recommendation and search) and natural language processing (NLP) tasks since 2010.

The popular approach used today is formed of three steps:

  1. An offline batch processing job ingests documents from an input knowledge base, splits them into chunks, creates an embedding for each chunk to represent its semantics using a pre-trained embedding model, such as Amazon Titan embedding models, then uses these embeddings as input to create a semantic search index.
  2. When answering a new question in real time, the input question is converted to an embedding, which is used to search for and extract the most similar chunks of documents using a similarity metric, such as cosine similarity, and an approximate nearest neighbors algorithm. The search precision can also be improved with metadata filtering.
  3. A prompt is constructed from the concatenation of a system message with a context that is formed of the relevant chunks of documents extracted in step 2, and the input question itself. This prompt is then presented to an LLM model to generate the final answer to the question from the context.

With the right underlying embedding model, capable of producing accurate semantic representations of the input document chunks and the input questions, and an efficient semantic search module, this solution is able to answer questions that require retrieving existent information in a database of documents. For example, if you have a service or a product, you could start by indexing its FAQ section or documentation and have an initial conversational AI tailored to your specific offering.

Limitations of RAG based on semantic search

Although RAG is an essential component in modern domain-specific AI assistants and a sensible starting point for building a conversational AI around a specialized knowledge base, it can’t answer questions that require scanning, comparing, and reasoning across all documents in your knowledge base simultaneously, especially when the augmentation is based solely on semantic search.

To understand these limitations, let’s consider again the example of deciding where to invest based on financial reports. If we were to use RAG to converse with these reports, we could ask questions such as “What are the risks that faced company X in 2022,” or “What is the net revenue of company Y in 2022?” For each of these questions, the corresponding embedding vector, which encodes the semantic meaning of the question, is used to retrieve the top-K semantically similar chunks of documents available in the search index. This is typically accomplished by employing an approximate nearest neighbors solution such as FAISS, NMSLIB, pgvector, or others, which strive to strike a balance between retrieval speed and recall to achieve real-time performance while maintaining satisfactory accuracy.

However, the preceding approach can’t accurately answer analytical questions across all documents, such as “What are the top 5 companies with the highest net revenues in 2022?”

This is because semantic search retrieval attempts to find the K most similar chunks of documents to the input question. But because none of the documents contain comprehensive summaries of revenues, it will return chunks of documents that merely contain mentions of “net revenue” and possibly “2022,” without fulfilling the essential condition of focusing on companies with the highest revenue. If we present these retrieval results to an LLM as context to answer the input question, it may formulate a misleading answer or refuse to answer, because the required correct information is missing.

These limitations come by design because semantic search doesn’t conduct a thorough scan of all embedding vectors to find relevant documents. Instead, it uses approximate nearest neighbor methods to maintain reasonable retrieval speed. A key strategy for efficiency in these methods is segmenting the embedding space into groups during indexing. This allows for quickly identifying which groups may contain relevant embeddings during retrieval, without the need for pairwise comparisons. Additionally, even traditional nearest neighbors techniques like KNN, which scan all documents, only compute basic distance metrics and aren’t suitable for the complex comparisons needed for analytical reasoning. Therefore, RAG with semantic search is not tailored for answering questions that involve analytical reasoning across all documents.

To overcome these limitations, we propose a solution that combines RAG with metadata and entity extraction, SQL querying, and LLM agents, as described in the following sections.

Overcoming RAG limitations with metadata, SQL, and LLM agents

Let’s examine more deeply a question on which RAG fails, so that we can trace back the reasoning required to answer it effectively. This analysis should point us towards the right approach that could complement RAG in the overall solution.

Consider the question: “What are the top 5 companies with the highest revenue in 2022?”

To be able to answer this question, we would need to:

  1. Identify the revenue for each company.
  2. Filter down to keep the revenues of 2022 for each of them.
  3. Sort the revenues in descending order.
  4. Slice out the top 5 revenues alongside the company names.

Typically, these analytical operations are done on structured data, using tools such as pandas or SQL engines. If we had access to a SQL table containing the columns company, revenue, and year, we could easily answer our question by running a SQL query, similar to the following example:

SELECT company, revenue FROM table_name WHERE year = 2022 ORDER BY revenue DESC LIMIT 5;

Storing structured metadata in a SQL table that contains information about relevant entities enables you to answer many types of analytical questions by writing the correct SQL query. This is why we complement RAG in our solution with a real-time SQL querying module against a SQL table, populated by metadata extracted in an offline process.

But how can we implement and integrate this approach to an LLM-based conversational AI?

There are three steps to be able to add SQL analytical reasoning:

  • Metadata extraction – Extract metadata from unstructured documents into a SQL table
  • Text to SQL – Formulate SQL queries from input questions accurately using an LLM
  • Tool selection – Identify if a question must be answered using RAG or a SQL query

To implement these steps, first we recognize that information extraction from unstructured documents is a traditional NLP task for which LLMs show promise in achieving high accuracy through zero-shot or few-shot learning. Second, the ability of these models to generate SQL queries from natural language has been proven for years, as seen in the 2020 release of Amazon QuickSight Q. Finally, automatically selecting the right tool for a specific question enhances the user experience and enables answering complex questions through multi-step reasoning. To implement this feature, we delve into LLM agents in a later section.

To summarize, the solution we propose is composed of the following core components:

  • Semantic search retrieval to augment generation context
  • Structured metadata extraction and querying with SQL
  • An agent capable of using the right tools to answer a question

Solution overview

The following diagram depicts a simplified architecture of the solution. It helps you identify and understand the role of the core components and how they interact to implement the full LLM-assistant behavior. The numbering aligns with the order of operations when implementing this solution.

In practice, we implemented this solution as outlined in the following detailed architecture.

For this architecture, we propose an implementation on GitHub, with loosely coupled components where the backend (5), data pipelines (1, 2, 3) and front end (4) can evolve separately. This is to simplify the collaboration across competencies when customizing and improving the solution for production.

Deploy the solution

To install this solution in your AWS account, complete the following steps:

  1. Clone the repository on GitHub.
  2. Install the backend AWS Cloud Development Kit (AWS CDK) app:
    1. Open the backend folder.
    2. Run npm install to install the dependencies.
    3. If you have never used the AWS CDK in the current account and Region, run bootstrapping with npx cdk bootstrap.
    4. Run npx cdk deploy to deploy the stack.
  3. Optionally, run the streamlit-ui as follows:
    1. We recommend cloning this repository into an Amazon SageMaker Studio environment. For more information, refer to Onboard to Amazon SageMaker Domain using Quick setup.
    2. Inside the frontend/streamlit-ui folder, run bash run-streamlit-ui.sh.
    3. Choose the link with the following format to open the demo: https://{domain_id}.studio.{region}.sagemaker.aws/jupyter/default/proxy/{port_number}/.
  4. Finally, you can run the Amazon SageMaker pipeline defined in the data-pipelines/04-sagemaker-pipeline-for-documents-processing.ipynb notebook to process the input PDF documents and prepare the SQL table and the semantic search index used by the LLM assistant.

In the rest of this post, we focus on explaining the most important components and design choices, to hopefully inspire you when designing your own AI assistant on an internal knowledge base. We assume that components 1 and 4 are straightforward to understand, and focus on the core components 2, 3, and 5.

Part 2: Entity extraction pipeline

In this section, we dive deeper into the entity extraction pipeline used to prepare structured data, which is a key ingredient for analytical question answering.

Text extraction

Documents are typically stored in PDF format or as scanned images. They may be formed of simple paragraph layouts or complex tables, and contain digital or handwritten text. To extract information correctly, we need to transform these raw documents into plain text, while preserving their original structure. To do this, you can use Amazon Textract, which is a machine learning (ML) service that provides mature APIs for text, tables, and forms extraction from digital and handwritten inputs.

In component 2, we extract text and tables as follows:

  1. For each document, we call Amazon Textract to extract the text and tables.
  2. We use the following Python script to recreate tables as pandas DataFrames.
  3. We consolidate the results into a single document and insert tables as markdown.

This process is outlined by the following flow diagram and concretely demonstrated in notebooks/03-pdf-document-processing.ipynb.

Entity extraction and querying using LLMs

To answer analytical questions effectively, you need to extract relevant metadata and entities from your document’s knowledge base to an accessible structured data format. We suggest using SQL to store this information and retrieve answers due to its popularity, ease of use, and scalability. This choice also benefits from the proven language models’ ability to generate SQL queries from natural language.

In this section, we dive deeper into the following components that enable analytical questions:

  • A batch process that extracts structured data out of unstructured data using LLMs
  • A real-time module that converts natural language questions to SQL queries and retrieves results from a SQL database

You can extract the relevant metadata to support analytical questions as follows:

  1. Define a JSON schema for information you need to extract, which contains a description of each field and its data type, and includes examples of the expected values.
  2. For each document, prompt an LLM with the JSON schema and ask it to extract the relevant data accurately.
  3. When the document length is beyond the context length, and to reduce the extraction cost with LLMs, you can use semantic search to retrieve and present the relevant chunks of documents to the LLM during extraction.
  4. Parse the JSON output and validate the LLM extraction.
  5. Optionally, back up the results on Amazon S3 as CSV files.
  6. Load into the SQL database for later querying.

This process is managed by the following architecture, where the documents in text format are loaded with a Python script that runs in an Amazon SageMaker Processing job to perform the extraction.

For each group of entities, we dynamically construct a prompt that includes a clear description of the information extraction task, and includes a JSON schema that defines the expected output and includes the relevant document chunks as context. We also add a few examples of input and correct output to improve the extraction performance with few-shot learning. This is demonstrated in notebooks/05-entities-extraction-to-structured-metadata.ipynb.

Part 3: Build an agentic document assistant with Amazon Bedrock

In this section, we demonstrate how to use Amazon Bedrock LLMs to query data and build an LLM agent that enhances RAG with analytical capabilities, thereby enabling you to build intelligent document assistants that can answer complex domain-specific questions across multiple documents. You can refer to the Lambda function on GitHub for the concrete implementation of the agent and tools described in this part.

Formulate SQL queries and answer analytical questions

Now that we have a structured metadata store with the relevant entities extracted and loaded into a SQL database that we can query, the question that remains is how to generate the right SQL query from the input natural language questions?

Modern LLMs are good at generating SQL. For instance, if you request from the Anthropic Claude LLM through Amazon Bedrock to generate a SQL query, you will see plausible answers. However, we need to abide by a few rules when writing the prompt to reach more accurate SQL queries. These rules are especially important for complex queries to reduce hallucination and syntax errors:

  • Describe the task accurately within the prompt
  • Include the schema of the SQL tables within the prompt, while describing each column of the table and specifying its data type
  • Explicitly tell the LLM to only use existing column names and data types
  • Add a few rows of the SQL tables

You could also postprocess the generated SQL query using a linter such as sqlfluff to correct formatting, or a parser such as sqlglot to detect syntax errors and optimize the query. Moreover, when the performance doesn’t meet the requirement, you could provide a few examples within the prompt to steer the model with few-shot learning towards generating more accurate SQL queries.

From an implementation perspective, we use an AWS Lambda function to orchestrate the following process:

  1. Call an Anthropic Claude model in Amazon Bedrock with the input question to get the corresponding SQL query. Here, we use the SQLDatabase class from LangChain to add schema descriptions of relevant SQL tables, and use a custom prompt.
  2. Parse, validate, and run the SQL query against the Amazon Aurora PostgreSQL-Compatible Edition database.

The architecture for this part of the solution is highlighted in the following diagram.

Security considerations to prevent SQL injection attacks

As we enable the AI assistant to query a SQL database, we have to make sure this doesn’t introduce security vulnerabilities. To achieve this, we propose the following security measures to prevent SQL injection attacks:

  • Apply least privilege IAM permissions – Limit the permission of the Lambda function that runs the SQL queries using an AWS Identity and Access Management (IAM) policy and role that follows the least privilege principle. In this case, we grant read-only access.
  • Limit data access – Only provide access to the bare minimum of tables and columns to prevent information disclosure attacks.
  • Add a moderation layer – Introduce a moderation layer that detects prompt injection attempts early on and prevents them from propagating to the rest of the system. It can take the form of rule-based filters, similarity matching against a database of known prompt injection examples, or an ML classifier.

Semantic search retrieval to augment generation context

The solution we propose uses RAG with semantic search in component 3. You can implement this module using knowledge bases for Amazon Bedrock. Additionally, there are a variety of others options to implement RAG, such as the Amazon Kendra Retrieval API, Amazon OpenSearch vector database, and Amazon Aurora PostgreSQL with pgvector, among others. The open source package aws-genai-llm-chatbot demonstrates how to use many of these vector search options to implement an LLM-powered chatbot.

In this solution, because we need both SQL querying and vector search, we decided to use Amazon Aurora PostgreSQL with the pgvector extension, which supports both features. Therefore, we implement the semantic-search RAG component with the following architecture.

The process of answering questions using the preceding architecture is done in two main stages.

First, an offline-batch process, run as a SageMaker Processing job, creates the semantic search index as follows:

  1. Either periodically, or upon receiving new documents, a SageMaker job is run.
  2. It loads the text documents from Amazon S3 and splits them into overlapping chunks.
  3. For each chunk, it uses an Amazon Titan embedding model to generate an embedding vector.
  4. It uses the PGVector class from LangChain to ingest the embeddings, with their document chunks and metadata, into Amazon Aurora PostgreSQL and create a semantic search index on all the embedding vectors.

Second, in real time and for each new question, we construct an answer as follows:

  1. The question is received by the orchestrator that runs on a Lambda function.
  2. The orchestrator embeds the question with the same embedding model.
  3. It retrieves the top-K most relevant documents chunks from the PostgreSQL semantic search index. It optionally uses metadata filtering to improve precision.
  4. These chunks are inserted dynamically in an LLM prompt alongside the input question.
  5. The prompt is presented to Anthropic Claude on Amazon Bedrock, to instruct it to answer the input question based on the available context.
  6. Finally, the generated answer is sent back to the orchestrator.

An agent capable of using tools to reason and act

So far in this post, we have discussed treating questions that require either RAG or analytical reasoning separately. However, many real-world questions demand both capabilities, sometimes over multiple steps of reasoning, in order to reach a final answer. To support these more complex questions, we need to introduce the notion of an agent.

LLM agents, such as the agents for Amazon Bedrock, have emerged recently as a promising solution capable of using LLMs to reason and adapt using the current context and to choose appropriate actions from a list of options, which presents a general problem-solving framework. As discussed in LLM Powered Autonomous Agents, there are multiple prompting strategies and design patterns for LLM agents that support complex reasoning.

One such design pattern is Reason and Act (ReAct), introduced in ReAct: Synergizing Reasoning and Acting in Language Models. In ReAct, the agent takes as input a goal that can be a question, identifies the pieces of information missing to answer it, and proposes iteratively the right tool to gather information based on the available tools’ descriptions. After receiving the answer from a given tool, the LLM reassesses whether it has all the information it needs to fully answer the question. If not, it does another step of reasoning and uses the same or another tool to gather more information, until a final response is ready or a limit is reached.

The following sequence diagram explains how a ReAct agent works toward answering the question “Give me the top 5 companies with the highest revenue in the last 2 years and identify the risks associated with the top one.”

The details of implementing this approach in Python are described in Custom LLM Agent. In our solution, the agent and tools are implemented with the following highlighted partial architecture.

To answer an input question, we use AWS services as follows:

  1. A user inputs their question through a UI, which calls an API on Amazon API Gateway.
  2. API Gateway sends the question to a Lambda function implementing the agent executor.
  3. The agent calls the LLM with a prompt that contains a description of the tools available, the ReAct instruction format, and the input question, and then parses the next action to complete.
  4. The action contains which tool to call and what the action input is.
  5. If the tool to use is SQL, the agent executor calls SQLQA to convert the question to SQL and run it. Then it adds the result to the prompt and calls the LLM again to see if it can answer the original question or if more actions are needed.
  6. Similarly, if the tool to use is semantic search, then the action input is parsed out and used to retrieve from the PostgreSQL semantic search index. It adds the results to the prompt and checks if the LLM is able to answer or needs another action.
  7. After all the information to answer a question is available, the LLM agent formulates a final answer and sends it back to the user.

You can extend the agent with further tools. In the implementation available on GitHub, we demonstrate how you can add a search engine and a calculator as extra tools to the aforementioned SQL engine and semantic search tools. To store the ongoing conversation history, we use an Amazon DynamoDB table.

From our experience so far, we have seen that the following are keys to a successful agent:

  • An underlying LLM capable of reasoning with the ReAct format
  • A clear description of the available tools, when to use them, and a description of their input arguments with, potentially, an example of the input and expected output
  • A clear outline of the ReAct format that the LLM must follow
  • The right tools for solving the business question made available to the LLM agent to use
  • Correctly parsing out the outputs from the LLM agent responses as it reasons

To optimize costs, we recommend caching the most common questions with their answers and updating this cache periodically to reduce calls to the underlying LLM. For instance, you can create a semantic search index with the most common questions as explained previously, and match the new user question against the index first before calling the LLM. To explore other caching options, refer to LLM Caching integrations.

Supporting other formats such as video, image, audio, and 3D files

You can apply the same solution to various types of information, such as images, videos, audio, and 3D design files like CAD or mesh files. This involves using established ML techniques to describe the file content in text, which can then be ingested into the solution that we explored earlier. This approach enables you to conduct QA conversations on these diverse data types. For instance, you can expand your document database by creating textual descriptions of images, videos, or audio content. You can also enhance the metadata table by identifying properties through classification or object detection on elements within these formats. After this extracted data is indexed in either the metadata store or the semantic search index for documents, the overall architecture of the proposed system remains largely consistent.

Conclusion

In this post, we showed how using LLMs with the RAG design pattern is necessary for building a domain-specific AI assistant, but is insufficient to reach the required level of reliability to generate business value. Because of this, we proposed extending the popular RAG design pattern with the concepts of agents and tools, where the flexibility of tools allows us to use both traditional NLP techniques and modern LLM capabilities to enable an AI assistant with more options to seek information and assist users in solving business problems efficiently.

The solution demonstrates the design process towards an LLM assistant able to answer various types of retrieval, analytical reasoning, and multi-step reasoning questions across all of your knowledge base. We also highlighted the importance of thinking backward from the types of questions and tasks that your LLM assistant is expected to help users with. In this case, the design journey led us to an architecture with the three components: semantic search, metadata extraction and SQL querying, and LLM agent and tools, which we think is generic and flexible enough for multiple use cases. We also believe that by getting inspiration from this solution and diving deep into your users’ needs, you will be able to extend this solution further toward what works best for you.


About the authors

Mohamed Ali Jamaoui is a Senior ML Prototyping Architect with 10 years of experience in production machine learning. He enjoys solving business problems with machine learning and software engineering, and helping customers extract business value with ML. As part of AWS EMEA Prototyping and Cloud Engineering, he helps customers build business solutions that leverage innovations in MLOPs, NLP, CV and LLMs.

Giuseppe Hannen is a ProServe Associate Consultant. Giuseppe applies his analytical skills in combination with AI&ML to develop clear and effective solutions for his customers. He loves to come up with simple solutions to complicated problems, especially those that involve the latest technological developments and research.

Laurens ten Cate is a Senior Data Scientist. Laurens works with enterprise customers in EMEA helping them accelerate their business outcomes using AWS AI/ML technologies. He specializes in NLP solutions and focusses on the Supply Chain & Logistics industry. In his free time he enjoys reading and art.

Irina Radu is a Prototyping Engagement Manager, part of AWS EMEA Prototyping and Cloud Engineering. She is helping customers get the best out of the latest tech, innovate faster and think bigger.

Read More

How Q4 Inc. used Amazon Bedrock, RAG, and SQLDatabaseChain to address numerical and structured dataset challenges building their Q&A chatbot

How Q4 Inc. used Amazon Bedrock, RAG, and SQLDatabaseChain to address numerical and structured dataset challenges building their Q&A chatbot

This post is co-written with Stanislav Yeshchenko from Q4 Inc.

Enterprises turn to Retrieval Augmented Generation (RAG) as a mainstream approach to building Q&A chatbots. We continue to see emerging challenges stemming from the nature of the assortment of datasets available. These datasets are often a mix of numerical and text data, at times structured, unstructured, or semi-structured.

Q4 Inc. needed to address some of these challenges in one of their many AI use cases built on AWS. In this post, we discuss a Q&A bot use case that Q4 has implemented, the challenges that numerical and structured datasets presented, and how Q4 concluded that using SQL may be a viable solution. Finally, we take a closer look at how the Q4 team used Amazon Bedrock and SQLDatabaseChain to implement a RAG-based solution with SQL generation.

Use case overview

Q4 Inc., headquartered in Toronto, with offices in New York and London, is a leading capital markets access platform that is transforming how issuers, investors, and sellers efficiently connect, communicate, and engage with each other. The Q4 Platform facilitates interactions across the capital markets through IR website products, virtual events solutions, engagement analytics, investor relations Customer Relationship Management (CRM), shareholder and market analysis, surveillance, and ESG tools.

In today’s fast-paced and data-driven financial landscape, Investor Relations Officers (IROs) play a critical role in fostering communication between a company and its shareholders, analysts, and investors. As part of their daily duties, IROs analyze diverse datasets, including CRM, ownership records, and stock market data. The aggregate of this data is used to generate financial reports, set investor relations goals, and manage communication with existing and potential investors.

To meet the growing demand for efficient and dynamic data retrieval, Q4 aimed to create a chatbot Q&A tool that would provide an intuitive and straightforward method for IROs to access the necessary information they need in a user-friendly format.

The end goal was to create a chatbot that would seamlessly integrate publicly available data, along with proprietary customer-specific Q4 data, while maintaining the highest level of security and data privacy. As for performance, the goal was to maintain a query response time of seconds to ensure a positive experience for end-users.

Financial markets is a regulated industry with high stakes involved. Providing incorrect or outdated information can impact investors’ and shareholders’ trust, in addition to other possible data privacy risks. Understanding the industry and the requirements, Q4 sets data privacy and response accuracy as its guiding principles in evaluating any solution before it can be taken to market.

For the proof of concept, Q4 decided to use a financial ownership dataset. The dataset consists of time series data points representing the number of assets owned; the transaction history between investment institutions, individuals, and public companies; and many more elements.

Because Q4 wanted to ensure it could satisfy all the functional and non-functional requirements we’ve discussed, the project also needed to stay commercially feasible. This was respected throughout the process of deciding on the approach, architecture, choice of technology, and solution-specific elements.

Experimentation and challenges

It was clear from the beginning that to understand a human language question and generate accurate answers, Q4 would need to use large language models (LLMs).

The following are some of the experiments that were conducted by the team, along with the challenges identified and lessons learned:

  • Pre-training – Q4 understood the complexity and challenges that come with pre-training an LLM using its own dataset. It quickly became obvious that this approach is resource intensive with many non-trivial steps, such as data preprocessing, training, and evaluation. In addition to the effort involved, it would be cost prohibitive. Considering the nature of the time series dataset, Q4 also realized that it would have to continuously perform incremental pre-training as new data came in. This would have required a dedicated cross-disciplinary team with expertise in data science, machine learning, and domain knowledge.
  • Fine-tuning – Fine-tuning a pre-trained foundation model (FM) involved using several labeled examples. This approach showed some initial success, but in many cases, model hallucination was a challenge. The model struggled to understand nuanced contextual cues and returned incorrect results.
  • RAG with semantic search – Conventional RAG with semantic search was the last step before moving to SQL generation. The team experimented with using search, semantic search, and embeddings to extract context. During the embeddings experiment, the dataset was converted into embeddings, stored in a vector database, and then matched with the embeddings of the question to extract context. The retrieved context in any of the three experiments was then used to augment the original prompt as an input to the LLM. This approach worked well for text-based content, where the data consists of natural language with words, sentences, and paragraphs. Considering the nature of Q4’s dataset, which is mostly financial data consisting of numbers, financial transactions, stock quotes, and dates, the results in all three cases were suboptimal. Even when using embeddings, the embeddings generated from numbers struggled with similarity ranking, and in many cases led to retrieving incorrect information.

Q4’s conclusion: Generating SQL is the path forward

Considering the challenges faced using conventional RAG methodology, the team started to consider SQL generation. The idea was to use the LLM to first generate a SQL statement from the user question, presented to the LLM in natural language. The generated query is then run against the database to fetch the relevant context. The context is finally used to augment the input prompt for a summarization step.

Q4’s hypothesis was that in order to get higher recall for the retrieval step, specifically for the numerical dataset, they needed to first generate SQL from the user question. This was believed to not only increase accuracy, but also keep the context within the business domain for a given question. For the query generation, and to generate accurate SQL, Q4 needed to make the LLM fully context aware of their dataset structure. This meant the prompt needed to include the database schema, a few sample data rows, and human-readable field explanations for the fields that are not easy to comprehend.

Based on the initial tests, this method showed great results. The LLM equipped with all the necessary information was able to generate the correct SQL, which was then run against the database to retrieve the correct context. After experimenting with the idea, Q4 decided that SQL generation was the way forward to address context extraction challenges for their own specific dataset.

Let’s start with describing the overall solution approach, break it down to its components, and then put the pieces together.

Solution overview

LLMs are large models with billions of parameters that are pre-trained using very large amounts of data from a variety of sources. Due to the breadth of the training datasets, LLMs are expected to have general knowledge in a variety of domains. LLMs are also known for their reasoning abilities, which vary from one model to another. This general behavior can be optimized to a specific domain or industry by further optimizing a foundation model using additional domain-specific pre-training data or by fine-tuning using labeled data. Given the right context, metadata, and instructions, a well-selected general purpose LLM can produce good-quality SQL as long as it has access to the right domain-specific context.

In Q4’s use case, we start with translating the customer question into SQL. We do this by combining the user question, database schema, some sample database rows, and detailed instructions as a prompt to the LLM to generate SQL. After we have the SQL, we can run a validation step if deemed necessary. When we’re happy with the quality of the SQL, we run the query against the database to retrieve the relevant context that we need for the following step. Now that we have the relevant context, we can send the user’s original question, the context retrieved, and a set of instructions back to the LLM to produce a final summarized response. The goal of the last step is to have the LLM summarize the results and provide a contextual and accurate answer that can be then passed along to the user.

The choice of LLM used at every stage of the process highly impacts the accuracy, cost, and performance. Choosing a platform or technology that can allow you the flexibility to switch between LLMs within the same use case (multiple LLM trips for different tasks), or across different use cases, can be beneficial in optimizing the quality of the output, latency, and cost. We address the choice of LLM later in this post.

Solution building blocks

Now that we have highlighted the approach at a high level, let’s dive into the details, starting with the solution building blocks.

Amazon Bedrock

Amazon Bedrock is a fully managed service that offers a choice of high-performing FMs from leading companies, including AI21 Labs, Anthropic, Cohere, Meta, Stability AI, and Amazon. Amazon Bedrock also offers a broad set of tools that are needed to build generative AI applications, simplify the development process, and maintain privacy and security. In addition, with Amazon Bedrock you can choose from various FM options, and you can further fine-tune the models privately using your own data to align models’ responses with your use case requirements. Amazon Bedrock is fully serverless with no underlying infrastructure to manage extending access to available models through a single API. Lastly, Amazon Bedrock supports several security and privacy requirements, including HIPAA eligibility and GDPR compliance.

In Q4’s solution, we use Amazon Bedrock as a serverless, API-based, multi-foundation model building block. Because we intend to make multiple trips to the LLM within the same use case, based on the task type, we can choose the model that is most optimal for a specific task, be it SQL generation, validation, or summarization.

LangChain

LangChain is an open source integration and orchestration framework with a set of pre-built modules (I/O, retrieval, chains, and agents) that you can use to integrate and orchestrate tasks between FMs, data sources, and tools. The framework facilitates building generative AI applications that require orchestrating multiple steps to produce the desired output, without having to write code from scratch. LangChain supports Amazon Bedrock as a multi-foundation model API.

Specific to Q4’s use case, we use LangChain for coordinating and orchestrating tasks in our workflow, including connecting to data sources and LLMs. This approach has simplified our code because we can use the existing LangChain modules.

SQLDatabaseChain

SQLDatabaseChain is a LangChain chain that can be imported from langchain_experimental. SLDatabaseChain makes it straightforward to create, implement, and run SQL queries, using its effective text-to-SQL conversions and implementations.

In our use case, we use SQLDatabaseChain in the SQL generation, simplifying and orchestrating interactions between the database and the LLM.

The dataset

Our structured dataset can reside in a SQL database, data lake, or data warehouse as long as we have support for SQL. In our solution, we can use any dataset type with SQL support; this should be abstracted from the solution and shouldn’t change the solution in any way.

Implementation details

Now that we have explored the solution approach, solution components, the choice of technology, and tools, we can put the pieces together. The following diagram highlights the end-to-end solution.

End to end Solution Architecture

Let’s walk through the implementation details and the process flow.

Generate the SQL query

To simplify coding, we use existing frameworks. We use LangChain as an orchestration framework. We start with the input stage, where we receive the user question in natural language.

In this first stage, we take this input and generate an equivalent SQL that we can run against the database for context extraction. To generate SQL, we use SQLDatabaseChain, which relies on Amazon Bedrock for access to our desired LLM. With Amazon Bedrock, using a single API, we get access to a number of underlying LLMs and can pick the right one for each LLM trip we make. We first establish a connection to the database and retrieve the required table schema along with some sample rows from the tables we intend to use.

In our testing, we found 2–5 rows of table data to be sufficient to give enough information to the model without adding too much unnecessary overhead. Three rows were just enough to provide context, without overwhelming the model with too much input. In our use case, we started with Anthropic Claude V2. The model is known for its advanced reasoning and articulate contextual responses when provided with the right context and instructions. As part of the instructions, we can include more clarifying details to the LLM. For example, we can describe that column Comp_NAME stands for the company name. We now can construct the prompt by combining the user question as is, the database schema, three sample rows from the table we intend to use, and a set of instructions to generate the required SQL in clean SQL format without comments or additions.

All the input elements combined are considered as the model input prompt. A well-engineered input prompt that is tailored to the model’s preferred syntax highly impacts both the quality and performance of the output. The choice of model to use for a specific task is also important, not only because it impacts the output quality, but also because it has cost and performance implications.

We discuss model selection and prompt engineering and optimization later in this post, but it’s worth noting that for the query generation stage, we noticed that Claude Instant was able to produce comparable results, especially when the user question is well phrased and not as sophisticated. However, Claude V2 produced better results even with more complex and indirect user input. We learned that although in some cases Claude Instant may provide sufficient accuracy at a better latency and price point, our case for query generation was better suited for Claude V2.

Verify the SQL query

Our next step is to verify that the LLM has successfully generated the right query syntax and that the query makes contextual sense considering the database schemas and the example rows provided. For this verification step, we can revert to native query validation within SQLDatabaseChain, or we can run a second trip to the LLM including the query generated along with validation instruction.

If we use an LLM for the validation step, we can use the same LLM as before (Claude V2) or a smaller, more performant LLM for a simpler task, such as Claude Instant. Because we’re using Amazon Bedrock, this should be a very simple adjustment. Using the same API, we can change the model name in our API call, which takes care of the change. It’s important to note that in most cases, a smaller LLM can provide better efficiency in both cost and latency and should be considered—as long as you’re getting the accuracy desired. In our case, testing proved the query generated to be consistently accurate and with the right syntax. Knowing that, we were able to skip this validation step and save on latency and cost.

Run the SQL query

Now that we have the verified SQL query, we can run the SQL query against the database and retrieve the relevant context. This should be a straightforward step.

We take the generated context, provide it to the LLM of our choice with the initial user question and some instruction, and ask the model to generate a contextual and articulate summary. We then present the generated summary to the user as an answer to the initial question, all aligned with the context extracted from our dataset.

For the LLM involved in the summarization step, we can use either Titan Text Express or Claude Instant. They would both present good options for the summarization task.

Application integration

The Q&A chatbot capability is one of Q4’s AI services. To ensure modularity and scalability, Q4 builds AI services as microservices that are accessible to Q4 applications through APIs. This API-based approach enables seamless integration with the Q4 Platform ecosystem and facilitates exposing the AI services’ capabilities to the full suite of platform applications.

The main objective of the AI services is to provide straightforward capabilities for retrieving data from any public or proprietary data source using natural language as input. In addition, the AI services provide additional layers of abstraction to ensure that functional and non-functional requirements, such as data privacy and security are met. The following diagram demonstrates the integration concept.

Application Integration Image

Implementation challenges

In addition to the challenges presented by the nature of the structured, numerical dataset that we discussed earlier, Q4 was faced with a number of other implementation challenges that needed to be addressed.

LLM selection and performance

Selecting the right LLM for the task is crucial because it directly impacts the quality of output as well as the performance (round trip latency). Here are some factors that play into the LLM selection process:

  • Type of LLM – The way the FMs are architected and the initial data the model has been pre-trained on determines the types of tasks the LLM would be good at and how good it will be. For example, a text LLM would be good at text generation and summarization, whereas a text-to-image or image-to-text model would be more geared towards image analytics and generation tasks.
  • LLM size – FM sizes are measured by the number of model parameters a particular model has, typically in billions for modern LLMs. Typically, the larger the model, the more expensive to initially train or subsequently fine-tune. On the other hand, in general, for the same model architecture, the larger the model is, the smarter we expect it to be in performing the type of task it is geared towards.
  • LLM performance – Typically, the larger the model, the more time it takes to generate output, assuming you’re using the same compute and I/O parameters (prompt and output size). In addition, for the same model size, performance is highly impacted by how optimized your prompt is, the size of the I/O tokens, and the clarity and syntax of the prompt. A well-engineered prompt, along with an optimized I/O token size, can improve the model response time.

Therefore, when optimizing your task, consider the following best practices:

  • Choose a model that is suitable for the task at hand
  • Select the smallest model size that can produce the accuracy you’re looking for
  • Optimize your prompt structure and be as specific as possible with the instructions in a way that is easy for the model to understand
  • Use the smallest input prompt that can provide enough instruction and context to produce the accuracy level you’re looking for
  • Limit the output size to the smallest size that can be meaningful for you and satisfy your output requirements

Taking the model selection and performance optimization factors into account, we went to work to optimize our SQL generation use case. After some testing, we noticed that, provided we have the right context and instructions, Claude Instant, with the same prompt data, would produce comparable quality of SQL as Claude V2 at a much better performance and price point. This stands true when the user input is more direct and simpler in nature. For more sophisticated input, Claude V2 was necessary to produce the desired accuracy.

Applying the same logic on the summarization task led us to conclude that using Claude Instant or Titan Text Express would produce the accuracy required at a much better performance point than if we use a larger model such as Claude V2. Titan Text Expressed also offered better price-performance, as we discussed earlier.

The orchestration challenge

We realized that there is a lot to orchestrate before we can get a meaningful output response for the user question. As shown in the solution overview, the process involved multiple database trips and multiple LLM trips that are intertwined. If we were to build from scratch, we would have had to make a significant investment in the undifferentiated heavy lifting just to get the basic code ready. We quickly pivoted to using LangChain as an orchestration framework, taking advantage of the power of the open source community, and reusing existing modules without reinventing the wheel.

The SQL challenge

We also realized that generating SQL is not as simple as context extraction mechanisms like semantic search or using embeddings. We need to first get the database schema and a few sample rows to include in our prompt to the LLM. There is also the SQL validation stage, where we needed to interact with both the database and the LLM. SQLDatabaseChain was the obvious choice of tool. Because it’s part of LangChain, it was straightforward to adapt, and now we can manage the SQL generation and verification assisted with the chain, minimizing the amount of work we needed to do.

Performance challenges

With the use of Claude V2, and after proper prompt engineering (which we discuss in the next section), we were able to produce high-quality SQL. Considering the quality of the SQL generated, we started to look at how much value the validation stage is actually adding. After further analyzing the results, it became clear that the quality of the SQL generated was consistently accurate in a way that made the cost/benefit of adding an SQL validation stage unfavorable. We ended up eliminating the SQL validation stage without negatively impacting the quality of our output and shaved off the SQL validation round trip time.

In addition to optimizing for a more cost- and performance-efficient LLM for the summarization step, we were able to use Titan Text Express to get better performance and cost-efficiency.

Further performance optimization involved fine-tuning the query generation process using efficient prompt engineering techniques. Rather than providing an abundance of tokens, the focus was on providing the least amount of input tokens, in the right syntax that the model is trained to understand, and with the minimal yet optimal set of instructions. We discuss this more in the next section—it’s an important topic that is applicable not only here but also in other use cases.

Prompt engineering and optimization

You can adjust Claude on Amazon Bedrock for various business use cases if the right prompt engineering techniques are employed. Claude mainly acts as a conversational assistant that utilizes a human/assistant format. Claude is trained to fill in text for the assistant role. Given the instructions and prompt completions desired, we can optimize our prompts for Claude using several techniques.

We start with a proper formatted prompt template that gives a valid completion, then we can further optimize the responses experimenting with prompting with various sets of inputs that are representative of real-world data. It’s recommended to get many inputs while developing a prompt template. You can also use separate sets of prompt development data and test data.

Another way to optimize the Claude response is to experiment and iterate by adding rules, instructions, and useful optimizations. From these optimizations, you can view different types of completions by, for example, telling Claude to mention “I don’t know” to prevent hallucinations, thinking step by step, using prompt chaining, giving room to “think” as it generates responses, and double-checking for comprehension and accuracy.

Let’s use our query generation task and discuss some of the techniques we used to optimize our prompt. There were a few core elements that benefited our query generation efforts:

  • Using the proper human/assistant syntax
  • Utilizing XML tags (Claude respects and understands XML tags)
  • Adding clear instructions for the model to prevent hallucination

The following generic example shows how we used the human/assistant syntax, applied XML tags, and added instructions to restrict the output to SQL and instruct the model to say “sorry, I am unable to help” if it can’t produce relevant SQL. The XML tags were used to frame the instructions, additional hints, database schema, additional table explanations, and example rows.

"""Human: You are a SQL expert.
You are tasked to generate a SQL statement from the instruction provided.

<instructions>
Understanding the input question, referencing the database schema, and reviewing
example rows, generate a SQL statement that represents the question.
</instructions>

<database_schema>
"here you can include your table schemas
</database_schema>

<table_description>
"Comp-Nam" stands for Company Name
"Own-Hist" stand for Ownership history
</table_description>

<example_rows>
"here you can insert 2-5 sample database rows"
</example_rows>

<question>
{input}
</question>

<additional_hints>
In your response provide only SQL with no additional comments.
The SQL has to follow the proper database schema.
If the question is unrelated to the database or if you are
unable to generate relevant SQL,
say "sorry, I am unable to help".
Do not make up an answer
Do not answer with anything other than SQL
</additional_hints>

Assistant: """

The final working solution

After we had addressed all the challenges identified during the proof of concept, we had fulfilled all the solution requirements. Q4 was satisfied with the quality of the SQL generated by the LLM. This stands true for simple tasks that required only a WHERE clause to filter the data, and also with more complex tasks that required context-based aggregations with GROUP BY and mathematical functions. The end-to-end latency of the overall solution came within what was defined as acceptable for the use case—single-digit seconds. This was all thanks to the choice of an optimal LLM at every stage, proper prompt engineering, eliminating the SQL verification step, and using an efficient LLM for the summarization step (Titan Text Express or Claude Instant).

It’s worth noting that using Amazon Bedrock as a fully managed service and the ability to have access to a suite of LLMs through the same API allowed for experimentation and seamless switching between LLMs by changing the model name in the API call. With this level of flexibility, Q4 was able to choose the most performant LLM for each LLM call based on the nature of the task, be it query generation, verification, or summarization.

Conclusion

There is no one solution that fits all use cases. In a RAG approach, the quality of the output highly depends on providing the right context. Extracting the right context is key, and every dataset is different with its unique characteristics.

In this post, we demonstrated that for numerical and structured datasets, using SQL to extract the context used for augmentation can lead to more favorable results. We also demonstrated that frameworks like LangChain can minimize the coding effort. Additionally, we discussed the need to be able to switch between LLMs within the same use case in order to achieve the most optimal accuracy, performance, and cost. Finally, we highlighted how Amazon Bedrock, being serverless and with a variety of LLMs under the hood, provides the flexibility needed to build secure, performant, and cost-optimized applications with the least amount of heavy lifting.

Start your journey towards building generative AI-enabled applications by identifying a use case of value to your business. SQL generation, as the Q4 team learned, can be a game changer in building smart applications that integrate with your data stores, unlocking revenue potential.


About the authors

Tamer Soliman is a Senior Solutions Architect at AWS. He helps Independent Software Vendor (ISV) customers innovate, build, and scale on AWS. He has over two decades of industry experience in consulting, training, and professional services. He is a multi patent inventor with three granted patents and his experience spans multiple technology domains including telecom, networking, application integration, AI/ML, and cloud deployments. He specializes in AWS Networking and has a profound passion for machine leaning, AI, and Generative AI.

Mani Khanuja is a Tech Lead – Generative AI Specialists, author of the book – Applied Machine Learning and High Performance Computing on AWS, and a member of the Board of Directors for Women in Manufacturing Education Foundation Board. She leads machine learning (ML) projects in various domains such as computer vision, natural language processing and generative AI. She helps customers to build, train and deploy large machine learning models at scale. She speaks in internal and external conferences such re:Invent, Women in Manufacturing West, YouTube webinars and GHC 23. In her free time, she likes to go for long runs along the beach.

Stanislav Yeshchenko is a Software Architect at Q4 Inc.. He has over a decade of industry experience in software development and system architecture. His diverse background spanning roles such as Technical Lead and Senior Full Stack Developer, powers his contributions to advancing innovation of the Q4 Platform. Stanislav is dedicated to driving technical innovation and shaping strategic solutions in the field.

Read More

Enable faster training with Amazon SageMaker data parallel library

Enable faster training with Amazon SageMaker data parallel library

Large language model (LLM) training has become increasingly popular over the last year with the release of several publicly available models such as Llama2, Falcon, and StarCoder. Customers are now training LLMs of unprecedented size ranging from 1 billion to over 175 billion parameters. Training these LLMs requires significant compute resources and time as hundreds to thousands of graphics processing units (GPUs) must be used to handle today’s vast training datasets and model sizes. One bottleneck in distributed training can be GPU communication handled by the NVIDIA Collective Communication Library (NCCL). In some large-distributed training jobs, more time can be spent on inter-GPU communication than actual GPU computation. To alleviate the GPU communication bottleneck and enable faster training, Amazon SageMaker is excited to announce an optimized AllGather collective operation as part of the SageMaker distributed data parallel library (SMDDP). AllGather is the most used collective operation in popular memory-efficient data parallelism solutions like DeepSpeed Zero Redundancy Optimizer (ZeRO) and Fully Sharded Data Parallelism (FSDP), and it is the main contributor to GPU communication overhead. In this post, we show a high-level overview of how SMDDP works, how you can enable SMDDP in your Amazon SageMaker training scripts, and the performance improvements you can expect.

Solution overview

Traditional data parallel training involves replicating an entire model across multiple GPUs, with each model training on different shards of data from the dataset. During the backward pass, gradients are averaged among GPU workers so that each model replica is updated with the same gradient values despite them being trained with different data shards. This technique allows much faster training on vast datasets by parallelizing the consumption of training data. However, some of today’s large models (e.g., Llama2 70B) are far too large to fit entirely within GPU memory, which makes traditional data parallelism unusable. To continue reaping the benefits of data parallelism while overcoming limited GPU memory, sharded data parallel solutions such as DeepSpeed ZeRO, PyTorch FSDP, and the Amazon SageMaker model parallelism library have grown in popularity.

In sharded data parallelism, rather than replicating the entire model on GPU workers, the model parameters, gradients, and optimizer states are broken up and distributed (i.e., sharded) across GPUs in the training job. To perform forward and backward pass computation, parameters are gathered from shards on other GPU workers to form one or more model layers. After computation is performed, these layers are then freed from memory to allow for the next set of layers to be gathered. Note that there are variants of sharded data parallelism where only the optimizer states and gradients are sharded, but not the model parameters. AllGather is still used in this type of sharded data parallelism, but only prior to forward pass computation in order to gather model parameters that have been updated by different gradient or optimizer state shards from other GPU workers. Refer to the different DeepSpeed ZeRO stages and the SHARD_GRAD_OP FSDP sharding strategy for more detail.

An AllGather collective operation is performed each time parameters are unsharded—NCCL provides the standard open-source implementation of this routine. As shown in the following, each GPU worker involved in the AllGather starts off with an input buffer and ends up with all of the input buffers from other workers concatenated together. When AllGather is used in sharded data parallelism, the input buffers contain the model parameter shards and the large output buffers contain one or more model layers materialized from the other shards.

Before and after AllGather operation on 4 GPUs

Although NCCL is typically used for AllGather in distributed training, its underlying low-level implementation isn’t tailored to the networking infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) instances, and thus its performance can slow down end-to-end training. The SMDDP library is a collective communication library for NVIDIA GPUs that serves as a drop-in replacement for NCCL and provides better performance for distributed training jobs with PyTorch. Specifically, SMDDP provides an optimized implementation of AllGather for p4d/p4de instance types.

Since collective operations like AllGather block forward and backward pass computation, faster execution of these operations directly translates into shorter end-to-end training time with no side effects on convergence. Other collective operations that’re used less frequently in sharded data parallel training are handled by falling back to NCCL.

Walkthrough

AWS-optimized AllGather

AWS-optimized AllGather uses the following techniques to achieve better performance on AWS infrastructure compared to NCCL:

  1. We move data between instances via Elastic Fabric Adapter (EFA) network with an all-to-all communication pattern. EFA is AWS’s low-latency and high-throughput network solution, and an all-to-all pattern for inter-node network communication is more tailored to the characteristics of EFA and AWS’ network infrastructure by requiring fewer packet hops compared to NCCL’s ring or tree communication pattern.
  2. GDRCopy to coordinate local NVLink and EFA network traffic. GDRCopy is a library that provides low-latency communication between CPU processes and GPU CUDA kernels. With this technology, we’re able to pipeline the intra-node and inter-node data movement.
  3. Reduced usage of GPU streaming multiprocessors to give back more compute power to model kernels. AWS P4d/P4de instances are equipped with NVIDIA A100 GPUs each of which has 108 streaming multiprocessors. While NCCL takes up to 24 streaming multiprocessors to execute collectives, SMDDP Collectives only use up to nine streaming multiprocessors. The saved streaming multiprocessors can be picked up by model compute kernels for quicker execution.

Usage

SMDDP collectives natively integrates with PyTorch through the process group abstraction in the torch.distributed module. A process group defines the interfaces for common collective operations such as AllGather, ReduceScatter, AllReduce, etc. Users can write generic distributed code and then choose the underlying backend, which provides the implementation for these operations based on the compute device used. CPU training jobs often use the gloo or mpi backend while NVIDIA GPUs use the nccl backend.

The SMDDP library comes into the picture by registering itself as a custom backend in the process group abstraction. This is done by the import statement, which is shown in the following code snippets. Then, when selecting the backend for your GPU-based distributed training job, just replace nccl with smddp. The smddp backend abides by the same semantics as the nccl backend and supports the same training scenarios.

DeepSpeed

import smdistributed.dataparallel.torch.torch_smddp
deepspeed.init_distributed(dist_backend="smddp")  # replacing "nccl"

FSDP

import smdistributed.dataparallel.torch.torch_smddp
dist.init_process_group(backend="smddp")  # replacing "nccl"

Benchmarks

We benchmarked standalone AllGather performance where the collective operation is run in isolation without any model training. Below is a sample result on 32 p4d instances comparing NCCL and SMDDP AllGather. The X-axis represents the output size of AllGather, and the Y-axis represents the network utilization rate of p4d’s 400 Gbps EFA network. The 4 sub-graphs represent the common communication group patterns where we have 1, 2, 4, and 8 ranks per p4d instance participating in the AllGather operation, respectively.

Network utilization of SMDDP and NCCL AllGather on 32 nodes

These microbenchmarks show that SMDDP outperforms NCCL with two key characteristics:

  1. The peak performance of SMDDP (approximately 90% bandwidth utilization) is higher than that of NCCL (approximately 80% bandwidth utilization) in all configurations.
  2. SMDDP reaches the peak performance at much smaller buffer sizes than NCCL. This particularly improves training speeds for smaller models or when the user sets a small AllGather buffer size in DeepSpeed (where AllGather size need not be equal to layer size).

Model training benchmarks

In large-scale training jobs where GPU communication is a significant bottleneck, SMDDP can markedly improve training speeds, as measured by model TFLOPS/GPU.

Configuration Performance
Model/Training Cluster Sharded Data Parallelism Solution Model TFLOPS/GPU with NCCL Model TFLOPS/GPU with SMDDP % speedup
13B Llama2
Seq length: 4096
Global batch size: 4M tokens
64 p4d.24xlarge nodes (512 NVIDIA A100 GPUs) PyTorch FSDP 97.89 121.85 24.40%
65B GPT-NeoX
Seq length: 2048
Global batch size: 4M tokens
64 p4d.24xlarge nodes (512 NVIDIA A100 GPUs) DeepSpeed ZeRO Stage 3* 99.23 108.66 9.50%

*EleutherAI’s Megatron-DeepSpeed repository was used. Tensor parallelism was also enabled with a tensor-parallel degree of eight.

Note: Model TFLOPS/GPU is based on the Model FLOPS Utilization calculation defined in the paper here and benchmark figures elsewhere may cite hardware TFLOPS/GPU as the performance metric. Hardware TFLOPS/GPU can be approximated as 4/3 x model TFLOPS/GPU.

Conclusion

In this post, we showed you how to significantly speed up sharded data parallel training jobs on Amazon SageMaker with just two lines of code change. Large-scale distributed training is becoming increasingly ubiquitous with the emergence or LLMs, but with this scale comes high costs. By reducing the communication bottleneck between GPUs, SMDDP helps you train faster at scale and save on compute resources. You can find more SMDDP examples with sharded data parallel training in the Amazon SageMaker Examples GitHub repository.


About the Authors

Apoorv Gupta is a Software Development Engineer at AWS, focused on building optimal deep learning systems for AWS infrastructure and hardware. He is interested in distributed computing, deep learning systems, and ML accelerators. Outside of work, Apoorv enjoys traveling, hiking, and video games.

Karan Dhiman is a Software Development Engineer at AWS, based in Toronto, Canada. He is very passionate about the machine learning space and building solutions for accelerating distributed computed workloads.

Ruhan Prasad is a Software Development Engineer at AWS who is working on making distributed deep learning training faster, cheaper, and easier to use on SageMaker. Outside of work, Ruhan enjoys playing tennis, traveling, and cooking.

Zhaoqi Zhu is a Senior Software Development Engineer at AWS, passionate about distributed systems and low level optimizations. He enjoys watching soccer matches while drinking (non-diet) soda.

Read More

Use custom metadata created by Amazon Comprehend to intelligently process insurance claims using Amazon Kendra

Use custom metadata created by Amazon Comprehend to intelligently process insurance claims using Amazon Kendra

Structured data, defined as data following a fixed pattern such as information stored in columns within databases, and unstructured data, which lacks a specific form or pattern like text, images, or social media posts, both continue to grow as they are produced and consumed by various organizations. For instance, according to International Data Corporation (IDC), the world’s data volume is expected to increase tenfold by 2025, with unstructured data accounting for a significant portion. Enterprises may want to add custom metadata like document types (W-2 forms or paystubs), various entity types such as names, organization, and address, in addition to the standard metadata like file type, date created, or size to extend the intelligent search while ingesting the documents. The custom metadata helps organizations and enterprises categorize information in their preferred way. For example, metadata can be used for filtering and searching. Customers can create the custom metadata using Amazon Comprehend, a natural-language processing (NLP) service managed by AWS to extract insights about the content of documents, and ingest it into Amazon Kendra along with their data into the index. Amazon Kendra is a highly accurate and easy-to-use enterprise search service powered by Machine Learning (AWS). The custom metadata can then be used to enrich the content for better filtering and facet capabilities. In Amazon Kendra, facets are scoped views of a set of search results. For example, you can provide search results for cities across the world, where documents are filtered by a specific city with which they are associated. You could also create facets to display results by a specific author.

Insurance companies are burdened with increasing numbers of claims that they must process. Additionally, the complexity of claims processing is also increasing due to the diverse types of insurance documents involved, and custom entities in each of these documents. In this post, we describe a use case for custom content enrichment for insurance providers. The insurance provider receives payout claims from the beneficiary’s attorney for different insurance types, such as home, auto, and life insurance. In this use case, the documents received by the insurance provider do not contain any metadata that allows searching the content based on certain entities and classes. The insurance provider wants to filter Kendra content based on custom entities and classes specific to their business domain. This post illustrates how you can automate and simplify metadata generation using custom models by Amazon Comprehend. The metadata generated can be customized during the ingestion process with Amazon Kendra Custom Document Enrichment (CDE) custom logic.

Let’s look at a few examples of Amazon Kendra search with or without filtering and facets capabilities.

In the following screenshot, Amazon Kendra provides a search result but there is no option to further narrow down the search results by using any filters.

The following screenshot shows Amazon Kendra search results can be filtered by using different facets like Law Firm, Policy Numbers, created by custom metadata to narrow down the search results.

The solution discussed in this post can easily be applied to other businesses/use-cases as well, such as healthcare, manufacturing, and research.

Solution overview

In this proposed solution, we will 1) classify insurance claims submissions into various classes, and 2) retrieve insurance-specific entities from these documents. When this is complete, the document can be routed to the appropriate department or downstream process.

The following diagram outlines the proposed solution architecture.

Amazon Comprehend custom classification API is used to organize your documents into categories (classes) that you define. Custom classification is a two-step process. First, you train a custom classification model (also called a classifier) to recognize the classes that are of interest to you. Then, you use your model to classify any number of document sets.

Amazon Comprehend custom entity recognition feature is used to identify specific entity types (names of insurance company, names of the insurer, policy number) beyond what is available in the generic entity types by default. Building a custom entity recognition model is a more effective approach than using string matching or regular expressions to extract entities from documents. A custom entity recognition model can learn the context where those names are likely to appear. Additionally, string matching will not detect entities that have typos or follow new naming conventions, while this is possible using a custom model.

Before diving deeper, let’s take a moment to explore Amazon Kendra. Amazon Kendra is a highly accurate and easy-to-use enterprise search service powered by machine learning. It allows users to find the information they need within the vast amount of content spread across their organization, ranging from websites and databases to intranet sites. We will first create an Amazon Kendra index to ingest the documents. While ingesting the data, it’s essential to consider the concept of Custom Data Enrichment (CDE). CDE enables you to enhance the search capability by incorporating external knowledge into the search index. For more information, refer to Enriching your documents during ingestion. In this post, the CDE logic invokes the custom APIs of Amazon Comprehend to enrich the documents with identified classes and entities. Finally, we use the Amazon Kendra search page to show how the metadata enhanced the search capability by adding faceting and filtering capabilities.

The high-level steps to implement this solution are as follows:

  1. Train the Amazon Comprehend custom classifier using training data
  2. Train the Amazon Comprehend custom entity recognition using training data
  3. Create the Amazon Comprehend custom classifier and custom entity recognition endpoints
  4. Create and deploy a Lambda function for post extraction enrichment
  5. Create and populate the Amazon Kendra index
  6. Use the extracted entities to filter searches in Amazon Kendra

We have also provided a sample application in the GitHub repo for reference.

Data security and IAM considerations

With security as the top priority, this solution follows the least privilege permissions principle for the services and features used. The IAM role used by Amazon Comprehend custom classification and custom entity recognition has permissions to access the dataset from the test bucket only. The Amazon Kendra service has access to a specific S3 bucket and Lambda function used to call comprehend APIs. The Lambda function has permissions to call the Amazon Comprehend APIs only. For more information, review section 1.2 and 1.3 in the notebook.

We recommend you do the following in a non-production environment prior to implementing the solution in the production environment.

Train the Comprehend custom classifier using training data

Amazon Comprehend Custom Classification supports two data format types for annotation files:

Since our data is already labeled and stored in CSV files, we will use the CSV file format for the annotation file as an example. We have to provide the labeled training data as UTF-8 encoded text in a CSV file. Do not include a header row in the CSV file. Adding a header row in your file may cause runtime errors. An example to the training data CSV file is as follows:

CLASS, Text of document 1
CLASS, Text of document 2

To prepare classifier training data, refer to Preparing classifier training data. For each row in the CSV file, the first column contains one or more class labels. A class label can be any valid UTF-8 string. We recommend using clear class names that don’t overlap in meaning. The name can include white space, and can consist of multiple words connected by underscores or hyphens. Do not leave any space characters before or after the commas that separate the values in a row.

Next, you will train either using Multi-class mode or Multi-label mode. Specifically, in multi-class mode, classification assigns one class for each document, while in multi-label mode, individual classes represent different categories that aren’t mutually exclusive. In our case we will be using the Multi-Class mode for Plain-text models.

You can prepare separate training and testing datasets for Amazon Comprehend custom classifier training and model evaluation. Or, only provide one dataset for both training and testing. Comprehend will automatically select 10% of your provided dataset to use as testing data. In this example, we are providing separate training and testing datasets.

The following example shows a CSV file containing the class names associated with the various documents.

Document format – Type of Insurance, Content of document 1

When the custom classification model is trained, it can capture different classes of insurance on the documents (Home, Auto, or Life insurance).

Train the Amazon Comprehend custom entity recognizer (NER) using training data

The training dataset for Amazon Comprehend Custom Entity Recognition (NER) can be prepared in one of two different ways:

  • Annotations – Provides a data set that contains the annotated entities for mode training
  • Entity lists (plain text only) – Provides a list of entities and their label type (such as “Insurance company names”) and a set of unannotated documents containing those entities for model training

For more information, refer to Preparing entity recognizer training data.

When training a model using entity list, we need to provide two pieces of information: a list of entity names with their associated custom entity types and a collection of unannotated documents in which the entities appear.

Automatic training requires having two types of information: sample documents and the entity list or annotations. Once the recognizer is trained, you can use it to detect custom entities in your documents. You can quickly analyze a small body of text in real time, or you can analyze a large set of documents with an asynchronous job.

You can prepare separate training and testing datasets for Amazon Comprehend custom entity recognizer training and model evaluation. Or provide only one dataset for both training and testing. Amazon Comprehend will automatically select 10% of your provided dataset to use as testing data. In the below example, we specified the training dataset as Documents.S3Uri under InputDataConfig.

The following example shows a CSV file containing the of entities:

Once the custom entities (NER) model is trained, it will be able to extract the various entities like “PAYOUT“, “INSURANCE_COMPANY“, “LAW_FIRM“, “POLICY_HOLDER_NAME“, “POLICY_NUMBER“.

Create the Amazon Comprehend custom classifier and custom entities (NER) endpoints

Amazon Comprehend’s endpoints make your custom models available for real-time classification. After you create an endpoint, you can make changes to it as your business needs evolve. For example, you can monitor your endpoint utilization and apply auto scaling to automatically set endpoint provisioning to fit your capacity needs. You can manage all your endpoints from a single view, and when you no longer need an endpoint, you can delete it to save costs. Amazon Comprehend support both synchronous and asynchronous options, if real-time classification isn’t required for your use case, you can submit a batch job to Amazon Comprehend for asynchronous data classification.

For this use case, you create an endpoint to make your custom model available for real-time analysis.

To meet your text processing needs, you assign inference units to the endpoint, and each unit allows a throughput of 100 characters per second. You can then adjust the throughput up or down.

Create and deploy a Lambda function for post extraction enrichment

The post-extraction Lambda function allows you to implement the logic to process the text extracted by Amazon Kendra from the ingested document. The post-extraction function we configured implements the code to invoke Amazon Comprehend to detect custom entities and custom classifying the documents from the text extracted by Amazon Kendra, and uses them to update the document metadata, which is presented as facets in an Amazon Kendra search. The function code is embedded in the notebook. The PostExtractionLambda code works as follows:

  • Splits the page text into sections that do not exceed the max byte length limit of the comprehend detect_entities API. (See Limits ).
    NOTE the script uses a naive character length splitting algorithm for simplicity – production use cases should implement overlapping or sentence boundary splits, based on UTF8 byte length.
  • For each section of the text, calls the comprehend real-time endpoints for custom entities and custom classifier to detect the following entity types: [“PAYOUT“, “INSURANCE_COMPANY“, “LAW_FIRM“, “POLICY_HOLDER_NAME“, “POLICY_NUMBER“, “INSURANCE_TYPE“].
  • Filters out detected entities that are below the confidence score threshold. We are using 0.50 threshold which means only entities with confidence 50% and more will be used. This can be tuned based on the use case and requirements.
  • Tracks the frequency count of each entity.
  • Selects only the top N (10) unique entities for each page, based on frequency of occurrence.
  • For document classification, the multi-class classifier assigns only one class for each document. In this Lambda function, the documents will be classified as Auto Insurance, Home Insurance, or Life Insurance.
#The function to read the input text and detect entities in it using Comprehend 
def entity_detector(doc_text):
    #List of JSON objects to store entities
    entity_data = dict()
    #List of observed text strings recognized as categories
    category_text = dict()
    #Frequency of each text string
    text_frequency = dict()
    for et in categories:
        entity_data[ et ] = []
        category_text[ et ] = []
        text_frequency[ et ] = dict()
    
    #Make detect_entities_v2 call in a loop to work with the text limit
    for i in range(0, len(doc_text), compre_text_size):
        try:
            entities = compre.detect_entities(Text=doc_text[i:i+compre_text_size], LanguageCode='en', EndpointArn=endpoint_custom_entity)
            
        except Exception as e:
            logger.info("Exiting - detect_entities_v2 terminated with exception")
            return []
        for e in entities["Entities"]:
            #For each of the recognized entities take only those that have confidence score higher than min_score, 
            #are printable, dont contain quotes and are previously unseen
            if ((e["Score"] > min_score) and (e["Text"].isprintable()) and (not '"' in e["Text"]) and (not e["Text"].upper() in category_text[e["Type"]])):
                #Append the text to entity data to be used for a Kendra custom attribute
                entity_data[e["Type"]].append(e["Text"])
                #Keep track of text in upper case so that we don't treat the same text written in different cases differently
                category_text[e["Type"]].append(e["Text"].upper())
                #Keep track of the frequency of the text so that we can take the text with highest frequency of occurrance
                text_frequency[e["Type"]][e["Text"].upper()] = 1
            elif (e["Text"].upper() in category_text[e["Type"]]):
                #Keep track of the frequency of the text so that we can take the text with highest frequency of occurrance
                text_frequency[e["Type"]][e["Text"].upper()] += 1
    #The Kendra attribute metadata JSON object to be populated
    metadata = dict()
    for et in categories:
        metadata[et] = []
        #Take at most elimit number of recognized text strings having the highest frequency of occurrance
        el = [pair[0] for pair in sorted(text_frequency[et].items(), key=lambda item: item[1], reverse=True)][0:elimit]
        for d in entity_data[et]:
            if (d.upper() in el):
                metadata[et].append(d)
    for md in metadata:
        metaUL.append({
            "name": md,
            "value": {
                "stringListValue": metadata[md]
            }
        })
    return metaUL

Note that as of this writing, CDE only supports synchronous calls or if it has to be asynchronous, then an explicit wait loop is needed. For post extraction Lambda the max execution time is 1 min. The Lambda custom logic can be changed based on the requirements that fit your use case.

Create and populate the Amazon Kendra index

In this step, we will ingest the data to the Amazon Kendra index and make it searchable for the users. During the ingestion, we will use the Lambda function created in the previous step as a post extraction step and the Lambda function will call the custom classification and custom entity recognition (NER) endpoints to create the custom metadata fields.

The high-level steps to implement this solution are as follows:

  1. Create Amazon Kendra Index.
  2. Create Amazon Kendra Data source – There are different data sources which can be used to ingest dataset. In this post we are using an S3 bucket.
  3. Create Facets ­ Law_Firm, Payout, Insurance_Company, Policy_Number, Policy_Holder_Name, Insurance_Type with string type as ‘STRING_LIST_VALUE’.
  4. Create Kendra CDE and point it to the post-extraction Lambda function previously created.
  5. Perform the sync process to ingest the dataset.

Once completed, you can populate the index with the insurance data, using the Kendra CDE with post extraction lambda, you can filter searches based on the custom entity types and custom classification as custom metadata fields.

Use the extracted entities to filter searches in Kendra

Now the index is populated and ready to use. In the Amazon Kendra console, choose Search Indexed Content under Data Management and do the following.

Query the following: List of insurance failed due to late filing?

The results show an answer from the policy type – HOME INSURANCE and brings text_18 and text_14 as the top results.

Choose “Filter search results” on the left. Now you will see all the Entity types and classification values extracted using Comprehend, and for each entity value and classification you will see the number of matching documents.

Under INSURANCE_TYPE choose “Auto-Insurance”, and then you will get an answer from text_25 file.

Note that your results may vary slightly from the results shown in the screenshot.

Try searching with your own queries, and observe how the entities and document classification identified by Amazon Comprehend quickly allows you to:

  • See how your search results are distributed across the categories.
  • Narrow your search by filtering on any of the entity/classification values.

Clean up

After you have experimented with the search and tried the notebook provided in the Github repository, delete the infrastructure you provisioned in your AWS account to avoid any unwanted charges. You can run the cleanup cells in the notebook. Alternatively, you can delete the resources manually through the AWS console:

  • Amazon Kendra Index
  • Comprehend custom classifier and custom entity recognition (NER) endpoints
  • Comprehend custom classifier and custom entity recognition (NER) custom models
  • Lambda function
  • S3 bucket
  • IAM roles and policies

Conclusion

In this post, we showed how Amazon Comprehend custom entities and custom classifier enables Amazon Kendra search powered by CDE feature to help end-users perform better searches on the structured/unstructured data. The custom entities of Amazon Comprehend and custom classifier makes it very useful for different use cases and various domain specific data. For more information about how to use Amazon Comprehend, refer to Amazon Comprehend developer resources and for Amazon Kendra, refer to Amazon Kendra developer resources.

Give this solution a try for your use case. We invite you to leave your feedback in the comments sections.


About the Authors

Amit Chaudhary is a Senior Solutions Architect at Amazon Web Services. His focus area is AI/ML, and he helps customers with generative AI, large language models, and prompt engineering. Outside of work, Amit enjoys spending time with his family.

Yanyan Zhang is a Senior Data Scientist in the Energy Delivery team with AWS Professional Services. She is passionate about helping customers solve real problems with AI/ML knowledge. Recently, her focus has been on exploring the potential of Generative AI and LLM. Outside of work, she loves traveling, working out and exploring new things.

Nikhil Jha is a Senior Technical Account Manager at Amazon Web Services. His focus areas include AI/ML, and analytics. In his spare time, he enjoys playing badminton with his daughter and exploring the outdoors.

Read More

Foundational data protection for enterprise LLM acceleration with Protopia AI

Foundational data protection for enterprise LLM acceleration with Protopia AI

This post is written in collaboration with Balaji Chandrasekaran, Jennifer Cwagenberg and Andrew Sansom and Eiman Ebrahimi from Protopia AI.

New and powerful large language models (LLMs) are changing businesses rapidly, improving efficiency and effectiveness for a variety of enterprise use cases. Speed is of the essence, and adoption of LLM technologies can make or break a business’s competitive advantage. AWS is especially well suited to provide enterprises the tools necessary for deploying LLMs at scale to enable critical decision-making.

In their implementation of generative AI technology, enterprises have real concerns about data exposure and ownership of confidential information that may be sent to LLMs. These concerns of privacy and data protection can slow down or limit the usage of LLMs in organizations. Enterprises need a responsible and safer way to send sensitive information to the models without needing to take on the often prohibitively high overheads of on-premises DevOps.

The post describes how you can overcome the challenges of retaining data ownership and preserving data privacy while using LLMs by deploying Protopia AI’s Stained Glass Transform to protect your data. Protopia AI has partnered with AWS to deliver the critical component of data protection and ownership for secure and efficient enterprise adoption of generative AI. This post outlines the solution and demonstrates how it can be used in AWS for popular enterprise use cases like Retrieval Augmented Generation (RAG) and with state-of-the-art LLMs like Llama 2.

Stained Glass Transform overview

Organizations seek to retain full ownership and control of their sensitive enterprise data. This is a pillar of responsible AI and an emerging data protection and privacy requirement above and beyond basic security and legal guarantees of LLM providers.

Although enterprise business units want to utilize LLMs for various tasks, they are also concerned about trade secrets, intellectual property, and other proprietary information leaking through data sent to these models. At the same time, enterprise security, compliance, data management, and information offices are apprehensive of exposing or leaking plain text customer information or other regulated data outside of the enterprise. AWS and Protopia AI are partnering to deliver the critical component that solves this common enterprise customer need.

Protopia AI’s Stained Glass Transform (SGT) solves these challenges by converting unprotected enterprise data to a randomized re-representation, referred to as RmoRed data, as shown in the following figure. This representation is a stochastic embedding of the original data, preserving the information the target LLM needs to function without exposing sensitive prompts or queries, context, or fine-tuning data. This re-representation is a one-way transformation that can’t be reversed, ensuring holistic privacy of enterprise data and protection against leaking plain text sensitive information to LLMs. SGT’s applicability is not limited to language models. Randomized re-representations can also be generated for visual and structured data. The name Stained Glass Transform is rooted in the visual appearance of randomized re-representations of visual data that can resemble viewing the data through stained glass, as demonstrated in this US Navy use case.

SGT works with state-of-the-art LLMs such as Llama 2. The following figure shows an example of applying SGT to a Llama 2 model for instruction following while adding a layer of protection to the instruction and context. The left side of the figure shows an example of a financial document as context, with the instruction asking the model to summarize the document. On the bottom left, the response generated by Llama 2 when operating on the raw prompt is shown. When using SGT, the embeddings associated with this prompt are transformed on the client side into stochastic embeddings, as described in more detail later in this post. The bottom right shows Llama 2 can still generate a correct response if the RmoRed data (post-transformation embeddings) are sent instead of the unprotected embeddings. The top right shows that if the RmoRed data leaked, a reconstruction of the original prompt would result in unintelligible text.

To create an SGT for a given model such as Llama 2, Protopia AI provides a lightweight library called the Stained Glass SDK, which is an extension of PyTorch. As shown in the following figure, after an SGT is created, it can be integrated into deployment pipelines in multiple ways. The transform that is created from the SDK can be deployed locally, in a hybrid setup, or completely on the cloud. This is possible because SGT is designed to be a lightweight process requiring very little compute resources and as such has minimal impact on the inference critical path. Another key evaluation is retention of model accuracy using re-represented data. We observe that across different data types and model variations, accuracy is retained within desirable tolerance limits when using re-represented data.

These options for deployment and maintaining the accuracy allows for confident adoption of SGT by all the stakeholders within an enterprise organization. To further protect the output of the LLM, Protopia AI can encode query outputs to a representation whose decoder is only available to the enterprise data owner.

Solution overview

The previous section described how you can use Stained Glass Transform in a variety of architectures. The following figure details the steps involved in creating, deploying, and using SGT for LLMs:

  • SGT creation – The team that trains the baseline LLM foundation model (providers of proprietary LLMs, cloud service provider, or enterprise ML teams creating their own LLMs) runs Protopia AI’s Stained Glass SDK software without altering their existing practices for training and deploying the LLM. After the foundation model training is complete, the SDK runs as an optimization pass over the language model to compute the SGT. This optimization pass is delivered through an extension to PyTorch. The SDK wraps the foundation model and mathematically discovers a unique Stained Glass Transform for that LLM. Further details of the underlying math can be found in the accompanying whitepaper. Note that because the team training the LLM itself is also running the Stained Glass SDK, there is no exposure or sending of model weights that is necessary for this step to be completed.
  • SGT release and deployment – The SGT that is output from the earlier optimization step is deployed as part of the data pipeline that feeds the trained LLM. As described in the previous section, the SGT sits on the enterprise client side.
  • SGT use – The SGT runs on the prompts created by the enterprise and generates protected prompts, which are sent to the deployed LLM. This enables the enterprise to retain ownership of their sensitive queries and context. Using Protopia AI Stained Glass, the unprotected sensitive data does not leave the enterprise’s site or trust zone.

You can use the Stained Glass SDK to create an SGT in multiple ways. For example, you can use the Stained Glass SDK in self-managed machine learning (ML) environments with Amazon Elastic Kubernetes Service (Amazon EKS) for training and inferencing or within Amazon Elastic Compute Cloud (Amazon EC2) directly. Another option is it can run within Amazon SageMaker to create an SGT for a given trained model. Transforming the input for deployment during inference from the client is independent of the chosen deployment implementation.

The following figure illustrates a possible implementation in a self-managed ML environment where training a Stained Glass Transform is performed on Amazon EKS.

In this workflow, a container is created using the Stained Glass SDK and deployed to Amazon Elastic Container Registry (Amazon ECR). This container is then deployed on Amazon EKS to train an SGT that is saved to Amazon Simple Storage Service (Amazon S3). If you’re using Amazon EC2, you can train a transformation directly on your instance as part of your ML setup. The Stained Glass SDK can run on a variety of instance types, including Amazon P5, P4, or G5 instance families, based on your base LLM requirements. After the LLM is deployed to be used for inference, the client application uses the created SGT, which is a lightweight operation, to transform prompts and context before sending them to the LLM. By doing so, only transformed data is exposed to the LLM, and ownership of the original input is retained on the client side.

The following figure demonstrates how you can train a transform and run inferencing on SageMaker.

The creation of the SGT follows a similar path as the Amazon EKS setup by ingesting the training data from Amazon S3, training an SGT on a container, and saving it to Amazon S3. You can use the Stained Glass SDK in your existing SageMaker setup with Amazon SageMaker Studio, SageMaker notebooks, and a SageMaker training job. The LLM is hosted as a SageMaker endpoint that is accessible by the client application. The inferencing for the client application is also identical to the Amazon EKS setup, except for what is serving the model.

Randomized re-representations to protect LLM prompts and fine-tuning data

This section covers a variety of use cases demonstrating how randomized re-representation protects LLM prompts. The examples illustrate major implications for enterprise generative AI efforts: opening new doors to AI use cases, accelerating speed to market while properly protecting enterprise data, and retaining ownership of the sensitive data required for use in LLM prompts.

RAG use case

A popular enterprise use case for LLMs is Retrieval Augmented Generation (RAG). The following figure shows an illustrative example where the prompts and sources are protected using Stained Glass. The left side of the figure shows the unprotected prompts and source information. In an enterprise implementation of RAG, the sources could include sensitive information such as enterprise trade secrets, intellectual property, or financial information. The right side shows the best possible reconstruction in human readable text from the RmoRed prompts created by the SGT.

We can observe that even in the best possible reconstruction, the information is completely obfuscated. However, the response from the model with and without the transformation is the same, with pointers to the original source documents, thereby preserving the accuracy of both the question and source documents while performing this popular enterprise use case.

Broad applicability across LLMs and languages

One of the highlights of the Stained Glass SDK is that it’s highly resilient to model advancements and adaptable to state-of-the-art models such as Llama 2. The following figure shows an SGT that was created on a Llama 2 LLM that was previously fine-tuned for working with Japanese text. This example further illustrates that SGTs can be created and applied for any language and that even inputs for fine-tuned models can be transformed. The general applicability of SGT is driven by the robust foundation of the Stained Glass SDK being model- and data-agnostic.

Protecting fine-tuning data as well as prompts

Stained Glass Transform is not limited solely to protecting data at inference time; it can also protect data used to fine-tune a foundation model. The process for creating the transformation for fine-tuning datasets is the same as that explained in the solution architecture section earlier in this post. The transformation is created for the foundation model to be fine-tuned without accessing the fine-tuning data. After the SGT has been created and trained for the foundation model, the fine-tuning dataset is transformed to randomized re-representations that will then be used to fine-tune the foundation model. This process is explained in more detail in the accompanying whitepaper.

In the following example, an enterprise customer needed to fine-tune an existing model for network log anomaly detection. They used Stained Glass to transform the sensitive fine-tuning dataset to randomized embeddings, which were used to fine-tune their foundation model. They found that the detection model that was fine-tuned on the transformed representations performed with almost identical accuracy compared to the hypothetical scenario of fine-tuning the foundation model on the unprotected fine-tuning dataset. The following table shows two examples of plain text data records from the fine-tuning dataset and a reconstruction to text of those same data records from the fine-tuning dataset.

Under the hood of Stained Glass Transform for LLMs

When applied to computer vision, SGT operates on input pixel features, and for LLMs, it operates at the embedding level. To highlight how Stained Glass Transform works, imagine the prompt embeddings as a matrix, as illustrated on the left of the following figure. In each entry, there is a deterministic value. This value can be mapped to the original data, exposing the unprotected prompt. Stained Glass Transform converts this matrix of deterministic values to a matrix whose elements are a cloud of possibilities.

The transformed prompt is rendered by sampling noise from probability distributions defined by the SGT and adding the sampled noise to the deterministic embeddings, which randomizes the original prompt values irreversibly. The model still understands the randomized re-represented prompt at the mathematical level and can carry out its task accurately.

Conclusion

This post discussed how Protopia AI’s Stained Glass Transform decouples raw data ownership and protection from the ML operations process, enabling enterprises to retain ownership and maintain privacy of sensitive information in LLM prompts and fine-tuning data. By using this state-of-the-art data protection for LLM usage, enterprises can accelerate adoption of foundation models and LLMs by worrying less about exposure of sensitive information. By safely unlocking the value in real enterprise data, organizations can enable the promised efficiencies and business outcomes of LLMs more efficiently and quickly. To learn more about this technology, you can find further reading in the accompanying whitepaper and connect with Protopia AI to get access and try it on your enterprise data.

About Protopia AI

Protopia AI is a leader in data protection and privacy-preserving AI/ML technologies based in Austin, Texas, and specializes in enabling AI algorithms and software platforms to operate without the need to access plain text information. Over the past 2 years, Protopia AI has successfully demonstrated its flagship Stained Glass Transform product across a variety of ML use cases and data types with the US Navy, leading financial services, and global technology providers.

Protopia AI works with enterprises, generative AI and LLM providers, and Cloud Service Providers (CSPs) to enable maintaining ownership and confidentiality of enterprise data while using AI/ML solutions. Protopia AI has partnered with AWS to deliver a critical component of data protection and ownership for enterprise adoption of generative AI, and was one of 21 startups selected for the inaugural AWS Generative AI Accelerator in 2023.


About the authors

Balaji Chandrasekaran is the VP for Go-to-Market & Customer Enablement at Protopia AI, works closely with clients to leverage AI in their business while prioritizing data protection and privacy. Prior to Protopia AI, Balaji was the Product Lead for AI Solutions at Infor, developing value-centric products while acting as a trusted partner for enterprise customers across diverse industries. Outside work, he enjoys music, hiking, and traveling with family.

Jennifer Cwagenberg leads the engineering team at Protopia AI and works to ensure that the Stained Glass technology meets the needs of their customers to protect their data. Jennifer has prior experience with security working at Toyota in their Product Cybersecurity Group, managing Cloud workloads at N-able, and responsible for data at Match.com.

Andrew Sansom is an AI Solutions Engineer at Protopia AI where he helps enterprises use AI while preserving private and sensitive information in their data. Prior to Protopia AI, he worked as a Technical Consultant focused on enabling AI solutions for clients across many industries including Finance, Manufacturing, Healthcare, and Education. He also taught Computer Science and Math to High School, University, and Professional students.

Eiman Ebrahimi, PhD, is a co-founder and the Chief Executive Officer of Protopia AI. Dr. Ebrahimi is passionate about enabling AI to enrich the human experience across different societal and industry verticals. Protopia AI is a vision for enhancing the lens through which AI observes the necessary and quality data it needs while creating novel capabilities for safeguarding sensitive information. Prior to Protopia AI, he was a Senior Research Scientist at NVIDIA for 9 years. His work at NVIDIA research aimed to solve problems of accessing massive datasets in ML/AI. He also co-authored peer-reviewed publications on how to utilize the power of thousands of GPUs to make training large language models feasible.

Rohit Talluri is a Generative AI GTM Specialist at Amazon Web Services (AWS). He is partnering with top generative AI model builders, strategic customers, key AI/ML partners, and AWS Service Teams to enable the next generation of artificial intelligence, machine learning, and accelerated computing on AWS. He was previously an Enterprise Solutions Architect, and the Global Solutions Lead for AWS Mergers & Acquisitions Advisory.

Read More