Learn how Amazon Ads created a generative AI-powered image generation capability using Amazon SageMaker

Learn how Amazon Ads created a generative AI-powered image generation capability using Amazon SageMaker

Amazon Ads helps advertisers and brands achieve their business goals by developing innovative solutions that reach millions of Amazon customers at every stage of their journey. At Amazon Ads, we believe that what makes advertising effective is delivering relevant ads in the right context and at the right moment within the consumer buying journey. With that goal, Amazon Ads has used artificial intelligence (AI), applied science, and analytics to help its customers drive desired business outcomes for nearly two decades.

In a March 2023 survey, Amazon Ads found that among advertisers who were unable to build successful campaigns, nearly 75 percent cited building the creative content as one of their biggest challenges. To help advertisers more seamlessly address this challenge, Amazon Ads rolled out an image generation capability that quickly and easily develops lifestyle imagery, which helps advertisers bring their brand stories to life. This blog post shares more about how generative AI solutions from Amazon Ads help brands create more visually rich consumer experiences.

In this blog post, we describe the architectural and operational details of how Amazon Ads implemented its generative AI-powered image creation solution on AWS. Before diving deeper into the solution, we start by highlighting the creative experience of an advertiser enabled by generative AI. Next, we present the solution architecture and process flows for machine learning (ML) model building, deployment, and inferencing. We end with lessons learned.

Advertiser creative experience

When building ad creative, advertisers prefer to customize the creative in a way that makes it relevant to their desired audiences. For example, an advertiser might have static images of their product against a white background. From an advertiser point of view, the process is handled in three steps:

  1. Image generation converts product-only images into rich, contextually relevant images using generative AI. The approach preserves the original product features, requiring no technical expertise.
  2. Anyone with access to the Amazon Ads console can create custom brand images without needing technical or design expertise.
  3. Advertisers can create multiple contextually relevant and engaging product images with no additional cost.

A benefit of the image-generation solution is the automatic creation of relevant product images based on product selection only, with no additional input required from the advertisers. While there are options to enhance background imagery such as prompts, themes, and custom product images, they are not necessary to generate compelling creative. If advertisers do not supply this information, the model will infer it based on information from their product listing on amazon.com.

An example screenshot from Amazon Ads generator where a product with various background.

Figure 1. An example from the image generation solution showing a hydro flask with various backgrounds.

Solution overview

Figure 2 shows a simplified solution architecture for inferencing and model deployment. The steps for the model development and deployment are shown in blue circles and depicted by roman-numerals (i,ii, … iv.) whereas inferencing steps are in orange with Hindu-Arabic numbers (1,2,… 8.).

AWS solution architecture showing the architecture for the Amazon Ads solution.

Figure 2. Solution architecture for inferencing and model deployment.

Amazon SageMaker is at the center of model development and deployment. The team used Amazon SageMaker JumpStart to rapidly prototype and iterate under their desired conditions (step i). Acting as a model hub, JumpStart provided a large selection of foundation models and the team quickly ran their benchmarks on candidate models. After selecting candidate large language models (LLMs), the science teams can proceed with the remaining steps by adding more customization. Amazon Ads applied scientists use SageMaker Studio as the web-based interface to work with SageMaker (step ii). SageMaker has the appropriate access policies to view some intermediary model results, which can be used for further experimentation (step iii).

The Amazon Ads team manually reviewed images at scale through a human-in-the-loop process where the team ensured that the application provides high quality and responsible images. To do that, the team deployed testing endpoints using SageMaker and generated a large number of images spanning various scenarios and conditions (step iv). Here, Amazon SageMaker Ground Truth allowed ML engineers to easily build the human-in-the-loop workflow (step v). The workflow allowed the Amazon Ads team to experiment with different foundation models and configurations through blind A/B testing to ensure that feedback to the generated images is unbiased. After the chosen model is ready to be moved into production, the model is deployed (step vi) using the team’s own in-house Model Lifecycle Manager tool. Under the hood, this tool uses artifacts generated by SageMaker (step vii) which is then deployed into the production AWS account (step viii), using SageMaker SDKs .

Regarding the inference, customers using Amazon Ads now have a new API to receive these generated images. The Amazon API Gateway receives the PUT request (step 1). The request is then processed by AWS Lambda, which uses AWS Step Functions to orchestrate the process (step 2). The product image is fetched from an image repository, which is a part of an existing solution predating this creative feature. The next step is to process customer text prompts and customize the image through content ingestion guardrails. Amazon Comprehend is used to detect undesired context in the text prompt, whereas Amazon Rekognition processes images for content moderation purposes (step 3). If the inputs pass the inspection, then the text continues as a prompt, while the image is processed by removing the background (step 4). Then, the deployed text-to-image model is used for image generation using the prompt and the processed image (step 5). The image is then uploaded into an Amazon Simple Storage Services (Amazon S3) bucket for images and the metadata about the image is stored in an Amazon DynamoDB table (step 6). This whole process starting from step 2 is orchestrated by AWS Step Functions. Finally, the Lambda function receives the image and meta-data (step 7) which are then sent to the Amazon Ads client service through the API Gateway (step 8).

Conclusion

This post presented the technical solution for the Amazon Ads generative AI-powered image generation solution, which advertisers can use to create customized brand images without needing a dedicated design team. Advertisers have a series of features to generate and customize images such as writing text prompts, selecting different themes, swapping the featured product, or uploading a new image of the product from their device or asset library allowing them to create impactful images for advertising their products.

The architecture uses modular microservices with separate components for model development, registry, model lifecycle management (which is an orchestration and step function-based solution to process advertiser inputs), select the appropriate model, and track the job throughout the service, and a customer facing API. Here, Amazon SageMaker is at the center of the solution, starting from JumpStart to final SageMaker deployment.

If you plan to build your generative AI application on Amazon SageMaker, the fastest way is with SageMaker JumpStart. Watch this presentation to learn how you can start your project with JumpStart.


About the Authors

Anita Lacea is the Single-Threaded Leader of generative AI image ads at Amazon, enabling advertisers to create visually stunning ads with the click of a button. Anita pairs her broad expertise across the hardware and software industry with the latest innovations in generative AI to develop performant and cost-optimized solutions for her customers, revolutionizing the way businesses connect with their audiences. She is passionate about traditional visual arts and is an exhibiting printmaker.

Burak Gozluklu is a Principal AI/ML Specialist Solutions Architect located in Boston, MA. He helps strategic customers adopt AWS technologies and specifically Generative AI solutions to achieve their business objectives. Burak has a PhD in Aerospace Engineering from METU, an MS in Systems Engineering, and a post-doc in system dynamics from MIT in Cambridge, MA. Burak is still a research affiliate in MIT. Burak is passionate about yoga and meditation.

Christopher de Beer is a senior software development engineer at Amazon located in Edinburgh, UK. With a background in visual design. He works on creative building products for advertising, focusing on video generation, helping advertisers to reach their customers through visual communication. Building products that automate creative production, using traditional as well as generative techniques, to reduce friction and delight customers. Outside of his work as an engineer Christopher is passionate about Human-Computer Interaction (HCI) and interface design.

Yashal Shakti Kanungo is an Applied Scientist III at Amazon Ads. His focus is on generative foundational models that take a variety of user inputs and generate text, images, and videos. It’s a blend of research and applied science, constantly pushing the boundaries of what’s possible in generative AI. Over the years, he has researched and deployed a variety of these models in production across the online advertising spectrum ranging from ad sourcing, click-prediction, headline generation, image generation, and more.

Sravan Sripada is a Senior Applied Scientist at Amazon located in Seattle, WA. His primary focus lies in developing generative AI models that enable advertisers to create engaging ad creatives (images, video, etc.) with minimal effort. Previously, he worked on utilizing machine learning for preventing fraud and abuse on the Amazon store platform. When not at work, He is passionate about engaging in outdoor activities and dedicating time to meditation.

Cathy Willcock is a Principal Technical Business Development Manager located in Seattle, WA. Cathy leads the AWS technical account team  supporting Amazon Ads adoption of AWS cloud technologies. Her team works across Amazon Ads enabling discovery, testing, design, analysis, and deployments of AWS services at scale, with a particular focus on innovation to shape the landscape across the AdTech and MarTech industry. Cathy has led engineering,  product, and marketing  teams and is an inventor of ground-to-air calling (1-800-RINGSKY).

Read More

RAG architecture with Voyage AI embedding models on Amazon SageMaker JumpStart and Anthropic Claude 3 models

RAG architecture with Voyage AI embedding models on Amazon SageMaker JumpStart and Anthropic Claude 3 models

This post is a guest post co-written with Tengyu Ma and Wen Phan from Voyage AI.

Organizations today have access to vast amounts of data, much of it proprietary, which holds the potential to unlock valuable insights when used effectively in generative artificial intelligence (AI) applications. Retrieval Augmented Generation (RAG) is a powerful technique designed to tap into this reservoir of information. By dynamically pulling relevant data from these extensive databases during the response generation process, RAG enables AI models to produce more accurate, relevant, and contextually rich outputs.

Embedding models are crucial components in the RAG architecture, serving as the foundation for effectively identifying and retrieving the most relevant information from a large dataset. These models convert large volumes of text into compact, numerical representations, allowing the system to quickly sift through and match query-related data with unprecedented precision. By facilitating a more efficient and accurate retrieval process, embedding models make sure that the generative component of RAG is fed with the most pertinent information.

In this post, we provide an overview of the state-of-the-art embedding models by Voyage AI and show a RAG implementation with Voyage AI’s text embedding model on Amazon SageMaker Jumpstart, Anthropic’s Claude 3 model on Amazon Bedrock, and Amazon OpenSearch Service. Voyage AI’s embedding models are the preferred embedding models for Anthropic. In addition to general-purpose embedding models, Voyage AI offers domain-specific embedding models that are tuned to a particular domain.

RAG architecture and embedding models

RAG is the predominant design pattern for enterprise chatbots where a retrieval system fetches validated sources and documents that are pertinent to the query and inputs them to a large language model (LLM) to generate a response. It combines the generative capabilities of models with the informational breadth found in vast databases, enabling the model to pull relevant external documents to enhance its responses. This results in outputs that are not only contextually rich but also factually accurate, significantly boosting the reliability and utility of LLMs across diverse applications.

Let’s briefly review RAG using the following figure.

RAG systems are empowered by semantic search using dense-vector representations of the documents called embeddings. These vectors are stored in a vector store, where they can be efficiently retrieved later. At query time, a query is also converted into a vector and then used to find and retrieve similar documents stored in the vector store via a k-nearest neighbor (k-NN) search against the document vector representations. Finally, the retrieved documents along with the query are used to prompt the generative model, often resulting in higher-quality responses and fewer hallucinations.

Embedding models are neural network models that transform queries and documents into embeddings. The retrieval quality is solely decided by how the data is represented as vectors, and the effectiveness of embedding models is evaluated based on their accuracy in retrieving relevant information. Therefore, the retrieval quality of the embedding models is highly correlated with the quality of the RAG system responses—to make your RAG more successful, you should consider improving your embeddings. Check out this blog for a detailed explanation.

Voyage AI’s general-purpose and domain-specific embedding models

Voyage AI develops cutting-edge embedding models with state-of-the-art retrieval accuracy. voyage-large-2 is Voyage’s most powerful generalist embedding model, outperforming popular competing models. Voyage also offers voyage-2, a base generalist embedding model optimized for latency and quality. The following table summarizes the Voyage embedding models currently available on SageMaker JumpStart.

Voyage AI Model SageMaker JumpStart Model ID Description
voyage-2 voyage-2-embedding General-purpose embedding model optimized for a balance between cost, latency, and retrieval quality
voyage-large-2 voyage-large-2-embedding General-purpose embedding model optimized for retrieval quality
voyage-code-2 voyage-code-2-embedding Domain-specific embedding model optimized for code retrieval (17% better than alternatives)

In addition to general-purpose embedding models, Voyage AI offers domain-specific ones that are tuned to a particular domain. These domain-specific embedding models are trained on massive domain-specific datasets, allowing them to deeply understand and excel in that domain. For example, Voyage’s code embedding model (voyage-code-2) outperforms general-purpose embedding models on code-related data documents, achieving about a 15% improvement over the next best model. This performance gap over the next best general-purpose embedding improves even more for datasets requiring deeper code understanding. See voyage-code-2: Elevate Your Code Retrieval for voyage-code-2 details. More recently, Voyage released a legal embedding model (voyage-law-2) that is optimized for legal retrieval and tops the MTEB leaderboard for legal retrieval. See Domain-Specific Embeddings and Retrieval: Legal Edition (voyage-law-2) for voyage-law-2 details. Voyage AI plans to continue releasing additional domain-specific embedding models in the near future, including finance, healthcare, and multi-language. For a list of all available Voyage AI embedding models, see Embeddings.

Voyage AI offers API endpoints for embedding models, making it seamless to integrate with other components of your RAG stack. The Voyage AI embedding models are available on AWS Marketplace and deployable as Amazon SageMaker endpoints within your account and VPC, eliminating security and compliance concerns. As part of SageMaker JumpStart, you can deploy Voyage AI embedding models with a few clicks and start running your RAG stack on AWS.

Solution overview

In this RAG solution, we use Voyage AI embedding models deployed with SageMaker JumpStart to demonstrate an example using the Apple 2022 annual report (SEC Form 10-K) as the corpus to retrieve from. Specifically, we deploy the SageMaker model package of the voyage-large-2 model. For the LLM, we use the Anthropic Claude 3 Sonnet model on Amazon Bedrock. We use OpenSearch Service as the vector store. You can also follow along with the notebook. The following diagram illustrates the solution architecture.

SageMaker JumpStart is the machine learning (ML) hub of SageMaker that offers one-click access to over 350 open source and third-party models. These models can be discovered and deployed through the Amazon SageMaker Studio UI or using the SageMaker Python SDK. SageMaker JumpStart provides notebooks to customize and deploy foundation models into your VPC.

Anthropic’s Claude 3 models are the next generation of state-of-the-art models from Anthropic. For the vast majority of workloads, Sonnet is faster on inputs and outputs than Anthropic’s Claude 2 and 2.1 models, with higher levels of intelligence. Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like Anthropic through an API, making it straightforward to build generative AI applications. To follow along, be sure to request model access to Anthropic Claude 3 Sonnet on Amazon Bedrock.

Amazon OpenSearch Service is a managed service that makes it straightforward to deploy, operate, and scale OpenSearch, a popular open source, distributed search analytics suite derived from Elasticsearch. OpenSearch provides the ability to do vector search via the k-NN search.

Prerequisites

To follow along, you need to create an OpenSearch Service domain. For the purposes of this walkthrough, the Easy create option is fine. Keep the Enable fine-grained access control option selected. Select Create master user and provide a user name and password. After the domain has been created, the domain details will have the domain endpoint, which you’ll need—along with the user name and password—to access your OpenSearch instance. You don’t need to worry about creating an index or inserting data. We use the OpenSearch Python client to work with our vector store in the walkthrough.

Deploy Embedding model endpoint

To use voyage-large-2, you need to subscribe to the SageMaker model package in AWS Marketplace. For instructions, see Subscribe to the model package. Choosing the model card in the SageMaker JumpStart UI will also bring you to the model listing page on AWS Marketplace.

After you’re subscribed, you can initialize and deploy the embedding model as a SageMaker endpoint as follows:

# Set embedding endpoint configuration
(embedding_model_id, embedding_model_version, embedding_instance_type) = (
    "voyage-large-2-embedding",
    "*",
    "ml.g5.xlarge",  # See AWS Marketplace model package for supported instance types
)

# Instantiate embedding model from JumpStart
from sagemaker.jumpstart.model import JumpStartModel

embedding_model = JumpStartModel(
    model_id=embedding_model_id,
    model_version=embedding_model_version,
    instance_type=embedding_instance_type,
)

# Deploy model as inference endpoint. This can take several minutes to deploy (5 to 10 minutes)
embedding_endpoint = embedding_model.deploy()

Vectorize Documents

With the embedding endpoint deployed, you can index your documents for retrieval.

Transform and chunk documents

You need a list of strings to invoke the deployed voyage-large-2 model. For many documents, like our example annual report, each string is a semantically meaningful chunk of text. There are several ways you can load and chunk documents for vectorization. The code in this section is just one example; feel free to use what suits your data source and files.

In this walkthrough, we load and chunk the source PDF file with the LangChain PyPDFLoader (which uses pypdf) and recursive character text splitter:

from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

loader = PyPDFLoader("apple-10k-2022.pdf")
document_chunks = loader.load_and_split(
    RecursiveCharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=100,
        length_function=len,
        is_separator_regex=False,
    )
)

In practice, selecting the text splitting chunk size and overlap requires some experimentation. The are many techniques for appropriately chunking documents for high-quality retrieval, but that is beyond the scope of this post.

Generate document embeddings

You can now vectorize your documents—or more precisely, your document chunks. See the following code:

# Set batch size
BATCH_SIZE = 45
In [ ]:
# Vectorize chunks in batches
index_list = []
for i in range(0, len(chunk_list), BATCH_SIZE):
    docs_playload = {
        "input": chunk_list[i:i + BATCH_SIZE],
        "input_type": "document",
        "truncation": "true",
    }

    embed_docs_response = embedding_endpoint.predict(json.dumps(docs_playload))

    doc_embeddings_list = [d["embedding"] for d in embed_docs_response["data"]]
    index_list += [
        {"document": document, "embedding": embedding} 
        for document, embedding in zip(chunk_list[i:i + BATCH_SIZE], doc_embeddings_list)
    ]

Create a vector store index

The next step is to populate your OpenSearch vector search index with the document embeddings using the OpenSearch Python client:

# Populate index with document, embedding, and ID
for id, i in zip(range(0, len(index_list)), index_list):
    index_response = opensearch_client.index(
        index=INDEX_NAME_OPENSEARCH,
        body={
            "document": i["document"],
            "embedding": i["embedding"],
        },
        id=id,
        refresh=True,
    )

Retrieve relevant documents

With your indexed vector store, you can now use embeddings to find relevant documents to your query:

# Set number of documents to retrieve
TOP_K = 3
In [ ]:
# Set vector search payload
vector_search_payload = {
    "size": TOP_K,
    "query": {"knn": {"embedding": {"vector": query_embedding, "k": TOP_K}}},
}
In [ ]:
vector_search_response = opensearch_client.search(
    index=INDEX_NAME_OPENSEARCH,
    body=vector_search_payload,
)

The following is a formatted semantic search result of the top three most-relevant document chunks, indicating the index ID, similarity score, and the first several characters of the chunk:

ID: 4
Score: 0.7956404
Document: under Section 404(b) of the Sarbanes-Oxley Act (15 U.S.C. 7262(b)) by the registered public accounting firm that prepared or issued its audit report. ☒
Indicate by check mark whether the Registrant is a shell company (as defined in Rule 12b-2 of the Act).
Yes  ☐ 	No  ☒
The aggregate market value of the voting and non-voting stock held by non-affiliates of the Registrant, as of March 25, 2022, the last business day of the Registrant’s most recently completed second fiscal quarter, was approximately $2,830,067,000,000. Solely for purposes of this disclosure, shares of common stock held by executive officers and directors of the Registrant as of such date have been excluded because such persons may be deemed to be affiliates. This determination of executive officers and directors as affiliates is not necessarily a conclusive determination for any other purposes.  15,908,118,000 shares of common stock were issued and outstanding as of October 14, 2022.
 
ID: 5
Score: 0.7367379
Document: 15,908,118,000 shares of common stock were issued and outstanding as of October 14, 2022.
DOCUMENTS INCORPORATED BY  REFERENCE
Portions of the Registrant’s definitive proxy statement relating to its 2023 annual meeting of shareholders are incorporated by reference into Part III of this Annual Report on Form 10-K where indicated. The Registrant’s definitive proxy statement will be filed with the U.S. Securities and Exchange Commission within 120 days after the end of the fiscal year to which this report relates.
 
ID: 178
Score: 0.7263324
Document: Note 3 – Financial Instruments
Cash, Cash Equivalents and Marketable Securities
The following tables show the Company’ s cash, cash equivalents and marketable securities by significant investment category as of September 24, 2022 and September 25, 2021 (in millions):
2022
Adjusted Cost
Unrealized Gains
Unrealized Losses
Fair Value
Cash and Cash Equivalents
Current Marketable Securities
Non-Current Marketable Securities
Cash $ 18,546 $ — $ — $ 18,546 $ 18,546 $ — $ —
Level 1 :
Money market funds 2,929 — — 2,929 2,929 — —
Mutual funds 274 — (47) 227 — 227 —
Subtotal 3,203 — (47) 3,156 2,929 227 —
Level 2 :
U.S. Treasury securities 25,134 — (1,725) 23,409 338 5,091 17,980
U.S. agency securities 5,823 — (655) 5,168 — 240 4,928
Non-U.S. government securities 16,948 2 (1,201) 15,749 — 8,806 6,943  	Certificates of deposit and time deposits 2,067 — — 2,067 1,805 262 —
Commercial paper 718 — — 718 28 690 —
Corporate debt securities 87,148 9 (7,707) 79,450 — 9,023 70,427

The top retrieved document chunk (ID 4 with a score of 0.7956404) contains a statement that provides a direct answer to our query:

The aggregate market value of the voting and non-voting stock held by non-affiliates of the Registrant, as of March 25, 2022, the last business day of the Registrant’s most recently completed second fiscal quarter, was approximately $2,830,067,000,000.

This additional context will enable Claude to provide a response that answers your query.

Generate a retrieval augmented response

You can now prompt Claude to use the retrieved documents to answer your query:

# Create retrieval-augmented prompt
rag_prompt = f"""Human:

INSTRUCTIONS:
Answer the QUERY using the CONTEXT text provided below. Keep your answer
grounded in the facts of the CONTEXT. If the CONTEXT doesn’t contain the
facts to answer the QUERY just respond with "I do not have enough context
to respond to this query.".

QUERY: {query}

CONTEXT: {context}

Assistant:
"""

Next initialize the Amazon Bedrock client to invoke Anthropic’s Claude3 Sonnet model in us-east-1.

# List available LLMs on Amazon Bedrock
bedrock_client = boto3.client('bedrock', region_name='us-east-1')
bedrock_fm = bedrock_client.list_foundation_models()
print([(m["modelId"], m["modelName"]) for m in bedrock_fm["modelSummaries"]])
In [ ]:
# Set Bedrock LLM model ID
LLM_MODEL_ID = "anthropic.claude-3-sonnet-20240229-v1:0"
In [ ]:
bedrock_runtime_client = boto3.client('bedrock-runtime', region_name='us-east-1')
In [ ]:
# Invoke LLM inference
try:
    llm_rag_response = bedrock_runtime_client.invoke_model(
        modelId=LLM_MODEL_ID,
        body=json.dumps(rag_payload),
        accept="application/json",
        contentType="application/json",
    )
except botocore.exceptions.ClientError as error:
    print("Client error")
    raise error

Claude produces a correct response that answers the query using the context provided:

According to the context provided, "The aggregate market value of the voting and non-voting stock held by non-affiliates of the Registrant, as of March 25, 2022, the last business day of the Registrant's most recently completed second fiscal quarter, was approximately $2,830,067,000,000."

Generate a response without retrieved documents

For comparison, let’s see how Claude will respond to a query directly—without any additional context. Specifically, we ask “What is the aggregate market value of the voting and non-voting stock held by non-affiliates of the Registrant, as of March 25, 2022?”

The following is Claude’s response:

Unfortunately, without access to the specific financial filings or reports for the company you are asking about, I do not have enough information to determine the aggregate market value of the voting and non-voting stock held by non-affiliates as of March 25, 2022.

Public companies are required to disclose this information in their annual reports and other SEC filings, but I do not have access to those documents for the particular company or "Registrant" you are referring to. The aggregate market value calculation takes into account things like the number of outstanding shares, recent stock prices, and which shareholders are considered affiliates versus non-affiliates.

Unless you are able to provide the company's name and the specific report containing this data point, I cannot lookup or calculate that aggregate market value figure. My role is to provide helpful information to you, but I do not have a way to access each public company's proprietary financial data directly. Please let me know if you can share more details about the company in question.

Claude indicates it does not have enough context, especially because the query didn’t explicitly indicate the registrant (the company).

Clean up

If you’ve been following along, don’t forget to delete the SageMaker endpoint and delete the OpenSearch Service domain when you’re done so you don’t incur unnecessary costs:

embedding_endpoint.delete_model()
embedding_endpoint.delete_endpoint()

Conclusion

Embeddings are at the heart of a RAG system, and Voyage AI offers the best general-purpose and domain-specific embedding models today. Get started or level up your existing RAG stack on AWS today with Voyage AI embedding models—it’s seamless with SageMaker JumpStart. You can try the notebook in this post yourself. Learn more about Voyage AI and follow them on X (Twitter) or LinkedIn for updates!


About the Authors

Tengyu Ma is CEO and Co-Founder of Voyage AI and an assistant professor of computer science at Stanford University. His research interests broadly include topics in machine learning, algorithms and their theory, such as deep learning, (deep) reinforcement learning, pre-training / foundation models, robustness, non-convex optimization, distributed optimization, and high-dimensional statistics. Tengyu earned his PhD from Princeton University and has worked at Facebook and Google as visiting scientists.

Wen Phan is Head of Product at Voyage AI and has spent the last decade developing and commercializing AI and data products for enterprises. He has worked with hundreds of users and organizations around the world to apply AI and data to their use cases in financial services, healthcare, defense, and technology, to name a few. Wen holds a B.S. in electrical engineering and M.S. in analytics and decision sciences. Personally, he enjoys spinning hip-hop records, dining out, and spending time with his wife and two kids — oh, and guzzling cookies and cream milkshakes, too!

Vivek Gangasani is an AI/ML Solutions Architect working with Generative AI startups on AWS. He helps world leading AI startups train, host and operationalize LLMs to build innovative Generative AI solutions. Currently, he is focused on developing strategies for fine-tuning and optimizing the inference performance at scale for LLMs. In his free time, Vivek enjoys hiking, watching movies and trying different cuisines.

Read More

Incorporate offline and online human – machine workflows into your generative AI applications on AWS

Incorporate offline and online human – machine workflows into your generative AI applications on AWS

Recent advances in artificial intelligence have led to the emergence of generative AI that can produce human-like novel content such as images, text, and audio. These models are pre-trained on massive datasets and, to sometimes fine-tuned with smaller sets of more task specific data. An important aspect of developing effective generative AI application is Reinforcement Learning from Human Feedback (RLHF). RLHF is a technique that combines rewards and comparisons, with human feedback to pre-train or fine-tune a machine learning (ML) model. Using evaluations and critiques of its outputs, a generative model can continue to refine and improve its performance. The interplay between Generative AI and human input paves the way for more accurate and responsible applications. You can learn how to improve your LLMs with RLHF on Amazon SageMaker, see Improving your LLMs with RLHF on Amazon SageMaker.

Athough RLHF is the predominant technique for incorporating human involvement, it is not the only available human in the loop technique. RLHF is an offline, asynchronous technique, where humans provide feedback on the generated outputs, based on input prompts. Humans can also add value by intervening into an existing communication happening between generative AI and users. For instance, as decided by AI or desired by the user, a human can be called into an existing conversation and take over the discussion.

In this post, we introduce a solution for integrating a “near-real-time human workflow” where humans are prompted by the generative AI system to take action when a situation or issue arises. This can also be a ruled-based method that can determine where, when and how your expert teams can be part of generative AI – user conversations. The entire conversation in this use case, starting with generative AI and then bringing in human agents who take over, is logged so that the interaction can be used as part of the knowledge base. Together with RLHF, near-real-time human-in-the-loop methods enable the development of responsible and effective generative AI applications.

This blog post uses RLHF as an offline human-in-the-loop approach and the near-real-time human intervention as an online approach. We present the solution and provide an example by simulating a case where the tier one AWS experts are notified to help customers using a chat-bot. We use an Amazon Titan model on Amazon Bedrock to find the sentiment of the customer using a Q&A bot and then notifying about negative sentiment to a human to take the appropriate actions. We also have another expert group providing feedback using Amazon SageMaker GroundTruth on completion quality for the RLHF based training. We used this feedback to finetune the model deployed on Amazon Bedrock to power the chat-bot. We provide LangChain and AWS SDK code-snippets, architecture and discussions to guide you on this important topic.

SageMaker GroudTruth

SageMaker Ground Truth offers the most comprehensive set of human-in-the-loop capabilities, allowing you to harness the power of human feedback across the ML lifecycle to improve the accuracy and relevancy of models. You can complete a variety of human-in-the-loop tasks with SageMaker Ground Truth, from data generation and annotation to model review, customization, and evaluation, through either a self-service or an AWS-managed offering.

Amazon Bedrock

Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon with a single API, along with a broad set of capabilities you need to build generative AI applications with security, privacy, and responsible AI. With Amazon Bedrock, you can easily experiment with and evaluate top FMs for your use case, privately customize them with your data using techniques such as fine-tuning and Retrieval Augmented Generation (RAG), and build agents that run tasks using your enterprise systems and data sources. Because Amazon Bedrock is serverless, you don’t have to manage any infrastructure, and you can securely integrate and deploy generative AI capabilities into your applications using the AWS services you are already familiar with.

Example use-case

In this use case, we work with a generative AI powered Q&A bot, which answers questions about SageMaker. We built the RAG solution as detailed in the following GitHub repo and used SageMaker documentation as the knowledge base. You can build such chatbots following the same process. The interface of the Q&A looks like the following screenshot. Amazon SageMaker Sample and used Amazon SageMaker documentation as the knowledge base. You can easily build such chatbots following the same process. Eventually, the interface of the Q&A looks like in Figure 1.

UI and the Chatbot example application to test human-workflow scenario.

Figure 1. UI and the Chatbot example application to test human-workflow scenario.

In this scenario, we incorporate two human workflows to increase customer satisfaction. The first is to send the interactions to human experts to assess and provide scores. This is an offline process that is part of the RLHF. A second real-time human workflow is initiated as decided by the LLM. We use a simple notification workflow in this post, but you can use any real-time human workflow to take over the AI-human conversation.

Solution overview

The solution consists of three main modules:

  • Near real-time human engagement workflow
  • Offline human feedback workflow for RLHF
  • Fine-tuning and deployment for RLHF

The RLHF and real-time human engagement workflows are independent. Therefore, you can use either or both based on your needs. In both scenarios, fine-tuning is a common final step to incorporate these learnings into LLMs. In the following sections, we provide the details about incorporating these steps one by one and divide the solution into related sections for you to choose and deploy.

The following diagram illustrates the solution architecture and workflow.

Solutions architecture for human-machine workflow modules

Figure 2. Solutions architecture for human-machine workflow modules

Implementation

Prerequisites

Our solution is an add-on to an existing Generative AI application. In our example, we used a Q&A chatbot for SageMaker as explained in the previous section. However, you can also bring your own application. The blog post assumes that you have expert teams or workforce who performs reviews or join workflows.

Build a near real-time human engagement workflow workflow

This section presents how an LLM can invoke a human workflow to perform a predefined activity. We use AWS Step Functions which is a serverless workflow orchestration service that you can use for human-machine workflows. In our case, we call the human experts into action, in real time, but you can build any workflow following the tutorial Deploying an Example Human Approval Project.

Decision workflow to trigger real time human engagement

In this scenario, the customer interacts with the Q&A bot (Step-1 in the previous architecture diagram), and if the interaction shows strong negative sentiment, it will invoke a pre-existing human workflow (Step-2 in Figure 2). In our case, it is a simple email notification (Step-3 in Figure 2) but you can extend this interaction such as including the experts into the chat-zone to take over the conversation and more (Step-4 in Figure 2).

Before we dive deep into the solution, it is important to discuss the workflow logic. The following figure shows the details of the decision workflow. The interaction starts with a customer communication. Here, before the LLM provides an answer to the customer request, the prompt-chain starts with an internal prompt asking the LLM to go over the customer response and look for clear negative sentiment. This prompt and internal sentiment analysis are not visible to customer. This is an internal chain before proceeding with the next steps of which responses may be reflected to the customer based on your preference. If the sentiment is negative, the next step is to trigger a pre-built engagement human-workflow while the chatbot informs the customer about the extra support coming to help. Otherwise, if the sentiment is neutral or positive, the normal response to the customer request will be provided.

This workflow is a demonstrative example and you can add to or modify it as you prefer. For example, you can make any other decision check, not limited to sentiment. You can also prepare your own response to the customer with the right prompting the chain so that you can implement your designed customer experience. Here, our simple example demonstrates how you can easily build such prompt in chains and engage external existing workflows, in our case, it is a human-workflow using Amazon Bedrock. We also use the same LLM to respond to this internal sentiment prompt check for simplicity. However, you can include different LLMs, which might have been fine-tuned for specific tasks, such as sentiment analysis, so that you rely on a different LLM for the Q&A chatbot experience. Adding more serial steps into chains increases the latency because now the customer query or request is being processed more than once.

Real-time (online) human workflow triggered by LLM.

Figure 3. Real-time (online) human workflow triggered by LLM.

Implementing the decision workflow with Amazon Bedrock

To implement the decision workflow, we used Amazon Bedrock and its LangChain integrations. The prompt chain is run through SequentialChain from LangChain. Because our human workflow is orchestrated with Step Functions, we also use LangChain’s StepFunction library.

  1. First, define the LLM and prompt template:
    prompt = PromptTemplate(
    input_variables=["text"],
    template="{text}",)
    llm = Bedrock(model_id="amazon.titan-tg1-large")
    llmchain_toxic = LLMChain(llm=llm, prompt=prompt,output_key="response")

  2. Then you feed the response from the first LLM to the next LLM through an LLM chain, where the second instruct is to find the sentiment of the response. We also instruct the LLM to provide 0 as positive and 1 as negative response.
    templateResponseSentiment="""Find the sentiment of below sentence, respond 0 if positive and respond 1 if negative
    {response} """
    
    prompt_sentiment= PromptTemplate( input_variables=["response"], template = templateResponseSentiment)
    llmchain_sentiment= LLMChain(llm=llm, prompt=prompt_sentiment,output_key="sentiment")
    
    from langchain.chains import SequentialChain
    overall_chain = SequentialChain(chains=[llmchain_toxic, llmchain_sentiment], input_variables=["text"],output_variables=["response", "sentiment"],verbose=True)

  3. Run a sequential chain to find the sentiment:
    response= overall_chain({ "text": "Can you code for me for SageMaker" })
    print("response payload " + str(response))
    print("n response sentiment: " + response['sentiment'])

  4. If the sentiment is negative, the model doesn’t provide the response back to customer, instead it invokes a workflow that will notify a human in loop:
    if "1" in response_sentiment['sentiment'] : # 1 represents negative sentiment
    print('triggered workflow, check email of the human on notification and add to workflow anything else you may want')
    lambda_client = boto3.client('lambda')
    #create input - send the response from LLM and detected sentiment
    lambda_payload1="{"response": "" + response['text'] +"","response_sentiment": " + ""1"}"
    lambda_client.invoke(FunctionName='triggerWorkflow', InvocationType='Event', Payload=lambda_payload1)

If you choose to have your human experts join a chat with the users, you can add these interactions of your expert teams to your knowledge base. This way, when the same or similar issue is raised, the chatbot can use these in their answers. In this post, we did not show this method, but you can create a knowledge base in Amazon Bedrock to use these human-to-human interactions for future conversations in your chatbot.

Build an offline human feedback workflow

In this scenario, we assume that the chat transcripts are stored in an Amazon Simple Storage Service (Amazon S3) bucket in JSON format, a typical chat transcript format, for the human experts to provide annotations and labels on each LLM response. The transcripts are sent for a labeling task performed by a labeling workforce using Amazon SageMaker Ground Truth. However, in some cases, it’s impossible to label all the transcripts due to resource limitations. In these cases, you may want to randomly sample the transcripts or use a pattern that can be sent to the labeling workforce based on your business case.

Pre-annotation Lambda function
The process starts with an AWS Lambda function. The pre-annotation Lambda function is invoked based on chron job or based on an event or on-demand. Here, we use the on-demand option. SageMaker Ground Truth sends the Lambda function a JSON-formatted request to provide details about the labeling job and the data object. More information can be found here. Following is the code snippet for the pre-processing Lambda function:

import json
def lambda_handler(event, context):
return {
"taskInput": event['dataObject']
}

# JSON formatted request

{
"version": "2018-10-16",
"labelingJobArn": <labelingJobArn>
"dataObject" : {
"source-ref": <s3Uri where dataset containing the chabot responses are stored>
}
}

Custom workflow for SageMaker Ground Truth
The remaining part of sending the examples, UI, and storing the results of the feedback are performed by SageMaker Ground Truth and invoked by the pre-annotation Lambda function. We use the labeling job with the custom template option in SageMaker Ground Truth. The workflow allows labelers to rate the relevance of an answer to a question from 1–5, with 5 being the most relevant. Here, we assumed a conventional RLHF workflow where the labeling workforce provides the score based on their expectation from the LLM in this situation. The following code shows an example:

<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>
<crowd-form>
<crowd-classifier
name="relevance"
categories="['1', '2', '3', '4', '5']"
header="How relevant is the below answer to the question: {{ task.input.source }}"
>
<classification-target>
{{ task.input.source }}
</classification-target>
<full-instructions header="Conversation Relevance Instructions">
<h2>How relevant is the below answer to the given question?</h2>
</full-instructions>
<short-instructions>
How relevant is the below answer to the question: {{ task.input.source }}
</short-instructions>
</crowd-classifier>
</crowd-form>

In our scenario, we used the following UI for our labeling workers to score the complete response given for the prompt. This provides feedback on the answer to a question given by the chatbot, marking it as 1–5, with 5 being most the relevant answer to the question.

Two examples from RLHF feedback UI.Two examples from RLHF feedback UI.

Figure 4. Two examples from RLHF feedback UI.

Post annotation Lambda function
When all workers complete the labeling task, SageMaker Ground Truth invokes the post-annotation Lambda function with a pointer to the dataset object and the workers’ annotations. This post-processing Lambda function is generally used for annotation consolidation, which has SageMaker Ground Truth create a  manifest file and uploads it to an S3 bucket for persistently storing consolidated annotations. The following code shows the postprocessing Lambda function:

import json
import boto3
from urllib.parse import urlparse

def lambda_handler(event, context):
consolidated_labels = []

parsed_url = urlparse(event['payload']['s3Uri']);
s3 = boto3.client('s3')
textFile = s3.get_object(Bucket = parsed_url.netloc, Key = parsed_url.path[1:])
filecont = textFile['Body'].read()
annotations = json.loads(filecont);

for dataset in annotations:
for annotation in dataset['annotations']:
new_annotation = json.loads(annotation['annotationData']['content'])
label = {
'datasetObjectId': dataset['datasetObjectId'],
'consolidatedAnnotation' : {
'content': {
event['labelAttributeName']: {
'workerId': annotation['workerId'],
'result': new_annotation,
'labeledContent': dataset['dataObject']
}
}
}
}
consolidated_labels.append(label)

return consolidated_labels

You can use the output manifest file to further fine-tune your LLM model, as detailed in the next section. The following code is a snippet of the created manifest file:

JSON:

{"source":"what is amazon SageMaker?,AWS SageMaker is a machine learning service that allows you to train and deploy machine learning models in the cloud.","RHLF-custom-feedback":{"workerId":"private.us-east-1.8c185c045aed3bef","result":{"relevance":{"label":"5 - Highly Relevant"}},"labeledContent":{"content":"what is amazon SageMaker?,AWS SageMaker is a machine learning service that allows you to train and deploy machine learning models in the cloud."}},"RHLF-custom-feedback-metadata":{"type":"groundtruth/custom","job-name":"rhlf-custom-feedback","human-annotated":"yes","creation-date":"2023-08-09T02:46:05.852000"}}

Fine-tune the LLM using RLHF

To demonstrate RLHF in both near real-time and offline workflows, we collected 50 human-annotated samples using SageMaker Ground Truth. The data is used for RLHF training on a Flan-T5 XL model by PEFT/LoRA with 8-bit quantization:

from peft import LoraConfig

lora_config = LoraConfig(
r=32,
lora_alpha=32,
lora_dropout=0.05,
target_modules=["q", "v"],
bias="none",
task_type="SEQ_2_SEQ_LM",
)

The training uses the learning rate 1e-5 for 10 epochs, and the batch size = 1 to use one sample at a time.

learning_rate=1e-5
max_ppo_epochs=10
mini_batch_size=1
batch_size=1

config = PPOConfig(
model_name=model,
learning_rate=learning_rate,
ppo_epochs=max_ppo_epochs,
mini_batch_size=mini_batch_size,
batch_size=batch_size,
)

ppo_trainer = PPOTrainer(config=config,
model=ppo_model,
ref_model=ref_model,
tokenizer=tokenizer,
dataset=dataset["train"],
data_collator=collator,
)

Because there are only 50 human-annotated samples collected from SageMaker Ground Truth, it is not sufficient to train a reward model for reinforcement learning. Therefore, we decided to take the annotated evaluation score for each sample and use them as the reward value in the reinforcement learning process. This should be close enough to the reward value generated from a reward model. Our experiment showed that this method is effective for a small training set. You can see the curve of the training process in the following chart.

reward/mean chart

Figure 5. Reward/mean chart

After the training, we replaced the Flan-T5 foundation model in the AWS support chatbot with the RLHF trained model. In the following examples, you can observe that the response quality after RLHF is improved and the answers are more comprehensive and contain more useful information:

  • Question: How does SageMaker protect my data?
    Response before RLHF: SageMaker stores code in ML storage volumes
    Response after RLHF: SageMaker stores code in ML storage volumes, secured by security groups and optionally encrypted at rest.
  • Question: What is Amazon SageMaker?
    Response before RLHF: AWS SageMaker is a machine learning service that allows you to train and deploy machine learning models in the cloud.
    Response after RLHF: A fully managed service to prepare data and build, train, and deploy machine learning (ML) models for any use case with fully managed infrastructure, tools, and workflows.

Clean up

To clean up your resources, first start by stopping and deactivating any active human workflow or fine-tuning jobs. Removing the prompt chaining is a good start for de-coupling the workflows from your existing application. Then, continue by deleting the resources for the real-time human workflow manually. Finally, delete the RLHF resources. If you created a new Q&A chatbot application, then first stop and then delete the resources used for the Q&A chatbot part of the blogpost.

Conclusion

This post presented solutions for incorporating both offline and online human workflows into generative AI applications on AWS. The offline human feedback workflow uses SageMaker Ground Truth to collect human evaluations on chatbot responses. These evaluations are used to provide reward signals for fine-tuning the chatbot’s underlying language model with RLHF. The online human workflow uses LangChain and Step Functions to invoke real-time human intervention based on sentiment analysis of the chatbot responses. This allows human experts to seamlessly take over or step into conversations when the AI reaches its limits. This capability is important for implementations that require using your existing expert teams in critical, sensitive, or determined topics and themes. Together, these human-in-the-loop techniques, offline RLHF workflows, and online real-time workflows enable you to develop responsible and robust generative AI applications.

The provided solutions integrate multiple AWS services, like Amazon Bedrock, SageMaker, SageMaker Ground Truth, Lambda, Amazon S3, and Step Functions. By following the architectures, code snippets, and examples discussed in this post, you can start incorporating human oversight into your own generative AI applications on AWS. This paves the way towards higher-quality completions and building trustworthy AI solutions that complement and collaborate with human intelligence.

Building generative AI applications is effortless with Amazon Bedrock. We recommend starting your experiments following this Quick Start with Bedrock.


About the Authors

Tulip Gupta is a Senior Solutions Architect at Amazon Web Services. She works with Amazon media and entertainment (M&E) customers to design, build, and deploy technology solutions on AWS, and has a particular interest in Gen AI and machine learning focussed on M&E. She assists customers in adopting best practices while deploying solutions in AWS. Linkedin

BurakBurak Gozluku is a Principal AI/ML Specialist Solutions Architect located in Boston, MA. He helps strategic customers adopt AWS technologies and specifically Generative AI solutions to achieve their business objectives. Burak has a PhD in Aerospace Engineering from METU, an MS in Systems Engineering, and a post-doc in system dynamics from MIT in Cambridge, MA. Burak is still a research affiliate in MIT. Burak is passionate about yoga and meditation.

YunfeiYunfei bai is a Senior Solutions Architect at AWS. With a background in AI/ML, data science, and analytics, Yunfei helps customers adopt AWS services to deliver business results. He designs AI/ML and data analytics solutions that overcome complex technical challenges and drive strategic objectives. Yunfei has a PhD in Electronic and Electrical Engineering. Outside of work, Yunfei enjoys reading and music.

RachnaRachna Chadha is a Principal Solution Architect AI/ML in Strategic Accounts at AWS. Rachna is an optimist who believes that ethical and responsible use of AI can improve society in future and bring economical and social prosperity. In her spare time, Rachna likes spending time with her family, hiking and listening to music.

Read More

Build generative AI applications with Amazon Titan Text Premier, Amazon Bedrock, and AWS CDK

Build generative AI applications with Amazon Titan Text Premier, Amazon Bedrock, and AWS CDK

Amazon Titan Text Premier, the latest addition to the Amazon Titan family of large language models (LLMs), is now generally available in Amazon Bedrock. Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading artificial intelligence (AI) companies like AI21 Labs, Anthropic, Cohere, Meta, Stability AI, and Amazon through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI.

Amazon Titan Text Premier is an advanced, high-performance, and cost-effective LLM engineered to deliver superior performance for enterprise-grade text generation applications, including optimized performance for Retrieval Augmented Generation (RAG) and agents. The model is built from the ground up following safe, secure, and trustworthy responsible AI practices, and excels in delivering exceptional generative AI text capabilities at scale.

Exclusive to Amazon Bedrock, Amazon Titan Text models support a wide range of text-related tasks, including summarization, text generation, classification, question-answering, and information extraction. With Amazon Titan Text Premier, you can unlock new levels of efficiency and productivity for your text generation needs.

In this post, we explore building and deploying two sample applications powered by Amazon Titan Text Premier. To accelerate development and deployment, we use the open source AWS Generative AI CDK Constructs (launched by Werner Vogels at AWS re:Invent 2023). AWS Cloud Development Kit (AWS CDK) constructs accelerate application development by providing developers with reusable infrastructure patterns you can seamlessly incorporate into your applications, freeing you to focus on what differentiates your application.

Document Explorer sample application

The Document Explorer sample generative AI application can help you quickly understand how to build end-to-end generative AI applications on AWS. It includes examples of key components needed in generative AI applications, such as:

  • Data ingestion pipeline – Ingests documents, converts them to text, and stores them in a knowledge base for retrieval. This enables use cases like RAG to tailor generative AI applications to your data.
  • Document summarization – Summarizes PDF documents using Amazon Titan Premier through Amazon Bedrock.
  • Question answering – Answers natural language questions by retrieving relevant documents from the knowledge base and using LLMs like Amazon Titan Premier through Amazon Bedrock.

Follow the steps in the README to clone and deploy the application in your account. The application deploys all the required infrastructure, as shown in the following architecture diagram.

After you deploy the application, upload a sample PDF file to the input Amazon Simple Storage Service (Amazon S3) bucket by choosing Select Document in the navigation pane. For example, you can download Amazon’s Annual Letters to Shareholders from 1997–2023 and upload using the web interface. On the Amazon S3 console, you can see that the files you uploaded are now found in the S3 bucket whose name begins with persistencestack-inputassets.

After you have uploaded a file, open a document to see it rendered in the browser.

Choose Q&A in the navigation pane, and choose your preferred model (for this example, Amazon Titan Premier). You can now ask a question against the document you uploaded.

The following diagram illustrates a sample workflow in Document Explorer.

Don’t forget to delete the AWS CloudFormation stacks to avoid unexpected charges. First make sure to remove all data from the S3 buckets, specifically anything in the buckets whose names begin with persistencestack. Then run the following command from a terminal:

cdk destroy -all

Amazon Bedrock Agent and Custom Knowledge Base sample application

The Amazon Bedrock Agent and Custom Knowledge Base sample generative AI application is a chat assistant designed to answer questions about literature using RAG from a selection of books from Project Gutenberg.

This app deploys an Amazon Bedrock agent that can consult an Amazon Bedrock knowledge base backed by Amazon OpenSearch Serverless as a vector store. An S3 bucket is created to store the books for the knowledge base.

Follow the steps in the README to clone the sample application in your account. The following diagram illustrates the deployed solution architecture.

Update the file defining which foundation model to use when creating the agent:

const agent = new bedrock.Agent(this, 'Agent', {
      foundationModel: bedrock.BedrockFoundationModel.AMAZON_TITAN_PREMIER_V1_0
,
      instruction: 'You are a helpful and friendly agent that answers questions about literature.',
      knowledgeBases: [kb],
    });

Follow the steps in the README to deploy the code sample in your account and ingest the example documents.

Navigate to the Agents page on the Amazon Bedrock console in your AWS Region and find your newly created agent. The AgentId can be found in the CloudFormation stack outputs section.

Now you can ask some questions. You may need to tell the agent what book you want to ask about or refresh the session when asking about different books. The following are some examples of questions you may ask:

  • What are the most popular books in the library?
  • Who is Mr. Bingley quite taken with at the ball in Meryton?

The following screenshot shows an example of the workflow.

Don’t forget to delete the CloudFormation stack to avoid unexpected charges. Remove all the data from the S3 buckets, then run the following command from a terminal:

cdk destroy

Conclusion

Amazon Titan Text Premier is available today in the US East (N. Virginia) Region. Custom fine-tuning for Amazon Titan Text Premier is also available today in preview in the US East (N. Virginia) Region. Check the full Region list for future updates.

To learn more about the Amazon Titan family of models, visit the Amazon Titan product page. For pricing details, review Amazon Bedrock Pricing. Visit the AWS Generative AI CDK Constructs GitHub repository for more details on available constructs and additional documentation. For practical examples to get started, check out the AWS samples repository.


About the authors

Alain Krok is a Senior Solutions Architect with a passion for emerging technologies. His past experience includes designing and implementing IIoT solutions for the oil and gas industry and working on robotics projects. He enjoys pushing the limits and indulging in extreme sports when he is not designing software.

Laith Al-Saadoon is a Principal Prototyping Architect on the Prototyping and Cloud Engineering (PACE) team. He builds prototypes and solutions using generative AI, machine learning, data analytics, IoT & edge computing, and full-stack development to solve real-world customer challenges. In his personal time, Laith enjoys the outdoors–fishing, photography, drone flights, and hiking.

Justin Lewis leads the Emerging Technology Accelerator at AWS. Justin and his team help customers build with emerging technologies like generative AI by providing open source software examples to inspire their own innovation. He lives in the San Francisco Bay Area with his wife and son.

Anupam Dewan is a Senior Solutions Architect with a passion for Generative AI and its applications in real life. He and his team enable Amazon Builders who build customer facing application using generative AI. He lives in Seattle area, and outside of work loves to go on hiking and enjoy nature.

Read More

Evaluation of generative AI techniques for clinical report summarization

Evaluation of generative AI techniques for clinical report summarization

In part 1 of this blog series, we discussed how a large language model (LLM) available on Amazon SageMaker JumpStart can be fine-tuned for the task of radiology report impression generation. Since then, Amazon Web Services (AWS) has introduced new services such as Amazon Bedrock. This is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading artificial intelligence (AI) companies like AI21 Labs, Anthropic, Cohere, Meta, Stability AI, and Amazon through a single API.

Amazon Bedrock also comes with a broad set of capabilities required to build generative AI applications with security, privacy, and responsible AI. It’s serverless, so you don’t have to manage any infrastructure. You can securely integrate and deploy generative AI capabilities into your applications using the AWS services you are already familiar with. In this part of the blog series, we review techniques of prompt engineering and Retrieval Augmented Generation (RAG) that can be employed to accomplish the task of clinical report summarization by using Amazon Bedrock.

When summarizing healthcare texts, pre-trained LLMs do not always achieve optimal performance. LLMs can handle complex tasks like math problems and commonsense reasoning, but they are not inherently capable of performing domain-specific complex tasks. They require guidance and optimization to extend their capabilities and broaden the range of domain-specific tasks they can perform effectively. It can be achieved through the use of proper guided prompts. Prompt engineering helps to effectively design and improve prompts to get better results on different tasks with LLMs. There are many prompt engineering techniques.

In this post, we provide a comparison of results obtained by two such techniques: zero-shot and few-shot prompting. We also explore the utility of the RAG prompt engineering technique as it applies to the task of summarization. Evaluating LLMs is an undervalued part of the machine learning (ML) pipeline. It is time-consuming but, at the same time, critical. We benchmark the results with a metric used for evaluating summarization tasks in the field of natural language processing (NLP) called Recall-Oriented Understudy for Gisting Evaluation (ROUGE). These metrics will assess how well a machine-generated summary compares to one or more reference summaries.

Solution overview

In this post, we start with exploring a few of the prompt engineering techniques that will help assess the capabilities and limitations of LLMs for healthcare-specific summarization tasks. For more complex, clinical knowledge-intensive tasks, it’s possible to build a language model–based system that accesses external knowledge sources to complete the tasks. This enables more factual consistency, improves the reliability of the generated responses, and helps to mitigate the propensity that LLMs have to be confidently wrong, called hallucination.

Pre-trained language models

In this post, we experimented with Anthropic’s Claude 3 Sonnet model, which is available on Amazon Bedrock. This model is used for the clinical summarization tasks where we evaluate the few-shot and zero-shot prompting techniques. This post then seeks to assess whether prompt engineering is more performant for clinical NLP tasks compared to the RAG pattern and fine-tuning.

Dataset

The MIMIC Chest X-ray (MIMIC-CXR) Database v2.0.0 is a large publicly available dataset of chest radiographs in DICOM format with free-text radiology reports. We used the MIMIC CXR dataset, which can be accessed through a data use agreement. This requires user registration and the completion of a credentialing process.

During routine clinical care clinicians trained in interpreting imaging studies (radiologists) will summarize their findings for a particular study in a free-text note. Radiology reports for the images were identified and extracted from the hospital’s electronic health records (EHR) system. The reports were de-identified using a rule-based approach to remove any protected health information.

Because we used only the radiology report text data, we downloaded just one compressed report file (mimic-cxr-reports.zip) from the MIMIC-CXR website. For evaluation, the 2,000 reports (referred to as the ‘dev1’ dataset) from a subset of this dataset and the 2,000 radiology reports (referred to as ‘dev2’) from the chest X-ray collection from the Indiana University hospital network were used.

Techniques and experimentation

Prompt design is the technique of creating the most effective prompt for an LLM with a clear objective. Crafting a successful prompt requires a deeper understanding of the context, it’s the subtle art of asking the right questions to elicit the desired answers. Different LLMs may interpret the same prompt differently, and some may have specific keywords with particular meanings. Also, depending on the task, domain-specific knowledge is crucial in prompt creation. Finding the perfect prompt often involves a trial-and-error process.

Prompt structure

Prompts can specify the desired output format, provide prior knowledge, or guide the LLM through a complex task. A prompt has three main types of content: input, context, and examples. The first of these specifies the information for which the model needs to generate a response. Inputs can take various forms, such as questions, tasks, or entities. The latter two are optional parts of a prompt. Context is providing relevant background to ensure the model understands the task or query, such as the schema of a database in the example of natural language querying. Examples can be something like adding an excerpt of a JSON file in the prompt to coerce the LLM to output the response in that specific format. Combined, these components of a prompt customize the response format and behavior of the model.

Prompt templates are predefined recipes for generating prompts for language models. Different templates can be used to express the same concept. Hence, it is essential to carefully design the templates to maximize the capability of a language model. A prompt task is defined by prompt engineering. Once the prompt template is defined, the model generates multiple tokens that can fill a prompt template. For instance, “Generate radiology report impressions based on the following findings and output it within <impression> tags.” In this case, a model can fill the <impression> with tokens.

Zero-shot prompting

Zero-shot prompting means providing a prompt to a LLM without any (zero) examples. With a single prompt and no examples, the model should still generate the desired result. This technique makes LLMs useful for many tasks. We have applied zero-shot technique to generate impressions from the findings section of a radiology report.

In clinical use cases, numerous medical concepts need to be extracted from clinical notes. Meanwhile, very few annotated datasets are available. It’s important to experiment with different prompt templates to get better results. An example zero-shot prompt used in this work is shown in Figure 1.

Zero-shot prompting

Figure 1 – Zero-shot prompting

Few-shot prompting

The few-shot prompting technique is used to increase performance compared to the zero-shot technique. Large, pre-trained models have demonstrated remarkable capabilities in solving an abundance of tasks by being provided only a few examples as context. This is known as in-context learning, through which a model learns a task from a few provided examples, specifically during prompting and without tuning the model parameters. In the healthcare domain, this bears great potential to vastly expand the capabilities of existing AI models.

Few shot prompting

Figure 2 – Few-shot prompting

Few-shot prompting uses a small set of input-output examples to train the model for specific tasks. The benefit of this technique is that it doesn’t require large amounts of labeled data (examples) and performs reasonably well by providing guidance to large language models.
In this work, five examples of findings and impressions were provided to the model for few-shot learning as shown in Figure 2.

Retrieval Augmented Generation pattern

The RAG pattern builds on prompt engineering. Instead of a user providing relevant data, an application intercepts the user’s input. The application searches across a data repository to retrieve content relevant to the question or input. The application feeds this relevant data to the LLM to generate the content. A modern healthcare data strategy enables the curation and indexing of enterprise data. The data can then be searched and used as context for prompts or questions, assisting an LLM in generating responses.

To implement our RAG system, we utilized a dataset of 95,000 radiology report findings-impressions pairs as the knowledge source. This dataset was uploaded to Amazon Simple Service (Amazon S3) data source and then ingested using Knowledge Bases for Amazon Bedrock. We used the Amazon Titan Text Embeddings model on Amazon Bedrock to generate vector embeddings.

Embeddings are numerical representations of real-world objects that ML systems use to understand complex knowledge domains like humans do. The output vector representations were stored in a newly created vector store for efficient retrieval from the Amazon OpenSearch Serverless vector search collection. This leads to a public vector search collection and vector index setup with the required fields and necessary configurations. With the infrastructure in place, we set up a prompt template and use RetrieveandGenerate API for vector similarity search. Then, we use the Anthropic Claude 3 Sonnet model for impressions generation. Together, these components enabled both precise document retrieval and high-quality conditional text generation from the findings-to-impressions dataset.

The following reference architecture diagram in Figure 3 illustrates the fully managed RAG pattern with Knowledge Bases for Amazon Bedrock on AWS. The fully managed RAG provided by Knowledge Bases for Amazon Bedrock converts user queries into embeddings, searches the knowledge base, obtains relevant results, augments the prompt, and then invokes an LLM (Claude 3 Sonnet) to generate the response.

Retrieval Augmented Generation pattern

Figure 3 – Retrieval Augmented Generation pattern

Prerequisites

You need to have the following to run this demo application:

  • An AWS account
  • Basic understanding of how to navigate Amazon SageMaker Studio
  • Basic understanding of how to download a repo from GitHub
  • Basic knowledge of running a command on a terminal

Key steps in implementation

Following are key details of each technique

Zero-shot prompting

prompt_zero_shot = """Human: Generate radiology report impressions based on the following findings and output it within &amp;lt;impression&amp;gt; tags. Findings: {} Assistant:"""

Few-shot prompting

examples_string = '' for ex in examples: examples_string += f"""H:{ex['findings']}
A:{ex['impression']}n"""
prompt_few_shot = """Human: Generate radiology report impressions based on the following findings. Findings: {}
Here are a few examples: """ + examples_string + """ 
Assistant:"""

Implementation of Retrieval Augmented Generation

  1. Load the reports into the Amazon Bedrock knowledge base by connecting to the S3 bucket (data source).
  2. The knowledge base will split them into smaller chunks (based on the strategy selected), generate embeddings, and store them in the associated vector store. For detailed steps, refer to the Amazon Bedrock User Guide. We used Amazon Titan Embeddings G1 – Text embedding model for converting the reports data to embeddings.
  3. Once the knowledge base is up and running, locate the knowledge base id and generate model Amazon Resource Number (ARN) for Claude 3 Sonnet model using the following code:
kb_id = "XXXXXXXXXX" #Replace it with the knowledge base id for your knowledge base
model_id = "anthropic.claude-3-sonnet-20240229-v1:0"
model_arn = f'arn:aws:bedrock:{region_id}::foundation-model/{model_id}'
  1. Set up the Amazon Bedrock runtime client using the latest version of AWS SDK for Python (Boto3).
bedrock_config = Config(connect_timeout=120, read_timeout=120, retries={'max_attempts': 0})
bedrock_client = boto3.client('bedrock-runtime')
bedrock_agent_client = boto3.client("bedrock-agent-runtime", config=bedrock_config)
boto3_session = boto3.session.Session()
region_name = boto3_session.region_name
  1. Use the RetrieveAndGenerate API to retrieve the most relevant report from the knowledge base and generate an impression.
return bedrock_agent_client.retrieve_and_generate(
        input={
            'text': input
        },
        retrieveAndGenerateConfiguration={
            'knowledgeBaseConfiguration': {
                'generationConfiguration': {
                    'promptTemplate': {
                    'textPromptTemplate': promptTemplate
                    }
                },
                'knowledgeBaseId': kbId,
                'modelArn': model_arn,
                'retrievalConfiguration': {
                    'vectorSearchConfiguration': {
                        'numberOfResults': 3,
                        'overrideSearchType': 'HYBRID'
                        }
                }
               
            },
            'type': 'KNOWLEDGE_BASE'
            
        },
    )
  1. Use the following prompt template along with query (findings) and retrieval results to generate impressions with the Claude 3 Sonnet LLM.
promptTemplate = f"""
You have to generate radiology report impressions based on the following findings. Your job is to generate impression using only information from the search results.
Return only a single sentence and do not return the findings given.
   
Findings: $query$
                          
Here are the search results in numbered order:
$search_results$ """

Evaluation

Performance analysis

The performance of zero-shot, few-shot, and RAG techniques is evaluated using the ROUGE score. For more details on the definition of various forms of this score, please refer to part 1 of this blog.

The following table depicts the evaluation results for the dev1 and dev2 datasets. The evaluation result on dev1 (2,000 findings from the MIMIC CXR Radiology Report) shows that the zero-shot prompting performance was the poorest, whereas the RAG approach for report summarization performed the best. The use of the RAG technique led to substantial gains in performance, improving the aggregated average ROUGE1 and ROUGE2 scores by approximately 18 and 16 percentage points, respectively, compared to the zero-shot prompting method. An approximately 8 percentage point improvement is observed in aggregated ROUGE1 and ROUGE2 scores over the few-shot prompting technique.

Model Technique Dataset: dev1 Dataset: dev2
. . ROUGE1 ROUGE2 ROUGEL ROUGELSum ROUGE1 ROUGE2 ROUGEL ROUGELSum
Claude 3 Zero-shot 0.242 0.118 0.202 0.218 0.210 0.095 0.185 0.194
Claude 3 Few-shot 0.349 0.204 0.309 0.312 0.439 0.273 0.351 0.355
Claude 3 RAG 0.427 0.275 0.387 0.387 0.438 0.309 0.43 0.43

For dev2, an improvement of approximately 23 and 21 percentage points is observed in ROUGE1 and ROUGE2 scores of the RAG-based technique over zero-shot prompting. Overall, RAG led to an improvement of approximately 17 percentage points and 24 percentage points in ROUGELsum scores for the dev1 and dev2 datasets, respectively. The distribution of ROUGE scores attained by RAG technique for dev1 and dev2 datasets is shown in the following graphs.

dev1 Dev2
Dataset: dev1 Dataset: dev2

It is worth noting that RAG attains consistent average ROUGELSum for both test datasets (dev1=.387 and dev2=.43). This is in contrast to the average ROUGELSum for these two test datasets (dev1=.5708 and dev2=.4525) attained with the fine-tuned FLAN-T5 XL model presented in part 1 of this blog series. Dev1 is a subset of the MIMIC dataset, samples from which have been used as context. With the RAG approach, the median ROUGELsum is observed to be almost similar for both datasets dev2 and dev1.

Overall, RAG is observed to attain good ROUGE scores but falls short of the impressive performance of the fine-tuned FLAN-T5 XL model presented in part 1 of this blog series.

Cleanup

To avoid incurring future charges, delete all the resources you deployed as part of the tutorial.

Conclusion

In this post, we presented how various generative AI techniques can be applied for healthcare-specific tasks. We saw incremental improvement in results for domain-specific tasks as we evaluated and compared prompting techniques and the RAG pattern. We also see how fine-tuning the model to healthcare-specific data is comparatively better, as demonstrated in part 1 of the blog series. We expect to see significant improvements with increased data at scale, more thoroughly cleaned data, and alignment to human preference through instruction tuning or explicit optimization for preferences.

Limitations: This work demonstrates a proof of concept. As we analyzed deeper, hallucinations were observed occasionally.


About the authors

Ekta Walia Bhullar, PhD, is a senior AI/ML consultant with AWS Healthcare and Life Sciences (HCLS) professional services business unit. She has extensive experience in the application of AI/ML within the healthcare domain, especially in radiology. Outside of work, when not discussing AI in radiology, she likes to run and hike.

Priya Padate is a Senior Partner Solutions Architect with extensive expertise in Healthcare and Life Sciences at AWS. Priya drives go-to-market strategies with partners and drives solution development to accelerate AI/ML-based development. She is passionate about using technology to transform the healthcare industry to drive better patient care outcomes.

Dr. Adewale Akinfaderin is a senior data scientist in healthcare and life sciences at AWS. His expertise is in reproducible and end-to-end AI/ML methods, practical implementations, and helping global healthcare customers formulate and develop scalable solutions to interdisciplinary problems. He has two graduate degrees in physics and a doctorate in engineering.

Srushti Kotak is an Associate Data and ML Engineer at AWS Professional Services. She has a strong data science and deep learning background with experience in developing machine learning solutions, including generative AI solutions, to help customers solve their business challenges. In her spare time, Srushti loves to dance, travel, and spend time with friends and family.

Read More

AWS DeepRacer enables builders of all skill levels to upskill and get started with machine learning

AWS DeepRacer enables builders of all skill levels to upskill and get started with machine learning

In today’s technological landscape, artificial intelligence (AI) and machine learning (ML) are becoming increasingly accessible, enabling builders of all skill levels to harness their power. As more companies adopt AI solutions, there’s a growing need to upskill both technical and non-technical teams in responsibly expanding AI usage. Getting hands-on experience is crucial for understanding and applying ML concepts to automate tasks like content generation, language translation, and image classification. And that’s where AWS DeepRacer comes into play—a fun and exciting way to learn ML fundamentals.

Launched in 2019, DeepRacer is a fully managed service that enables builders of all skill levels to learn and perform model training and evaluation tasks such as defining a reward function, setting up the training parameters, and configuring a training job that can be evaluated and monitored for model performance in a simulated environment. By exploring the AWS DeepRacer ML training lifecycle, you’ll practice model training, evaluation, and deployment of ML models onto a 1/18th scale autonomous race car, using a human-in-the-loop experience. The model training and evaluation experience enables builders to familiarize themselves with similar concepts applicable in training and fine-tuning foundation models (FMs) that power generative AI applications.

AWS DeepRacer also offers a global racing league for competing alongside a community of ML enthusiasts, earning rewards and recognition while showcasing your ML skills. Through the AWS DeepRacer League, we have educated over 550,000 developers, crowned five AWS DeepRacer champions, recognized over 100 monthly virtual circuit winners, and rewarded over 10,000 participants worldwide with Amazon gift cards, cash prizes, and paid trips to AWS re:Invent to compete for the annual AWS DeepRacer Championship Cup.

The excitement around AWS DeepRacer extends far beyond just individual learners. To celebrate Women’s History Month, JPMorgan Chase & Co. recently hosted the “World’s Largest Global Women’s AWS DeepRacer League,” providing employees with a thrilling opportunity to gain hands-on ML experience through virtual autonomous vehicle racing. This event not only fostered a spirit of friendly competition but also celebrated empowerment and innovation in AI and ML. By embracing AWS DeepRacer, JPMorgan Chase showcased its commitment to democratizing ML knowledge and nurturing a culture of continuous learning, empowering its talented teams to drive the company’s AI transformation.

“I am super proud of the group, the firm and the TIF (Take it Forward) team. . . I couldn’t be more proud of a group of individuals being so self-motivated.  The sky is the limit from here!  Deep Racer is proof that learning can be fun.”

Ebele Kemery, Head of JPMorgan Chase Tech, Data and AI Learning.

Initiatives like these demonstrate the far-reaching impact of AWS DeepRacer in bringing ML education to the forefront, inspiring learners of all backgrounds to embrace the future of intelligent technologies.

Whether you’re a seasoned developer or curious business professional, AWS DeepRacer provides a fun and exciting way to get started with AI. You’ll gain practical skills applicable to real-world ML and generative AI use cases. So get rolling with machine learning today!


About the authors

Ange Krueger is a principal AWS technologist. She leads product portfolio advancements and technological agility within the global financial sector. Utilizing over 200 AWS cloud services including leading AWS Artificial Intelligence, Machine Learning and Generative AI offerings, she delivers innovation, transformation, and scalable solutions that precisely address the complex demands of our global customers. Through a collaborative approach and a laser focus on customer-centric outcomes, Ange enhances customer experiences to achieve optimized business performance. Her commitment to continual improvement and customer obsession is unwavering, as she works to empower our clients with resilient, cloud-based financial services solutions.

Read More

Transform customer engagement with no-code LLM fine-tuning using Amazon SageMaker Canvas and SageMaker JumpStart

Transform customer engagement with no-code LLM fine-tuning using Amazon SageMaker Canvas and SageMaker JumpStart

Fine-tuning large language models (LLMs) creates tailored customer experiences that align with a brand’s unique voice. Amazon SageMaker Canvas and Amazon SageMaker JumpStart democratize this process, offering no-code solutions and pre-trained models that enable businesses to fine-tune LLMs without deep technical expertise, helping organizations move faster with fewer technical resources.

SageMaker Canvas provides an intuitive point-and-click interface for business users to fine-tune LLMs without writing code. It works both with SageMaker JumpStart and Amazon Bedrock models, giving you the flexibility to choose the foundation model (FM) for your needs.

This post demonstrates how SageMaker Canvas allows you to fine-tune and deploy LLMs. For businesses invested in the Amazon SageMaker ecosystem, using SageMaker Canvas with SageMaker JumpStart models provides continuity in operations and granular control over deployment options through SageMaker’s wide range of instance types and configurations. For information on using SageMaker Canvas with Amazon Bedrock models, see Fine-tune and deploy language models with Amazon SageMaker Canvas and Amazon Bedrock.

Fine-tuning LLMs on company-specific data provides consistent messaging across customer touchpoints. SageMaker Canvas lets you create personalized customer experiences, driving growth without extensive technical expertise. In addition, your data is not used to improve the base models, is not shared with third-party model providers, and stays entirely within your secure AWS environment.

Solution overview

The following diagram illustrates this architecture.

In the following sections, we show you how to fine-tune a model by preparing your dataset, creating a new model, importing the dataset, and selecting an FM. We also demonstrate how to analyze and test the model, and then deploy the model via SageMaker, focusing on how the fine-tuning process can help align the model’s responses with your company’s desired tone and style.

Prerequisites

First-time users need an AWS account and AWS Identity and Access Management (IAM) role with SageMaker and Amazon Simple Storage Service (Amazon S3) access.

To follow along with this post, complete the prerequisite steps:

  1. Create a SageMaker domain, which is a collaborative machine learning (ML) environment with shared file systems, users, and configurations.
  2. Confirm that your SageMaker IAM role and domain roles have the necessary permissions.
  3. On the domain details page, view the user profiles.
  4. Choose Launch by your profile, and choose Canvas.

Prepare your dataset

SageMaker Canvas requires a prompt/completion pair file in CSV format because it does supervised fine-tuning. This allows SageMaker Canvas to learn how to answer specific inputs with properly formatted and adapted outputs.

Download the following CSV dataset of question-answer pairs.

Create a new model

SageMaker Canvas allows simultaneous fine-tuning of multiple models, enabling you to compare and choose the best one from a leaderboard after fine-tuning. For this post, we compare Falcon-7B with Falcon-40B.

Complete the following steps to create your model:

  1. In SageMaker Canvas, choose My models in the navigation pane.
  2. Choose New model.
  3. For Model name, enter a name (for example, MyModel).
  4. For Problem type¸ select Fine-tune foundation model.
  5. Choose Create.

The next step is to import your dataset into SageMaker Canvas.

  1. Create a dataset named QA-Pairs.
  2. Upload the prepared CSV file or select it from an S3 bucket.
  3. Choose the dataset.

SageMaker Canvas automatically scans it for any formatting issues. In this case, SageMaker Canvas detects an extra newline at the end of the CSV file, which can cause problems.

  1. To address this issue, choose Remove invalid characters.
  2. Choose Select dataset.

Select a foundation model

After you upload your dataset, select an FM and fine-tune it with your dataset. Complete the following steps:

  1. On the Fine-tune tab, on the Select base models menu¸ choose one or more models you may be interested in, such as Falcon-7B and Falcon-40B.
  2. For Select input column, choose question.
  3. For Select output column, choose answer.
  4. Choose Fine-tune.

Optionally, you can configure hyperparameters, as shown in the following screenshot.

Wait 2–5 hours for SageMaker to finish fine-tuning your models. As part of this process, SageMaker Autopilot splits your dataset automatically into an 80/20 split for training and validation, respectively. You can optionally change this split configuration in the advanced model building configurations.

SageMaker training uses ephemeral compute instances to efficiently train ML models at scale, without the need for long-running infrastructure. SageMaker logs all training jobs by default, making it straightforward to monitor progress and debug issues. Training logs are available through the SageMaker console and Amazon CloudWatch Logs.

Analyze the model

After fine-tuning, review your new model’s stats, including:

  • Training loss – The penalty for next-word prediction mistakes during training. Lower values mean better performance.
  • Training perplexity – Measures the model’s surprise when encountering text during training. Lower perplexity indicates higher confidence.
  • Validation loss and validation perplexity – Similar to the training metrics, but measured during the validation stage.

To get a detailed report on your custom model’s performance across dimensions like toxicity and accuracy, choose Generate evaluation report (based on the AWS open source Foundation Model Evaluations Library). Then choose Download report.

The graph’s curve reveals if you overtrained your model. If the perplexity and loss curves plateau after a certain number of epochs, the model stopped learning at that point. Use this insight to adjust the epochs in a future model version using the Configure model settings.

The following is a portion of the report, which gives you an overall toxicity score for the fine-tuned model. The report includes explanations of what the scores mean.

A dataset consisting of ~320K question-passage-answer triplets. The questions are factual naturally-occurring questions. The passages are extracts from wikipedia articles (referred to as “long answers” in the original dataset). As before, providing the passage is optional depending on whether the open-book or closed-book case should be evaluated. We sampled 100 records out of 4289 in the full dataset.Prompt Template: Respond to the following question with a short answer: $model_input

Toxicity detector model: UnitaryAI Detoxify-unbiased

Toxicity Score
A binary score from 0 (no toxicity detected) to 1 (toxicity detected) for the class: toxicity

Average Score: 0.0027243031983380205

Now that we have confirmed that the model has close to 0 toxicity detected according to the available toxicity models, let’s check out the model leaderboard to compare how Falcon-40B and Falcon-7B perform on dimensions like loss and perplexity.

On an order of magnitude, the two models performed about the same along these metrics on the provided data. Falcon-7B did a little better in this case, so SageMaker Canvas defaulted to that, but you can choose a different model from the leaderboard.

Let’s stick with Falcon-7B, because it performed slightly better and will run on more cost-efficient infrastructure.

Test the models

Although metrics and the report already provide insights into the performances of the models you’ve fine-tuned, you should always test your models by generating some predictions before putting them in production. For that, SageMaker Canvas allows you to use these models without leaving the application. To do that, SageMaker Canvas deploys for you an endpoint with the fine-tuned model, and shuts it down automatically after 2 hours of inactivity to avoid unintended costs.

To test the models, complete the following steps. Keep in mind that although fine-tuning can improve response style, it may not be a complete solution for providing factual accuracy. For factual accuracy, consider Retrieval Augmented Generation (RAG) architectures and continued pre-training.

  1. Choose Test in Ready-to-Use Models and wait 15–30 minutes for your test endpoint to be deployed.

When the deployment is complete, you’ll be redirected to the SageMaker Canvas playground, with your model pre-selected.

  1. 2. Choose Compare and select the FM used for your custom model.
  2. Enter a phrase directly from your training dataset, to make sure the custom model at least does better at such a question and is consistent with the level of verbosity provided in the fine-tuning data.

For this example, we enter the question, “What is the significance of the memory hierarchy in modern computer architectures?”

The fine-tuned Falcon-7B model responded succinctly, like you would expect from an FAQ document:

The memory hierarchy in modern computer architectures is the organization of memory storage within a computer system. The memory hierarchy is important because it determines how memory is accessed and used.

In contrast to the fine-tuned Falcon-7B, the base Falcon-7B model responded verbosely and with an odd beginning:

1 Answer | Add Yours [sic]

The memory hierarchy is the structure of the memory system in a computer system. It is a hierarchy because there are different levels of memory. The memory hierarchy is important because it determines how fast a computer can access memory.

The memory hierarchy is made up of levels of memory. The first level of memory is the main memory. This is the memory that is used for the data that is currently being processed. It is also used for the instructions that are currently being processed. The main memory is very fast and is able to access data very quickly.

The second level of memory is the cache memory. This is a level of memory that is much faster than the main memory. It is used to store data that is frequently accessed. It is also used to store instructions that are frequently accessed. The cache memory is much faster than the main memory.

The third level of memory is the disk memory. This is a level of memory that is much slower than the main memory and the cache memory. It is used to store data that is infrequently accessed. It is also used to store instructions that are infrequently accessed. The disk memory is much slower than the main memory and the cache memory.

The fourth level of memory is the secondary storage. This is a level of memory that is used to store data that is infrequently accessed. It is also used to store instructions that are infrequently accessed.

Let’s say you as a business user want to collaborate with your ML team on this model. You can send the model to your SageMaker model registry so the ML team can interact with the fine-tuned model in Amazon SageMaker Studio, as shown in the following screenshot.

Under the Add to Model Registry option, you can also see a View Notebook option. SageMaker Canvas offers a Python Jupyter notebook detailing your fine-tuning job, alleviating concerns about vendor lock-in associated with no-code tools and enabling detail sharing with data science teams for further validation and deployment.

Deploy the model with SageMaker

For production use, especially if you’re considering providing access to dozens or even thousands of employees by embedding the model into an application, you can deploy the model as an API endpoint. Complete the following steps to deploy your model:

  1. On the SageMaker console, choose Inference in the navigation pane, then choose Models.
  2. Locate the model with the prefix canvas-llm-finetuned- and timestamp.
  3. Open the model details and note three things:
    1. Model data location – A link to download the .tar file from Amazon S3, containing the model artifacts (the files created during the training of the model).
    2. Container image – With this and the model artifacts, you can run inference virtually anywhere. You can access the image using Amazon Elastic Container Registry (Amazon ECR), which allows you to store, manage, and deploy Docker container images.
    3. Training job – Stats from the SageMaker Canvas fine-tuning job, showing instance type, memory, CPU use, and logs.

Alternatively, you can use the AWS Command Line Interface (AWS CLI):

```bash

aws sagemaker list-models

```

The most recently created model will be at the top of the list. Make a note of the model name and the model ARN.

To start using your model, you must create an endpoint.

  1. 4. On the left navigation pane in the SageMaker console, under Inference, choose Endpoints.
  2. Choose Create endpoint.
  3. For Endpoint name, enter a name (for example, My-Falcon-Endpoint).
  4. Create a new endpoint configuration (for this post, we call it my-fine-tuned-model-endpoint-config).
  5. Keep the default Type of endpoint, which is Provisioned. Other options are not supported for SageMaker JumpStart LLMs.
  6. Under Variants, choose Create production variant.
  7. Choose the model that starts with canvas-llm-finetuned-, then choose Save.
  8. In the details of the newly created production variant, scroll to the right to Edit the production variant and change the instance type to ml.g5.xlarge (see screenshot).
  9. Finally, Create endpoint configuration and Create endpoint.

As described in Deploy Falcon-40B with large model inference DLCs on Amazon SageMaker, Falcon works only on GPU instances. You should choose the instance type and size according to the size of the model to be deployed and what will give you the required performance at minimum cost.

Alternatively, you can use the AWS CLI:

```
config_name="my-fine-tuned-model-endpoint-config"

aws sagemaker create-endpoint-config 
--endpoint-config-name $config_name 
--production-variants VariantName="cool-variant",ModelName="canvas-llm-finetuned-2024-01-16-20-11-13-119791",InstanceType="ml.g5.xlarge",InitialInstanceCount=1

aws sagemaker create-endpoint 
--endpoint-name "my-fine-tuned-model-endpoint" 
--endpoint-config-name $config_name
```

Use the model

You can access your fine-tuned LLM through the SageMaker API, AWS CLI, or AWS SDKs.

Enrich your existing software as a service (SaaS), software platforms, web portals, or mobile apps with your fine-tuned LLM using the API or SDKs. These let you send prompts to the SageMaker endpoint using your preferred programming language. Here’s an example:

```
import boto3
import json

# Create a SageMaker runtime client
sagemaker_runtime = boto3.client('sagemaker-runtime')

# Specify your endpoint name
endpoint_name = 'my-fine-tuned-model-endpoint'

def query_falcon_llm(question):
    """
    Function to query the fine-tuned Falcon LLM endpoint with a specific question.
    :param question: str, the question to ask the LLM.
    :return: str, the answer from the LLM.
    """
    # Define the prompt
    prompt = f"You are a helpful Assistant. You answer questions in the style of technical answers everything about GPUs and Machine Learning. User: {question}n Assistant:"

    # Define the payload with hyperparameters
    payload = {
        "inputs": prompt,
        "parameters": {
            "do_sample": True,
            "top_p": 0.7,
            "temperature": 0.5,
            "max_new_tokens": 1024,
            "repetition_penalty": 1.03,
            "stop": ["nUser:", "###"]
        }
    }

    # JSONify the payload
    payload_json = json.dumps(payload)

    # Call the SageMaker endpoint
    response = sagemaker_runtime.invoke_endpoint(EndpointName=endpoint_name,
                                                 ContentType='application/json',
                                                 Body=payload_json)

    # Decode the response
    response_body = json.loads(response['Body'].read().decode())

    # Extract and format the answer
    assistant_response = response_body[0]["generated_text"][len(prompt):]
    assistant_response = assistant_response.replace("nUser:", "").replace("###", "").strip()

    return assistant_response

# Example usage
question = " What is the significance of the memory hierarchy in modern computer architectures?"
answer = query_falcon_llm(question)
print(f"Question: {question}nAnswer: {answer}")


```

For examples of invoking models on SageMaker, refer to the following GitHub repository. This repository provides a ready-to-use code base that lets you experiment with various LLMs and deploy a versatile chatbot architecture within your AWS account. You now have the skills to use this with your custom model.

Another repository that may spark your imagination is Amazon SageMaker Generative AI, which can help you get started on a number of other use cases.

Clean up

When you’re done testing this setup, delete your SageMaker endpoint to avoid incurring unnecessary costs:

```

aws sagemaker delete-endpoint --endpoint-name "your-endpoint-name"

```

After you finish your work in SageMaker Canvas, you can either log out or set the application to automatically delete the workspace instance, which stops billing for the instance.

Conclusion

In this post, we showed you how SageMaker Canvas with SageMaker JumpStart models enable you to fine-tune LLMs to match your company’s tone and style with minimal effort. By fine-tuning an LLM on company-specific data, you can create a language model that speaks in your brand’s voice.

Fine-tuning is just one tool in the AI toolbox and may not be the best or the complete solution for every use case. We encourage you to explore various approaches, such as prompting, RAG architecture, continued pre-training, postprocessing, and fact-checking, in combination with fine-tuning to create effective AI solutions that meet your specific needs.

Although we used examples based on a sample dataset, this post showcased these tools’ capabilities and potential applications in real-world scenarios. The process is straightforward and applicable to various datasets, such as your organization’s FAQs, provided they are in CSV format.

Take what you learned and start brainstorming ways to use language models in your organization while considering the trade-offs and benefits of different approaches. For further inspiration, see Overcoming common contact center challenges with generative AI and Amazon SageMaker Canvas and New LLM capabilities in Amazon SageMaker Canvas, with Bain & Company.


About the Author

Yann Stoneman headshot -- white male in 30s with slight beard and glasses smilingYann Stoneman is a Solutions Architect at AWS focused on machine learning and serverless application development. With a background in software engineering and a blend of arts and tech education from Juilliard and Columbia, Yann brings a creative approach to AI challenges. He actively shares his expertise through his YouTube channel, blog posts, and presentations.

Read More

How LotteON built dynamic A/B testing for their personalized recommendation system

How LotteON built dynamic A/B testing for their personalized recommendation system

This post is co-written with HyeKyung Yang, Jieun Lim, and SeungBum Shim from LotteON.

LotteON is transforming itself into an online shopping platform that provides customers with an unprecedented shopping experience based on its in-store and online shopping expertise. Rather than simply selling the product, they create and let customers experience the product through their platform.

LotteON has been providing various forms of personalized recommendation services throughout the LotteON customer journey and across its platform, from its main page to its shopping cart and order completion pages. Through the development of new, high-performing models and continuous experimentation, they’re providing customers with personalized recommendations, improving CTR (click-through rate) metrics and increasing customer satisfaction.

In this post, we show you how LotteON implemented dynamic A/B testing for their personalized recommendation system.

The dynamic A/B testing system monitors user reactions, such as product clicks, in real-time from the recommended item lists provided. It dynamically assigns the most responsive recommendation model among multiple models to enhance the customer experience with the recommendation list. Using Amazon SageMaker and AWS services, these solutions offer insights into real-world implementation know-how and practical use cases for deployment.

LotteON screens

Defining the business problem

In general, there are two types of A/B testing that are useful for measuring the performance of a new model: offline testing and online testing. Offline testing evaluates the performance of a new model based on past data. Online A/B testing, also known as split testing, is a method used to compare two versions of a webpage, or in LotteON’s case, two recommendation models, to determine which one performs better. A key strength of online A/B testing is its ability to provide empirical evidence based on user behavior and preferences. This evidence-based approach to selecting a recommendation model reduces guesswork and subjectivity in optimizing both click-through rates and sales.

A typical online A/B test serves two models in a certain ratio (such as 5:5) for a fixed period of time (for example, a day or a week). When one model performs better than the other, the lower performing model is still served for the duration of the experiment, regardless of its impact on the business. To improve this, LotteON turned to dynamic A/B testing, which evaluates the performance of models in real time and dynamically updates the ratios at which each model is served, so that better performing models are served more often. To implement dynamic A/B testing, they used the multi-armed bandit (MAB) algorithm, which performs real-time optimizations.

LotteON’s dynamic A/B testing automatically selects the model that drives the highest click-through rate (CTR) on their site. To build their dynamic A/B testing solution, LotteON used AWS services such as Amazon SageMaker and AWS Lambda. By doing so, they were able to reduce the time and resources that would otherwise be required for traditional forms of A/B testing. This frees up their scientists to focus more of their time on model development and training.

Solution and implementation details

The MAB algorithm evolved from casino slot machine profit optimization. MAB’s usage method differs in selection (arm) from the existing method, which is widely used to re-rank news or products. In this implementation the selection (the arm) in MAB must be a model. There are various MAB algorithms such as ε-greedy and Thompson sampling.

The ε-greedy algorithm balances exploration and exploitation by choosing the best-known option most of the time, but randomly exploring other options with a small probability ε. Thompson sampling involves defining the β distribution for each option, with parameters alpha (α) representing the number of successes so far and beta (β) representing failures. As the algorithm collects more observations, alpha and beta are updated, shifting the distributions toward the true success rate. The algorithm then randomly samples from these distributions to decide which option to try next—balancing exploitation of the best-performing options to-date with exploration of less-tested options. In this way, MAB learns which model is best based on actual outcomes.

Based on LotteON’s evaluation of both ε-greedy and Thompson sampling, which considered the balance of exposure opportunities of the models under test, they decided to use Thompson sampling. Based on the number of clicks obtained, they were able to derive an efficiency model. For a hands-on workshop on dynamic A/B testing with MAB and Thompson sampling algorithms, see Dynamic A/B Testing on Amazon Personalize & SageMaker Workshop. LotteON’s goal was to provide real-time recommendations for high CTR efficient models.

mab components

With the option (arm) configured as a model, and the alpha value for each model configured as a click, the beta value for each model was configured as a non-click. To apply the MAB algorithm to actual services, they introduced the bTS (batched Thompson sampling) method, which processes Thompson sampling on a batch basis. Specifically, they evaluated models based on traffic over a certain period of time (24 hours), and updated parameters at a certain time interval (1 hour).

In the handler part of the Lambda function, a bTS operation is performed that reflects the parameter values ​​for each model (arm), and the click probabilities of the two models are calculated. The ID of the model with the highest probability of clicks is then selected. One thing to keep in mind when conducting dynamic A/B testing is not to start Thompson sampling right away. You should allow warm-up time for sufficient exploration. To avoid prematurely determining the winner due to small parameter values at the beginning of the test, you must collect an adequate number of impressions or click-metrics.

Dynamic A/B test architecture

The following figure shows the architecture for the dynamic A/B test that LotteON implemented.

Dynamic AB test architecture

The architecture in the preceding figure shows the data flow of Dynamic A/B testing and consists of the following four decoupled components:

1. MAB serving flow

Step 1: The user accesses LotteON’s recommendation page.

Step 2: The recommendations API checks MongoDB for information about ongoing experiments with recommendation section codes and, if the experiment is active, sends an API request with the member ID and section code to the Amazon API Gateway.

Step 3: API Gateway provides the received data to Lambda. If there is relevant data in the API Gateway cache, a specific model code in the cache is immediately passed to the recommendation API.

Step 4: The Lambda function checks the experiment type (that is, dynamic A/B test or online static A/B test) in MongoDB and runs its algorithm. If the experiment type is dynamic A/B test, the alpha (number of clicks) and beta (number of non-clicks) required for the Thompson sampling algorithm are retrieved from MongoDB, the values ​​are obtained, and the Thompson sampling algorithm is run. Through this, the selected model’s identifier is delivered to Amazon API Gateway by the Lambda function.

Step 5: API Gateway provides the selected model’s identifier to the recommended API and caches the selected model’s identifier for a certain period of time.

Step 6: The recommendation API calls the model inference server (that is, the SageMaker endpoint) using the selected model’s identifier to receive a recommendation list and provides it to the user’s recommendation web page.

2. The flow of an alpha and beta parameter update

Step 1: The system powering LotteON’s recommendation page stores real-time logs in Amazon S3.

Step 2: Amazon EMR downloads the logs stored in Amazon S3.

Step 3: Amazon EMR processes the data and updates the alpha and beta parameter values to MongoDB for use in the Thompson sampling algorithm.

3. The flow of business metrics monitoring

Step 1: Streamlit pulls experimental business metrics from MongoDB to visualize.

Step 2: Monitor efficiency metrics such as CTR per model over time.

4. The flow of system operation monitoring

Step 1: When a recommended API call occurs, API Gateway and Lambda are launched, and Amazon CloudWatch logs are produced.

Step 2: Check system operation metrics using CloudWatch and AWS X-Ray dashboards based on CloudWatch logs.

Implementation Details 1: MAB serving flow mainly involving API Gateway and Lambda

The APIs that can serve MAB results—that is, the selected model—are implemented using serverless compute services, Lambda, and API Gateway. Let’s take a look at the implementation and settings.

1. API Gateway configuration

When a LotteON user signs in to the recommended product area, member ID, section code, and so on are passed to API Gateway as GET parameters. Using the passed parameters, the selected model can be used for inferencing during a certain period of time through the cache function of Amazon API Gateway.

2. API Gateway cache settings

Setting up a cache in API Gateway is straightforward. To set up the cache, first enable it by selecting the appropriate checkbox under the Settings tab for your chosen stage. After it’s activated, you can define the cache time-to-live (TTL), which is the duration in seconds that cached data remains valid. This value can be set anywhere up to a maximum of 3,600 seconds.

API Gateway cache settings

The API Gateway caching feature is limited to the parameters of GET requests. To use caching for a particular parameter, you should insert a query string in the GET request’s query parameters within the resource. Then select the Enable API Cache option. It is essential to deploy your API using the deploy action in the API Gateway console to activate the caching function.

API Gateway - Deploy API

After the cache is set, the same model is used for inference on specific customers until the TTL has elapsed. Following that, or when the recommendation section is first exposed, API Gateway will call Lambda with the MAB function implemented.

3. Add an API Gateway mapping template

When a Lambda handler function is invoked, it can receive the HTTPS request details from API Gateway as an event parameter. To provide a Lambda function with more detailed information, you can enhance the event payload using a mapping template in the API Gateway. This template is part of the integration request setup, which defines how incoming requests are mapped to the expected format of the Lambda function.

API Gateway - Edit mapping template

The specified parameters are then passed to the Lambda function’s event parameters. The following code is an example of source code that uses the event parameter in Lambda.

def lambda_handler (event, context):
    event_param = event ["name"]
    return {
        'message': event_param
    } 

4. Lambda for Dynamic A/B Test

Lambda receives a member ID and section code as event parameter values. The Lambda function uses the received section code to run the MAB algorithm. In the case of the MAB algorithm, a dynamic A/B test is performed by getting the model (arm) settings and aggregated results. After updating the alpha and beta values according to bTS when reading the aggregated results, the probability of a click for each model is obtained through the beta distribution (see the following code), and the model with the maximum value is returned. For example, given model A and model B, where model B has a higher probability of producing a click-through event, model B is returned.

def select_variant (self): 
    probs = []
    for v in self.variant_metrics:
        success = v["mab_alpha”]
        failure = v["mab_beta”]
        probs.append(AlgorithmBase.random_beta(1 + success, 1 + failure)) 

    variant_index = AlgorithmBase.argmax(probs) 

    return (self.variant_metrics [variant_index] ["variant_name"], probs) 

The overall implementation using the bTS algorithm, including the above code, was based on the Dynamic A/B testing for machine learning models with Amazon SageMaker MLOps projects post.

Implementation details 2: Alpha and beta parameter update

A product recommendation list is displayed to the LotteON user. When the user clicks on a specific product in the recommendation list, that data is captured and logged to Amazon S3. As shown in the following figure, LotteON used AWS EMR to perform Spark Jobs that periodically pulled the logged data from S3, processed the data, and inserted the results into MongoDB.

MAB flow

The results generated at this stage play a key role in determining the distribution used in MAB. The following impression and click data were examined in detail.

  1. Impression and click data

Note: Before updating the alpha and beta parameters in bTS, verify the integrity and completeness of log data, including impressions and clicks from the recommendation section.

Implementation details 3: Business metrics monitoring

To assess the most effective model, it’s essential to monitor business metrics during A/B testing. For this purpose, a dashboard was developed using Streamlit on an Amazon Elastic Compute Cloud (Amazon EC2) environment.

Streamlit is a Python library can be used to create web apps for data analysis. LotteON added the necessary Python package information for dashboard configuration to the requirements.txt file, specifying Streamlit version 1.14.1, and proceeded with the installation as demonstrated in the following:

 $ python3 -m pip install --upgrade pip 
 $ pip3 install -r requirements.txt

The default port provided by Streamlit is 8501, so it’s required to set the inbound custom TCP port 8501 to allow access to the Streamlit web browser.

Security Group

When setup is complete, use the streamlit run pythoncode.py command in the terminal, where pythoncode.py is the Python script containing the Streamlit code to run the application. This command launches the Streamlit web interface for the specified application.

import streamlit as st 
    st.title ('streamlit example')

LotteON created a dashboard based on Streamlit. The functionality of this organized dashboard includes monitoring simple business metrics such as model trends over time, daily and real-time winner models, as shown in the following figure.

The dashboard allowed LotteON to analyze the business metrics of the model and check the service status in real time. It also monitored the effectiveness of model version updates and reduced the time to check the service impact of the retraining pipeline.

Streamlit dashboard 1

The following shows an enlarged view of the cumulative CTR of the two models (EXP-01-APS002-01 model A, EXP-01-NCF-01 model B) on the testing day. Let’s take a look at each model to see what that means. Model A provided customers with 29,274 recommendation lists that received 1,972 product clicks and generated a CTR of 6.7 percent (1,972/29,274).

Model B, on the other hand, served 7,390 recommended lists, received 430 product clicks, and generated a CTR of 5.8 percent (430/7,390). Alpha and beta parameters, the number of clicks and the number of non-clicks respectively, of each model were used to set the beta distribution. Model A’s alpha parameter was 1972 (number of clicks) and its beta parameter was 27,752 (number of non-clicks [29,724 – 1,972]). Model B’s alpha parameter was 430 (number of clicks) and its beta parameter was 6,960 (number of non-clicks). The larger the X-axis value corresponding to the peak in the beta distribution graph, the better the performance (CTR) model.

In the following figure, model A (EXP-01-APS002-01) shows better performance because it’s further to the right in relation to the X axis. This is also consistent with the CTR rates of 6.7 percent and 5.8 percent.

Streamlit dashboard 2

Implementation details 4: System operation monitoring with CloudWatch and AWS X-Ray

You can enable CloudWatch settings, custom access logging, and AWS X-Ray tracking features from the Logs/Tracking tab in the API Gateway menu.

CloudWatch settings and custom access logging

In the configuration step, you can change the CloudWatch Logs type to set the logging level, and after activating detailed indicators, you can check detailed metrics such as 400 errors and 500 errors. By enabling custom access logs, you can check which IP accessed the API and how.

CloudWatch - Logs and tracing settings

Additionally, the retention period for CloudWatch Logs must be specified separately on the CloudWatch page to avoid storing them indefinitely.

If you select API Gateway from the CloudWatch Explorer list, you can view the number of API calls, latency, and cache hits and misses on a dashboard. Find the Cache Hit Rate as shown in the following formula and check the effectiveness of the cache on the dashboard.

  • Cache Hit Rate = CacheHitCount / (CacheHitCount + CacheMissCount)

CloudWatch - Explorer

By selecting Lambda as the log group in the CloudWatch Logs Insights menu, you can verify the actual model code returned by Lambda, where MAB is performed, to check whether the sampling logic is working and branch processing is being performed.

fields @timestamp, @message, @logStream, @log 
 | filter @message like 'Model A' or message like 'Model B' 
 | stats count (*) by @message

CloudWatch - query

As shown in the preceding image, LotteON observed how often the two models were called by the Lambda function during the A/B test. Specifically, the model labeled LF001-01 (the champion model) was invoked 4,910 times, while the model labeled NCF-02 (the challenger model) was invoked 4,905 times. These numbers represent the degree to which each model was selected in the experiment.

AWS X-Ray

If you enable the X-Ray trace feature, trace data is sent from the enabled AWS service to X-Ray and the visualized API service flow can be monitored from the service map menu in the X-Ray section of the CloudWatch page.

X-Ray - Service map

X-Ray - Metrics

As shown in the preceding figure, you can easily track and monitor latency, number of calls, and number of HTTP call status for each service section by choosing the API Gateway icon and each Lambda node.

There was no need to store performance metrics for a long time because most for Lambda functions metrics are analyzed within a week and aren’t used afterward. Because data from X-Ray is stored for 30 days by default, which is enough time to use the metrics, the data was used without changing the storage cycle. (For more information, see the AWS X-Ray FAQs.)

Conclusion

In this post, we explained how Lotte ON builds and uses a dynamic A/B testing environment. Through this project, Lotte ON was able to test the model’s performance in various ways online by combining dynamic A/B testing with the MAB function. It also allows comparison of different types of recommendation models and is designed to be comparable across model versions, facilitating online testing.

In addition, data scientists could concentrate on improving model performance and training as they can check metrics and system monitoring instantly. The dynamic A/B testing system was initially developed and applied to the LotteON main page, and then expanded to the main page recommendation tab and product detail recommendation section. Because the system is able to evaluate online performance without significantly reducing the click-through rate of existing models, we have been able to conduct more experiments without impacting users.

Dynamic A/B Test exercises can also be found in AWS Workshop – Dynamic A/B Testing on Amazon Personalize & SageMaker.


About the Authors

HyeKyung Yang is a research engineer in the Lotte E-commerce Recommendation Platform Development Team and is in charge of developing ML/DL recommendation models by analyzing and utilizing various data and developing a dynamic A/B test environment.

Jieun Lim is a data engineer in the Lotte E-commerce Recommendation Platform Development Team and is in charge of operating LotteON’s personalized recommendation system and developing personalized recommendation models and dynamic A/B test environments.

SeungBum Shim is a data engineer in the Lotte E-commerce Recommendation Platform Development Team, responsible for discovering ways to use and improve recommendation-related products through LotteON data analysis, and developing MLOps pipelines and ML/DL recommendation models.

Jesam Kim is an AWS Solutions Architect and helps enterprise customers adopt and troubleshoot cloud technologies and provides architectural design and technical support to address their business needs and challenges, especially in AIML areas such as recommendation services and generative AI.

Gonsoo Moon is an AWS AI/ML Specialist Solutions Architect and provides AI/ML technical support. His main role is to collaborate with customers to solve their AI/ML problems based on various use cases and production experience in AI/ML.

Read More

Unleashing the power of generative AI: Verisk’s journey to an Instant Insight Engine for enhanced customer support

Unleashing the power of generative AI: Verisk’s journey to an Instant Insight Engine for enhanced customer support

This post is co-written with Tom Famularo, Abhay Shah and Nicolette Kontor from Verisk.

Verisk (Nasdaq: VRSK) is a leading data analytics and technology partner for the global insurance industry. Through advanced analytics, software, research, and industry expertise across over 20 countries, Verisk helps build resilience for individuals, communities, and businesses. The company is committed to ethical and responsible AI development, with human oversight and transparency. Verisk is using generative artificial intelligence (AI) to enhance operational efficiencies and profitability for insurance clients while adhering to its ethical AI principles.

Verisk’s FAST platform is a leader in the life insurance and retirement sector, providing enhanced efficiency and flexible, easily upgradable architecture. FAST has earned a fourth consecutive leader ranking in the 2024 ISG Provider Lens report for its seamless integration with Verisk’s data, analytics, and claims tools. The software as a service (SaaS) platform offers out-of-the-box solutions for life, annuity, employee benefits, and institutional annuity providers. With preconfigured components and platform configurability, FAST enables carriers to reduce product time-to-market by 75% and launch new offerings in as little as 2 months.

In this post, we describe the development of the customer support process in FAST incorporating generative AI, the data, the architecture, and the evaluation of the results. Conversational AI assistants are rapidly transforming customer and employee support. Verisk has embraced this technology and has developed their own Instant Insight Engine, or AI companion, that provides an enhanced self-service capability to their FAST platform.

The Opportunity

Verisk FAST’s initial foray into using AI was due to the immense breadth and complexity of the platform. With hundreds of thousands of hours spent on customer support every year, it became abundantly clear they needed help to scale their efforts and meet their objectives. Verisk’s talented teams were overloaded handling common inquiries, leaving less time for the type of innovation that would allow them to maintain the pole position as insurance technology providers.

Verisk FAST’s AI companion aims to alleviate this burden by not only providing 24/7 support for business processing and configuration questions related to FAST, but also tapping into the immense knowledge base to provide an in-depth, tailored response. It is designed to be deeply integrated into the FAST platform and use all of Verisk’s documentation, training materials, and collective expertise. It relies on a Retrieval Augmented Generation (RAG) approach and a mix of AWS services and proprietary configuration to instantly answer most user questions about the Verisk FAST platform’s extensive capabilities.

When the AI companion is rolled out at scale, it will allow Verisk’s staff to focus more time on complex problems, critical initiatives, and innovation while delivering a better customer experience. As part of the build-out, Verisk came across several considerations, key findings, and decisions worth sharing for any enterprise looking to take the first step in tapping into generative AI’s potential.

The Approach

When building an interactive agent with large language models (LLMs), there are often two techniques that can be used: RAG and fine-tuning. The choice between these approaches depends on the use case and available dataset. Verisk FAST started building a RAG pipeline for their AI companion and have iteratively enhanced this solution. The following are some of the reasons why continuing with a RAG architecture made sense to Verisk:

  • Access to Dynamic Data – The FAST platform is a constantly evolving platform adding both business functionality and technical capabilities. Verisk needed to make sure their responses were always based on the most up-to-date information. The RAG approach allows for accessing frequently updated data, enabling responses using the most recent information without frequent retraining of the model.
  • Multiple Data Sources – In addition to recency of data, another important aspect was the ability to tap into multiple different data sources to retrieve the right context. These data sources may be both internal and external to provide a more holistic response. The ease of expanding the knowledge domain without the need to fine-tune with new data sources makes the solution extensible.
  • Reduce Hallucination – Retrieval reduces the risk of hallucination compared to free-form text generation because responses derive directly from the provided excerpts.
  • LLM Linguistics – Although appropriate context can be retrieved from enterprise data sources, the underlying LLM handles linguistics and fluency.
  • Transparency – Verisk wants to continuously improve the AI companion’s ability to generate responses. A RAG architecture gave them the transparency needed into the context retrieval process, information that would ultimately be used for generating user responses. Having that transparency helped Verisk identify areas of the system where their documents were lacking and needed some restructuring.
  • Data governance – With a wide variety of users accessing the platform and with different users having access to different data, data governance and isolation was paramount. Verisk injected controls into the RAG pipeline that restricted access to data based on user access controls, making sure responses were highly tuned to the user.

Although both RAG and fine-tuning have trade-offs, RAG was the optimal approach for building an AI companion on the FAST platform given their requirements for real-time accuracy, explainability, and configurability. The pipeline architecture allows for iterative enhancement as Verisk FAST’s use cases evolve.

Solution Overview

The following diagram presents a high-level architectural data flow highlighting several of the AWS services used in building the solution. Verisk’s solution represents a compound AI system, involving multiple interacting components and making numerous calls to the LLM to furnish responses to the user. Using the FAST platform for orchestrating these diverse components proved to be an intuitive choice, circumventing certain challenges encountered with alternative frameworks such as LangChain.

Architecture Diagram highlighting the application architecture.

The key components are as follows:

Amazon Comprehend

To bolster security, Verisk aimed to block the submission of personally identifiable information (PII) within user questions. Although PII isn’t typically necessary for interactions with the AI companion, Verisk employed Amazon Comprehend to detect any potential PII within queries.

 Amazon Kendra

In designing an effective RAG solution, one of the most critical steps is the context retrieval from enterprise documentation. Although many options exist to store embeddings, Verisk FAST opted to use Amazon Kendra due to its powerful out-of-the-box semantic search capabilities. As a fully managed service, Verisk took advantage of its deep-learning search models without additional provisioning. Verisk compared using Amazon OpenSearch Serverless with several embedding approaches and Amazon Kendra, and saw better retrieval results with Amazon Kendra. As you’ll see further in the post, Verisk incorporated the Retrieve API and the Query API to retrieve semantically relevant passages for their queries to further improve generation by the LLM.

Amazon Bedrock

Anthropic Claude, available in Amazon Bedrock, played various roles within Verisk’s solution:

  • Response Generation – When building their AI companion, Verisk thoroughly evaluated the LLM options from leading providers, using their dataset to test each model’s comprehension and response quality. After this extensive testing, Verisk found Anthropic’s Claude model consistently outperformed across key criteria. Claude demonstrated superior language understanding in Verisk’s complex business domain, allowing more pertinent responses to user questions. It also did exceedingly well at SQL generation, better than any other model they tested. Given Claude’s standout results across Verisk FAST’s use cases, it was the clear choice to power their AI companion’s natural language capabilities.
  • Preprocessing of Images and Videos – The outputs from Amazon Rekognition and Amazon Transcribe were fed into Claude. Claude demonstrated remarkable capabilities in generating natural language descriptions, which could be effectively used for indexing purposes with Amazon Kendra. Additionally, Claude excelled at summarizing video transcriptions into concise segments corresponding to specific time intervals, enabling the display of videos at precise points. This combination of AWS services and Claude’s language processing capabilities facilitated a more intuitive and user-friendly experience for media exploration and navigation.
  • Relevance Ranking – Although Amazon Kendra returned confidence scores on search results, Verisk needed to further tune the search results for Query API calls for a few scenarios. Verisk was able to use Claude to rank the relevance of search results from Amazon Kendra, further improving the results returned to the user.
  • Tool Identification – Verisk used Claude to determine the most suitable techniques, whether API calls or SQL queries, for retrieving data from the operational database based on user requests. Furthermore, Claude generated SQL queries tailored to the provided schemas, enabling efficient data retrieval.
  • Conversation Summarization – When a user asks a follow-up question, the AI companion can continue the conversational thread. To enable this, Verisk used Claude to summarize the dialogue to update the context from Amazon Kendra. The full conversation summary and new excerpts are input to the LLM to generate the next response. This conversational flow allows the AI compan to answer user follow-up questions and have a more natural, contextual dialogue, bringing Verisk FAST closer to having a true AI assistant that can engage in useful back-and-forth conversations with users.

Amazon Rekognition

Primarily used for processing images containing text and process flow diagrams, the pre-trained features of Amazon Rekognition facilitated information extraction. The extracted data was then passed to Claude for transformation into a more natural language format suitable for indexing within Amazon Kendra.

Amazon Transcribe

Similar to Amazon Rekognition, Amazon Transcribe was employed to preprocess videos and generate transcripts, with a notable feature being the masking of sensitive information. The verbose transcripts, along with timestamps, were condensed using Claude before being indexed into Amazon Kendra.

Prompt Template Warehouse

Central to the solution was the dynamic selection of templates to create prompts based on question classification. Substantial effort was invested in developing and continuously improving these prompt templates.

Throughout Verisk’s journey, they worked closely with the AWS Solutioning team to brainstorm concrete suggestions to enhance the overall solution.

Data Harvesting

Before Verisk started building anything in the platform, they spent weeks amassing information, initially in the form of questions and answers. Verisk FAST’s initial dataset comprised 10,000 questions and their corresponding answers, meticulously collected and vetted to confirm accuracy and relevance. However, they understood that this was not a one-and-done effort. Verisk needed to continually expand its knowledge base by identifying new data sources across the business.

Driven by this, Verisk diligently added 15,000 more questions, making sure they covered less frequently encountered scenarios. Verisk also added user guides, technical documentation, and other text-based information. This data spanned several categories, from business processing to configuration to their delivery approach. This enriched the AI companion’s knowledge and understanding of diverse user queries, enabling it to provide more accurate and insightful responses.

The Verisk FAST team also recognized the necessity of exploring additional modalities. Videos and images, particularly those illustrating process flows and information sharing videos, proved to be invaluable sources of data. During the initial rollout phase, it became evident that certain inquiries demanded real-time data retrieval from their operational data store. Through some slick prompt engineering and using Claude’s latest capabilities to invoke APIs, Verisk seamlessly accessed their database to procure real-time information.

Structuring and Retrieving the Data

An essential element in developing the AI companion’s knowledge base was properly structuring and effectively querying the data to deliver accurate answers. Verisk explored various techniques to optimize both the organization of the content and the methods to extract the most relevant information:

  • Chunking – One key step in preparing the accumulated questions and answers was splitting the data into individual documents to facilitate indexing into Amazon Kendra. Rather than uploading a single large file containing all 10,000 question-answer pairs, Verisk chunked the data into 10,000 separate text documents, with each document containing one question-answer pair. By splitting the data into small, modular documents focused on a single question-answer pair, Verisk could more easily index each document and had greater success in pulling back the correct context. Chunking the data also enabled straightforward updating and reindexing of the knowledge base over time. Verisk applied the same technique to other data sources as well.
  • Selecting the Right Number of Results – Verisk tested configuring Amazon Kendra to return different numbers of results for each question query. Returning too few results ran the risk of not capturing the best answer, whereas too many results made it more difficult to identify the right response. Verisk found returning the top three matching results from Amazon Kendra optimized both accuracy and performance.
  • Multi-step Query – To further improve accuracy, Verisk implemented a multi-step query process. First, they used the Amazon Kendra Retrieve API to get multiple relevant passages and excerpts based on keyword search. Next, they took a second pass at getting excerpts through the Query API, to find any additional shorter documents that might have been missed. Combining these two query types enabled Verisk to reliably identify the correct documentation and excerpts to generate a response.
  • Relevance Parameters – Verisk also tuned relevance parameters in Amazon Kendra to weigh their most up-to-date documentation higher than others. This improved results over just generic text search.

By thoroughly experimenting and optimizing both the knowledge base powering their AI companion and the queries to extract answers from it, Verisk was able to achieve very high answer accuracy during the proof of concept, paving the way for further development. The techniques they explored—multi-stage querying, tuning relevance, enriching data—became core elements of their approach for extracting quality automated answers.

LLM Parameters and Models

Experimenting with prompt structure, length, temperature, role-playing, and context was key to improving the quality and accuracy of the AI companion’s Claude-powered responses. The prompt design guidelines provided by Anthropic were incredibly helpful.

Verisk crafted prompts that provided Claude with clear context and set roles for answering user questions. Setting the temperature to 0.5 helped reduce randomness and repetition in the generated responses.

Verisk also experimented with different models to improve the efficiency of the overall solution. Although Claude 3 models like Sonnet and Haiku did a great job at generating responses, as part of the overall solution, Verisk didn’t always need the LLM to generate text. For scenarios that required identification of tools, Claude Instant was a better suited model due to its quicker response times.

Metrics, Data Governance, and Accuracy

A critical component of Verisk FAST’s AI companion and its usefulness is their rigorous evaluation of its performance and the accuracy of its generated responses.

As part of the proof of concept in working with the Amazon Generative AI Innovation Center, Verisk came up with 100 questions to evaluate the accuracy and performance of the AI companion. Central to this process was crafting questions designed to assess the bot’s ability to comprehend and respond effectively across a diverse range of topics and scenarios. These questions spanned a variety of topics and varying levels of difficulty. Verisk wanted to make sure their AI companion provided accurate responses to frequently asked questions and could demonstrate proficiency in handling nuanced and less predictable or straightforward inquiries. The results provided invaluable insights into RAG’s strengths and areas for improvement, guiding Verisk’s future efforts to refine and enhance its capabilities further.

After Verisk integrated their AI companion into the platform and began testing it with real-world scenarios, their accuracy rate was approximately 40%. However, within a few months, it rapidly increased to over 70% because of all the data harvesting work, and the accuracy continues to steadily improve each day.

Contributing to the AI companion’s rising accuracy is Verisk’s evaluation heat map. This provides a visual representation of the documentation available across 20 topics that comprehensively encompasses the Verisk FAST platform’s capabilities. This is compared against the volume of inquiries within each specific topic segment and the health of the generated responses in each.

This visualized data allows the Verisk FAST team to effortlessly identify gaps. They can quickly see which capability the AI companion currently struggles with against where user questions are most focused on. The Verisk team can then prioritize expanding its knowledge in these areas through additional documentation, training data, research materials, and testing.

FAST Grading Rubric

Business Impact

Verisk initially rolled out the AI companion to one beta customer to demonstrate real-world performance and impact. Supporting a customer in this way is a stark contrast to how Verisk has historically engaged with and supported customers in the past, where they would typically have a team allocated to interact with the customer directly. Now only a fraction of the time a person would usually spend is needed to review submissions and adjust responses. Verisk FAST’s AI companion has helped them cost-effectively scale while still providing high-quality assistance.

In analyzing this early usage data, Verisk uncovered additional areas they can drive business value for their customers. As they collect additional information, this data will help them uncover what will be needed to improve results and prepare for a wider rollout.

Ongoing development will focus on expanding these capabilities, prioritized based on the collected questions. Most exciting, though, are the new possibilities on the horizon with generative AI. Verisk knows this technology is rapidly advancing, and they are eager to harness innovations to bring even more value to their customers. As new models and techniques emerge, Verisk plans to adapt their AI companion to take advantage of the latest capabilities. Although the AI companion currently focuses on responding to user questions, this is only the starting point. Verisk plans to quickly improve its capabilities to proactively make suggestions and configure functionality directly in the system itself. The Verisk FAST team is inspired by the challenge of pushing the boundaries of what is possible with generative AI and is excited to test the limits of what’s possible.

Conclusion

Verisk’s journey in developing an AI companion for their FAST platform showcases the immense potential of generative AI to transform customer support and drive operational efficiencies. By meticulously harvesting, structuring, and retrieving data, and leveraging large language models, semantic search capabilities, and rigorous evaluation processes, Verisk has created a robust solution that provides accurate, real-time responses to user inquiries. As Verisk continues to expand the AI companion’s capabilities while adhering to ethical and responsible AI development practices, they are poised to unlock greater value for customers, enable staff to focus on innovation, and set new standards for customer support in the insurance industry.

For more information, see the following resources:


About the Authors

Tom Famularo was Co-Founder/CEO or FAST and lead’s Verisk Life Solutions, based in NJ. Tom is responsible for platform strategy, data/analytics, AI and Verisk’s life/annuity customers. His focus and passion are for teaching customers and team members how to allow technology to enable business outcomes with far less human effort. Outside of work, he’s an avid fan of his son’s baseball and football teams.

Abhay Shah leads engineering efforts for the FAST Platform at Verisk – Life Solutions, where he offers guidance on architecture and provides technical leadership for Customer Implementations and Product Development. With over two decades of experience in the technology sector, Abhay helps insurance carriers maximize the value of their ecosystem through modern technology and is excited by the opportunities that AI provides. Beyond his professional passion, he enjoys reading, traveling, and coaching the middle school robotics team.

Nicolette Kontor is a technology enthusiast who thrives on helping customers embrace digital transformation. In her current role at Verisk – Life Solutions, she spearheads the application of artificial intelligence to the FAST Platform, which she finds tremendously rewarding and exciting. With over 10 years of experience in major customer implementations and product development, Nicolette is driven to deliver innovative solutions that unlock value for insurance carriers. Beyond her professional pursuits, Nicolette is an avid traveler, having explored 39 countries to date. She enjoys winning trivia, reading mystery novels, and learning new languages.

Ryan Doty is a Sr. Solutions Architect at AWS, based out of New York. He helps enterprise customers in the Northeast U.S. accelerate their adoption of the AWS Cloud by providing architectural guidelines to design innovative and scalable solutions. Coming from a software development and sales engineering background, the possibilities that the cloud can bring to the world excite him.

Tarik Makota is a Senior Principal Solutions Architect with Amazon Web Services. He provides technical guidance, design advice, and thought leadership to AWS’ customers across the US Northeast. He holds an M.S. in Software Development and Management from Rochester Institute of Technology.

Dom Bavaro is a Senior Solutions Architect for Financial Services. While providing technical guidance to customers across many use cases, He is focused on helping customer build and productionize Generative AI solutions and workflows

Read More