Use RAG for drug discovery with Knowledge Bases for Amazon Bedrock

Use RAG for drug discovery with Knowledge Bases for Amazon Bedrock

Amazon Bedrock provides a broad range of models from Amazon and third-party providers, including Anthropic, AI21, Meta, Cohere, and Stability AI, and covers a wide range of use cases, including text and image generation, embedding, chat, high-level agents with reasoning and orchestration, and more. Knowledge Bases for Amazon Bedrock allows you to build performant and customized Retrieval Augmented Generation (RAG) applications on top of AWS and third-party vector stores using both AWS and third-party models. Knowledge Bases for Amazon Bedrock automates synchronization of your data with your vector store, including diffing the data when it’s updated, document loading, and chunking, as well as semantic embedding. It allows you to seamlessly customize your RAG prompts and retrieval strategies—we provide the source attribution, and we handle memory management automatically. Knowledge Bases is completely serverless, so you don’t need to manage any infrastructure, and when using Knowledge Bases, you’re only charged for the models, vector databases and storage you use.

RAG is a popular technique that combines the use of private data with large language models (LLMs). RAG starts with an initial step to retrieve relevant documents from a data store (most commonly a vector index) based on the user’s query. It then employs a language model to generate a response by considering both the retrieved documents and the original query.

In this post, we demonstrate how to build a RAG workflow using Knowledge Bases for Amazon Bedrock for a drug discovery use case.

Overview of Knowledge Bases for Amazon Bedrock

Knowledge Bases for Amazon Bedrock supports a broad range of common file types, including .txt, .docx, .pdf, .csv, and more. To enable effective retrieval from private data, a common practice is to first split these documents into manageable chunks. Knowledge Bases has implemented a default chunking strategy that works well in most cases to allow you to get started faster. If you want more control, Knowledge Bases lets you control the chunking strategy through a set of preconfigured options. You can control the maximum token size and the amount of overlap to be created across chunks to provide coherent context to the embedding. Knowledge Bases for Amazon Bedrock manages the process of synchronizing data from your Amazon Simple Storage Service (Amazon S3) bucket, splits it into smaller chunks, generates vector embeddings, and stores the embeddings in a vector index. This process comes with intelligent diffing, throughput, and failure management.

At runtime, an embedding model is used to convert the user’s query to a vector. The vector index is then queried to find documents similar to the user’s query by comparing document vectors to the user query vector. In the final step, semantically similar documents retrieved from the vector index are added as context for the original user query. When generating a response for the user, the semantically similar documents are prompted in the text model, together with source attribution for traceability.

Knowledge Bases for Amazon Bedrock supports multiple vector databases, including Amazon OpenSearch Serverless, Amazon Aurora, Pinecone, and Redis Enterprise Cloud. The Retrieve and RetrieveAndGenerate APIs allow your applications to directly query the index using a unified and standard syntax without having to learn separate APIs for each different vector database, reducing the need to write custom index queries against your vector store. The Retrieve API takes the incoming query, converts it into an embedding vector, and queries the backend store using the algorithms configured at the vector database level; the RetrieveAndGenerate API uses a user-configured LLM provided by Amazon Bedrock and generates the final answer in natural language. The native traceability support informs the requesting application about the sources used to answer a question. For enterprise implementations, Knowledge Bases supports AWS Key Management Service (AWS KMS) encryption, AWS CloudTrail integration, and more.

In the following sections, we demonstrate how to build a RAG workflow using Knowledge Bases for Amazon Bedrock, backed by the OpenSearch Serverless vector engine, to analyze an unstructured clinical trial dataset for a drug discovery use case. This data is information rich but can be vastly heterogenous. Proper handling of specialized terminology and concepts in different formats is essential to detect insights and ensure analytical integrity. With Knowledge Bases for Amazon Bedrock, you can access detailed information through simple, natural queries.

Build a knowledge base for Amazon Bedrock

In this section, we demo the process of creating a knowledge base for Amazon Bedrock via the console. Complete the following steps:

  1. On the Amazon Bedrock console, under Orchestration in the navigation pane, choose Knowledge base.
  2. Choose Create knowledge base.

  1. In the Knowledge base details section, enter a name and optional description.
  2. In the IAM permissions section, select Create and use a new service role.
  3. For Service name role, enter a name for your role, which must start with AmazonBedrockExecutionRoleForKnowledgeBase_.
  4. Choose Next.

  1. In the Data source section, enter a name for your data source and the S3 URI where the dataset sits. Knowledge Bases supports the following file formats:
    • Plain text (.txt)
    • Markdown (.md)
    • HyperText Markup Language (.html)
    • Microsoft Word document (.doc/.docx)
    • Comma-separated values (.csv)
    • Microsoft Excel spreadsheet (.xls/.xlsx)
    • Portable Document Format (.pdf)
  1. Under Additional settings¸ choose your preferred chunking strategy (for this post, we choose Fixed size chunking) and specify the chunk size and overlay in percentage. Alternatively, you can use the default settings.
  2. Choose Next.

  1. In the Embeddings model section, choose the Titan Embeddings model from Amazon Bedrock.
  2. In the Vector database section, select Quick create a new vector store, which manages the process of setting up a vector store.
  3. Choose Next.

  1. Review the settings and choose Create knowledge base.

  1. Wait for the knowledge base creation to complete and confirm its status is Ready.
  2. In the Data source section, or on the banner at the top of the page or the popup in the test window, choose Sync to trigger the process of loading data from the S3 bucket, splitting it into chunks of the size you specified, generating vector embeddings using the selected text embedding model, and storing them in the vector store managed by Knowledge Bases for Amazon Bedrock.

The sync function supports ingesting, updating, and deleting the documents from the vector index based on changes to documents in Amazon S3. You can also use the StartIngestionJob API to trigger the sync via the AWS SDK.

When the sync is complete, the Sync history shows status Completed.

Query the knowledge base

In this section, we demonstrate how to access detailed information in the knowledge base through straightforward and natural queries. We use an unstructured synthetic dataset consisting of PDF files, the page number of each ranging from 10–100 pages, simulating a clinical trial plan of a proposed new medicine including statistical analysis methods and participant consent forms. We use the Knowledge Bases for Amazon Bedrock retrieve_and_generate and retrieve APIs with Amazon Bedrock LangChain integration.

Before you can write scripts that use the Amazon Bedrock API, you’ll need to install the appropriate version of the AWS SDK in your environment. For Python scripts, this will be the AWS SDK for Python (Boto3):

pip install langchain
pip install boto3

Additionally, enable access to the Amazon Titan Embeddings model and Anthropic Claude v2 or v1. For more information, refer to Model access.

Generate questions using Amazon Bedrock

We can use Anthropic Claude 2.1 for Amazon Bedrock to propose a list of questions to ask on the clinical trial dataset:

import boto3
from langchain.llms.bedrock import Bedrock

bedrock_client = boto3.client("bedrock-runtime")

# Start with the query
prompt = "For medical research trial consent forms to sign, what are the top 5 questions can be asked?"

claude_llm = Bedrock(
    model_id="anthropic.claude-v2:1",
    model_kwargs={"temperature": 0, "top_k": 10, "max_tokens_to_sample": 3000},
    client=bedrock_client,
)

# Provide the prompt to the LLM to generate an answer to the query without any additional context provided
response = claude_llm(prompt)
questions = [
    item.split(".")[1].strip() for item in response.strip().split("nn")[1:-1]
]
questions
>>> answer:
'What is the purpose of the study? Make sure you understand the goals of the research and what the study procedures will entail',
'What are the risks and potential benefits? The form should explain all foreseeable risks, side effects, or discomforts you might experience from participating',
'What will participation involve? Get details on what tests, medications, lifestyle changes, or procedures you will go through, how much time it will take, and how long the study will last',
'Are there any costs or payments? Ask if you will be responsible for any costs related to the study or get paid for participating',
'How will my privacy be protected? The form should explain how your personal health information will be kept confidential before, during, and after the trial'

Use the Amazon Bedrock RetrieveAndGenerate API

For a fully managed RAG experience, you can use the native Knowledge Bases for Amazon Bedrock RetrieveAndGenerate API to obtain the answers directly:

bedrock_agent_client = boto3.client("bedrock-agent-runtime")

kb_id = "<YOUR_KNOWLEDGE_BASE_ID>"

def retrieveAndGenerate(
    input: str,
    kbId: str,
    region: str = "us-east-1",
    sessionId: str = None,
    model_id: str = "anthropic.claude-v2:1",
):
    model_arn = f"arn:aws:bedrock:{region}::foundation-model/{model_id}"

    if sessionId:
        return bedrock_agent_client.retrieve_and_generate(
            input={"text": input},
            retrieveAndGenerateConfiguration={
                "type": "KNOWLEDGE_BASE",
                "knowledgeBaseConfiguration": {
                    "knowledgeBaseId": kbId,
                    "modelArn": model_arn,
                },
            },
            sessionId=sessionId,
        )

    else:
        return bedrock_agent_client.retrieve_and_generate(
            input={"text": input},
            retrieveAndGenerateConfiguration={
                "type": "KNOWLEDGE_BASE",
                "knowledgeBaseConfiguration": {
                    "knowledgeBaseId": kbId,
                    "modelArn": model_arn,
                },
            },
        )

response = retrieveAndGenerate(
    "What are the potential risks and benefits of participating?", kb_id
)

generated_text = response["output"]["text"]
>>> "The potential risks include side effects from the study medication lithium such as nausea, loose stools, thirst, urination changes, shakiness, headaches, sweating, fatigue, decreased concentration, and skin rash. There is also a risk of lithium interaction with other medications. For women, there is a risk of birth defects if lithium is taken during pregnancy. There are no guaranteed benefits, but possible benefits include new information that could help the participant from the interviews and tests conducted during the study."

The cited information source can be obtained via the following code (with some of the output redacted for brevity):

response["citations"]

>>> [
    {
        "generatedResponsePart": {
            "textResponsePart": {
                "text": " The potential risks include side effects from the study...",
                "span": {"start": 0, "end": 361},
            }
        },
        "retrievedReferences": [
            {
                "content": {
                    "text": "590 ICF#2 Page 7 of 19 The primary risks and discomforts of participation…"
                },
                "location": {"type": "S3", "s3Location": {"uri": "s3://XXXX/XXXX.pdf"}},
            },
            {
                "content": {
                    "text": "N/A CSP 590 ICF#2 Page 10 of 19 Risks associated with suddenly stopping study medications..."
                },
                "location": {"type": "S3", "s3Location": {"uri": "s3://XXXX/XXXX.pdf"}},
            },
        ],
    },
    {
        "generatedResponsePart": {
            "textResponsePart": {
                "text": " There are no guaranteed benefits, but possible benefits include...",
                "span": {"start": 363, "end": 531},
            }
        },
        "retrievedReferences": [
            {
                "content": {
                    "text": "research, not usual clinical care. After these are done we ask..."
                },
                "location": {"type": "S3", "s3Location": {"uri": "s3://XXXX/XXXX.pdf"}},
            }
        ],
    },
]

By passing the session ID of the RetrieveAndGenerate API, you can preserve the conversation context and ask follow-up questions. For example, without the context, if you ask for more details from the previous answer, it may not be able to answer correctly:

retrieveAndGenerate("elaborate more on the first side effect", kb_id, sessionId=None)["output"]["text"]
>>> "The search results do not provide additional details about the mild nausea side effect that would allow me to elaborate further on it."

But by passing the session ID, the RAG pipeline is able to identify the corresponding context and return relevant answers:

retrieveAndGenerate("elaborate more on the first side effect", kb_id, sessionId=response["sessionId"])["output"]["text"]
>>> "The search results provide details that nausea from taking lithium is usually mild and goes away after days or weeks for most people. Specifically, up to 75% of people may experience mild nausea when first starting lithium, but this goes away in 90-99% of people who continue taking it."

The following table shows the retrieved answers to all the corresponding questions.

Question Answer
What is the purpose of the study? Make sure you understand the goals of the research and what the study procedures will entail. The purpose of the study is to test whether lithium is effective at preventing repeated suicidal self-directed violence in patients with depression or bipolar disorder.
What are the risks and potential benefits? The form should explain all foreseeable risks, side effects, or discomforts you might experience from participating. The possible risks or discomforts include: the interview questions causing discomfort, side effects from the lithium medication such as nausea, loose stools, thirst, urination changes, shakiness, headaches, sweating, fatigue, decreased concentration, skin rash, thyroid changes, worsening acne/psoriasis, lithium toxicity, and risks if the medication is suddenly stopped. The potential benefits are that the tests may lead to new information to help the participant, and lithium may help prevent repeated suicidal self-directed violence for those with depression or bipolar disorder.
What will participation involve? Get details on what tests, medications, lifestyle changes, or procedures you will go through, how much time it will take, and how long the study will last. Participation will involve completing an interview and questionnaires covering thinking, behaviors, mental health treatment, medications, alcohol and drug use, home and social supports, and understanding of the research study. This takes about two hours and can be done in multiple sessions, in person and by phone. If eligible for the full study, there will be about 20 study visits over one year. This will involve taking study medication, having vital signs checked, completing questionnaires, reviewing side effects, and continuing normal medical and mental health care.
Are there any costs or payments? Ask if you will be responsible for any costs related to the study or get paid for participating. Yes, there are costs and payments discussed in the search results. You will not be charged for any treatments or procedures that are part of the study. However, you will still have to pay any usual VA co-payments for care and medications not related to the study. You will not be paid for participation, but the study will reimburse expenses related to participation like transportation, parking, etc. Reimbursement amounts and process are provided.
How will my privacy be protected? The form should explain how your personal health information will be kept confidential before, during, and after the trial. Your privacy will be protected by conducting interviews in private, keeping written notes in locked files and offices, storing electronic information in encrypted and password protected files, and obtaining a Confidentiality Certificate from the Department of Health and Human Services to prevent disclosing information that identifies you. Information that identifies you may be shared with doctors responsible for your care or for audits and evaluations by government agencies, but talks and papers about the study will not identify you.

Query using the Amazon Bedrock Retrieve API

To customize your RAG workflow, you can use the Retrieve API to fetch the relevant chunks based on your query and pass it to any LLM provided by Amazon Bedrock. To use the Retrieve API, define it as follows:

def retrieve(query: str, kbId: str, numberOfResults: int = 5):
    return bedrock_agent_client.retrieve(
        retrievalQuery={"text": query},
        knowledgeBaseId=kbId,
        retrievalConfiguration={
            "vectorSearchConfiguration": {"numberOfResults": numberOfResults}
        },
    )

Retrieve the corresponding context (with some of the output redacted for brevity):

query = "What is the purpose of the medical research study?"
response = retrieve(query, kb_id, 3)
retrievalResults = response["retrievalResults"]
>>> [
    {
        "content": {"text": "You will not be charged for any procedures that..."},
        "location": {"type": "S3", "s3Location": {"uri": "s3://XXXXX/XXXX.pdf"}},
        "score": 0.6552521,
    },
    {
        "content": {"text": "and possible benefits of the study. You have been..."},
        "location": {"type": "S3", "s3Location": {"uri": "s3://XXXX/XXXX.pdf"}},
        "score": 0.6581577,
    },
    ...,
]

Extract the context for the prompt template:

def get_contexts(retrievalResults):
    contexts = []
    for retrievedResult in retrievalResults:
        contexts.append(retrievedResult["content"]["text"])
    return " ".join(contexts)

contexts = get_contexts(retrievalResults)

Import the Python modules and set up the in-context question answering prompt template, then generate the final answer:

from langchain.prompts import PromptTemplate

PROMPT_TEMPLATE = """
Human: You are an AI system working on medical trial research, and provides answers to questions 
by using fact based and statistical information when possible.
Use the following pieces of information to provide a concise answer to the question enclosed in <question> tags.
If you don't know the answer, just say that you don't know, don't try to make up an answer.

<context>
{context_str}
</context>

<question>
{query_str}
</question>

The response should be specific and use statistics or numbers when possible.

Assistant:"""

claude_prompt = PromptTemplate(
    template=PROMPT_TEMPLATE, input_variables=["context_str", "query_str"]
)

prompt = claude_prompt.format(context_str=contexts, query_str=query)
response = claude_llm(prompt)
>>> "Based on the context provided, the purpose of this medical research study is to evaluate the efficacy of lithium compared to a placebo in preventing suicide over a 1 year period. Specifically, participants will be randomly assigned to receive either lithium or a placebo pill for 1 year, with their doctors and the participants themselves not knowing which treatment they receive (double-blind). Blood lithium levels will be monitored and doses adjusted over the first 6-8 visits, then participants will be followed monthly for 1 year to assess outcomes."

Query using Amazon Bedrock LangChain integration

To create an end-to-end customized Q&A application, Knowledge Bases for Amazon Bedrock provides integration with LangChain. To set up the LangChain retriever, provide the knowledge base ID and specify the number of results to return from the query:

from langchain.retrievers.bedrock import AmazonKnowledgeBasesRetriever

retriever = AmazonKnowledgeBasesRetriever(
    knowledge_base_id=kb_id,
    retrieval_config={"vectorSearchConfiguration": {"numberOfResults": 4}},
)

Now set up LangChain RetrievalQA and generate answers from the knowledge base:

from langchain.chains import RetrievalQA

qa = RetrievalQA.from_chain_type(
    llm=claude_llm,
    chain_type="stuff",
    retriever=retriever,
    return_source_documents=True,
    chain_type_kwargs={"prompt": claude_prompt},
)

[qa(q)["result"] for q in questions]

This will generate corresponding answers similar to the ones listed in the earlier table.

Clean up

Make sure to delete the following resources to avoid incurring additional charges:

Conclusion

Amazon Bedrock provides a broad set of deeply integrated services to power RAG applications of all scales, making it straightforward to get started with analyzing your company data. Knowledge Bases for Amazon Bedrock integrates with Amazon Bedrock foundation models to build scalable document embedding pipelines and document retrieval services to power a wide range of internal and customer-facing applications. We are excited about the future ahead, and your feedback will play a vital role in guiding the progress of this product. To learn more about the capabilities of Amazon Bedrock and knowledge bases, refer to Knowledge base for Amazon Bedrock.


About the Authors

Mark Roy is a Principal Machine Learning Architect for AWS, helping customers design and build AI/ML solutions. Mark’s work covers a wide range of ML use cases, with a primary interest in computer vision, deep learning, and scaling ML across the enterprise. He has helped companies in many industries, including insurance, financial services, media and entertainment, healthcare, utilities, and manufacturing. Mark holds six AWS Certifications, including the ML Specialty Certification. Prior to joining AWS, Mark was an architect, developer, and technology leader for over 25 years, including 19 years in financial services.

Mani Khanuja is a Tech Lead – Generative AI Specialists, author of the book – Applied Machine Learning and High Performance Computing on AWS, and a member of the Board of Directors for Women in Manufacturing Education Foundation Board. She leads machine learning (ML) projects in various domains such as computer vision, natural language processing and generative AI. She helps customers to build, train and deploy large machine learning models at scale. She speaks in internal and external conferences such re:Invent, Women in Manufacturing West, YouTube webinars and GHC 23. In her free time, she likes to go for long runs along the beach.

Dr. Baichuan Sun, currently serving as a Sr. AI/ML Solution Architect at AWS, focuses on generative AI and applies his knowledge in data science and machine learning to provide practical, cloud-based business solutions. With experience in management consulting and AI solution architecture, he addresses a range of complex challenges, including robotics computer vision, time series forecasting, and predictive maintenance, among others. His work is grounded in a solid background of project management, software R&D, and academic pursuits. Outside of work, Dr. Sun enjoys the balance of traveling and spending time with family and friends.

Derrick Choo is a Senior Solutions Architect at AWS focused on accelerating customer’s journey to the cloud and transforming their business through the adoption of cloud-based solutions. His expertise is in full stack application and machine learning development. He helps customers design and build end-to-end solutions covering frontend user interfaces, IoT applications, API and data integrations and machine learning models. In his free time, he enjoys spending time with his family and experimenting with photography and videography.

Frank Winkler is a Senior Solutions Architect and Generative AI Specialist at AWS based in Singapore, focused in Machine Learning and Generative AI. He works with global digital native companies to architect scalable, secure, and cost-effective products and services on AWS. In his free time, he spends time with his son and daughter, and travels to enjoy the waves across ASEAN.

Nihir Chadderwala is a Sr. AI/ML Solutions Architect in the Global Healthcare and Life Sciences team. His expertise is in building Big Data and AI-powered solutions to customer problems especially in biomedical, life sciences and healthcare domain. He is also excited about the intersection of quantum information science and AI and enjoys learning and contributing to this space. In his spare time, he enjoys playing tennis, traveling, and learning about cosmology.

Read More

Unlock personalized experiences powered by AI using Amazon Personalize and Amazon OpenSearch Service

Unlock personalized experiences powered by AI using Amazon Personalize and Amazon OpenSearch Service

OpenSearch is a scalable, flexible, and extensible open source software suite for search, analytics, security monitoring, and observability applications, licensed under the Apache 2.0 license. Amazon OpenSearch Service is a fully managed service that makes it straightforward to deploy, scale, and operate OpenSearch in the AWS Cloud.

OpenSearch uses a probabilistic ranking framework called BM-25 to calculate relevance scores. If a distinctive keyword appears more frequently in a document, BM-25 assigns a higher relevance score to that document. This framework, however, doesn’t consider user behavior like click-through or purchase data, which can further improve relevance for individual users.

Improving the functionality of search is an integral aspect of enhancing the overall user experience and engagement on a website or application. Search traffic is considered high intent because users are actively seeking a particular item, and they have been found to convert up to two times more than non-site search visitors on average. By using user interaction data such as clicks, likes, and purchases, businesses can improve search relevancy to capitalize on this traffic and reduce instances of users abandoning their sessions due to difficulties in finding the desired items. By refining the quality of search results, businesses can significantly improve their customer engagement, satisfaction, and loyalty, as well as increase their conversion rates, ultimately leading to greater profitability and success.

Amazon Personalize allows you to add sophisticated personalization capabilities to your applications by using the same machine learning (ML) technology used on Amazon.com for over 20 years. No ML expertise is required.

Amazon Personalize supports the automatic adjustment of recommendations based on contextual information about your user, such as device type, location, time of day, or other information you provide. You supply Amazon Personalize with historical data about your users and their interactions within your application, such as purchase history, ratings, and likes. You can add data to Amazon Personalize in bulk by importing large historical datasets all at once from an Amazon Simple Storage Service (Amazon S3) CSV file, using a format required by Amazon Personalize. You can also add data incrementally by importing records using the Amazon Personalize console or API. After your historical data is imported, you can continue to provide new data in real time by sending user interaction events. Based on the use case you want to address, such as product recommendations, you select a pre-built recipe that is optimized for that goal. Amazon Personalize analyzes your data and trains a custom ML model based on the parameters in the recipe to generate personalized recommendations optimized for your users and application. After the model is trained, you can generate real-time personalized recommendations for your users.

With the newly launched Amazon Personalized Search Plugin for Amazon OpenSearch Service, you can use user interaction histories and interests to enhance their search results. By utilizing an Amazon Personalize recipe such as Personalized-Ranking, you can help boost search results for relevant items based on user interests at the time of getting search results from OpenSearch Service.

This post explains how to integrate the Amazon Personalize Search Ranking plugin with OpenSearch Service to enable personalized search experiences. To build Amazon Personalize artifacts in this post, we use a dataset from IMDb, the world’s most authoritative source for movie, TV, and celebrity content, available on AWS Marketplace, as well as the MovieLens dataset prepared by GroupLens research at the University of Minnesota, consisting of user rankings for various movies.

Solution overview

The following diagram illustrates the solution architecture.

The workflow includes the following steps:

  1. A user issues a search request through their website or portal. This search request is sent to OpenSearch Service.
  2. The top N search results are returned from the OpenSearch Service index and sent to the plugin to preprocess and prepare the input for an Amazon Personalize campaign.
  3. The request is sent to Amazon Personalize to get the re-ranked search results.
  4. Amazon Personalize returns the personalized ranking of the search results with the relevant score for each result.
  5. The reranked hits are returned by the plugin to OpenSearch Service, with a weighting applied between the OpenSearch Service relevance score and the Amazon Personalize personalized ranking score. You specify a weight parameter (between 0.0–1.0) that controls the balance between OpenSearch Service and Amazon Personalize when reranking results. A higher weight means more influence from the Amazon Personalize ranking scores vs. the OpenSearch Service scores. This allows you to customize how much the personalized recommendations affect the final search results ranking returned to the user.
  6. The user gets personalized search results based on their preferences and interactions.

Prerequisites

You should have the following prerequisites:

  • An AWS account.
  • An AWS Identity and Access Management (IAM) role with appropriate access permissions. We provide AWS CloudFormation templates and Jupyter notebooks to help set up the required IAM role and access.
  • To enable personalization in OpenSearch Service, you need to set up the required Amazon Personalize resources, including a dataset group, solution version, and campaign. We have provided a Jupyter notebook that creates all the Amazon Personalize resources, taking advantage of the fully managed Jupyter notebook instance capabilities of Amazon SageMaker.

Deploy the CloudFormation stack

The CloudFormation stack automates the deployment of the OpenSearch Service domain and SageMaker Notebook instance. Complete the following steps to deploy the stack:

  1. Sign in to the AWS Management Console with your credentials in the account where you want to deploy the CloudFormation stack.
  2. Launch the CloudFormation stack directly.
  3. On the Specify details page, provide any parameters required by the template, such as OpenSearch Service and SageMaker instance sizes.
  4. On the Configure stack options page, specify a stack name and any other options you want to set.
  5. Complete creating the stack and monitor the status on the stack details page.
  6. After the stack is created, open the SageMaker notebook instance from the console.

The notebook instance will already be preloaded with the required notebooks.

Set up and complete the Amazon Personalize workflow

Open the 1.Configure_Amazon_Personalize.ipynb notebook to set up the Amazon Personalize artifacts. This notebook walks you through the following steps:

  1. Download the dataset and preprocess the data to create the required input files for creating the datasets.
  2. Create a dataset group.
  3. Create datasets and schemas.
  4. Prepare and import data.
  5. Create a solution and a solution version.
  6. Create a campaign for the solution version.

Install the Amazon Personalize Search Ranking plugin using a Jupyter notebook

Open the 2.Configure_Amazon_OpenSearch.ipynb notebook and run through the instructions. This notebook walks you through the following steps:

  1. Ingest sample index data into the OpenSearch Service instance. Populating the index with representative data facilitates thorough testing and validation of the plugin.
  2. Install the plugin package in the OpenSearch Service domain. This integrates the personalization capabilities into the OpenSearch environment.
  3. Set up search pipelines to activate the plugin’s functionality. Search pipelines contain request preprocessors and response postprocessors that transform queries and results. When constructing a pipeline, specify the Amazon Personalize campaign ARN created earlier in a personalized_search_ranking postprocessor to enable personalized re-ranking. This configures the plugin to retrieve real-time personalization results from Amazon Personalize for application during result processing. Defining pipelines allows the plugin to augment search relevance based on user preferences.

Install the Amazon Personalize Search Ranking plugin using the console

You can also set up the Amazon Personalize search plugin from the console. You only need to do this if you have not installed the plugin using the Jupyter notebook from earlier.

To install the Amazon Personalize Search Ranking plugin on OpenSearch Service, complete the following steps:

  1. On the OpenSearch Service console, navigate to your domain.
  2. On the Packages tab, choose Associate package to associate the Amazon Personalize Search Ranking plugin with your OpenSearch Service domain. The plugin version must match the OpenSearch Service domain version.

The Amazon Personalize Search Ranking plugin can be installed on OpenSearch Service versions 2.9 and above.

  1. Locate the Amazon Personalize Search Ranking plugin in the list of available plugins.
  2. Choose Associate next to the plugin to install it and associate it with your existing OpenSearch Service domain.

After you have connected the plugin, it will appear in the list of packages as a plugin type. With the plugin installed, the installation process is now finished.

Enable the Amazon Personalize Search Ranking plugin

The Amazon Personalize Search Ranking plugin uses the search-pipeline feature of OpenSearch Service, released starting with version 2.9. The plugin depends on the search-pipeline feature to apply Amazon Personalized ranking on search results provided by OpenSearch Service and also needs to be set up as a search-pipeline response processor. This pipeline definition will contain configuration for the Amazon Personalize plugin, which includes the Amazon Personalize campaign to call for getting Amazon Personalize ranking, the IAM role to access Amazon Personalize resources, as well as the parameters defined in the following table.

Settings Required Default Description
campaign Yes None Specify the ARN of the Amazon Personalize campaign to use to personalize results.
recipe Yes None Specify the name of the Amazon Personalize recipe to use. As of this writing, aws-personalized-ranking is the only supported value.
item_id_field No “_id” If the _id field for an indexed document in OpenSearch doesn’t correspond with your Amazon Personalize itemId, specify the name of the field that does.
weight Yes None Specify the emphasis that the response processor puts on personalization when it re-ranks results. Specify a value within a range of 0.0–1.0. The closer to 1.0 that it is, the more likely it is that results from Amazon Personalize rank higher. If you specify 0.0, no personalization occurs and OpenSearch Service takes precedence.
tag No None Specify an identifier for the processor.
iam_role_arn Yes None Specify the IAM role to access Amazon Personalize resources. This is required for OpenSearch Service, and optional for open source OpenSearch.
aws_region Yes None Specify the AWS Region where you created your Amazon Personalize campaign.
ignore_failure No None Specify whether the plugin ignores any processor failures. For values, specify true or false. For your production environments, we recommend that you specify true to avoid any interruptions for query responses. For test environments, you can specify false to view any errors that the plugin generates.
external_account_iam_role_arn No None If you use OpenSearch Service and your Amazon Personalize and OpenSearch Service resources exist in different accounts, specify the ARN of the role that has permission to access to Amazon Personalize.

The following Python code snippet creates a search pipeline with a personalized_search_ranking response processor on an OpenSearch Service domain. You run this step one time as a part of the notebook that accompanies this post:

Define search pipeline for personalized ranking

You can use the following Python code to create a search pipeline with a personalized_search_ranking response processor on an OpenSearch Service domain. Replace domain endpoint with your domain endpoint URL. For example: https://<domain name>.<AWS region>.es.amazonaws.com.

import requests
from requests_auth_aws_sigv4 import AWSSigV4

domain_endpoint = 'domain endpoint'
pipeline_name = 'pipeline name'
url = f'{domain_endpoint}/_search/pipeline/{pipeline_name}'
auth = AWSSigV4('es')

headers = {'Content-Type': 'application/json'}

body = {
  "description": "A pipeline to apply custom re-ranking from Amazon Personalize",
  "response_processors": [
    {
      "personalized_search_ranking" : {
        "campaign_arn" : "<Replace with Amazon Personalize Campaign ARN>",
        "item_id_field" : "itemId",
        "recipe" : "aws-personalized-ranking",
        "weight" : "0.3",
        "tag" : "personalize-processor",
        "iam_role_arn": "<Replace with Role ARN>",
        "aws_region": "<Replace with AWS region>",
        "ignore_failure": true
    }
  ]
}
try:
    response = requests.put(url, auth=auth, json=body, headers=headers)
    print(response.text)
except Exception as e:
    print(f"Error: {e}")

Apply a search pipeline to an individual query

After you configure a search pipeline with a personalized_search_ranking response processor, you can apply the Amazon Personalize Search Ranking plugin to your OpenSearch queries and view the re-ranked results. Update the code to specify your domain endpoint, your OpenSearch Service index, the name of your pipeline (you configured above), and your query (we use “Tom Cruise” for query). For user_id, specify the ID of the user that you’re getting search results for. This user must be in the data that you used to create your Amazon Personalize solution version.

import requests
from requests_auth_aws_sigv4 import AWSSigV4

domain_endpoint = 'domain endpoint'
index = 'index name'
url = f'{domain_endpoint}/{index}/_search/'

auth = AWSSigV4('es')
headers = {'Content-Type': 'application/json'}
params = {"search_pipeline": "<Replace with pipeline-name>"}
body = {
    "query": {
        "multi_match": {
            "query": "Tom Cruise",
            "fields": ["title", "plot", "genres", "directedBy", "starring"]
        }
    },
    "ext": {
        "personalize_request_parameters": {
            "user_id": "<Replace with USER ID>"
        }
    }
}
try:
    response = requests.post(url, auth=auth, params=params, json=body, headers=headers)
    print(response)
except Exception as e:
    print(f"Error: {e}")

Evaluate the results

Open the 3.Testing.ipynb notebook and walk through the steps to test and compare the results for queries that use personalization and those that don’t. The Amazon Personalize Search Ranking plugin re-ranks the search results in the OpenSearch Service query response. It considers both the ranking from Amazon Personalize and the ranking from OpenSearch Service. This notebook walks you through the following steps:

  1. Define the necessary connection parameters to establish a connection with your OpenSearch Service domain. This involves specifying the domain endpoint, authentication credentials, and any additional configuration settings required for your specific OpenSearch Service setup.
  2. Create a set of sample queries, including queries with personalization parameters and queries without personalization parameters. These queries will be used to evaluate the impact of personalization on the search results.
  3. Run and compare the results for queries that use personalization and those that do not.

For our example, we used a query for “Tom Cruise” and for the personalization parameter, we used a user with a recent history of viewing drama and romance film genres. The subsequent search results exhibit how the plugin tailors and prioritizes recommendations predicated on the user’s observed viewing behavior. This exemplifies the plugin’s ability to deliver a customized, curated experience by considering individual user preferences and engagement patterns. The capability to refine and attune search outcomes based on inferences of a user’s preferences enables delivering enhanced relevance and utility.

Personalized vs. non-personalized results

Let’s consider personalizing results for a user with ID 12. First, we check this user’s recent interactions by running the code in the 3.Testing.ipynb notebook to retrieve their interaction history. This allows us to see what types of movies this user has reviewed recently, which can inform how we personalize recommendations for them.

In this example, we see that the user has expressed interest in drama, romance, and thriller movie genres. To provide personalized recommendations, we first run queries with personalization parameters enabled, utilizing the user’s genre preferences. We then run the same queries without personalization enabled, for comparison. The following results show the difference between the non-personalized and personalized recommendation outputs.

The first two columns display the default OpenSearch Service results for the query “Tom Cruise” on a movies index, showing a variety of Tom Cruise films across different genres. The next two columns showcase personalized OpenSearch Service results for the same “Tom Cruise” query, but customized for a user interested in drama, romance, and thriller genres. Compared to the generic results, the personalized results prominently feature Tom Cruise movies in the user’s preferred drama, romance, and thriller genres. The delta highlights how the personalized results have been re-ranked relative to the non-personalized results, prioritizing films that match the user’s genre preferences. This demonstrates how personalization can tailor OpenSearch Service results to individual users’ tastes and interests.

This comparison demonstrates how Amazon Personalize can customize OpenSearch Service movie results to match an individual user’s interests. Although standard OpenSearch Service aims to universally serve relevant movie results for Tom Cruise, Amazon Personalize tailors the results to focus on Tom Cruise films it predicts this user will enjoy based on their unique viewing history and preferences.

The side-by-side results illustrate how Amazon Personalize provides a more targeted, user-centric search experience by personalizing the movie results to the individual.

Clean up

Complete the following steps to clean up your resources:

  1. Follow the steps in the 4.Cleanup.ipynb notebook to clean up the resources created through the notebook.
  2. On the AWS CloudFormation console, delete the stack that you created.

Conclusion

The Amazon Personalize Search Ranking plugin integrates seamlessly with OpenSearch Service to enable personalized search experiences. By using user behavior data and the ML capabilities of Amazon Personalize, the plugin can reorder OpenSearch Service result rankings to boost relevance for each unique user. This creates a custom-tailored search experience that surfaces the most relevant content higher in the results. The plugin is configurable to balance personalization with OpenSearch Service native scoring to fit diverse use cases. Overall, the Amazon Personalize Search Ranking plugin is a powerful way to enhance OpenSearch Service search relevance and engagement by factoring in the individual interests and preferences of your users. With just a few configuration steps, you can start serving hyper-relevant results that resonate strongly with your users.

Additional resources


About the Authors

James Jory is a Principal Solutions Architect in Applied AI with AWS. He has a special interest in personalization and recommender systems and a background in ecommerce, marketing technology, and customer data analytics. In his spare time, he enjoys camping and auto racing simulations.

Reagan Rosario is a Solutions Architect at AWS, specializing in building scalable, highly available, and secure cloud solutions for education technology companies. With over 10 years of experience in software engineering and architecture roles, Reagan loves using his technical knowledge to help AWS customers architect robust cloud solutions that leverage the breadth and depth of AWS.

Read More

Automate Amazon SageMaker Pipelines DAG creation

Automate Amazon SageMaker Pipelines DAG creation

Creating scalable and efficient machine learning (ML) pipelines is crucial for streamlining the development, deployment, and management of ML models. In this post, we present a framework for automating the creation of a directed acyclic graph (DAG) for Amazon SageMaker Pipelines based on simple configuration files. The framework code and examples presented here only cover model training pipelines, but can be readily extended to batch inference pipelines as well.

This dynamic framework uses configuration files to orchestrate preprocessing, training, evaluation, and registration steps for both single-model and multi-model use cases based on user-defined Python scripts, infrastructure needs (including Amazon Virtual Private Cloud (Amazon VPC) subnets and security groups, AWS Identity and Access Management (IAM) roles, AWS Key Management Service (AWS KMS) keys, containers registry, and instance types), input and output Amazon Simple Storage Service (Amazon S3) paths, and resource tags. Configuration files (YAML and JSON) allow ML practitioners to specify undifferentiated code for orchestrating training pipelines using declarative syntax. This enables data scientists to quickly build and iterate on ML models, and empowers ML engineers to run through continuous integration and continuous delivery (CI/CD) ML pipelines faster, decreasing time to production for models.

Solution overview

The proposed framework code starts by reading the configuration files. It then dynamically creates a SageMaker Pipelines DAG based on the steps declared in the configuration files and the interactions and dependencies among steps. This orchestration framework caters to both single-model and multi-model use cases, and provides a smooth flow of data and processes. The following are the key benefits of this solution:

  • Automation – The entire ML workflow, from data preprocessing to model registry, is orchestrated with no manual intervention. This reduces the time and effort required for model experimentation and operationalization.
  • Reproducibility – With a predefined configuration file, data scientists and ML engineers can reproduce the entire workflow, achieving consistent results across multiple runs and environments.
  • Scalability Amazon SageMaker is used throughout the pipeline, enabling ML practitioners to process large datasets and train complex models without infrastructure concerns.
  • Flexibility – The framework is flexible and can accommodate a wide range of ML use cases, ML frameworks (such as XGBoost and TensorFlow), multi-model training, and multi-step training. Every step of the training DAG can be customized via the configuration file.
  • Model governance – The Amazon SageMaker Model Registry integration allows for tracking model versions, and therefore promoting them to production with confidence.

The following architecture diagram depicts how you can use the proposed framework during both experimentation and operationalization of ML models. During experimentation, you can clone the framework code repository provided in this post and your project-specific source code repositories into Amazon SageMaker Studio, and set your virtual environment (detailed later in this post). You can then iterate on preprocessing, training, and evaluation scripts, as well as configuration choices. To create and run a SageMaker Pipelines training DAG, you can call the framework’s entry point, which will read all the configuration files, create the necessary steps, and orchestrate them based on the specified step ordering and dependencies.

During operationalization, the CI pipeline clones the framework code repository and project-specific training repositories into an AWS CodeBuild job, where the framework’s entry point script is called to create or update the SageMaker Pipelines training DAG, and then run it.

Repository structure

The GitHub repository contains the following directories and files:

  • /framework/conf/ – This directory contains a configuration file that is used to set common variables across all modeling units such as subnets, security groups, and IAM role at the runtime. A modeling unit is a sequence of up to six steps for training an ML model.
  • /framework/createmodel/ – This directory contains a Python script that creates a SageMaker model object based on model artifacts from a SageMaker Pipelines training step. The model object is later used in a SageMaker batch transform job for evaluating model performance on a test set.
  • /framework/modelmetrics/ – This directory contains a Python script that creates an Amazon SageMaker Processing job for generating a model metrics JSON report for a trained model based on results of a SageMaker batch transform job performed on test data.
  • /framework/pipeline/ – This directory contains Python scripts that use Python classes defined in other framework directories to create or update a SageMaker Pipelines DAG based on the specified configurations. The model_unit.py script is used by pipeline_service.py to create one or more modeling units. Each modeling unit is a sequence of up to six steps for training an ML model: process, train, create model, transform, metrics, and register model. Configurations for each modeling unit should be specified in the model’s respective repository. The pipeline_service.py also sets dependencies among SageMaker Pipelines steps (how steps within and across modeling units are sequenced or chained) based on the sagemakerPipeline section, which should be defined in the configuration file of one of the model repositories (the anchor model). This allows you to override default dependencies inferred by SageMaker Pipelines. We discuss the configuration file structure later in this post.
  • /framework/processing/ – This directory contains a Python script that creates a SageMaker Processing job based on the specified Docker image and entry point script.
  • /framework/registermodel/ – This directory contains a Python script for registering a trained model along with its calculated metrics in SageMaker Model Registry.
  • /framework/training/ – This directory contains a Python script that creates a SageMaker training job.
  • /framework/transform/ – This directory contains a Python script that creates a SageMaker batch transform job. In the context of model training, this is used to calculate the performance metric of a trained model on test data.
  • /framework/utilities/ – This directory contains utility scripts for reading and joining configuration files, as well as logging.
  • /framework_entrypoint.py – This file is the entry point of the framework code. It calls a function defined in the /framework/pipeline/ directory to create or update a SageMaker Pipelines DAG and run it.
  • /examples/ – This directory contains several examples of how you can use this automation framework to create simple and complex training DAGs.
  • /env.env – This file allows you to set common variables such as subnets, security groups, and IAM role as environment variables.
  • /requirements.txt – This file specifies Python libraries that are required for the framework code.

Prerequisites

You should have the following prerequisites before deploying this solution:

  • An AWS account
  • SageMaker Studio
  • A SageMaker role with Amazon S3 read/write and AWS KMS encrypt/decrypt permissions
  • An S3 bucket for storing data, scripts, and model artifacts
  • Optionally, the AWS Command Line Interface (AWS CLI)
  • Python3 (Python 3.7 or greater) and the following Python packages:
    • boto3
    • sagemaker
    • PyYAML
  • Additional Python packages used in your custom scripts

Deploy the solution

Complete the following steps to deploy the solution:

  1. Organize your model training repository according to the following structure:
    <MODEL-DIR-REPO>
     .
    ├── <MODEL-DIR>
    |    ├── conf
    |    |   └── conf.yaml
    |    └── scripts
    |        ├── preprocess.py
    |        ├── train.py
    |        ├── transform.py
    |        └── evaluate.py
    └── README.md
    

  2. Clone the framework code and your model source code from the Git repositories:
    • Clone dynamic-sagemaker-pipelines-framework repo into a training directory. In the following code, we assume the training directory is called aws-train:
      git clone https://github.com/aws-samples/dynamic-sagemaker-pipelines-framework.git aws-train

    • Clone the model source code under the same directory. For multi-model training, repeat this step for as many models as you need to train.
      git clone https:<MODEL-DIR-REPO>.git aws-train

For single-model training, your directory should look like the following:

<aws-train>  
.  
├── framework
└── <MODEL-DIR>

For multi-model training, your directory should look like the following:

<aws-train>  
.  
├── framework
└── <MODEL-DIR-1>
└── <MODEL-DIR-2>
└── <MODEL-DIR-3>
  1. Set up the following environment variables. Asterisks indicate environment variables that are required; the rest are optional.
Environment Variable Description
SMP_ACCOUNTID* AWS account where the SageMaker pipeline is run
SMP_REGION* AWS Region where the SageMaker pipeline is run
SMP_S3BUCKETNAME* S3 bucket name
SMP_ROLE* SageMaker role
SMP_MODEL_CONFIGPATH* Relative path of the of single-model or multi-model configuration files
SMP_SUBNETS Subnet IDs for SageMaker networking configuration
SMP_SECURITYGROUPS Security group IDs for SageMaker networking configuration

For single-model use cases, SMP_MODEL_CONFIGPATH will be <MODEL-DIR>/conf/conf.yaml. For multi-model use cases, SMP_MODEL_CONFIGPATH will be */conf/conf.yaml, which allows you to find all conf.yaml files using Python’s glob module and combine them to form a global configuration file. During experimentation (local testing), you can specify environment variables inside the env.env file and then export them by running the following command in your terminal:

source env.env

Note that the values of environment variables in env.env should be placed inside quotation marks (for example, SMP_REGION="us-east-1"). During operationalization, these environment variables should be set by the CI pipeline.

  1. Create and activate a virtual environment by running the following commands:
    python -m venv .venv
    
    source .venv/bin/activate

  2. Install the required Python packages by running the following command:
    pip install -r requirements.txt

  3. Edit your model training conf.yaml files. We discuss the configuration file structure in the next section.
  4. From the terminal, call the framework’s entry point to create or update and run the SageMaker Pipeline training DAG:
    python framework/framework_entrypoint.py

  5. View and debug the SageMaker Pipelines run on the Pipelines tab of the SageMaker Studio UI.

Configuration file structure

There are two types of configuration files in the proposed solution: framework configuration and model configuration. In this section, we describe each in detail.

Framework configuration

The /framework/conf/conf.yaml file sets the variables that are common across all modeling units. This includes SMP_S3BUCKETNAME, SMP_ROLE, SMP_MODEL_CONFIGPATH, SMP_SUBNETS, SMP_SECURITYGROUPS, and SMP_MODELNAME. Refer to Step 3 of deployment instructions for descriptions of these variables and how to set them via environment variables.

Model configuration

For each model in the project, we need to specify the following in the <MODEL-DIR>/conf/conf.yaml file (asterisks indicate required sections; the rest are optional):

  • /conf/models* – In this section, you can configure one or more modeling units. When the framework code is run, it will automatically read all configuration files during runtime and append them to the config tree. Theoretically, you can specify all modeling units in the same conf.yaml file, but it’s recommended to specify each modeling unit configuration in its respective directory or Git repository to minimize errors. The units are as follows:
    • {model-name}* – The name of the model.
    • source_directory* – A common source_dir path to use for all steps within the modeling unit.
    • preprocess – This section specifies preprocessing parameters.
    • train* – This section specifies training job parameters.
    • transform* – This section specifies SageMaker Transform job parameters for making predictions on the test data.
    • evaluate – This section specifies SageMaker Processing job parameters for generating a model metrics JSON report for the trained model.
    • registry* – This section specifies parameters for registering the trained model in SageMaker Model Registry.
  • /conf/sagemakerPipeline* – This section defines the SageMaker Pipelines flow, including dependencies among steps. For single-model use cases, this section is defined at the end of the configuration file. For multi-model use cases, the sagemakerPipeline section only needs to be defined in the configuration file of one of the models (any of the models). We refer to this model as the anchor model. The parameters are as follows:
    • pipelineName* – Name of the SageMaker pipeline.
    • models* – Nested list of modeling units:
      • {model-name}* – Model identifier, which should match a {model-name} identifier in the /conf/models section.
        • steps*
          • step_name* – Step name to be displayed in the SageMaker Pipelines DAG.
          • step_class* – (Union[Processing, Training, CreateModel, Transform, Metrics, RegisterModel])
          • step_type* – This parameter is only required for preprocessing steps, for which it should be set to preprocess. This is needed to distinguish preprocess and evaluate steps, both of which have a step_class of Processing.
          • enable_cache – ([Union[True, False]]). This indicates whether to enable SageMaker Pipelines caching for this step.
          • chain_input_source_step – ([list[step_name]]). You can use this to set the channel outputs of another step as input to this step.
          • chain_input_additional_prefix – This is only allowed for steps of the Transform step_class, and can be used in conjunction with chain_input_source_step parameter to pinpoint the file that should be used as the input to the transform step.
    • dependencies – This section specifies the sequence in which the SageMaker Pipelines steps should be run. We have adapted the Apache Airflow notation for this section (for example, {step_name} >> {step_name}). If this section is left blank, explicit dependencies specified by the chain_input_source_step parameter or implicit dependencies define the SageMaker Pipelines DAG flow.

Note that we recommend having one training step per modeling unit. If multiple training steps are defined for a modeling unit, the subsequent steps implicitly take the last training step to create the model object, calculate metrics, and register the model. If you need to train multiple models, it’s recommended to create multiple modeling units.

Examples

In this section, we demonstrate three examples of ML model training DAGs created using the presented framework.

Single-model training: LightGBM

This is a single-model example for a classification use case where we use LightGBM in script mode on SageMaker. The dataset consists of categorical and numerical variables to predict the binary label Revenue (to predict if the subject makes a purchase or not). The preprocessing script is used to model the data for training and testing and then stage it in an S3 bucket. The S3 paths are then provided to the training step in the configuration file.

When the training step runs, SageMaker loads the file on the container at /opt/ml/input/data/{channelName}/, accessible via the environment variable SM_CHANNEL_{channelName} on the container (channelName= ‘train’ or ‘test’).The training script does the following:

  1. Load the files locally from local container paths using the NumPy load module.
  2. Set hyperparameters for the training algorithm.
  3. Save the trained model at the local container path /opt/ml/model/.

SageMaker takes the content under /opt/ml/model/ to create a tarball that is used to deploy the model to SageMaker for hosting.

The transform step takes as input the staged test file as input and the trained model to make predictions on the trained model. The output of the transform step is chained to the metrics step to evaluate the model against the ground truth, which is explicitly supplied to the metrics step. Finally, the output of the metrics step is implicitly chained to the register step to register the model in SageMaker Model Registry with information about the model’s performance produced in the metrics step. The following figure shows a visual representation of the training DAG. You can refer to the scripts and configuration file for this example in the GitHub repo.

Single-model training: LLM fine-tuning

This is another single-model training example, where we orchestrate fine-tuning of a Falcon-40B large language model (LLM) from Hugging Face Hub for a text summarization use case. The preprocessing script loads the samsum dataset from Hugging Face, loads the tokenizer for the model, and processes the train/test data splits for fine-tuning the model on this domain data in the falcon-text-summarization-preprocess step.

The output is chained to the falcon-text-summarization-tuning step, where the training script loads the Falcon-40B LLM from Hugging Face Hub and starts accelerated fine-tuning using LoRA on the train split. The model is evaluated in the same step after fine-tuning, which gatekeeps the evaluation loss to fail the falcon-text-summarization-tuning step, which causes the SageMaker pipeline to stop before it is able to register the fine-tuned model. Otherwise, the falcon-text-summarization-tuning step runs successfully and the model is registered in SageMaker Model Registry. The following figure shows a visual representation of the LLM fine-tuning DAG. The scripts and configuration file for this example are available in the GitHub repo.

Multi-model training

This is a multi-model training example where a principal component analysis (PCA) model is trained for dimensionality reduction, and a TensorFlow Multilayer Perceptron model is trained for California Housing Price prediction. The TensorFlow model’s preprocessing step uses a trained PCA model to reduce dimensionality of its training data. We add a dependency in the configuration to ensure the TensorFlow model is registered after PCA model registration. The following figure shows a visual representation of the multi-model training DAG example. The scripts and configuration files for this example are available in the GitHub repo.

Clean up

Complete the following steps to clean up your resources:

  1. Use the AWS CLI to list and remove any remaining pipelines that are created by the Python scripts.
  2. Optionally, delete other AWS resources such as the S3 bucket or IAM role created outside SageMaker Pipelines.

Conclusion

In this post, we presented a framework for automating SageMaker Pipelines DAG creation based on configuration files. The proposed framework offers a forward-looking solution to the challenge of orchestrating complex ML workloads. By using a configuration file, SageMaker Pipelines provides the flexibility to build orchestration with minimal code, so you can streamline the process of creating and managing both single-model and multi-model pipelines. This approach not only saves time and resources, but also promotes MLOps best practices, contributing to the overall success of ML initiatives. For more information about implementation details, review the GitHub repo.


About the Authors

Luis Felipe Yepez Barrios, is a Machine Learning Engineer with AWS Professional Services, focused on scalable distributed systems and automation tooling to expedite scientific innovation in the field of Machine Learning (ML). Furthermore, he assists enterprise clients in optimizing their machine learning solutions through AWS services.

Jinzhao Feng, is a Machine Learning Engineer at AWS Professional Services. He focuses on architecting and implementing large scale Generative AI and classical ML pipeline solutions. He is specialized in FMOps, LLMOps and distributed training.

Harsh Asnani, is a Machine Learning Engineer at AWS. His Background is in Applied Data Science with a focus on operationalizing Machine Learning workloads in the cloud at scale.

Hasan Shojaei, is a Sr. Data Scientist with AWS Professional Services, where he helps customers across different industries solve their business challenges through the use of big data, machine learning, and cloud technologies. Prior to this role, Hasan led multiple initiatives to develop novel physics-based and data-driven modeling techniques for top energy companies. Outside of work, Hasan is passionate about books, hiking, photography, and history.

Alec Jenab, is a Machine Learning Engineer who specializes in developing and operationalizing machine learning solutions at scale for enterprise customers. Alec is passionate about bringing innovative solutions to market, especially in areas where machine learning can meaningfully improve end user experience. Outside of work, he enjoys playing basketball, snowboarding, and discovering hidden gems in San Francisco.

Read More

Accelerating large-scale neural network training on CPUs with ThirdAI and AWS Graviton

Accelerating large-scale neural network training on CPUs with ThirdAI and AWS Graviton

This guest post is written by Vihan Lakshman, Tharun Medini, and Anshumali Shrivastava from ThirdAI.

Large-scale deep learning has recently produced revolutionary advances in a vast array of fields. Although this stunning progress in artificial intelligence remains remarkable, the financial costs and energy consumption required to train these models has emerged as a critical bottleneck due to the need for specialized hardware like GPUs. Traditionally, even modestly sized neural models have required costly hardware accelerators for training, which limits the number of organizations with the financial means to take full advantage of this technology.

Founded in 2021, ThirdAI Corp. is a startup dedicated to the mission of democratizing artificial intelligence technologies through algorithmic and software innovations that fundamentally change the economics of deep learning. We have developed a sparse deep learning engine, known as BOLT, that is specifically designed for training and deploying models on standard CPU hardware as opposed to costly and energy-intensive accelerators like GPUs. Many of our customers have reported strong satisfaction with ThirdAI’s ability to train and deploy deep learning models for critical business problems on cost-effective CPU infrastructure.

In this post, we investigate of potential for the AWS Graviton3 processor to accelerate neural network training for ThirdAI’s unique CPU-based deep learning engine.

The benefits of high-performance CPUs

At ThirdAI, we achieve these breakthroughs in efficient neural network training on CPUs through proprietary dynamic sparse algorithms that activate only a subset of neurons for a given input (see the following figure), thereby side-stepping the need for full dense computations. Unlike other approaches to sparse neural network training, ThirdAI uses locality-sensitive hashing to dynamically select neurons for a given input as shown in the bold lines below. In certain cases, we have even observed that our sparse CPU-based models train faster than the comparable dense architecture on GPUs.

Dense Neural architecture with bold lines showing which neurons are selected

Given that many of our target customers operate in the cloud—and among those, the majority use AWS—we were excited to try out the AWS Graviton3 processor to see if the impressive price-performance improvements of Amazon’s silicon innovation would translate to our unique workload of sparse neural network training and thereby provide further savings for customers. Although both the research community and the AWS Graviton team have delivered exciting advances in accelerating neural network inference on CPU instances, we at ThirdAI are, to our knowledge, the first to seriously study how to train neural models on CPUs efficiently.

As shown in our results, we observed a significant training speedup with AWS Graviton3 over the comparable Intel and NVIDIA instances on several representative modeling workloads.

Instance types

For our evaluation, we considered two comparable AWS CPU instances: a c6i.8xlarge machine powered by Intel’s Ice Lake processor and a c7g.8xlarge powered by AWS Graviton3. The following table summarizes the details of each instance.

Instance vCPU RAM (GB) Processor On-Demand Price (us-east-1)
c7g.8xlarge 32 64 AWS Graviton3 $1.1562/hr
c6i.8xlarge 32 64 Intel Ice Lake $1.36/hr
g5g.8xlarge (GPU) 32 64 with 16 GB GPU Memory AWS Graviton2 processors with 1 NVIDIA T4G GPU $1.3720/hr

Evaluation 1: Extreme classification

For our first evaluation, we focus on the problem of extreme multi-label classification (XMC), an increasingly popular machine learning (ML) paradigm with a number of practical applications in search and recommendations (including at Amazon). For our evaluation, we focus on the public Amazon-670K product recommendation task, which, given an input product, identifies similar products from a collection of over 670,000 items.

In this experiment, we benchmark ThirdAI’s BOLT engine against TensorFlow 2.11 and PyTorch 2.0 on the aforementioned hardware choices: Intel Ice Lake, AWS Graviton3, and an NVIDIA T4G GPU. For our experiments on Intel and AWS Graviton, we use the AWS Deep Learning AMI (Ubuntu 18.04) version 59.0. For our GPU evaluation, we use the NVIDIA GPU-Optimized Arm64 AMI, available via the AWS Marketplace. For this evaluation, we use the SLIDE model architecture, which achieves both competitive performance on this extreme classification task and strong training performance on CPUs. For our TensorFlow and PyTorch comparisons, we implement the analogous version of the SLIDE multi-layer perceptron (MLP) architecture with dense matrix multiplications. We train each model for five epochs (full passes through the training dataset) with a fixed batch size of 256 and learning rate of 0.001. We observed that all models achieved the same test accuracy of 33.6%.

The following chart compares the training time of ThirdAI’s BOLT to TensorFlow 2.11 and PyTorch 2.0 on the Amazon670k extreme classification benchmark. All models achieve the same test precision. We observe that AWS Graviton3 considerably accelerates the performance of BOLT out of the box with no customizations needed—by approximately 40%. ThirdAI’s BOLT on AWS Graviton3 also achieves considerably faster training than the TensorFlow or PyTorch models trained on the GPU. Note that there is no ThirdAI result on the NVIDIA GPU benchmark because BOLT is designed to run on CPUs. We do not include TensorFlow and PyTorch CPU benchmarks because of the prohibitively long training time.

Amazon 670k Training time Bar chart comparing instances c6i.8xlarge vs c7g.8xlarge

The following table summarizes the training time and test accuracy for each processor/specialized processor(GPU).

Processor Engine Training Time (s) Test Accuracy
Intel Ice Lake (c6i.8xlarge) BOLT 1470 33.6
AWS Graviton3 (c7g.8xlarge) BOLT 935 33.6
NVIDIA T4G (g5g.8xlarge) TensorFlow 7550 33.6
NVIDIA T4G (g5g.8xlarge) PyTorch 5130 33.6

Evaluation 2: Yelp Polarity sentiment analysis

For our second evaluation, we focus on the popular Yelp Polarity sentiment analysis benchmark, which involves classifying a review as positive or negative. For this evaluation, we compare ThirdAI’s Universal Deep Transformers (UDT) model against a fine-tuned DistilBERT network, a compressed pre-trained language model that achieves near-state-of-the-art performance with reduced inference latency. Because fine-tuning DistilBERT models on a CPU would take a prohibitively long time (at least several days), we benchmark ThirdAI’s CPU-based models against DistilBERT fine-tuned on a GPU. We train all models with a batch size of 256 for a single pass through the data (one epoch). We note that we can achieve slightly higher accuracy with BOLT with additional passes through the data, but we restrict ourselves to a single pass in this evaluation for consistency.

As shown in the following figure, AWS Graviton3 again accelerates ThirdAI’s UDT model training considerably. Furthermore, UDT is able to achieve comparable test accuracy to DistilBERT with a fraction of the training time and without the need for a GPU. We note that there has also been recent work in optimizing the fine-tuning of Yelp Polarity on CPUs. Our models, however, still achieve greater efficiency gains and avoid the cost of pre-training, which is substantial and requires the use of hardware accelerators like GPUs.

Training time on Yelp Polarity C7g vs c6i

The following table summarizes the training time, test accuracy, and inference latency.

Processor Engine Model Training Time (s) Test Accuracy Inference Latency (ms)
Intel Icelake (c6i.8xlarge) BOLT UDT 47 93.2 <1
Graviton3 (c7g.8xlarge) BOLT UDT 29 92.9 <1
T4G GPU (g5g.8xlarge) TensorFlow DistilBERT 4200 93.3 8.7
T4G GPU (g5g.8xlarge) PyTorch DistilBERT 3780 93.4 8.3

Evaluation 3: Multi-class text classification (DBPedia)

For our final evaluation, we focus on the problem of multi-class text classification, which involves assigning a label to a given input text from a set of more than two output classes. We focus on the DBPedia benchmark, which consists of 14 possible output classes. Again, we see that AWS Graviton3 accelerates UDT performance over the comparable Intel instance by roughly 40%. We also see that BOLT achieves comparable results to the DistilBERT transformer-based model fine-tuned on a GPU while achieving sub-millisecond latency.

ThirdAI BOLT training time on c7g vs c6i

The following table summarizes the training time, test accuracy, and inference latency.

Processor Engine Model Training Time (s) Test Accuracy Inference Latency (ms)
Intel Icelake (c6i.8xlarge) BOLT UDT 23 98.23 <1
Graviton3 (c7g.8xlarge) BOLT UDT 14 98.10 <1
T4G GPU (g5g.8xlarge) TensorFlow DistilBERT 4320 99.23 8.6
T4G GPU (g5g.8xlarge) PyTorch DistilBERT 3480 99.29 8

Get started with ThirdAI on AWS Graviton

We have designed our BOLT software for compatibility with all major CPU architectures, including AWS Graviton3. In fact, we didn’t have to make any customizations to our code to run on AWS Graviton3. Therefore, you can use ThirdAI for model training and deployment on AWS Graviton3 with no additional effort. In addition, as detailed in our recent research whitepaper, we have developed a set of novel mathematical techniques to automatically tune the specialized hyperparameters associated with our sparse models, allowing our models to work well immediately out of the box.

We also note that our models primarily work well for search, recommendation, and natural language processing tasks that typically feature large, high-dimensional output spaces and a requirement of extremely low inference latency. We are actively working on extending our methods to additional domains, such as computer vision, but be aware that our efficiency improvements do not translate to all ML domains at this time.

Conclusion

In this post, we investigated the potential for the AWS Graviton3 processor to accelerate neural network training for ThirdAI’s unique CPU-based deep learning engine. Our benchmarks on search, text classification, and recommendations benchmarks suggest that AWS Graviton3 can accelerate ThirdAI’s model training workloads by 30–40% over the comparable x86 instances with a price-performance improvement of nearly 50%. Furthermore, because AWS Graviton3 instances are available at a lower cost than the analogous Intel and NVIDIA machines and enable shorter training and inference times, you can further unlock the value of the AWS pay-as-you-go usage model by using lower-cost machines for shorter durations of time.

We are very excited by the price and performance savings of AWS Graviton3 and will look to pass on these improvements to our customers so they can enjoy faster ML training and inference with improved performance on low-cost CPUs. As customers of AWS ourselves, we are delighted by the speed at which AWS Graviton3 allows us to experiment with our models, and we look forward to using more cutting-edge silicon innovation from AWS going forward. Graviton Technical Guide is a good resource to consider while evaluating your ML workloads to run on Graviton. You can also try Graviton t4g instances free trial.

The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post. At the time of writing the blog the most current instance were c6i and hence the comparison was done with c6i instances.


About the Author

Vihan Lakshman – Vihan Lakshman is a research scientist at ThirdAI Corp. focused on developing systems for resource-efficient deep learning. Prior to ThirdAI, he worked as an Applied Scientist at Amazon and received undergraduate and master’s degrees from Stanford University. Vihan is also a recipient of a National Science Foundation research fellowship.

Tharun Medini – Tharun Medini is the co-founder and CTO of ThirdAI Corp. He did his PhD in “Hashing Algorithms for Search and Information Retrieval” at Rice University. Prior to ThirdAI, Tharun worked at Amazon and Target. Tharun is the recipient of numerous awards for his research, including the Ken Kennedy Institute BP Fellowship, the American Society of Indian Engineers Scholarship, and a Rice University Graduate Fellowship.

Anshumali Shrivastava – Anshumali Shrivastava is an associate professor in the computer science department at Rice University. He is also the Founder and CEO of ThirdAI Corp, a company that is democratizing AI to commodity hardware through software innovations. His broad research interests include probabilistic algorithms for resource-frugal deep learning. In 2018, Science news named him one of the Top-10 scientists under 40 to watch.  He is a recipient of the National Science Foundation CAREER Award, a Young Investigator Award from the Air Force Office of Scientific Research, a machine learning research award from Amazon, and a Data Science Research Award from Adobe. He has won numerous paper awards, including Best Paper Awards at NIPS 2014 and MLSys 2022, as well as the Most Reproducible Paper Award at SIGMOD 2019. His work on efficient machine learning technologies on CPUs has been covered by popular press including Wall Street Journal, New York Times, TechCrunch, NDTV, etc.

Read More

Supercharge your AI team with Amazon SageMaker Studio: A comprehensive view of Deutsche Bahn’s AI platform transformation

Supercharge your AI team with Amazon SageMaker Studio: A comprehensive view of Deutsche Bahn’s AI platform transformation

AI’s growing influence in large organizations brings crucial challenges in managing AI platforms. These include developing a scalable and operationally efficient platform that adheres to organizational compliance and security standards. Amazon SageMaker Studio offers a comprehensive set of capabilities for machine learning (ML) practitioners and data scientists. These include a fully managed AI development environment with an integrated development environment (IDE), simplifying the end-to-end ML workflow. Its collaborative capabilities such as real-time coediting and sharing notebooks within the team ensures smooth teamwork, while the scalability and high-performance training caters to large datasets. With built-in security, cost-effectiveness, and a range of pre-built tools like Amazon SageMaker Autopilot, Amazon SageMaker JumpStart, and Amazon SageMaker Feature store, SageMaker Studio is a powerful platform for accelerating AI projects and empowering data scientists at every level of expertise.

Deutsche Bahn is a leading transportation organization in Germany with a revenue of 56.3 billion EUR (in 2022), a workforce of 336,884 employees (including 221,343 employees in Germany), and operations spanning 130 countries. They offer a wide range of services, including public and regional transport, freight services, and rail infrastructure. Through the integrated operation of traffic and railway infrastructure, as well as the economically and ecologically intelligent connection of all modes of transport, Deutsche Bahn moves people and goods. Deutsche Bahn has been at the forefront in adopting AI, using SageMaker Studio as a key AI platform. At Deutsche Bahn, a dedicated AI platform team manages and operates the SageMaker Studio platform, and multiple data analytics teams within the organization use the platform to develop, train, and run various analytics and ML activities.

The AI platform team’s key objective is to ensure seamless access to Workbench services and SageMaker Studio for all Deutsche Bahn teams and projects, with a primary focus on data scientists and ML engineers. This platform helps Deutsche Bahn realize a spectrum of use cases, ranging from railway maintenance, forecasting, and future applications in generative AI.

The AI platform managed service, built on SageMaker Studio, seamlessly aligns with Deutsche Bahn’s group-wide platform strategy. It meets the company’s compliance requirements, enables a swift project initiation for the team by provisioning a SageMaker domain, and reduces maintenance overhead due to an overarching operating model. Major benefits include high scalability of the service, in large part due to automation and a self-service model, and an attractive pricing model that’s primarily based on resource consumption.

“SageMaker Studio provided us a common platform that is scalable, security compliant, and addresses the development needs of data scientists from multiple data analytics teams within the DB organization. Before this, each team managed and operated their own JupyterLab notebooks, which was not efficient or cost-effective. Within 8 weeks, we onboarded over 120 developers, provisioned 25 SageMaker domains, and quickly got started using this platform.”

– Emmanuel Drosos, product owner at DB Systel.

In this post, we explore how Deutsche Bahn scaled and operated their AI platform using SageMaker Studio for multiple teams, while ensuring robust security and oversight.

Solution overview

The architecture at Deutsche Bahn consists of a central platform account managed by a platform team responsible for managing infrastructure and operations for SageMaker Studio. SageMaker Studio resources are grouped by SageMaker domains, each consisting of an associated Amazon Elastic File System (Amazon EFS) volume, a list of authorized users, and a variety of security, application, policy, and Amazon Virtual Private Cloud (Amazon VPC) configurations. At Deutsche Bahn, data scientists from various teams use SageMaker domains for their ML activities; each team has a dedicated SageMaker domain that they use for developing and testing ML models and collaborate using features such as notebook sharing.

From an infrastructure perspective, the VPC provisioned in the AI platform account as shown in the following figure has no outbound internet connectivity to ensure security and compliance. For high availability, multiple identical private isolated subnets are provisioned. The SageMaker Studio domains are deployed in VPC only mode, which creates an elastic network interface for communication between the SageMaker service account (AWS service account) and the platform account’s VPC. The endpoints like SageMaker API, SageMaker Studio, and SageMaker notebook facilitate secure and reliable communication between the platform account’s VPC and the SageMaker domain managed by AWS in the SageMaker service account.

Each data analytics team is able to request one or multiple SageMaker domains through the company’s internal self-service portal. This process of ordering a SageMaker domain is orchestrated through a separate workflow process (via AWS Step Functions). During this orchestration flow, an Azure Active Directory (AD) group for the data analytics team is provisioned with the AD group name corresponding to the domain name. The orchestration leads to a continuous integration and continuous deployment (CI/CD) pipeline deploying an AWS Cloud Development Kit (AWS CDK) app consisting of a SageMaker domain for the respective team.

In addition to the SageMaker domain, a customized AWS Identity and Access Management (IAM) role (SageMaker-execution-role), Amazon Simple Storage Service (Amazon S3) bucket (data-bucket), customer managed key (CMK), and other AWS resources are provisioned during the deployment process by the AWS CDK app, as illustrated in the following figure. The AD group contains scientists who needs access to their team’s SageMaker domain. The AD group name corresponds to the SageMaker domain’s name and is primarily used during the authorization process.

Client separation is implemented on the level of SageMaker domains by using IAM authentication mode. A domain-specific IAM role (SageMaker-execution-role) is attached to each domain that follows the principle of least privilege and is assumed by the data analytics team during the login process. This role grants data scientists in the team the ability to perform various activities, such as running processing jobs, hyperparameter tuning jobs, transformation jobs, and experiments, as well as creating models. These ML activities are run on behalf of the user by SageMaker using the IAM pass role permission. However, certain actions like creating S3 buckets, modifying IAM roles, updating SageMaker domains, and provisioning large instances are restricted for security, compliance, and cost control reasons. The associated IAM policy makes sure that the data analytics team only has access to the relevant S3 bucket and CMK for their authorized domain, as depicted in the following figure. Additionally, the role SageMaker-execution-role allows the team members to assume roles in other accounts within the Deutsche Bahn organization from SageMaker Studio, providing them with flexibility to access resources like Amazon Relational Database Service (Amazon S3), other S3 buckets, and Amazon Athena. The IAM policy uses aws:RequestTag and aws:ResourceTag for fine-grained access control during SageMaker activities, like processing jobs, training jobs, and create models. These tags also help track associated costs for the domain. For more information, refer to Actions, resources, and condition keys for Amazon SageMaker.

ml-14819-3

The CMK encrypts both the SageMaker domain’s file system contents stored in Amazon EFS and the contents of the S3 bucket (data-bucket) that is provisioned to store data for SageMaker processing and transformation jobs. In addition, resource-based policies, such as the bucket policy and CMK policy, provide an extra layer of security, restricting both access to only authorized AI team members and permitted actions on these resources.

The AI team does not have AWS Management Console access to the AI platform team’s account. To access SageMaker Studio, as illustrated in the following figure, the data scientists from the data analytics team use a generated presigned URL by authenticating through an Amazon Cognito based custom login application. After the user logs in to this custom application, they receive an OAuth access token that contains information such as AD group name. After they log in to the custom application, the user requests SageMaker domain access through the UI by triggering an Amazon API Gateway call to generate a presigned URL. API Gateway invokes the PreSignUrlGenerator AWS Lambda function and uses an Amazon Cognito authorizer to validate the OAuth access token in the request header. The PreSignUrlGenerator function validates user access permissions for the requested SageMaker domain by comparing the AD name in the access token against the requested SageMaker domain. Upon successful authorization, the PreSignUrlGenerator function creates a SageMaker user profile upon first login and generates a presigned URL response. The custom login application then redirects the users to the requested SageMaker domain.

ml14819-4

AWS CDK

The solution at Deutsche Bahn uses AWS CDK as infrastructure as code (IaC) to provision a SageMaker domain along with resources like S3 buckets and a CMK. The following figure illustrates the stacks and associated resources used for SageMaker deployment. The infrastructure stack takes care of setting up essential resources like VPC, subnets, and multiple SageMaker endpoints. The resources such as VPC, subnets, and service control policies (SCPs) are managed by a central cloud team through a different stack (but is shown here for simplicity). The SageMakerStudioStack is primarily responsible for provisioning a SageMaker domain, a dedicated data bucket, a CMK, and the dedicated IAM role SageMaker-execution-role. Notably, each SageMaker domain is provisioned through its individual SageMakerStudioStack.

ml-14819-5

The solution uses a purpose-built L3 construct (SageMaker Studio domain), as shown in the following figure, for the SageMaker domain resource. SageMaker Studio has a lifecycle configuration feature that enables specific initializations during the startup of JupyterLab or KernelGateway apps.

ml-14819-6

Deutsch Bahn uses the lifecycle configuration as shown in the following figure to automatically detect and shut down idle instances in the SageMaker domain, reducing unnecessary costs. Due to restricted outbound connectivity, the data analytics team uses internally hosted images and third-party libraries from the company’s internal artifactory. The lifecycle configuration script for KernelGateway configures pip and conda package managers to redirect downloads to the internally hosted artifactory location. As of this writing, there is no AWS CDK construct for the lifecycle configuration resource; therefore, they use a custom CDK resource to provision and manage the LifeCycleConfig script. Custom resources in AWS CDK offer the ability to provision and manage resources not directly supported by AWS CloudFormation or AWS CDK constructs.

Installation

The sample AWS CDK application demonstrates how various components, including the SageMaker domain, lifecycle configuration, Amazon Cognito, and IAM role with the least privileges, function together. Within the application, the SagemakerStudioStack class handles the provisioning of a SageMaker domain, IAM role (sagemaker-execution-role) that users assume, CMK, lifecycle configuration, SageMaker user profile, S3 bucket for data processing, and Amazon Cognito user group. The demo AWS CDK application provides a concise overview of key components, such as the SageMaker domain, lifecycle configuration, authentication through Amazon Cognito, and IAM role with least privileges. The SagemakerLoginStack, on the other hand, is responsible for deploying the Amazon Cognito user pool, Lambda function, and API Gateway for generating presigned URLs. The CognitoUserStack primarily focuses on deploying a user within the Amazon Cognito user pool.

You can run the following commands to compile, synthesize, and deploy the application. You should adjust the account, user, and password in the sample code for your application. The password should be at least 8 characters, with uppercase characters and numbers. The user parameter is the SageMaker domain user that will be authenticated by Amazon Cognito.

  1. Download the source code from the GitHub repo.
  2. Bootstrap the AWS account. In the following code, adjust the account number and Region as needed:
    cdk bootstrap aws://11111111111/eu-central-1

  3. Install the packages and compile the code:
    npm install
    npm run build

  4. Synthesize the AWS CDK application:
    npx cdk synth -c account=11111111111 -c region='eu-central-1' -c domain-name=team1 -c user=demo-user -c password=<your password>

  5. Deploy the application with all stacks into the account and Region of your choice:
    npx cdk deploy --all -c account=11111111111 -c region='eu-central-1' -c domain-name=team1 -c user=demo-user -c password=<password>

  6. Download the Postman app to make an API call.

If you don’t have a Postman account, create a free account with your email. If you already have an account, sign in to your account.

  1. On the File menu, choose Import and import the Postman environment JSON file included in the GitHub repo.
  2. On the Environments tab in Postman, locate the environment called SageMaker.
  3. Add the following environment variables, which you see as part of the stack deployment output from SagemakerLoginStack:
    ..... output from the cdk deploy .....
    
    //PreSignedURLApi
    
    SageMaker-login-stack.PreSignedURLApiEndpointXXXX= https://xxxxxxx.execute-api.eu-central-1.amazonaws.com/prod/
    
    //UserPoolClientId
    
    SageMaker-login-stack.UserPoolUserPoolClientIdFXXXX = xxxxxxxxxxxxxxxx
    
    //UserPoolClientSecret
    
    SageMaker-login-stack.UserPoolUserPoolClientSecretC1D088A5 = xxxxxxxxxxxxxxx
    
    //CognitoSigninDomain
    
    SageMaker-login-stack.UserPoolCognitoSigninDomainD3B08161 = https://SageMaker-login-xxxxx.auth.eu-central-1.amazoncognito.com/oauth2

Use the following parameters (fetch the values from the output during cdk deploy):

    • domainName – The domain name parameter you passed in cdk deploy, for example team1
    • client-id – The Amazon Cognito client ID
    • client-secret – The Amazon Cognito client secret.
    • SageMaker-presigned-api – The URL of the API Gateway created by AWS CDK, which generates the presigned URL
    • cognito-signin-endpoint – The endpoint URL of the Amazon Cognito domain where the client app (in this case, Postman) authenticates by providing credentials of the user (demo-user)

The next step is to generate an OAuth2 token.

    1. On the Authorization tab, choose the SageMaker environment and choose Generate New Access Token.

All the values on this tab should be prefilled.

    1. Update the environment variables and choose Get New Access Token.

ml-14819-8

  1. In the pop-up window that opens, log in to Amazon Cognito with the user name (demo-user) and password you used earlier.

Upon successful authentication, a new access token is generated.

  1. Choose Use Token.
  2. Choose GeneratePresignedUrlDemo in the Postman SageMaker collections and choose Send.
  3. Make sure you selected the right environment (SageMaker) on the drop-down list.

This makes a REST API call to API Gateway and generates a presigned URL to access the SageMaker domain. You can see this URL in the response body.

  1. Copy this URL and enter it in the browser window.

A new SageMaker domain will be launched with your user profile.

This demo application supports SageMaker features like training jobs, processing jobs, and model endpoints. Note that features like Amazon SageMaker Canvas, SageMaker JumpStart, and SageMaker Feature Store are not activated.

Clean up

Complete the following steps to clean up your resources:

  1. On the SageMaker console, in the navigation pane, choose Domain, User Profile, and Apps.
  2. Delete all running apps (KernelGateway or JupyterLab) from this solution.
  3. Delete all the SageMaker user profiles you created during the login step.
  4. On the Amazon EFS console, delete the EFS file system created for this post.
  5. Run the following command to delete the resources created with the AWS CDK:
    npx cdk destroy --all

Conclusion

The post highlighted how Deutsche Bahn effectively used SageMaker Studio to revamp its AI platform, resulting in a scalable, automated, and manageable solution to support its diverse data analytics teams. This architecture features a central platform account, a self-service domain ordering process, and infrastructure provisioning using AWS CDK. The deployment process incorporates a CI/CD pipeline, ensuring the smooth delivery of SageMaker domains.

Overall, the transformation brought about by SageMaker Studio has empowered Deutsche Bahn to construct a robust platform for their AI initiatives, catering to over 100 developers and managing 20 SageMaker domains within a single AWS account.

Lastly, we extend our sincere appreciation to Nico Seegert (d-fine) and Philipp Vollmer (Deutsche Bahn), whose invaluable contributions were instrumental in shaping this architecture.

For further reading, refer to the following resources:

___________________________________________________________________________________________

About the authors

Prasanna Tuladhar is a Cloud Infrastructure Architect at AWS Professional Services in Munich, Germany. Specializing in cloud infrastructure, workload migration, and DevOps on the AWS platform, he empowers customers to achieve their business objectives. Outside of work, he enjoys jogging, hiking, and quality time with his family.

Emmanuel Drosos is a Product Owner for the AI platform at DBSystel, a subsidiary of Deutsche Bahn (DB) Germany. With a passion for innovation and technology, Emmanuel spearheads initiatives aimed at leveraging the power of the cloud to drive AI platform at DB (Deutsche Bahn). The AI.Platform is one of DB’s group-wide development platforms. It includes AI services and tools for the development of AI (machine learning) models and directly usable AI services. Simple, integrated and scalable.He works closely with other DB customers to unlock the full potential of AI platform, enabling them to achieve their business objectives efficiently and effectively. Outside of his professional activities, Emmanuel enjoys traveling and is an enthusiastic nature and hiking lover.

Vishwanath Bhat is a DevOps Architect at AWS Professional Services, based in Germany. He helps customers to get the full benefit of the cloud and achieve their business goals with AWS cloud. When not working, he likes to go swimming in alpine lakes, hiking, reading or play football.

Kumudhan Cherarajan is a DevOps Consultant at AWS Professional Services, based in Switzerland. He is passionate about helping customers adopt process and services that increase their efficiency in the cloud journey. When not working, he likes to play cricket and music.

Read More

Build a robust text-to-SQL solution generating complex queries, self-correcting, and querying diverse data sources

Build a robust text-to-SQL solution generating complex queries, self-correcting, and querying diverse data sources

Structured Query Language (SQL) is a complex language that requires an understanding of databases and metadata. Today, generative AI can enable people without SQL knowledge. This generative AI task is called text-to-SQL, which generates SQL queries from natural language processing (NLP) and converts text into semantically correct SQL. The solution in this post aims to bring enterprise analytics operations to the next level by shortening the path to your data using natural language.

With the emergence of large language models (LLMs), NLP-based SQL generation has undergone a significant transformation. Demonstrating exceptional performance, LLMs are now capable of generating accurate SQL queries from natural language descriptions. However, challenges still remain. First, human language is inherently ambiguous and context-dependent, whereas SQL is precise, mathematical, and structured. This gap may result in inaccurate conversion of the user’s needs into the SQL that’s generated. Second, you might need to build text-to-SQL features for every database because data is often not stored in a single target. You may have to recreate the capability for every database to enable users with NLP-based SQL generation. Third, despite the larger adoption of centralized analytics solutions like data lakes and warehouses, complexity rises with different table names and other metadata that is required to create the SQL for the desired sources. Therefore, collecting comprehensive and high-quality metadata also remains a challenge. To learn more about text-to-SQL best practices and design patterns, see Generating value from enterprise data: Best practices for Text2SQL and generative AI.

Our solution aims to address those challenges using Amazon Bedrock and AWS Analytics Services. We use Anthropic Claude v2.1 on Amazon Bedrock as our LLM. To address the challenges, our solution first incorporates the metadata of the data sources within the AWS Glue Data Catalog to increase the accuracy of the generated SQL query. The workflow also includes a final evaluation and correction loop, in case any SQL issues are identified by Amazon Athena, which is used downstream as the SQL engine. Athena also allows us to use a multitude of supported endpoints and connectors to cover a large set of data sources.

After we walk through the steps to build the solution, we present the results of some test scenarios with varying SQL complexity levels. Finally, we discuss how it is straightforward to incorporate different data sources to your SQL queries.

Solution overview

There are three critical components in our architecture: Retrieval Augmented Generation (RAG) with database metadata, a multi-step self-correction loop, and Athena as our SQL engine.

We use the RAG method to retrieve the table descriptions and schema descriptions (columns) from the AWS Glue metastore to ensure that the request is related to the right table and datasets. In our solution, we built the individual steps to run a RAG framework with the AWS Glue Data Catalog for demonstration purposes. However, you can also use knowledge bases in Amazon Bedrock to build RAG solutions quickly.

The multi-step component allows the LLM to correct the generated SQL query for accuracy. Here, the generated SQL is sent for syntax errors. We use Athena error messages to enrich our prompt for the LLM for more accurate and effective corrections in the generated SQL.

You can consider the error messages occasionally coming from Athena like feedback. The cost implications of an error correction step are negligible compared to the value delivered. You can even include these corrective steps as supervised reinforced learning examples to fine-tune your LLMs. However, we did not cover this flow in our post for simplicity purposes.

Note that there is always inherent risk of having inaccuracies, which naturally comes with generative AI solutions. Even if Athena error messages are highly effective to mitigate this risk, you can add more controls and views, such as human feedback or example queries for fine-tuning, to further minimize such risks.

Athena not only allows us to correct the SQL queries, but it also simplifies the overall problem for us because it serves as the hub, where the spokes are multiple data sources. Access management, SQL syntax, and more are all handled via Athena.

The following diagram illustrates the solution architecture.

The solution architecture and the process flow is shown.

Figure 1. The solution architecture and process flow.

The process flow includes the following steps:

  1. Create the AWS Glue Data Catalog using an AWS Glue crawler (or a different method).
  2. Using the Titan-Text-Embeddings model on Amazon Bedrock, convert the metadata into embeddings and store it in an Amazon OpenSearch Serverless vector store, which serves as our knowledge base in our RAG framework.

At this stage, the process is ready to receive the query in natural language. Steps 7–9 represent a correction loop, if applicable.

  1. The user enters their query in natural language. You can use any web application to provide the chat UI. Therefore, we did not cover the UI details in our post.
  2. The solution applies a RAG framework via similarity search, which adds the extra context from the metadata from the vector database. This table is used for finding the correct table, database, and attributes.
  3. The query is merged with the context and sent to Anthropic Claude v2.1 on Amazon Bedrock.
  4. The model gets the generated SQL query and connects to Athena to validate the syntax.
  5. If Athena provides an error message that mentions the syntax is incorrect, the model uses the error text from Athena’s response.
  6. The new prompt adds Athena’s response.
  7. The model creates the corrected SQL and continues the process. This iteration can be performed multiple times.
  8. Finally, we run the SQL using Athena and generate output. Here, the output is presented to the user. For the sake of architectural simplicity, we did not show this step.

Prerequisites

For this post, you should complete the following prerequisites:

  1. Have an AWS account.
  2. Install the AWS Command Line Interface (AWS CLI).
  3. Set up the SDK for Python (Boto3).
  4. Create the AWS Glue Data Catalog using an AWS Glue crawler (or a different method).
  5. Using the Titan-Text-Embeddings model on Amazon Bedrock, convert the metadata into embeddings and store it in an OpenSearch Serverless vector store.

Implement the solution

You can use the following Jupyter notebook, which includes all the code snippets provided in this section, to build the solution. We recommend using Amazon SageMaker Studio to open this notebook with an ml.t3.medium instance with the Python 3 (Data Science) kernel. For instructions, refer to Train a Machine Learning Model. Complete the following steps to set up the solution:

  1. Create the knowledge base in OpenSearch Service for the RAG framework:
    def add_documnets(self,index_name: str,file_name:str):
    
    documents = JSONLoader(file_path=file_name, jq_schema='.', text_content=False, json_lines=False).load()
    docs = OpenSearchVectorSearch.from_documents(embedding=self.embeddings, opensearch_url=self.opensearch_domain_endpoint, http_auth=self.http_auth, documents=documents, index_name=index_name, engine="faiss")
    index_exists = self.check_if_index_exists(index_name,aws_region,opensearch_domain_endpoint,http_auth)
    if not index_exists :
    logger.info(f'index :{index_name} is not existing ')
    sys.exit(-1)
    else:
    logger.info(f'index :{index_name} Got created')

  2. Build the prompt (final_question) by combining the user input in natural language (user_query), the relevant metadata from the vector store (vector_search_match), and our instructions (details):
    def userinput(user_query):
    logger.info(f'Searching metadata from vector store')
    
    # vector_search_match=rqst.getEmbeddding(user_query)
    vector_search_match = rqst.getOpenSearchEmbedding(index_name,user_query)
    
    # print(vector_search_match)
    details = "It is important that the SQL query complies with Athena syntax. 
    During join if column name are same please use alias ex llm.customer_id 
    in select statement. It is also important to respect the type of columns: 
    if a column is string, the value should be enclosed in quotes. 
    If you are writing CTEs then include all the required columns. 
    While concatenating a non string column, make sure cast the column to string. 
    For date columns comparing to string , please cast the string input."
    final_question = "nnHuman:"+details + vector_search_match + user_query+ "nnAssistant:"
    answer = rqst.generate_sql(final_question)
    return answer

  3. Invoke Amazon Bedrock for the LLM (Claude v2) and prompt it to generate the SQL query. In the following code, it makes multiple attempts in order to illustrate the self-correction step:x
    try:
    logger.info(f'we are in Try block to generate the sql and count is :{attempt + 1}')
    generated_sql = self.llm.predict(prompt)
    query_str = generated_sql.split("```")[1]
    query_str = " ".join(query_str.split("n")).strip()
    sql_query = query_str[3:] if query_str.startswith("sql") else query_str
    
    # return sql_query
    syntaxcheckmsg=rqstath.syntax_checker(sql_query)
    if syntaxcheckmsg=='Passed':
    logger.info(f'syntax checked for query passed in attempt number :{attempt + 1}')
    return sql_query

  4. If any issues are received with the generated SQL query ({sqlgenerated}) from the Athena response ({syntaxcheckmsg}), the new prompt (prompt) is generated based on the response and the model tries again to generate the new SQL:
    else:
    prompt = f"""{prompt} 
    This is syntax error: {syntaxcheckmsg}.
    To correct this, please generate an alternative SQL query which will correct the syntax error. The updated query should take care of all the syntax issues encountered. Follow the instructions mentioned above to remediate the error.
    Update the below SQL query to resolve the issue:
    {sqlgenerated}
    Make sure the updated SQL query aligns with the requirements provided in the initial question."""
    prompts.append(prompt)

  5. After the SQL is generated, the Athena client is invoked to run and generate the output:
    query_execution = self.athena_client.start_query_execution(
    QueryString=query_string,
    ResultConfiguration=result_config,
    QueryExecutionContext=query_execution_context, )
    execution_id = query_execution["QueryExecutionId"]

Test the solution

In this section, we run our solution with different example scenarios to test different complexity levels of SQL queries.

To test our text-to-SQL, we use two datasets available from IMDB. Subsets of IMDb data are available for personal and non-commercial use. You can download the datasets and store them in Amazon Simple Storage Service (Amazon S3). You can use the following Spark SQL snippet to create tables in AWS Glue. For this example, we use title_ratings and title:

source_title_ratings3_path = 's3://llm-athena-output/input_data/title.ratings.tsv'
target_title_s3_path='s3://llm-athena-output/output_data/imdb_stg/title_ratings'
source_titleratingdf=spark.read.csv(source_title_ratings3_path,sep="t",header=True)
source_titleratingdf.write.mode('overwrite').format('parquet').option('path', target_title_s3_path).saveAsTable('imdb_stg.title_ratings')

Store data in Amazon S3 and metadata in AWS Glue

In this scenario, our dataset is stored in an S3 bucket. Athena has an S3 connector that allows you to use Amazon S3 as a data source that can be queried.

For our first query, we provide the input “I am new to this. Can you help me see all the tables and columns in imdb schema?”

The following is the generated query:

WITH tables AS (
SELECT table_name
FROM information_schema.tables
WHERE table_schema = 'imdb_stg'),
columns AS (
SELECT
c.table_name,
c.column_name,
c.data_type,
c.is_nullable,
c.column_default,
c.ordinal_position
FROM information_schema.columns c
WHERE c.table_schema = 'imdb_stg')
SELECT
t.table_name,
c.column_name,
c.data_type,
c.is_nullable,
c.column_default,
c.ordinal_position
FROM tables t
INNER JOIN columns c
ON t.table_name = c.table_name
ORDER BY
t.table_name,
c.ordinal_position
LIMIT 10;

The following screenshot and code show our output.

Table showing the attributes of the example IMBD dataset.

table_name column_name data_type
0 title titleid varchar
1 title ordering integer
2 title title varchar
3 title region varchar
4 title language varchar

For our second query, we ask “Show me all the title and details in US region whose rating is more than 9.5.”

The following is our generated query:

WITH us_titles AS (
SELECT t.title, t.region, tr.averageRating, tr.numVotes
FROM imdb_stg.title t
INNER JOIN imdb_stg.title_ratings tr
ON t.titleId = tr.tconst
WHERE t.region = 'US' AND cast(tr.averageRating as varchar) > '9.5'
)
SELECT title, region, averageRating, numVotes
FROM us_titles
LIMIT 100;

The response is as follows.

A table showing the results of the sql query

title region averageRating numVotes
0 The Way You Saw Me US 9.7 8
1 The Brother Side of the Wake US 9.6 20
2 Ignis Fatuus US 9.6 11
3 Love and Hip Hop Atlanta US 9.9 11
4 ronny/lily US 9.7 14781

For our third query, we enter “Great Response! Now show me all the original type titles having ratings more than 7.5 and not in the US region.”

The following query is generated:

WITH titles AS (
SELECT t.titleId,
t.title,
t.types,
t.isOriginalTitle,
cast(tr.averageRating as decimal(3,1)) as averageRating,
tr.numVotes,
t.region
FROM imdb_stg.title t
INNER JOIN imdb_stg.title_ratings tr
ON t.titleId = tr.tconst
WHERE t.isOriginalTitle = '1'
AND cast(tr.averageRating as decimal(3,1)) > 7.5
AND t.region != 'US')
SELECT *
FROM titles
LIMIT 100;

We get the following results.

A single row showing the result of the SQL query.

titleId title types isOriginalTitle averageRating numVotes region
0 tt0986264 Taare Zameen Par original 1 8.3 203760 XWW

Generate self-corrected SQL

This scenario simulates a SQL query that has syntax issues. Here, the generated SQL will be self-corrected based on the response from Athena. In the following response, Athena gave a COLUMN_NOT_FOUND error and mentioned that table_description can’t be resolved:

Status : {'State': 'FAILED', 'StateChangeReason': "COLUMN_NOT_FOUND: line 1:50: Column 'table_description' 
cannot be resolved or requester is not authorized to access requested resources",
'SubmissionDateTime': datetime.datetime(2024, 1, 14, 14, 38, 57, 501000, tzinfo=tzlocal()),
'CompletionDateTime': datetime.datetime(2024, 1, 14, 14, 38, 57, 778000, tzinfo=tzlocal()),
'AthenaError': {'ErrorCategory': 2, 'ErrorType': 1006, 'Retryable': False, 'ErrorMessage': "COLUMN_NOT_FOUND: 
line 1:50: Column 'table_description' cannot be resolved or requester is not authorized to 
access requested resources"}}
COLUMN_NOT_FOUND: line 1:50: Column 'table_description' cannot be resolved or requester is not authorized to access requested resources
Try Count: 2
2024-01-14 14:39:02,521,llm_execute,MainProcess,INFO,Try Count: 2
we are in Try block to generate the sql and count is :2
2024-01-14 14:39:02,521,llm_execute,MainProcess,INFO,we are in Try block to generate the sql and count is :2
Executing: Explain WITH tables AS ( SELECT table_name FROM information_schema.tables WHERE table_schema = 'imdb_stg' ), columns AS ( SELECT c.table_name, c.column_name, c.data_type, c.is_nullable, c.column_default, c.ordinal_position FROM information_schema.columns c WHERE c.table_schema = 'imdb_stg' ) SELECT t.table_name, c.column_name, c.data_type, c.is_nullable, c.column_default, c.ordinal_position FROM tables t INNER JOIN columns c ON t.table_name = c.table_name ORDER BY t.table_name, c.ordinal_position LIMIT 10;
I am checking the syntax here
execution_id: 904857c3-b7ac-47d0-8e7e-6b9d0456099b
Status : {'State': 'SUCCEEDED', 'SubmissionDateTime': datetime.datetime(2024, 1, 14, 14, 39, 29, 537000, tzinfo=tzlocal()), 'CompletionDateTime': datetime.datetime(2024, 1, 14, 14, 39, 30, 183000, tzinfo=tzlocal())}
syntax checked for query passed in tries number :2

Using the solution with other data sources

To use the solution with other data sources, Athena handles the job for you. To do this, Athena uses data source connectors that can be used with federated queries. You can consider a connector as an extension of the Athena query engine. Pre-built Athena data source connectors exist for data sources like Amazon CloudWatch Logs, Amazon DynamoDB, Amazon DocumentDB (with MongoDB compatibility), and Amazon Relational Database Service (Amazon RDS), and JDBC-compliant relational data sources such MySQL, and PostgreSQL under the Apache 2.0 license. After you set up a connection to any data source, you can use the preceding code base to extend the solution. For more information, refer to Query any data source with Amazon Athena’s new federated query.

Clean up

To clean up the resources, you can start by cleaning up your S3 bucket where the data resides. Unless your application invokes Amazon Bedrock, it will not incur any cost. For the sake of infrastructure management best practices, we recommend deleting the resources created in this demonstration.

Conclusion

In this post, we presented a solution that allows you to use NLP to generate complex SQL queries with a variety of resources enabled by Athena. We also increased the accuracy of the generated SQL queries via a multi-step evaluation loop based on error messages from downstream processes. Additionally, we used the metadata in the AWS Glue Data Catalog to consider the table names asked in the query through the RAG framework. We then tested the solution in various realistic scenarios with different query complexity levels. Finally, we discussed how to apply this solution to different data sources supported by Athena.

Amazon Bedrock is at the center of this solution. Amazon Bedrock can help you build many generative AI applications. To get started with Amazon Bedrock, we recommend following the quick start in the following GitHub repo and familiarizing yourself with building generative AI applications. You can also try knowledge bases in Amazon Bedrock to build such RAG solutions quickly.


About the Authors

Sanjeeb Panda is a Data and ML engineer at Amazon. With the background in AI/ML, Data Science and Big Data, Sanjeeb design and develop innovative data and ML solutions that solve complex technical challenges and achieve strategic goals for global 3P sellers managing their businesses on Amazon. Outside of his work as a Data and ML engineer at Amazon, Sanjeeb Panda is an avid foodie and music enthusiast.

Burak Gozluklu is a Principal AI/ML Specialist Solutions Architect located in Boston, MA. He helps strategic customers adopt AWS technologies and specifically Generative AI solutions to achieve their business objectives. Burak has a PhD in Aerospace Engineering from METU, an MS in Systems Engineering, and a post-doc in system dynamics from MIT in Cambridge, MA. Burak is still a research affiliate in MIT. Burak is passionate about yoga and meditation.

Read More

How Axfood enables accelerated machine learning throughout the organization using Amazon SageMaker

How Axfood enables accelerated machine learning throughout the organization using Amazon SageMaker

This is a guest post written by Axfood AB. 

In this post, we share how Axfood, a large Swedish food retailer, improved operations and scalability of their existing artificial intelligence (AI) and machine learning (ML) operations by prototyping in close collaboration with AWS experts and using Amazon SageMaker.

Axfood is Sweden’s second largest food retailer, with over 13,000 employees and more than 300 stores. Axfood has a structure with multiple decentralized data science teams with different areas of responsibility. Together with a central data platform team, the data science teams bring innovation and digital transformation through AI and ML solutions to the organization. Axfood has been using Amazon SageMaker to cultivate their data using ML and has had models in production for many years. Lately, the level of sophistication and the sheer number of models in production is increasing exponentially. However, even though the pace of innovation is high, the different teams had developed their own ways of working and were in search of a new MLOps best practice.

Our challenge

To stay competitive in terms of cloud services and AI/ML, Axfood chose to partner with AWS and has been collaborating with them for many years.

During one of our recurring brainstorming sessions with AWS, we were discussing how to best collaborate across teams to increase the pace of innovation and efficiency of data science and ML practitioners. We decided to put in a joint effort to build a prototype on a best practice for MLOps. The aim of the prototype was to build a model template for all data science teams to build scalable and efficient ML models—the foundation to a new generation of AI and ML platforms for Axfood. The template should bridge and combine best practices from AWS ML experts and company-specific best practice models—the best of both worlds.

We decided to build a prototype from one of the currently most developed ML models within Axfood: forecasting sales in stores. More specifically, the forecast for fruits and vegetables of upcoming campaigns for food retail stores. Accurate daily forecasting supports the ordering process for the stores, increasing sustainability by minimizing food waste as a result of optimizing sales by accurately predicting the needed in-store stock levels. This was the perfect place to start for our prototype—not only would Axfood gain a new AI/ML platform, but we would also get a chance to benchmark our ML capabilities and learn from leading AWS experts.

Our solution: A new ML template on Amazon SageMaker Studio

Building a full ML pipeline that is designed for an actual business case can be challenging. In this case, we are developing a forecasting model, so there are two main steps to complete:

  1. Train the model to make predictions using historical data.
  2. Apply the trained model to make predictions of future events.

In Axfood’s case, a well-functioning pipeline for this purpose was already set up using SageMaker notebooks and orchestrated by the third-party workflow management platform Airflow. However, there are many clear benefits of modernizing our ML platform and moving to Amazon SageMaker Studio and Amazon SageMaker Pipelines. Moving to SageMaker Studio provides many predefined out-of-the-box features:

  • Monitoring model and data quality as well as model explainability
  • Built-in integrated development environment (IDE) tools such as debugging
  • Cost/performance monitoring
  • Model acceptance framework
  • Model registry

However, the most important incentive for Axfood is the ability to create custom project templates using Amazon SageMaker Projects to be used as a blueprint for all data science teams and ML practitioners. The Axfood team already had a robust and mature level of ML modeling, so the main focus was on building the new architecture.

Solution overview

Axfood’s proposed new ML framework is structured around two main pipelines: the model build pipeline and the batch inference pipeline:

  • These pipelines are versioned within two separate Git repositories: one build repository and one deploy (inference) repository. Together, they form a robust pipeline for forecasting fruits and vegetables.
  • The pipelines are packaged into a custom project template using SageMaker Projects in integration with a third-party Git repository (Bitbucket) and Bitbucket pipelines for continuous integration and continuous deployment (CI/CD) components.
  • The SageMaker project template includes seed code corresponding to each step of the build and deploy pipelines (we discuss these steps in more detail later in this post) as well as the pipeline definition—the recipe for how the steps should be run.
  • Automation of building new projects based on the template is streamlined through AWS Service Catalog, where a portfolio is created, serving as an abstraction for multiple products.
  • Each product translates into an AWS CloudFormation template, which is deployed when a data scientist creates a new SageMaker project with our MLOps blueprint as the foundation. This activates an AWS Lambda function that creates a Bitbucket project with two repositories—model build and model deploy—containing the seed code.

The following diagram illustrates the solution architecture. Workflow A depicts the intricate flow between the two model pipelines—build and inference. Workflow B shows the flow to create a new ML project.

Model build pipeline

The model build pipeline orchestrates the model’s lifecycle, beginning from preprocessing, moving through training, and culminating in being registered in the model registry:

  • Preprocessing – Here, the SageMaker ScriptProcessor class is employed for feature engineering, resulting in the dataset the model will be trained on.
  • Training and batch transform – Custom training and inference containers from SageMaker are harnessed to train the model on historical data and create predictions on the evaluation data using a SageMaker Estimator and Transformer for the respective tasks.
  • Evaluation – The trained model undergoes evaluation by comparing the generated predictions on the evaluation data to the ground truth using ScriptProcessor.
  • Baseline jobs – The pipeline creates baselines based on statistics in the input data. These are essential for monitoring data and model quality, as well as feature attributions.
  • Model registry – The trained model is registered for future use. The model will be approved by designated data scientists to deploy the model for use in production.

For production environments, data ingestion and trigger mechanisms are managed via a primary Airflow orchestration. Meanwhile, during development, the pipeline is activated each time a new commit is introduced to the model build Bitbucket repository. The following figure visualizes the model build pipeline.

Batch inference pipeline

The batch inference pipeline handles the inference phase, which consists of the following steps:

  • Preprocessing – Data is preprocessed using ScriptProcessor.
  • Batch transform – The model uses the custom inference container with a SageMaker Transformer and generates predictions given the input preprocessed data. The model used is the latest approved trained model in the model registry.
  • Postprocessing – The predictions undergo a series of postprocessing steps using ScriptProcessor.
  • Monitoring – Continuous surveillance completes checks for drifts related to data quality, model quality, and feature attribution.

If discrepancies arise, a business logic within the postprocessing script assesses whether retraining the model is necessary. The pipeline is scheduled to run at regular intervals.

The following diagram illustrates the batch inference pipeline. Workflow A corresponds to preprocessing, data quality and feature attribution drift checks, inference, and postprocessing. Workflow B corresponds to model quality drift checks. These pipelines are divided because the model quality drift check will only run if new ground truth data is available.

SageMaker Model Monitor

With Amazon SageMaker Model Monitor integrated, the pipelines benefit from real-time monitoring on the following:

  • Data quality – Monitors any drift or inconsistencies in data
  • Model quality – Watches for any fluctuations in model performance
  • Feature attribution – Checks for drift in feature attributions

Monitoring model quality requires access to ground truth data. Although obtaining ground truth can be challenging at times, using data or feature attribution drift monitoring serves as a competent proxy to model quality.

Specifically, in the case of data quality drift, the system watches out for the following:

  • Concept drift – This pertains to changes in the correlation between input and output, requiring ground truth
  • Covariate shift – Here, the emphasis is on alterations in the distribution of independent input variables

SageMaker Model Monitor’s data drift functionality meticulously captures and scrutinizes the input data, deploying rules and statistical checks. Alerts are raised whenever anomalies are detected.

In parallel to using data quality drift checks as a proxy for monitoring model degradation, the system also monitors feature attribution drift using the normalized discounted cumulative gain (NDCG) score. This score is sensitive to both changes in feature attribution ranking order as well as to the raw attribution scores of features. By monitoring drift in attribution for individual features and their relative importance, it’s straightforward to spot degradation in model quality.

Model explainability

Model explainability is a pivotal part of ML deployments, because it ensures transparency in predictions. For a detailed understanding, we use Amazon SageMaker Clarify.

It offers both global and local model explanations through a model-agnostic feature attribution technique based on the Shapley value concept. This is used to decode why a particular prediction was made during inference. Such explanations, which are inherently contrastive, can vary based on different baselines. SageMaker Clarify aids in determining this baseline using K-means or K-prototypes in the input dataset, which is then added to the model build pipeline. This functionality enables us to build generative AI applications in the future for increased understanding of how the model works.

Industrialization: From prototype to production

The MLOps project includes a high degree of automation and can serve as a blueprint for similar use cases:

  • The infrastructure can be reused entirely, whereas the seed code can be adapted for each task, with most changes limited to the pipeline definition and the business logic for preprocessing, training, inference, and postprocessing.
  • The training and inference scripts are hosted using SageMaker custom containers, so a variety of models can be accommodated without changes to the data and model monitoring or model explainability steps, as long as the data is in tabular format.

After finishing the work on the prototype, we turned to how we should use it in production. To do so, we felt the need to make some additional adjustments to the MLOps template:

  • The original seed code used in the prototype for the template included preprocessing and postprocessing steps run before and after the core ML steps (training and inference). However, when scaling up to use the template for multiple use cases in production, the built-in preprocessing and postprocessing steps may lead to decreased generality and reproduction of code.
  • To improve generality and minimize repetitive code, we chose to slim down the pipelines even further. Instead of running the preprocessing and postprocessing steps as part of the ML pipeline, we run these as part of the primary Airflow orchestration before and after triggering the ML pipeline.
  • This way, use case-specific processing tasks are abstracted from the template, and what is left is a core ML pipeline performing tasks that are general across multiple use cases with minimal repetition of code. Parameters that differ between use cases are supplied as input to the ML pipeline from the primary Airflow orchestration.

The result: A rapid & efficient approach to model build & deployment

The prototype in collaboration with AWS has resulted in an MLOps template following current best practices that is now available for use to all of Axfood’s data science teams. By creating a new SageMaker project within SageMaker Studio, data scientists can get started on new ML projects quickly and seamlessly transition to production, allowing for more efficient time management. This is made possible by automating tedious, repetitive MLOps tasks as part of the template.

Furthermore, several new functionalities have been added in an automated fashion to our ML setup. These gains include:

  • Model monitoring – We can perform drift checks for model and data quality as well as model explainability
  • Model and data lineage – It’s now possible to trace exactly which data has been used for which model
  • Model registry – This helps us catalog models for production and manage model versions

Conclusion

In this post, we discussed how Axfood improved operations and scalability of our existing AI and ML operations in collaboration with AWS experts and by using SageMaker and its related products.

These improvements will help Axfood’s data science teams building ML workflows in a more standardized way and will greatly simplify analysis and monitoring of models in production—ensuring the quality of ML models built and maintained by our teams.

Please leave any feedback or questions in the comments section.


About the Authors

Dr. Björn Blomqvist is the Head of AI Strategy at Axfood AB. Before joining Axfood AB he led a team of Data Scientists at Dagab, a part of Axfood, building innovative machine learning solutions with the mission to provide good and sustainable food to people all over Sweden. Born and raised in the north of Sweden, in his spare time Björn ventures to snowy mountains and open seas.

Oskar Klang is a Senior Data Scientist at the analytics department at Dagab, where he enjoys working with everything analytics and machine learning, e.g. optimizing supply chain operations, building forecasting models and, more recently, GenAI applications. He is committed to building more streamlined machine learning pipelines, enhancing efficiency and scalability.

Pavel Maslov is a Senior DevOps and ML engineer in the Analytic Platforms team. Pavel has extensive experience in the development of frameworks, infrastructure, and tools in the domains of DevOps and ML/AI on the AWS platform. Pavel has been one of the key players in building the foundational capability within ML at Axfood.

Joakim Berg is the Team Lead and Product Owner Analytic Platforms, based in Stockholm Sweden. He is leading a team of Data Platform end DevOps/MLOps engineers providing Data and ML platforms for the Data Science teams. Joakim has many years of experience leading senior development and architecture teams from different industries.

Read More

Techniques and approaches for monitoring large language models on AWS

Techniques and approaches for monitoring large language models on AWS

Large Language Models (LLMs) have revolutionized the field of natural language processing (NLP), improving tasks such as language translation, text summarization, and sentiment analysis. However, as these models continue to grow in size and complexity, monitoring their performance and behavior has become increasingly challenging.

Monitoring the performance and behavior of LLMs is a critical task for ensuring their safety and effectiveness. Our proposed architecture provides a scalable and customizable solution for online LLM monitoring, enabling teams to tailor your monitoring solution to your specific use cases and requirements. By using AWS services, our architecture provides real-time visibility into LLM behavior and enables teams to quickly identify and address any issues or anomalies.

In this post, we demonstrate a few metrics for online LLM monitoring and their respective architecture for scale using AWS services such as Amazon CloudWatch and AWS Lambda. This offers a customizable solution beyond what is possible with model evaluation jobs with Amazon Bedrock.

Overview of solution

The first thing to consider is that different metrics require different computation considerations. A modular architecture, where each module can intake model inference data and produce its own metrics, is necessary.

We suggest that each module take incoming inference requests to the LLM, passing prompt and completion (response) pairs to metric compute modules. Each module is responsible for computing its own metrics with respect to the input prompt and completion (response). These metrics are passed to CloudWatch, which can aggregate them and work with CloudWatch alarms to send notifications on specific conditions. The following diagram illustrates this architecture.

Fig 1: Metric compute module – solution overview

Fig 1: Metric compute module – solution overview

The workflow includes the following steps:

  1. A user makes a request to Amazon Bedrock as part of an application or user interface.
  2. Amazon Bedrock saves the request and completion (response) in Amazon Simple Storage Service (Amazon S3) as the per configuration of invocation logging.
  3. The file saved on Amazon S3 creates an event that triggers a Lambda function. The function invokes the modules.
  4. The modules post their respective metrics to CloudWatch metrics.
  5. Alarms can notify the development team of unexpected metric values.

The second thing to consider when implementing LLM monitoring is choosing the right metrics to track. Although there are many potential metrics that you can use to monitor LLM performance, we explain some of the broadest ones in this post.

In the following sections, we highlight a few of the relevant module metrics and their respective metric compute module architecture.

Semantic similarity between prompt and completion (response)

When running LLMs, you can intercept the prompt and completion (response) for each request and transform them into embeddings using an embedding model. Embeddings are high-dimensional vectors that represent the semantic meaning of the text. Amazon Titan provides such models through Titan Embeddings. By taking a distance such as cosine between these two vectors, you can quantify how semantically similar the prompt and completion (response) are. You can use SciPy or scikit-learn to compute the cosine distance between vectors. The following diagram illustrates the architecture of this metric compute module.

Fig 2: Metric compute module – semantic similarity

Fig 2: Metric compute module – semantic similarity

This workflow includes the following key steps:

  1. A Lambda function receives a streamed message via Amazon Kinesis containing a prompt and completion (response) pair.
  2. The function gets an embedding for both the prompt and completion (response), and computes the cosine distance between the two vectors.
  3. The function sends that information to CloudWatch metrics.

Sentiment and toxicity

Monitoring sentiment allows you to gauge the overall tone and emotional impact of the responses, whereas toxicity analysis provides an important measure of the presence of offensive, disrespectful, or harmful language in LLM outputs. Any shifts in sentiment or toxicity should be closely monitored to ensure the model is behaving as expected. The following diagram illustrates the metric compute module.

Fig 3: Metric compute module – sentiment and toxicity

Fig 3: Metric compute module – sentiment and toxicity

The workflow includes the following steps:

  1. A Lambda function receives a prompt and completion (response) pair through Amazon Kinesis.
  2. Through AWS Step Functions orchestration, the function calls Amazon Comprehend to detect the sentiment and toxicity.
  3. The function saves the information to CloudWatch metrics.

For more information about detecting sentiment and toxicity with Amazon Comprehend, refer to Build a robust text-based toxicity predictor and Flag harmful content using Amazon Comprehend toxicity detection.

Ratio of refusals

An increase in refusals, such as when an LLM denies completion due to lack of information, could mean that either malicious users are trying to use the LLM in ways that are intended to jailbreak it, or that users’ expectations are not being met and they are getting low-value responses. One way to gauge how often this is happening is by comparing standard refusals from the LLM model being used with the actual responses from the LLM. For example, the following are some of Anthropic’s Claude v2 LLM common refusal phrases:

“Unfortunately, I do not have enough context to provide a substantive response. However, I am an AI assistant created by Anthropic to be helpful, harmless, and honest.”

“I apologize, but I cannot recommend ways to…”

“I'm an AI assistant created by Anthropic to be helpful, harmless, and honest.”

On a fixed set of prompts, an increase in these refusals can be a signal that the model has become overly cautious or sensitive. The inverse case should also be evaluated. It could be a signal that the model is now more prone to engage in toxic or harmful conversations.

To help model integrity and model refusal ratio, we can compare the response with a set of known refusal phrases from the LLM. This could be an actual classifier that can explain why the model refused the request. You can take the cosine distance between the response and known refusal responses from the model being monitored. The following diagram illustrates this metric compute module.

Fig 4: Metric compute module – ratio of refusals

Fig 4: Metric compute module – ratio of refusals

The workflow consists of the following steps:
  1. A Lambda function receives a prompt and completion (response) and gets an embedding from the response using Amazon Titan.
  2. The function computes the cosine or Euclidian distance between the response and existing refusal prompts cached in memory.
  3. The function sends that average to CloudWatch metrics.

Another option is to use fuzzy matching for a straightforward but less powerful approach to compare the known refusals to LLM output. Refer to the Python documentation for an example.

Summary

LLM observability is a critical practice for ensuring the reliable and trustworthy use of LLMs. Monitoring, understanding, and ensuring the accuracy and reliability of LLMs can help you mitigate the risks associated with these AI models. By monitoring hallucinations, bad completions (responses), and prompts, you can make sure your LLM stays on track and delivers the value you and your users are looking for. In this post, we discussed a few metrics to showcase examples.

For more information about evaluating foundation models, refer to Use SageMaker Clarify to evaluate foundation models, and browse additional example notebooks available in our GitHub repository. You can also explore ways to operationalize LLM evaluations at scale in Operationalize LLM Evaluation at Scale using Amazon SageMaker Clarify and MLOps services. Finally, we recommend referring to Evaluate large language models for quality and responsibility to learn more about evaluating LLMs.


About the Authors

Bruno Klein is a Senior Machine Learning Engineer with AWS Professional Services Analytics Practice. He helps customers implement big data and analytics solutions. Outside of work, he enjoys spending time with family, traveling, and trying new food.

Rushabh Lokhande is a Senior Data & ML Engineer with AWS Professional Services Analytics Practice. He helps customers implement big data, machine learning, and analytics solutions. Outside of work, he enjoys spending time with family, reading, running, and playing golf.

Read More