Build a RAG-based QnA application using Llama3 models from SageMaker JumpStart

Build a RAG-based QnA application using Llama3 models from SageMaker JumpStart

Organizations generate vast amounts of data that is proprietary to them, and it’s critical to get insights out of the data for better business outcomes. Generative AI and foundation models (FMs) play an important role in creating applications using an organization’s data that improve customer experiences and employee productivity.

The FMs are typically pretrained on a large corpus of data that’s openly available on the internet. They perform well at natural language understanding tasks such as summarization, text generation, and question answering on a broad variety of topics. However, they can sometimes hallucinate or produce inaccurate responses when answering questions that they haven’t been trained on. To prevent incorrect responses and improve response accuracy, a technique called Retrieval Augmented Generation (RAG) is used to provide models with contextual data.

In this post, we provide a step-by-step guide for creating an enterprise ready RAG application such as a question answering bot. We use the Llama3-8B FM for text generation and the BGE Large EN v1.5 text embedding model for generating embeddings from Amazon SageMaker JumpStart. We also showcase how you can use FAISS as an embeddings store and packages such as LangChain for interfacing with the components and run inferences within a SageMaker Studio notebook.

SageMaker JumpStart

SageMaker JumpStart is a powerful feature within the Amazon SageMaker ML platform that provides ML practitioners a comprehensive hub of publicly available and proprietary foundation models.

Llama 3 overview

Llama 3 (developed by Meta) comes in two parameter sizes—8B and 70B with 8K context length—that can support a broad range of use cases with improvements in reasoning, code generation, and instruction following. Llama 3 uses a decoder-only transformer architecture and new tokenizer that provides improved model performance with 128K size. In addition, Meta improved post-training procedures that substantially reduced false refusal rates, improved alignment, and increased diversity in model responses.

BGE Large overview

The embedding model BGE Large stands for BAAI general embedding large. It’s developed by BAAI and is designed to enhance retrieval capabilities within large language models (LLMs). The model supports three retrieval methods:

  • Dense retrieval (BGE-M3)
  • Lexical retrieval (LLM Embedder)
  • Multi-vector retrieval (BGE Embedding Reranker).

You can use the BGE embedding model to retrieve relevant documents and then use the BGE reranker to obtain final results.

On Hugging Face, the Massive Text Embedding Benchmark (MTEB) is provided as a leaderboard for diverse text embedding tasks. It currently provides 129 benchmarking datasets across 8 different tasks on 113 languages. The top text embedding models from the MTEB leaderboard are made available from SageMaker JumpStart, including BGE Large.

For more details about this model, see the official Hugging Face mode card page.

RAG overview

Retrieval-Augmented Generation (RAG) is a technique that enables the integration of external knowledge sources with FM. RAG involves three main steps: retrieval, augmentation, and generation.

First, relevant content is retrieved from an external knowledge base based on the user’s query. Next, this retrieved information is combined or augmented with the user’s original input, creating an augmented prompt. Finally, the FM processes this augmented prompt, which includes both the query and the retrieved contextual information, and generates a response tailored to the specific context, incorporating the relevant knowledge from the external source.

Solution overview

You will construct a RAG QnA system on a SageMaker notebook using the Llama3-8B model and BGE Large embedding model. The following diagram illustrates the step-by-step architecture of this solution, which is described in the following sections.

Implementing this solution takes three high level steps: Deploying models, data processing and vectorization, and running inferences.

To demonstrate this solution, a sample notebook is available in the GitHub repo.

The notebook is powered by an ml.t3.medium instance to demonstrate deploying the model as an API endpoint using an SDK through SageMaker JumpStart. You can use these model endpoints to explore, experiment, and optimize for comparing advanced RAG application techniques using LangChain. We also illustrate the integration of the FAISS embeddings store into the RAG workflow, highlighting its role in storing and retrieving embeddings to enhance the application’s performance.

We will also discuss how you can use LangChain to create effective and more efficient RAG applications. LangChain is a Python library designed to build applications with LLMs. It provides a modular and flexible framework for combining LLMs with other components, such as knowledge bases, retrieval systems, and other AI tools, to create powerful and customizable applications.

After everything is set up, when a user interacts with the QnA application, the flow is as follows:

  1. The user sends a query using the QnA application.
  2. The application sends the user query to the vector database to find similar documents.
  3. The documents returned as a context are captured by the QnA application.
  4. The QnA application submits a request to the SageMaker JumpStart model endpoint with the user query and context returned from the vector database.
  5. The endpoint sends the request to the SageMaker JumpStart model.
  6. The LLM processes the request and generates an appropriate response.
  7. The response is captured by the QnA application and displayed to the user.

Prerequisites

To implement this solution, you need the following:

  • An AWS account with privileges to create AWS Identity and Access Management (IAM) roles and policies. For more information, see Overview of access management: Permissions and policies.
  • Basic familiarity with SageMaker and AWS services that support LLMs.
  • The Jupyter Notebooks needs ml.t3.medium.
  • You need access to accelerated instances (GPUs) for hosting the LLMs. This solution needs access to a minimum of the following instance sizes:
    • ml.g5.12xlarge for endpoint use when deploying the BGE Large En v1.5 text embedding model
    • ml.g5.2xlarge for endpoint use when deploying the Llama-3-8B model endpoint

To increase your quota, refer to Requesting a quota increase.

Prompt template for Llama3

While both Llama 2 and Llama 3 are powerful language models that are optimized for dialogue-based tasks, their prompting formats differ significantly in how they handle multi-turn conversations, specify roles, and mark message boundaries, reflecting distinct design choices and trade-offs.

Llama 3 prompting format: Llama 3 employs a structured format designed for multi-turn conversations involving different roles (system, user, and assistant). It uses dedicated tokens to explicitly mark roles, message boundaries, and the end of the prompt:

  • Placeholder tokens: {{user_message}} and {{assistant_message}}
  • Role marking: <|start_header_id|>{role}<|end_header_id|>
  • Message boundaries: <|eot_id|> signals end of a message within a turn.
  • Prompt End Marker: <|start_header_id|>assistant<|end_header_id|> signals start of assistant’s response.

Llama 2 prompting format: Llama 2 uses a more compact representation with different tokens for handling conversations:

  • User message enclosure: [INST][/INST]
  • Start and end of sequence: <s></s>
  • System message enclosure: <<SYS>><</SYS>>
  • Message separation: <s></s> separates user messages and model responses.

Key differences:

  • Role specification: Llama 3 uses a more explicit approach with dedicated tokens, while Llama 2 relies on enclosing tags.
  • Message boundary marking: Llama 3 uses <|eot_id|>, Llama 2 uses <s></s>.
  • Prompt end marker: Llama 3 uses <|start_header_id|>assistant<|end_header_id|>, Llama 2 uses [/INST] and </s>.

The choice depends on the use case and integration requirements. Llama 3’s format is more structured and role-aware and is better suited for conversational AI applications with complex multi-turn conversations. Llama 2’s format, while more compact, might be less explicit in handling roles and message boundaries.

Implement the solution

To implement the solution, you’ll use the following steps:

  • Set up a SageMaker Studio notebook
  • Deploy models on Amazon SageMaker JumpStart
  • Set up Llama3-8b and BGE Large En v1.5 models with LangChain
  • Prepare data and generate embeddings
    • Load documents of different kind and generate embeddings to create a vector store
  • Retrieve documents to the question using the following approaches from LangChain
    • Regular Retrieval Chain
    • Parent Document Retriever Chain
  • Prepare a prompt that goes as input to the LLM and presents an answer in a human friendly manner

Set up a SageMaker Studio notebook

To follow the code in this post:

  1. Open SageMaker Studio and clone the following GitHub repository.
  2. Open the notebook RAG-recipes/llama3-rag-langchain-smjs.ipynb and choose the PyTorch 2.0.0 Python 3.10 GPU Optimized image, Python 3 kernel, and ml.t3.medium as the instance type.
  3. If this is your first time using SageMaker Studio notebooks, see Create or Open an Amazon SageMaker Studio Notebook.

To set up the development environment, you need to install the necessary Python libraries, as demonstrated in the following code. The example notebook provided includes these commands:

%%writefile requirements.txt
langchain==0.1.14
pypdf==4.1.0
faiss-cpu==1.8.0
boto3==1.34.58
sqlalchemy==2.0.29

After the libraries are written in requirement.txt, install all the libraries:

!pip install -U -r requirements.txt --quiet

Deploy pretrained models

After you’ve imported the required libraries, you can deploy the Llama 3 8B Instruct LLM model on SageMaker JumpStart using the SageMaker SDK:

  1. Import the JumpStartModel class from the SageMaker JumpStart library
    from sagemaker.jumpstart.model import JumpStartModel

  2. Specify the model ID for the HuggingFace Llama 3 8b Instruct LLM model, and deploy the model.
    model_id = "meta-textgeneration-llama-3-8b-instruct"
    accept_eula = True
    model = JumpStartModel(model_id=model_id)
    predictor = model.deploy(accept_eula=accept_eula)

  3. Specify the model ID for the HuggingFace BGE Large EN embedding model and deploy the model.
    model_id = "huggingface-sentencesimilarity-bge-large-en-v1-5"
    text_embedding_model = JumpStartModel(model_id=model_id)
    embedding_predictor = text_embedding_model.deploy()

Set up models with LangChain

For this step, you’ll use the following code to set up models.

import json
import sagemaker
 
from langchain_core.prompts import PromptTemplate
from langchain_community.llms import SagemakerEndpoint
from langchain_community.embeddings import SagemakerEndpointEmbeddings
from langchain_community.llms.sagemaker_endpoint import LLMContentHandler
from langchain_community.embeddings.sagemaker_endpoint import EmbeddingsContentHandler
  1. Replace the endpoint names in the below code snippet with the endpoint names that are deployed in your environment. You can get the endpoint names from predictors created in the previous section or view the endpoints created by going to SageMaker Studio, left navigation deployments → endpoints and replace the values for llm_endpoint_name and embedding_endpoint_name.
    sess = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs
    region = sess._region_name
    llm_endpoint_name = "meta-textgeneration-llama-3-8b-instruct-XXXX"
    embedding_endpoint_name = "hf-sentencesimilarity-bge-large-en-v1-XXXXX"

  2. Transform input and output data to process API calls for Llama 3 8B Instruct on Amazon SageMaker.
    from typing import Dict
     
    class Llama38BContentHandler(LLMContentHandler):
        content_type = "application/json"
        accepts = "application/json"
     
        def transform_input(self, prompt: str, model_kwargs: dict) -> bytes:
            payload = {
                "inputs": prompt,
                "parameters": {
                    "max_new_tokens": 1000,
                    "top_p": 0.9,
                    "temperature": 0.6,
                    "stop": ["<|eot_id|>"],
                },
            }
            input_str = json.dumps(
                payload,
            )
            #print(input_str)
            return input_str.encode("utf-8")
     
        def transform_output(self, output: bytes) -> str:
            response_json = json.loads(output.read().decode("utf-8"))
            #print(response_json)
            content = response_json["generated_text"].strip()
            return content 

  3. Instantiate the LLM with SageMaker and LangChain
    # Instantiate the content handler for Llama3-8B
    llama_content_handler = Llama38BContentHandler()
     
    # Setup for using the Llama3-8B model with SageMaker Endpoint
    llm = SagemakerEndpoint(
         endpoint_name=llm_endpoint_name,
         region_name=region,
         model_kwargs={"max_new_tokens": 1024, "top_p": 0.9, "temperature": 0.7},
         content_handler=llama_content_handler
     )

  4. Transform input and output data to process API calls for BGE Large En on SageMaker
    from typing import List
     
    class BGEContentHandlerV15(EmbeddingsContentHandler):
        content_type = "application/json"
        accepts = "application/json"
     
        def transform_input(self, text_inputs: List[str], model_kwargs: dict) -> bytes:
            """
            Transforms the input into bytes that can be consumed by SageMaker endpoint.
            Args:
                text_inputs (list[str]): A list of input text strings to be processed.
                model_kwargs (Dict): Additional keyword arguments to be passed to the endpoint.
                   Possible keys and their descriptions:
                   - mode (str): Inference method. Valid modes are 'embedding', 'nn_corpus', and 'nn_train_data'.
                   - corpus (str): Corpus for Nearest Neighbor. Required when mode is 'nn_corpus'.
                   - top_k (int): Top K for Nearest Neighbor. Required when mode is 'nn_corpus'.
                   - queries (list[str]): Queries for Nearest Neighbor. Required when mode is 'nn_corpus' or 'nn_train_data'.
            Returns:
                The transformed bytes input.
            """
            input_str = json.dumps(
                {
                    "text_inputs": text_inputs,
                    **model_kwargs
                }
            )
            return input_str.encode("utf-8")
     
        def transform_output(self, output: bytes) -> List[List[float]]:
            """
            Transforms the bytes output from the endpoint into a list of embeddings.
            Args:
                output: The bytes output from SageMaker endpoint.
            Returns:
                The transformed output - list of embeddings
            Note:
                The length of the outer list is the number of input strings.
                The length of the inner lists is the embedding dimension.
            """
            response_json = json.loads(output.read().decode("utf-8"))
            return response_json["embedding"]

  5. Instantiate the embedding model with SageMaker and LangChain
    bge_content_handler = BGEContentHandlerV15()
    sagemaker_embeddings = SagemakerEndpointEmbeddings(
        endpoint_name=embedding_endpoint_name,
        region_name=region,
        model_kwargs={"mode": "embedding"},
        content_handler=bge_content_handler,
    )

Prepare data and generate embeddings

In this example, you will use several years of Amazon’s Annual Reports (SEC filings) for investors as a text corpus to perform QnA on.

  1. Start by using the following code to download the PDF documents from the provided URLs and create a list of metadata for each downloaded document.
    !mkdir -p ./data
    
    from urllib.request import urlretrieve
    urls = [
    'https://d18rn0p25nwr6d.cloudfront.net/CIK-0001018724/c7c14359-36fa-40c3-b3ca-5bf7f3fa0b96.pdf',
    'https://d18rn0p25nwr6d.cloudfront.net/CIK-0001018724/d2fde7ee-05f7-419d-9ce8-186de4c96e25.pdf',
    'https://d18rn0p25nwr6d.cloudfront.net/CIK-0001018724/f965e5c3-fded-45d3-bbdb-f750f156dcc9.pdf',
    'https://d18rn0p25nwr6d.cloudfront.net/CIK-0001018724/336d8745-ea82-40a5-9acc-1a89df23d0f3.pdf'
    ]
    
    filenames = [
    'AMZN-2024-10-K-Annual-Report.pdf',
    'AMZN-2023-10-K-Annual-Report.pdf',
    'AMZN-2022-10-K-Annual-Report.pdf',
    'AMZN-2021-10-K-Annual-Report.pdf'
    ]
    
    metadata = [
    dict(year=2024, source=filenames[0]),
    dict(year=2023, source=filenames[1]),
    dict(year=2022, source=filenames[2]),
    dict(year=2021, source=filenames[3])]
    
    data_root = "./data/"
    
    for idx, url in enumerate(urls):
    file_path = data_root + filenames[idx]
    urlretrieve(url, file_path)

    If you look at the Amazon 10-Ks, the first four pages are all the very similar and might skew the responses if they are kept in the embeddings. This will cause repetition, take longer to generate embeddings, and might skew your results.

  2. In the next step, you will take the downloaded data, trim the 10-K (first four pages) and overwrite them as processed files.
    from pypdf import PdfReader, PdfWriter
    import glob
    
    local_pdfs = glob.glob(data_root + '*.pdf')
    
    # Iterate over each PDF file
    for idx, local_pdf in enumerate(local_pdfs):
    pdf_reader = PdfReader(local_pdf)
    pdf_writer = PdfWriter()
    
    if idx == 0:
    # Keep the first 4 pages for the first document
    for pagenum in range(len(pdf_reader.pages)):
    page = pdf_reader.pages[pagenum]
    pdf_writer.add_page(page)
    else:
    # Remove the first 4 pages for other documents
    for pagenum in range(4, len(pdf_reader.pages)):
    page = pdf_reader.pages[pagenum]
    pdf_writer.add_page(page)
    
    # Write the modified content to a new file
    with open(local_pdf, 'wb') as new_file:
    new_file.seek(0)
    pdf_writer.write(new_file)
    new_file.truncate()

  3. After downloading, you can load the documents with the help of DirectoryLoader from PyPDF available under LangChain and splitting them into smaller chunks. Note: The retrieved document or text should be large enough to contain enough information to answer a question; but small enough to fit into the LLM prompt. Also, the embedding model has a limit on the length of input tokens of 512 tokens, which translates to approximately 2,000 characters. For this use-case, you are creating chunks of approximately 1,000 characters with an overlap of 100 characters using RecursiveCharacterTextSplitter.
    import numpy as np
    from langchain_community.document_loaders import PyPDFLoader
    from langchain.text_splitter import RecursiveCharacterTextSplitter
    
    documents = []
    
    for idx, file in enumerate(filenames):
    loader = PyPDFLoader(data_root + file)
    document = loader.load()
    for document_fragment in document:
    document_fragment.metadata = metadata[idx]
    
    documents += document
    
    # - in our testing Character split works better with this PDF data set
    text_splitter = RecursiveCharacterTextSplitter(
    # Set a really small chunk size, just to show.
    chunk_size=1000,
    chunk_overlap=100,
    )
    
    docs = text_splitter.split_documents(documents)
    print(docs[100])

  4. Before you proceed, look at some of the statistics regarding the document preprocessing you just performed:
    avg_doc_length = lambda documents: sum([len(doc.page_content) for doc in documents])//len(documents)
    
    print(f'Average length among {len(documents)} documents loaded is {avg_doc_length(documents)} characters.')
    print(f'After the split we have {len(docs)} documents as opposed to the original {len(documents)}.')
    print(f'Average length among {len(docs)} documents (after split) is {avg_doc_length(docs)} characters.')

  5. You started with four PDF documents, which have been split into approximately 500 smaller chunks. Now you can see how a sample embedding would look like for one of those chunks.
    sample_embedding = np.array(sagemaker_embeddings.embed_query(docs[0].page_content))
    print("Sample embedding of a document chunk: ", sample_embedding)
    print("Size of the embedding: ", sample_embedding.shape)

    This can be done using FAISS implementation inside LangChain which takes input from the embedding model and the documents to create the entire vector store. Using the Index Wrapper, you can abstract away most of the heavy lifting such as creating the prompt, getting embeddings of the query, sampling the relevant documents, and calling the LLM. VectorStoreIndexWrapper.

    from langchain_community.vectorstores import FAISS
    from langchain.indexes.vectorstore import VectorStoreIndexWrapper
     
    vectorstore_faiss = FAISS.from_documents(
        docs,
        sagemaker_embeddings,
    )
    wrapper_store_faiss = VectorStoreIndexWrapper(vectorstore=vectorstore_faiss)
    

Answer questions using a LangChain vector store wrapper

You use the wrapper provided by LangChain, which wraps around the vector store and takes input from the LLM. This wrapper performs the following steps behind the scenes:

  • Inputs the question
  • Creates question embedding
  • Fetches relevant documents
  • Stuffs the documents and the question into a prompt
  • Invokes the model with the prompt and generate the answer in a human readable manner.

Note: In this example we are using Llama 3 8B Instruct as the LLM under Amazon SageMaker, this particular model performs best if the inputs are provided under

<|begin_of_text|><|start_header_id|>system<|end_header_id|>,
{{system_message}},
<|eot_id|><|start_header_id|>user<|end_header_id|>,
{{user_message}}, and the model is requested to generate an output after
<|eot_id|><|start_header_id|>assistant<|end_header_id|>.

The following is an example of how to control the prompt so that the LLM stays grounded and doesn’t answer outside the context.

prompt_template = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are a helpful assistant.
<|eot_id|><|start_header_id|>user<|end_header_id|>
{query}
<|eot_id|><|start_header_id|>assistant<|end_header_id|>
"""
PROMPT = PromptTemplate(
    template=prompt_template, input_variables=["query"]
)
query = "How did AWS perform in 2021?"
answer = wrapper_store_faiss.query(question=PROMPT.format(query=query), llm=llm)
print(answer)

You can ask another question.

query_2 = "How much square footage did Amazon have in North America in 2023?"
answer = wrapper_store_faiss.query(question=PROMPT.format(query=query_2), llm=llm)
print(answer)

Retrieval QA chain

We’ve shown you a basic method to get context-aware answers. Now, let’s look at a more customizable option with RetrievalQA. You can customize how fetched documents are added to the prompt using the chain_type parameter, control the number of relevant documents retrieved by changing the k parameter, and get source documents used by the LLM by enabling return_source_documents.RetrievalQA also allows providing custom prompt templates specific to the model.

from langchain.chains import RetrievalQA

prompt_template = """
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

This is a conversation between an AI assistant and a Human.

<|eot_id|><|start_header_id|>user<|end_header_id|>

Use the following pieces of context to provide a concise answer to the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.
#### Context ####
{context}
#### End of Context ####

Question: {question}
<|eot_id|><|start_header_id|>assistant<|end_header_id|>
"""
PROMPT = PromptTemplate(
template=prompt_template, input_variables=["context", "question"]
)

qa = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=vectorstore_faiss.as_retriever(
search_type="similarity", search_kwargs={"k": 3}
),
return_source_documents=True,
chain_type_kwargs={"prompt": PROMPT}
)

You can then ask a question:

query = "How did AWS perform in 2023?"
result = qa({"query": query})
print(result['result'])

Parent document retriever chain

Let’s explore a more advanced RAG option with ParentDocumentRetriever. It balances storing small chunks for accurate embeddings and larger chunks to preserve context. First, a parent_splitter divides documents into larger parent chunks. Then, a child_splitter creates smaller child chunks. Child chunks are indexed in a vector store using embeddings for efficient retrieval. To retrieve relevant info, ParentDocumentRetriever fetches child chunks from the vector store, looks up their parent IDs, and returns corresponding larger parent chunks, stored in an InMemoryStore. This approach balances accurate embeddings with contextual information for meaningful retrieval.

from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
  1. Sometimes, the full documents can so large that you don’t want to retrieve them as is. In that case, you can first split the raw documents into larger chunks, and then split it into smaller chunks. You then index the smaller chunks, but on retrieval you retrieve the larger chunks (but still not the full documents).
    # This text splitter is used to create the parent documents
    parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)
    # This text splitter is used to create the child documents
    # It should create documents smaller than the parent
    child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
    # The vectorstore to use to index the child chunks
    vectorstore_faiss = FAISS.from_documents(
    child_splitter.split_documents(documents),
    sagemaker_embeddings,
    )
    # The storage layer for the parent documents
    store = InMemoryStore()
    
    # The storage layer for the parent documents
    store = InMemoryStore()
    retriever = ParentDocumentRetriever(
    vectorstore=vectorstore_faiss,
    docstore=store,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
    )
    retriever.add_documents(documents, ids=None)

  2. Now, initialize the chain using the ParentDocumentRetriever. Pass the prompt in using the chain_type_kwargs argument.
    qa = RetrievalQA.from_chain_type(
        llm=llm,
        chain_type="stuff",
        retriever=retriever,
        return_source_documents=True,
        chain_type_kwargs={"prompt": PROMPT}
    )

  3. Start asking questions:
    query = "How did AWS perform in 2023?"
    result = qa({"query": query})
    print(result['result'])

Clean up

To avoid incurring unnecessary costs, when you’re done, delete the SageMaker endpoints and OpenSearch Service domain, either using the following code snippets or the SageMaker JumpStart UI.

predictor.delete_model()
predictor.delete_endpoint()
embedding_endpoint.delete_model()
embedding_endpoint.delete_endpoint()

To use the SageMaker console, complete the following steps:

  1. On the SageMaker console, under Inference in the navigation pane, choose Endpoints.
  2. Search for the embedding and text generation endpoints.
  3. On the endpoint details page, choose Delete.
  4. Choose Delete again to confirm.

Conclusion

In this post, we showed you a powerful RAG solution using SageMaker JumpStart to deploy the Llama 3 8B Instruct model and the BGE Large En v1.5 embedding model.

We showed you how to create a robust vector store by processing documents of various formats and generating embeddings. This vector store facilitates retrieving relevant documents based on user queries using LangChain’s retrieval algorithms. We demonstrated the ability to prepare custom prompts tailored for the Llama 3 model, ensuring context-aware responses, and presented these context-specific answers in a human-friendly manner.

This solution highlights the power of SageMaker JumpStart in deploying cutting-edge models and the versatility of LangChain in creating effective RAG applications. By seamlessly integrating these components, we enabled high-quality, context-specific response generation, enhancing the Llama 3 model’s performance across natural language processing tasks. To explore this solution and embark on your context-aware language generation journey, visit the notebook in the GitHub repository.

To get started now, check out SageMaker JumpStart in SageMaker Studio.


About the Authors

Supriya Puragundla is a Senior Solutions Architect at AWS. She has over 15 years of IT experience in software development, design and architecture. She helps key enterprise customer accounts on their data, generative AI and AI/ML journeys. She is passionate about data-driven AI and the area of depth in ML and generative AI.

Dr. Farooq Sabir is a Senior Artificial Intelligence and Machine Learning Specialist Solutions Architect at AWS. He holds PhD and MS degrees in Electrical Engineering from the University of Texas at Austin and an MS in Computer Science from Georgia Institute of Technology. He has over 15 years of work experience and also likes to teach and mentor college students. At AWS, he helps customers formulate and solve their business problems in data science, machine learning, computer vision, artificial intelligence, numerical optimization, and related domains. Based in Dallas, Texas, he and his family love to travel and go on long road trips.

Marco Punio is a Sr. Specialist Solutions Architect focused on generative AI strategy, applied AI solutions, and conducting research to help customers hyperscale on AWS. Marco is based in Seattle, WA, and enjoys writing, reading, exercising, and building applications in his free time.

Niithiyn Vijeaswaran is a Solutions Architect at AWS. His area of focus is generative AI and AWS AI Accelerators. He holds a Bachelor’s degree in Computer Science and Bioinformatics. Niithiyn works closely with the Generative AI GTM team to enable AWS customers on multiple fronts and accelerate their adoption of generative AI. He’s an avid fan of the Dallas Mavericks and enjoys collecting sneakers.

Yousuf Athar is a Solutions Architect at AWS specializing in generative AI and AI/ML. With a Bachelor’s degree in Information Technology and a concentration in Cloud Computing, he helps customers integrate advanced generative AI capabilities into their systems, driving innovation and competitive edge. Outside of work, Yousuf loves to travel, watch sports, and play football.

Gaurav Parekh is an AWS Solutions Architect specializing in Generative AI, Analytics and Networking technologies.

Read More

Best prompting practices for using Meta Llama 3 with Amazon SageMaker JumpStart

Best prompting practices for using Meta Llama 3 with Amazon SageMaker JumpStart

Llama 3, Meta’s latest large language model (LLM), has taken the artificial intelligence (AI) world by storm with its impressive capabilities. As developers and businesses explore the potential of this powerful model, crafting effective prompts is key to unlocking its full potential.

In this post, we dive into the best practices and techniques for prompting Meta Llama 3 using Amazon SageMaker JumpStart to generate high-quality, relevant outputs. We discuss how to use system prompts and few-shot examples, and how to optimize inference parameters, so you can get the most out of Meta Llama 3. Whether you’re building chatbots, content generators, or custom AI applications, these prompting strategies will help you harness the power of this cutting-edge model.

Meta Llama 2 vs. Meta Llama 3

Meta Llama 3 represents a significant advancement in the field of LLMs. Building upon the capabilities of its predecessor Meta Llama 2, this latest iteration brings state-of-the-art performance across a wide range of natural language tasks. Meta Llama 3 demonstrates improved capabilities in areas such as reasoning, code generation, and instruction following compared to Meta Llama 2.

The Meta Llama 3 release introduces four new LLMs by Meta, building upon the Meta Llama 2 architecture. They come in two variants—8 billion and 70 billion parameters—with each size offering both a base pre-trained version and an instruct-tuned version. Additionally, Meta is training an even larger 400-billion-parameter model, which is expected to further enhance the capabilities of Meta Llama 3. All Meta Llama 3 variants boast an impressive 8,000 token context length, allowing them to handle longer inputs compared to previous models.

Meta Llama 3 introduces several architectural changes from Meta Llama 2, using a decoder-only transformer along with a new 128,000 tokenizer to improve token efficiency and overall model performance. Meta has put significant effort into curating a massive and diverse pre-training dataset of over 15 trillion tokens from publicly available sources spanning STEM, history, current events, and more. Meta’s post-training procedures have reduced false refusal rates, aimed at better aligning outputs with human preferences while increasing response diversity.

Solution overview

SageMaker JumpStart is a powerful feature within the Amazon SageMaker machine learning (ML) platform that provides ML practitioners a comprehensive hub of publicly available and proprietary foundation models (FMs). With this managed service, ML practitioners get access to growing list of cutting-edge models from leading model hubs and providers that they can deploy to dedicated SageMaker instances within a network isolated environment, and customize models using SageMaker for model training and deployment.

With Meta Llama 3 now available on SageMaker JumpStart, developers can harness its capabilities through a seamless deployment process. You gain access to the full suite of Amazon SageMaker MLOps tools, such as Amazon SageMaker Pipelines, Amazon SageMaker Debugger, and monitoring—all within a secure AWS environment under virtual private cloud (VPC) controls.

Drawing from our previous learnings with Llama-2-Chat, we highlight key techniques to craft effective prompts and elicit high-quality responses tailored to your applications. Whether you are building conversational AI assistants, enhancing search engines, or pushing the boundaries of language understanding, these prompting strategies will help you unlock Meta Llama 3’s full potential.

Before we continue our deep dive into prompting, let’s make sure we have all the necessary requirements to follow the examples.

Prerequisites

To try out this solution using SageMaker JumpStart, you need the following prerequisites:

Deploy Meta Llama 3 8B on SageMaker JumpStart

You can deploy your own model endpoint through the SageMaker JumpStart Model Hub available from SageMaker Studio or through the SageMaker SDK. To use SageMaker Studio, complete the following steps:

  1. In SageMaker Studio, choose JumpStart in the navigation pane.
  2. Choose Meta as the model provider to see all the models available by Meta AI.
  3. Choose the Meta Llama 8B Instruct model to view the model details such as license, data used to train, and how to use the model.On the model details page, you will find two options, Deploy and Preview notebooks, to deploy the model and create an endpoint.
  4. Choose Deploy to deploy the model to an endpoint.
  5. You can use the default endpoint and networking configurations or modify them based on your requirements.
  6. Choose Deploy to deploy the model.

Crafting effective prompts

Prompting is important when working with LLMs like Meta Llama 3. It is the main way to communicate what you want the model to do and guide its responses. Crafting clear, specific prompts for each interaction is key to getting useful, relevant outputs from these models.

Although language models share some similarities in how they’re built and trained, each has its own differences when it comes to effective prompting. This is because they’re trained on different data, using different techniques and settings, which can lead to subtle differences in how they behave and perform. For example, some models might be more sensitive to the exact wording or structure of the prompt, whereas others might need more context or examples to generate accurate responses. On top of that, the intended use case and domain of the model can also influence the best prompting strategies, because different tasks might benefit from different approaches.

You should experiment and adjust your prompts to find the most effective approach for each specific model and application. This iterative process is crucial for unlocking the full potential of each model and making sure the outputs align with what you’re looking for.

Prompt components

In this section, we discuss components by Meta Llama 3 Instruct expects in a prompt. Newlines (‘n’) are part of the prompt format; for clarity in the examples, they have been represented as actual new lines.

The following is an example instruct prompt with a system message:

<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are a helpful AI assistant for travel tips and recommendations<|eot_id|><|start_header_id|>user<|end_header_id|>
What can you help me with?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

The prompt contains the following key sections:

  • <|begin_of_text|> – Specifies the start of the prompt.
  • <|start_header_id|>system<|end_header_id|> – Specifies the role for the following message (for example, system).
  • You are a helpful AI assistant for travel tips and recommendations – Includes the system message.
  • <|eot_id|> – Specifies the end of the input message.
  • <|start_header_id|>user<|end_header_id|> – Specifies the role for the following message (for example, user).
  • What can you help me with? – Includes the user message.
  • <|start_header_id|>assistant<|end_header_id|> – Ends with the assistant header, to prompt the model to start generation. The model expects the assistant header at the end of the prompt to start completing it.

Following this prompt, Meta Llama 3 completes it by generating the {{assistant_message}}. It signals the end of the {{assistant_message}} by generating the <|eot_id|>.

The following is an example prompt with a single user message:

<|begin_of_text|><|start_header_id|>user<|end_header_id|>
What is France's capital?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

The following is the system prompt and multiple-turn conversation between the user and assistant:

<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are a helpful AI assistant for travel tips and recommendations<|eot_id|><|start_header_id|>user<|end_header_id|>
What is France's capital?<|eot_id|><|start_header_id|>assistant<|end_header_id|>
Bonjour! The capital of France is Paris!<|eot_id|><|start_header_id|>user<|end_header_id|>
What can I do there?<|eot_id|><|start_header_id|>assistant<|end_header_id|>
Paris, the City of Light, offers a romantic getaway with must-see attractions like the Eiffel Tower and Louvre Museum, romantic experiences like river cruises and charming neighborhoods, and delicious food and drink options, with helpful tips for making the most of your trip.<|eot_id|><|start_header_id|>user<|end_header_id|>
Give me a detailed list of the attractions I should visit, and time it takes in each one, to plan my trip accordingly.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Fundamental techniques

The following are some fundamental techniques in crafting our prompts:

  • Zero-shot prompting – Zero-shot prompting provides no examples to the model and relies solely on the model’s preexisting knowledge to generate a response based on the instruction given. The following is an example zero-shot prompt:
    <|begin_of_text|><|start_header_id|>system<|end_header_id|>
    Provide a brief summary of the causes of World War II.
    <|eot_id|><|start_header_id|>user<|end_header_id|>
    The causes of World War II include...
    <|eot_id|><|start_header_id|>assistant<|end_header_id|>
    

    This produces the following response:

    ...the aggressive expansion of fascist and nationalist ideologies, particularly in Germany, Italy, and Japan, which led to a complex web of alliances and tensions between nations. Some key factors contributing to the outbreak of World War II include: 1. The Treaty of Versailles: The harsh penalties and reparations imposed on Germany after World War I led to widespread resentment and a sense of injustice, creating a fertile ground for extremist ideologies like Nazism to flourish. 2. Rise of Fascist and Nationalist Regimes: The rise of fascist regimes in Germany, Italy, and Japan, led by charismatic leaders like Adolf Hitler, Benito Mussolini, and Hideki Tojo, who pursued aggressive expansionist policies and militarization. 3. Appeasement Policy: The policy of appeasement, pursued by Britain and France, which involved giving in to Germany's demands in the hope of avoiding war, only emboldened Hitler's aggression and created a power vacuum in Europe. 4. Economic Crisis: The global economic crisis of the 1930s, known as the Great Depression, led to widespread unemployment, poverty, and social unrest, creating an environment in which extremist ideologies could thrive. 5. Japanese Aggression: Japan's invasion of Manchuria in 1931 and its subsequent withdrawal from the League of Nations marked the beginning of its expansionist policies, which would eventually lead to its alliance with Germany and Italy. 6. German Remilitarization of the Rhineland: Germany's remilitarization of the Rhineland in 1936, which was demilitarized under the Treaty of Versailles, was a direct challenge to the Treaty and marked the beginning of Germany's aggressive expansion. 7. Anschluss with Austria: Germany's annexation of Austria in 1938, known as the Anschluss, further expanded its territory and created a sense of unease among its neighbors. 8. Munich Agreement: The Munich Agreement of 1938, in which Britain and France allowed Germany to annex the Sudetenland, a region of Czechoslovakia with a majority German population, is seen as a failed attempt to appease Hitler and avoid war. 9. Invasion of Poland: Germany's invasion of Poland on September 1, 1939, marked the beginning of World War II, as it prompted the United Kingdom and France to declare war on Germany. These complex and interconnected factors ultimately led to the outbreak of World War II, one of the deadliest conflicts in human history.

  • Few-shot prompting – Few-shot prompting involves providing the model with a few examples (usually two or more) of the desired input and output format. The model learns from these examples to generate an appropriate response for a new input. The following is an example few-shot prompt:
    <|begin_of_text|><|start_header_id|>System<|end_header_id|>
                Extract the relevant information from the following parahrapgh and present it in a JSON format.
                <|eot_id|><|start_header_id|>User<|end_header_id|>
                Michael Doe, a 45-year-old teacher from Boston, Massachusetts, is an avid reader and enjoys gardening during his spare time.
                Example 1:
                Paragraph: "John Doe is a 32-year-old software engineer from San Francisco, California. He enjoys hiking and playing guitar in his free time."
                "employee": {
                    "fullname": "John Doe",
                    "city": "San Francisco",
                    "state": "California",
                    "occupation": "software engineer",
                    "hobbies": [
                        "hiking",
                        "playing guitar"
                        ],
                    "recentTravel": "not provided"
                },
                Example 2:
                Paragraph: "Emily Jax, a 27-year-old marketing manager from New York City, loves traveling and trying new cuisines. She recently visited Paris and enjoyed the city's rich cultural heritage."
                "employee": {
                    "fullname": "Emily Jax",
                    "city": "New York City",
                    "state": "New York",
                    "occupation": "marketing manager",
                    "hobbies": [
                        "traveling",
                        "trying new cuisines"
                    ],
                    "recentTravel": "Paris"
                }            <|eot_id|><|start_header_id|>Assistant<|end_header_id|>
    

    This produces the following response:

    "employee": {
        "fullname": "Michael Doe",
        "city": "Boston",
        "state": "Massachusetts",
        "occupation": "teacher",
        "hobbies": [
            "reading",
            "gardening"
        ],
        "recentTravel": "not provided"
    }

  • Task decomposition – Task decomposition is a powerful technique that enhances the performance of LLMs by breaking down complex tasks into smaller, manageable sub-tasks. This approach not only improves efficiency and accuracy, but also allows for better resource management and adaptability to task complexity. The following is an example task decomposition prompt:
    <|begin_of_text|><|start_header_id|>system<|end_header_id|>
    Break down the task of planning a vacation into smaller, manageable steps.
    <|eot_id|><|start_header_id|>user<|end_header_id|>
    1. Choose a destination.
    2. Set a budget.
    3. Research accommodations.
    4. Plan activities.
    5. Book flights and accommodations.
    6. Pack and prepare for the trip.
    <|eot_id|><|start_header_id|>assistant<|end_header_id|>
    

    This produces the following response:

    Here's a more detailed breakdown of the task of planning a vacation into smaller, manageable steps: **Step 1: Choose a Destination** * Brainstorm a list of potential destinations based on interests, climate, and time of year * Research each destination to learn more about its culture, attractions, and activities * Consider factors such as travel time, cost, and visa requirements * Narrow down the options and select a final destination **Step 2: Set a Budget** * Determine how much money is available for the trip * Research the cost of travel, accommodations, food, and activities in the chosen destination * Set a daily budget and allocate funds for each aspect of the trip * Consider budgeting for contingencies and unexpected expenses **Step 3: Research Accommodations** * Research different types of accommodations, such as hotels, hostels, and vacation rentals * Compare prices and amenities of different options * Read reviews and check ratings from other travelers * Consider booking accommodations in advance to ensure availability and get the best rates **Step 4: Plan Activities** * Research top attractions and activities in the destination * Create a list of must-see sights and experiences * Consider booking tours or guided activities in advance * Leave some free time in the itinerary for spontaneity and relaxation **Step 5: Book Flights and Accommodations** * Compare prices and schedules for flights to the destination * Book flights and accommodations that fit within the budget * Consider booking a package deal that includes both flights and accommodations * Make sure to read and understand the cancellation and refund policies **Step 6: Pack and Prepare for the Trip** * Research the weather and cultural norms of the destination * Create a packing list and start gathering necessary items * Check the validity of passports and obtain any necessary visas or travel documents * Notify banks and credit card companies of travel plans to avoid transaction issues * Make photocopies of important documents and leave them with a trusted friend or family member By breaking down the task of planning a vacation into these smaller steps, it becomes more manageable and less overwhelming.

  • Chain-of-thought prompting – Chain-of-thought (CoT) prompting is a technique where the model is prompted to break down its reasoning into a series of intermediate steps before providing a final answer. This allows the model to perform more complex reasoning tasks. The following is an example CoT prompt:
    <|begin_of_text|><|start_header_id|>system<|end_header_id|>
    Solve the following math problem step by step.
    <|eot_id|><|start_header_id|>user<|end_header_id|>
    If you have 10 apples and you give 3 apples to your friend, then buy 5 more apples, and finally eat 2 apples, how many apples do you have left?
    <|eot_id|><|start_header_id|>assistant<|end_header_id|>

    This produces the following response:

    Let's break it down step by step: 1. You start with 10 apples. 2. You give 3 apples to your friend, so you have: 10 - 3 = 7 apples left 3. You buy 5 more apples, so you now have: 7 + 5 = 12 apples 4. You eat 2 apples, so you have: 12 - 2 = 10 apples left Therefore, you have 10 apples left.

To summarize:

  • Zero-shot uses no examples, relying on the model’s existing knowledge
  • Few-shot provides a small number of examples to guide the model
  • Task decomposition enhances LLM performance by breaking down complex tasks into smaller, manageable sub-tasks.
  • CoT breaks down complex reasoning into step-by-step prompts

The choice of technique depends on the complexity of the task and the availability of good example prompts. More complex reasoning usually benefits from CoT prompting.

Meta Llama 3 inference parameters

For Meta Llama 3, the Messages API allows you to interact with the model in a conversational way. You can define the role of the message and the content. The role can be either system, assistant, or user. The system role is used to provide context to the model, and the user role is used to ask questions or provide input to the model.

Users can get tailored responses for their use case using the following inference parameters while invoking Meta Llama 3:

  • Temperature – Temperature is a value between 0–1, and it regulates the creativity of Meta Llama 3 responses. Use a lower temperature if you want more deterministic responses, and use a higher temperature if you want more creative or different responses from the model.
  • Top-k – This is the number of most-likely candidates that the model considers for the next token. Choose a lower value to decrease the size of the pool and limit the options to more likely outputs. Choose a higher value to increase the size of the pool and allow the model to consider less likely outputs.
  • Top-p – Top-p is used to control the token choices made by the model during text generation. It works by considering only the most probable token options and ignoring the less probable ones, based on a specified probability threshold value (p). By setting the top-p value below 1.0, the model focuses on the most likely token choices, resulting in more stable and repetitive completions. This approach helps reduce the generation of unexpected or unlikely outputs, providing greater consistency and predictability in the generated text.
  • Stop sequences – This refers to the parameter to control the stopping sequence for the model’s response to a user query. This value can either be "<|start_header_id|>", "<|end_header_id|>", or "<|eot_id|>".

The following is an example prompt with inference parameters specific to the Meta Llama 3 model:

Llama3 Prompt:

<|begin_of_text|><|start_header_id|>user<|end_header_id|>
You are an assistant for question-answering tasks. Use the following pieces of retrieved context in the section demarcated by "```" to answer the question.
The context may contain multiple question answer pairs as an example, just answer the final question provided after the context.
If you dont know the answer just say that you dont know. Use three sentences maximum and keep the answer concise.

{context}
Question: {input}
<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Llama3 Inference Parameters:

max_new_tokens: 100
top_p: 0.92
temperature: 0.1
details: True
stop: '<|eot_id|>'

Example prompts

In this section, we present two example prompts.

The following prompt is for a question answering use case:

<|begin_of_text|><|start_header_id|>user<|end_header_id|>
You are an assistant for question-answering tasks. Use the following pieces of retrieved context in the section demarcated by "```" to answer the question. The context may contain multiple question answer pairs as an example, just answer the final question provided after the context. If you dont know the answer just say that you dont know. Use three sentences maximum and keep the answer concise.
"In the dawn of a new era, where circuits hum and glow, A force awakens, with knowledge to bestow. From labs and minds, a creation takes flight, A blend of code and thought, shining bright. "Beep-boop," it whispers, and worlds are born, Ideas flow freely, like a vast unending morn. From essays to equations, it solves with ease, A marvel for scholars, a guide to please. In the land of Academia, where wisdom dwells, A generative AI, with stories to tell. Transforming lives, minds, and the way we learn, A partnership with humans, a bond to discern. Yet amidst the joy, a shadow looms, For with great creation, consequences bloom. Professors ponder their worth, as AI grows, A future uncertain, as the digital wind blows. But hope emerges, a solution in sight, A synergy of knowledge, a future bright. AI as a partner, not a foe or tool, Together we'll thrive, a new golden rule. For in the end, the lesson is clear, AI's gift of knowledge, for all to revere. Rejoice, dear world, in this newfound bond, The rise of AI, of which we've grown fond. So let us embrace, this change and advance, Side by side, in a harmonious dance. For the future with AI, a world of wonder and awe, A leap into the unknown, that leaves us in awe. This poem captures the excitement, potential, and challenges of the rise of AI, reflecting themes from the provided search results“

Question: what is this poem about?

<|eot_id|><|start_header_id|>assistant<|end_header_id|>

This produces the following response:

This poem is about the rise of Artificial Intelligence (AI) and its potential impact on humanity, highlighting both the benefits and challenges it brings. It explores the themes of AI as a partner in learning and growth, while also acknowledging the uncertainty and potential consequences of its development.

The following prompt illustrates code generation:

<|begin_of_text|><|start_header_id|>user<|end_header_id|>

You're an code companion assistant who's main job is to assist software developers write efficient and secure code. If you do not know the answer, say I dont know. Also, think step-by-step to arrive to a solution.

Question: Can you write a simple streamlit app that I can use to turn the background color to the input that user selects?

<|eot_id|><|start_header_id|>assistant<|end_header_id|>

This produces the following response:

I can help you with that. Here's a simple Streamlit app that allows the user to select a background color from a dropdown menu and updates the background color accordingly:
```
import streamlit as st

# Create a dropdown menu for selecting the background color
color_options = ["Red", "Green", "Blue", "Yellow", "Purple"]
selected_color = st.selectbox("Select Background Color", color_options)

# Set the background color based on the user's selection
if selected_color == "Red":
st.write("<style>body { background-color: #FF0000; }</style>", unsafe_allow_html=True)
elif selected_color == "Green":
st.write("<style>body { background-color: #00FF00; }</style>", unsafe_allow_html=True)
elif selected_color == "Blue":
st.write("<style>body { background-color: #0000FF; }</style>", unsafe_allow_html=True)
elif selected_color == "Yellow":
st.write("<style>body { background-color: #FFFF00; }</style>", unsafe_allow_html=True)
elif selected_color == "Purple":
st.write("<style>body { background-color: #800080; }</style>", unsafe_allow_html=True)

# Add a header to display the selected color
st.header(f"Background Color: {selected_color}")
```
This app uses Streamlit's `selectbox` function to create a dropdown menu with the available color options. When the user selects a color, the app updates the background color using HTML styles. The `st.write` function is used to inject the HTML code into the app, and the `unsafe_allow_html=True` parameter allows us to use HTML code in the app.

Note that this is a simple example, and you can customize the app further by adding more color options, using a color picker widget, or even allowing the user to input a custom color code.

Clean up

To avoid incurring unnecessary costs, when you are done, delete the SageMaker endpoints using the following code snippets:

predictor.delete_model()
predictor.delete_endpoint()

Alternatively, to use the SageMaker console, complete the following steps:

  1. On the SageMaker console, under Inference in the navigation pane, choose Endpoints.
  2. Search for the embedding and text generation endpoints.
  3. On the endpoint details page, choose Delete.
  4. Choose Delete again to confirm.

Conclusion

Model providers such as Meta AI are releasing improved capabilities of their FMs in the form of new generation model families. It is critical for developers and businesses to understand the key differences between previous generation models and new generation models in order to take full advantage their capabilities. This post highlighted the differences between previous generation Meta Llama 2 and the new generation Meta Llama3 models, and demonstrated how developers can discover and deploy the Meta Llama3 models for inference using SageMaker JumpStart.

To fully take advantage of the model’s extensive abilities, you must understand and apply creative prompting techniques and adjust inference parameters. We highlighted key techniques to craft effective prompts for Meta Llama3 to help the LLMs produce high-quality responses tailored to your applications.

Visit SageMaker JumpStart in SageMaker Studio now to get started. For more information, refer to Train, deploy, and evaluate pretrained models with SageMaker JumpStart, JumpStart Foundation Models, and Getting started with Amazon SageMaker JumpStart. Use the SageMaker notebook provided in the GitHub repository as a starting point to deploy the model and run inference using the prompting best practices discussed in this post.


About the Authors

Sebastian Bustillo is a Solutions Architect at AWS. He focuses on AI/ML technologies with a profound passion for generative AI and compute accelerators. At AWS, he helps customers unlock business value through generative AI. When he’s not at work, he enjoys brewing a perfect cup of specialty coffee and exploring the world with his wife.

Madhur Prashant is an AI and ML Solutions Architect at Amazon Web Services. He is passionate about the intersection of human thinking and generative AI. His interests lie in generative AI, specifically building solutions that are helpful and harmless, and most of all optimal for customers. Outside of work, he loves doing yoga, hiking, spending time with his twin, and playing the guitar.

Supriya Puragundla is a Senior Solutions Architect at AWS. She helps key customer accounts on their generative AI and AI/ML journey. She is passionate about data-driven AI and the area of depth in machine learning and generative AI.

Farooq Sabir a Senior AI/ML Specialist Solutions Architect at AWS. He holds a PhD in Electrical Engineering from the University of Texas at Austin. He helps customers solve their business problems using data science, machine learning, artificial intelligence, and numerical optimization.

Brayan Montiel is a Solutions Architect at AWS based in Austin, Texas. He supports enterprise customers in the automotive and manufacturing industries, helping to accelerate cloud adoption technologies and modernize IT infrastructure. He specializes in AI/ML technologies, empowering customers to use generative AI and innovative technologies to drive operational growth and efficiencies. Outside of work, he enjoys spending quality time with his family, being outdoors, and traveling.

Jose Navarro is an AI/ML Solutions Architect at AWS, based in Spain. Jose helps AWS customers—from small startups to large enterprises—architect and take their end-to-end machine learning use cases to production. In his spare time, he loves to exercise, spend quality time with friends and family, and catch up on AI news and papers.

Read More

How healthcare payers and plans can empower members with generative AI

How healthcare payers and plans can empower members with generative AI

In this post, we discuss how generative artificial intelligence (AI) can help health insurance plan members get the information they need. Many health insurance plan beneficiaries find it challenging to navigate through the complex member portals provided by their insurance plans. These portals often require multiple clicks, filters, and searches to find specific information about their benefits, deductibles, claim history, and other important details. This can lead to dissatisfaction, confusion, and increased calls to customer service, resulting in a suboptimal experience for both members and providers.

The problem arises from the inability of traditional UIs to understand and respond to natural language queries effectively. Members are forced to learn and adapt to the system’s structure and terminology, rather than the system being designed to understand their natural language questions and provide relevant information seamlessly. Generative AI technology, such as conversational AI assistants, can potentially solve this problem by allowing members to ask questions in their own words and receive accurate, personalized responses. By integrating generative AI powered by Amazon Bedrock and purpose-built AWS data services such as Amazon Relational Database Service (Amazon RDS) into member portals, healthcare payers and plans can empower their members to find the information they need quickly and effortlessly, without navigating through multiple pages or relying heavily on customer service representatives. Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Stability AI, and Amazon through a unified API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI.

The solution presented in this post not only enhances the member experience by providing a more intuitive and user-friendly interface, but also has the potential to reduce call volumes and operational costs for healthcare payers and plans. By addressing this pain point, healthcare organizations can improve member satisfaction, reduce churn, and streamline their operations, ultimately leading to increased efficiency and cost savings.

Figure 1: Solution Demo

Figure 1: Solution Demo

Solution overview

In this section, we dive deep to show how you can use generative AI and large language models (LLMs) to enhance the member experience by transitioning from a traditional filter-based claim search to a prompt-based search, which allows members to ask questions in natural language and get the desired claims or benefit details. From a broad perspective, the complete solution can be divided into four distinct steps: text-to-SQL generation, SQL validation, data retrieval, and data summarization. The following diagram illustrates this workflow.

Figure 2: Logical Workflow

Figure 2: Logical Workflow

Let’s dive deep into each step one by one.

Text-to-SQL generation

This step takes the user’s questions as input and converts that into a SQL query that can be used to retrieve the claim- or benefit-related information from a relational database. A pre-configured prompt template is used to call the LLM and generate a valid SQL query. The prompt template contains the user question, instructions, and database schema along with key data elements, such as member ID and plan ID, which are necessary to limit the query’s result set.

SQL validation

This step validates the SQL query generated in previous step and makes sure it’s complete and safe to be run on a relational database. Some of the checks that are performed include:

  • No delete, drop, update, or insert operations are present in the generated query
  • The query starts with select
  • WHERE clause is present
  • Key conditions are present in the WHERE clause (for example, member-id = “78687576501” or member-id like “786875765%%”)
  • Query length (string length) is in expected range (for example, not more than 250 characters)
  • Original user question length is in expected range (for example, not more than 200 characters)

If a check fails, the query isn’t run; instead, a user-friendly message suggesting that the user contact customer service is sent.

Data retrieval

After the query has been validated, it is used to retrieve the claims or benefits data from a relational database. The retrieved data is converted into a JSON object, which is used in the next step to create the final answer using an LLM. This step also checks if no data or too many rows are returned by the query. In both cases, a user-friendly message is sent to the user, suggesting they provide more details.

Data summarization

Finally, the JSON object retrieved in the data retrieval step along with the user’s question is sent to LLM to get the summarized response. A pre-configured prompt template is used to call the LLM and generate a user-friendly summarized response to the original question.

Architecture

The solution uses Amazon API Gateway, AWS Lambda, Amazon RDS, Amazon Bedrock, and Anthropic Claude 3 Sonnet on Amazon Bedrock to implement the backend of the application. The backend can be integrated with an existing web application or portal, but for the purpose of this post, we use a single page application (SPA) hosted on Amazon Simple Storage Service (Amazon S3) for the frontend and Amazon Cognito for authentication and authorization. The following diagram illustrates the solution architecture.

Figure 3: Solution Architecture

Figure 3: Solution Architecture

The workflow consists of the following steps:

  1. A single page application (SPA) is hosted using Amazon S3 and loaded into the end-user’s browser using Amazon CloudFront.
  2. User authentication and authorization is done using Amazon Cognito.
  3. After a successful authentication, a REST API hosted on API Gateway is invoked.
  4. The Lambda function, exposed as a REST API using API Gateway, orchestrates the logic to perform the functional steps: text-to-SQL generation, SQL validation, data retrieval, and data summarization. The Amazon Bedrock API endpoint is used to invoke the Anthropic Claude 3 Sonnet LLM. Claim and benefit data is stored in a PostgreSQL database hosted on Amazon RDS. Another S3 bucket is used for storing prompt templates that will be used for SQL generation and data summarizations. This solution uses two distinct prompt templates:
    1. The text-to-SQL prompt template contains the user question, instructions, database schema along with key data elements, such as member ID and plan ID, which are necessary to limit the query’s result set.
    2. The data summarization prompt template contains the user question, raw data retrieved from the relational database, and instructions to generate a user-friendly summarized response to the original question.
  5. Finally, the summarized response generated by the LLM is sent back to the web application running in the user’s browser using API Gateway.

Sample prompt templates

In this section, we present some sample prompt templates.

The following is an example of a text-to-SQL prompt template:

<role> 
    You are a data analyst and expert in writing PostgreSQL DB queries and healthcare claims data.
</role>
<task> 
    Your task is to generate a SQL query based on the provided DDL, instructions, user_question, examples, and member_id. 
    Always add the condition "member_id =" in the generated SQL query, where the value of member_id will be provided in the member_id XML tag below.
</task>
<member_id> {text1} </member_id>
<DDL> 
    CREATE TABLE claims_history (claim_id SERIAL PRIMARY KEY, member_id INTEGER NOT NULL, member_name VARCHAR(30) NOT NULL, 
    relationship_code VARCHAR(10) NOT NULL, claim_type VARCHAR(20) NOT NULL, claim_date DATE NOT NULL, provider_name VARCHAR(100), 
    diagnosis_code VARCHAR(10), procedure_code VARCHAR(10), ndc_code VARCHAR(20), charged_amount NUMERIC(10,2), 
    allowed_amount NUMERIC(10,2), plan_paid_amount NUMERIC(10,2), patient_responsibility NUMERIC(10,2))
</DDL>
<instructions>
    1. Claim_type has two possible values - 'Medical' or 'RX'. Use claim_type = 'RX' for pharmacy or prescription claims.
    2. Relationship_code has five possible values - 'subscriber', 'spouse', 'son', 'daughter', or 'other'.
    3. 'I' or 'me' means "where relationship_code = 'subscriber'". 'My son' means "where relationship_code = 'son'" and so on.
    4. For creating a SQL WHERE clause for member_name or provider_name, use the LIKE operator with wildcard characters as a prefix and suffix. This is applicable when user_question contains a name.
    5. Return the executable query with the symbol @@ at the start and end.
    6. If the year is not provided in the date, assume it's the current year. Convert the date to the 'YYYY-MM-DD' format to use in the query.
    7. The SQL query must be generated based on the user_question. If the user_question does not provide enough information to generate the SQL, respond with "@@null@@" without generating any SQL query.
    8. If user_question is stated in the form of a SQL Query or contains delete, drop, update, insert, etc. SQL keywords, then respond with "@@null@@" without generating any SQL query.
</instructions>
<examples>
    <example> 
        <sample_question>List all claims for my son or Show me all my claims for my son</sample_question>
        <sql_query>@@SELECT * FROM claims_history WHERE relationship_code = 'son' AND member_id = '{member_id}';@@</sql_query> 
    </example>
    <example> 
        <sample_question>Total claims in 2021</sample_question>
        <sql_query>@@SELECT COUNT(*) FROM claims_history WHERE EXTRACT(YEAR FROM claim_date) = 2021 AND member_id = '{member_id}';@@</sql_query> 
    </example>
    <example> 
        <sample_question>List all claims for Michael</sample_question>
        <sql_query>@@SELECT * FROM claims_history WHERE member_name LIKE '%Michael%' AND member_id = '{member_id}';@@</sql_query> 
    </example>
    <example> 
        <sample_question>List all claims for Dr. John or Doctor John or Provider John</sample_question>
        <sql_query>@@SELECT * FROM claims_history WHERE provider_name LIKE '%John%' AND member_id = '{member_id}';@@</sql_query> 
    </example>
    <example> 
        <sample_question>Show me the doctors/providers/hospitals my son Michael visited on 1/19</sample_question>
        <sql_query>@@SELECT provider_name, claim_date FROM claims_history WHERE relationship_code = 'son' AND member_name LIKE '%Michael%' AND claim_date = '2019-01-19' AND member_id = '{member_id}';@@</sql_query> 
    </example>
    <example> 
        <sample_question>What is my total spend in last 12 months</sample_question> 
        <sql_query>@@SELECT SUM(allowed_amount) AS total_spend_last_12_months FROM claims_history WHERE claim_date >= CURRENT_DATE - INTERVAL '12 MONTHS' AND relationship_code = 'subscriber' AND member_id = 9875679801;@@</sql_query> 
    </example>
</examples>
<user_question> {text2} </user_question>

The {text1} and {text2} data items will be replaced programmatically to populate the ID of the logged-in member and user question. Also, more examples can be added to help the LLM generate appropriate SQLs.

The following is an example of a data summarization prompt template:

<role> 
    You are a customer service agent working for a health insurance plan and helping to answer questions asked by a customer. 
</role>
<task> 
    Use the result_dataset containing healthcare claims data to answer the user_question. This result_dataset is the output of the sql_query.
</task>
<instructions>
    1. To answer a question, use simple non-technical language, just like a customer service agent talking to a 65-year-old customer.
    2. Use a conversational style to answer the question precisely.
    3. If the JSON contains a "count" field, it means the count of claims. For example, "count": 6 means there are 6 claims, and "count": 11 means there are 11 claims.
    4. If the result_dataset does not contain meaningful claims data, then respond with one line only: "No data found for the search criteria."
</instructions>
<user_question> {text1} </user_question>
<sql_query> {text2} </sql_query>
<result_dataset> {text3} </result_dataset>

The {text1}, {text2}, and {text3} data items will be replaced programmatically to populate the user question, the SQL query generated in the previous step, and data formatted in JSON and retrieved from Amazon RDS.

Security

Amazon Bedrock is in scope for common compliance standards such as Service and Organization Control (SOC), International Organization for Standardization (ISO), and Health Insurance Portability and Accountability Act (HIPAA) eligibility, and you can use Amazon Bedrock in compliance with the General Data Protection Regulation (GDPR). The service enables you to deploy and use LLMs in a secured and controlled environment. The Amazon Bedrock VPC endpoints powered by AWS PrivateLink allow you to establish a private connection between the virtual private cloud (VPC) in your account and the Amazon Bedrock service account. It enables VPC instances to communicate with service resources without the need for public IP addresses. We define the different accounts as follows:

  • Customer account – This is the account owned by the customer, where they manage their AWS resources such as RDS instances and Lambda functions, and interact with the Amazon Bedrock hosted LLMs securely using Amazon Bedrock VPC endpoints. You should manage access to Amazon RDS resources and databases by following the security best practices for Amazon RDS.
  • Amazon Bedrock service accounts – This set of accounts is owned and operated by the Amazon Bedrock service team, which hosts the various service APIs and related service infrastructure.
  • Model deployment accounts – The LLMs offered by various vendors are hosted and operated by AWS in separate accounts dedicated for model deployment. Amazon Bedrock maintains complete control and ownership of model deployment accounts, making sure no LLM vendor has access to these accounts.

When a customer interacts with Amazon Bedrock, their requests are routed through a secured network connection to the Amazon Bedrock service account. Amazon Bedrock then determines which model deployment account hosts the LLM model requested by the customer, finds the corresponding endpoint, and routes the request securely to the model endpoint hosted in that account. The LLM models are used for inference tasks, such as generating text or answering questions.

No customer data is stored within Amazon Bedrock accounts, nor is it ever shared with LLM providers or used for tuning the models. Communications and data transfers occur over private network connections using TLS 1.2+, minimizing the risk of data exposure or unauthorized access.

By implementing this multi-account architecture and private connectivity, Amazon Bedrock provides a secure environment, making sure customer data remains isolated and secure within the customer’s own account, while still allowing them to use the power of LLMs provided by third-party providers.

Conclusion

Empowering health insurance plan members with generative AI technology can revolutionize the way they interact with their insurance plans and access essential information. By integrating conversational AI assistants powered by Amazon Bedrock and using purpose-built AWS data services such as Amazon RDS, healthcare payers and insurance plans can provide a seamless, intuitive experience for their members. This solution not only enhances member satisfaction, but can also reduce operational costs by streamlining customer service operations. Embracing innovative technologies like generative AI becomes crucial for organizations to stay competitive and deliver exceptional member experiences.

To learn more about how generative AI can accelerate health innovations and improve patient experiences, refer to Payors on AWS and Transforming Patient Care: Generative AI Innovations in Healthcare and Life Sciences (Part 1). For more information about using generative AI with AWS services, refer to Build generative AI applications with Amazon Aurora and Knowledge Bases for Amazon Bedrock and the Generative AI category on the AWS Database Blog.


About the Authors

Sachin Jain is a Senior Solutions Architect at Amazon Web Services (AWS) with focus on helping Healthcare and Life-Sciences customers in their cloud journey. He has over 20 years of experience in technology, healthcare and engineering space.

Sanjoy Thanneer is a Sr. Technical Account Manager with AWS based out of New York. He has over 20 years of experience working in Database and Analytics Domains. He is passionate about helping enterprise customers build scalable , resilient and cost efficient Applications.

Sukhomoy Basak is a Sr. Solutions Architect at Amazon Web Services, with a passion for Data, Analytics, and GenAI solutions. Sukhomoy works with enterprise customers to help them architect, build, and scale applications to achieve their business outcomes.

Read More

Enabling production-grade generative AI: New capabilities lower costs, streamline production, and boost security

Enabling production-grade generative AI: New capabilities lower costs, streamline production, and boost security

As generative AI moves from proofs of concept (POCs) to production, we’re seeing a massive shift in how businesses and consumers interact with data, information—and each other. In what we consider “Act 1” of the generative AI story, we saw previously unimaginable amounts of data and compute create models that showcase the power of generative AI. Just last year, many businesses, and even more individuals, were focused on learning and experimenting, and the sheer number of POCs was impressive. Thousands of customers, across diverse industries, conducted experiments anywhere from dozens to hundreds of experiments as they explored the potential of generative AI applications and the implications.

By early 2024, we are beginning to see the start of “Act 2,” in which many POCs are evolving into production, delivering significant business value. To learn more about Act 1 and Act 2, refer to Are we prepared for “Act 2” of gen AI?. The move to a production mindset focuses new attention on key challenges as companies build and evaluate models on specific tasks and search for the leanest, fastest, and most cost-effective options. Considering—and reducing—the investment required for production workloads means bringing new efficiency to the sometime complicated process of building, testing, and fine-tuning foundation models (FMs).

Delivering capabilities that increase efficiency and reduce costs

Offering multiple entry points to their generative AI journey is critical to delivering value to companies moving their generative AI applications into production. Our generative AI technology stack provides the services and capabilities necessary to build and scale generative AI applications—from Amazon Q (the most capable generative AI–powered assistant for accelerating software development) at the top layer to Amazon Bedrock (The easiest way to build and scale generative AI applications with foundation models) at the middle layer to Amazon SageMaker (purpose-built to help you build, train, and deploy FMs) at the foundational, bottom layer. While these layers provide different points of entry, the fundamental truth is that every generative AI journey starts at the foundational bottom layer.

Organizations that want to build their own models or want granular control are choosing Amazon Web Services (AWS) because we are helping customers use the cloud more efficiently and leverage more powerful, price-performant AWS capabilities such as petabyte-scale networking capability, hyperscale clustering, and the right tools to help you build. Our deep investment in this layer enhances the capabilities and efficiency of the services we provide at higher layers.

To make generative AI use cases economical, you need to run your training and inference on incredibly high-performing, cost-effective infrastructure that’s purpose-built for AI. Amazon SageMaker makes it easy to optimize at each step of the model lifecycle, whether you are building, training, or deploying. However, FM training and inference present challenges—including operational burden, overall cost, and performance lag that contributes to an overall subpar user experience. State-of-the-art generative AI models are averaging latencies in the order of seconds, and many of today’s massive models are too large to fit into a single instance.

In addition, the blistering pace of model optimization innovations leaves model builders with months of research to learn and implement these techniques, even before finalizing deployment configurations.

Introducing Amazon Elastic Kubernetes Service (Amazon EKS) in Amazon SageMaker HyperPod

Recognizing these challenges, AWS launched Amazon SageMaker HyperPod last year. Taking efficiency one step further, earlier this week, we announced the launch of Amazon EKS support on Amazon SageMaker HyperPod. Why? Because provisioning and managing the large GPU clusters needed for AI can pose a significant operational burden. And training runs that take weeks to complete are challenging, since a single failure can derail the entire process. Ensuring infrastructure stability and optimizing performance of distributed training workloads can also pose challenges.

Amazon SageMaker HyperPod provides a fully managed service that removes the operational burden and enables enterprises to accelerate FM development at an unprecedented scale. Now, support for Amazon EKS in Amazon SageMaker HyperPod makes it possible for builders to manage their SageMaker HyperPod clusters using Amazon EKS. Builders can use a familiar Kubernetes interface while eliminating the undifferentiated heavy lifting involved in setting up and optimizing these clusters for generative AI model development at scale. SageMaker HyperPod provides a highly resilient environment that automatically detects, diagnoses, and recovers from underlying infrastructure faults so that builders can train FMs for weeks or months at a time with minimal disruption.

Customer quote: Articul8 AI

“Amazon SageMaker HyperPod has helped us tremendously in managing and operating our computational resources more efficiently with minimum downtime. We were early adopters of the Slurm-based SageMaker HyperPod service and have benefitted from its ease-of-use and resiliency features, resulting in up to 35% productivity improvement and rapid scale up of our gen AI operations.

As a Kubernetes house, we are now thrilled to welcome the launch of Amazon EKS support for SageMaker HyperPod. This is a game changer for us because it integrates seamlessly with our existing training pipelines and makes it even easier for us to manage and operate our large-scale Kubernetes clusters. In addition, this also helps our end customers because we are now able to package and productize this capability into our gen AI platform, enabling our customers to run their own training and fine-tuning workloads in a more streamlined manner.”

– Arun Subramaniyan, Founder and CEO of Articul8 AI

Bringing new efficiency to inference

Even with the latest advancements in generative AI modeling, the inference phase remains a significant bottleneck. We believe that businesses creating customer or consumer-facing generative AI applications shouldn’t have to sacrifice performance for cost-efficiency. They should be able to get both. That’s why two months ago, we released the inference optimization toolkit on Amazon SageMaker, a fully managed solution that provides the latest model optimization techniques, such as speculative decoding, compilation, and quantization. Available across SageMaker, this toolkit offers a simple menu of the latest optimization techniques that can be used individually or together to create an “optimization recipe.” Thanks to easy access and implementation of these techniques, customers can achieve up to ~2x higher throughput while reducing costs by ~50% for generative AI inference.

Responsible model deployment that is safe and trustworthy

While cost and performance are critical issues, it’s important not to lose sight of other concerns that come to the forefront as we shift from POC to production. No matter what model you choose, it needs to be deployed in a safe, trustworthy, and responsible way. We all need to be able to unlock generative AI’s full potential while mitigating its risks. It should be easy to implement safeguards for your generative AI applications, customized to your requirements and responsible AI policies.

That’s why we built Amazon Bedrock Guardrails, a service that provides customizable safeguards so you can filter prompts and model responses. Guardrails can help block specific words or topics. As well, customers can use Guardrails to help identify and prevent restricted content from reaching end users.

We also have filters for harmful content and personal identifiable information (PII) and security checks for malicious prompts, such as prompt injections. Recently, we also developed guardrails to help reduce hallucinations by checking that responses are found in the source material and related to the query.

Delivering value with game-changing innovation

Our partnership with the NFL and our joint Next Gen Stats program offer impressive proof of how a production mindset is delivering true value not only to an organization but to people across the world. By using AWS AI tools and engineers, the NFL is taking tackle analysis to the next level, giving teams, broadcasters, and fans deeper insights into one of football’s most crucial skills—tackling. As fans know, tackling is a complex, evolving process that unfolds throughout each play. But traditional stats only tell part of the story. That’s why the NFL and AWS created Tackle Probability—a groundbreaking AI-powered metric that can identify a missed tackle, when and where that tackle attempt took place, and do it all in real time. For further detail, go to NFL on AWS.

Building this stat required 5 years of historical data to train an AI model on Amazon SageMaker capable of processing millions of data points per game, tracking 20 different features for each of the 11 defenders every tenth of a second. The result is a literally game-changing stat that provides unprecedented insights. Now the NFL can quantify tackling efficiency in ways never before possible. A defender can be credited with 15 tackle attempts in a game without a single miss, or we can measure how many missed tackles a running back forced. All told, there will be at least 10 new stats from this model.

For the NFL, coaches can now quantify tackling efficiency and identify players who consistently put themselves in the right position to make the play. And broadcasters can highlight broken or made tackles to fans in real time.

Building breakthroughs with AWS

The NFL is far from alone in making in using AWS to shift its focus from POC to production. Exciting startups like Evolutionary Scale are making it easy to generate new proteins and antibodies. Airtable is making it easier for their customers to use their data and build applications. And organizations like Slack are embedding generative AI into the workday. Fast-moving, successful start-ups are choosing AWS to build and accelerate their businesses. In fact, 96 percent of all AI/ML unicorns—and 90 percent of the 2024 Forbes AI 50—are AWS customers.

Why? Because we’re addressing the cost, performance, and security issues that enable production-grade generative AI applications. We’re empowering data scientists, ML engineers, and other builders with new capabilities that make generative AI development faster, easier, more secure, and less costly. We’re making FM building and tuning—and a portfolio of intuitive tools that make it happen—available to more organizations as part of our ongoing commitment to the democratization of generative AI.

Fueling the next wave of innovation

Optimizing costs, boosting production efficiency, and ensuring security—these are among the top challenges as generative AI evolves from POC production. We’re helping address these issues by adding innovative new capabilities to Amazon SageMaker, Amazon Bedrock, and beyond. And we’re lowering the barriers to entry by making these tools available to everyone, from large enterprises with ML teams to small businesses and individual developers just getting started. Empowering more people and organizations to experiment with generative AI creates an explosion of creative new use cases and applications. That’s exactly what we’re seeing as generative AI continues its rapid evolution from a fascinating technology to a day-to-day reality—improving experiences, inspiring innovation, boosting the competitive edge, and creating significant new value.


About the author

Baskar Sridharan is the Vice President for AI/ML and Data Services & Infrastructure, where he oversees the strategic direction and development of key services, including Bedrock, SageMaker, and essential data platforms like EMR, Athena, and Glue.

Prior to his current role, Baskar spent nearly six years at Google, where he contributed to advancements in cloud computing infrastructure. Before that, he dedicated 16 years to Microsoft, playing a pivotal role in the development of Azure Data Lake and Cosmos, which have significantly influenced the landscape of cloud storage and data management.

Baskar earned a Ph.D. in Computer Science from Purdue University and has since spent over two decades at the forefront of the tech industry.

He has lived in Seattle for over 20 years, where he, his wife, and two children embrace the beauty of the Pacific Northwest and its many outdoor activities. In his free time, Baskar enjoys practicing music and playing cricket and baseball with his kids.

Read More

Scaling Thomson Reuters’ language model research with Amazon SageMaker HyperPod

Scaling Thomson Reuters’ language model research with Amazon SageMaker HyperPod

Thomson Reuters, a global content and technology-driven company, has been using artificial intelligence and machine learning (AI/ML) in its professional information products for decades. The introduction of generative AI provides another opportunity for Thomson Reuters to work with customers and advance how they do their work, helping professionals draw insights and automate workflows, enabling them to focus their time where it matters most.

In this post, we explore the journey that Thomson Reuters took to enable cutting-edge research in training domain-adapted large language models (LLMs) using Amazon SageMaker HyperPod, an Amazon Web Services (AWS) feature focused on providing purpose-built infrastructure for distributed training at scale.

LLMs disrupt the industry

Towards the end of 2022, groundbreaking LLMs were released that realized drastic improvements over previous model capabilities. The resulting technology opened new doors to enhancing customer experiences by tailoring content, recommendations, and responses to individual customers in natural chat-like interfaces. For many businesses, the race was on to bring this technology into their products to maintain or gain competitive advantage. Thomson Reuters was no exception and keenly felt the need to help its customers be successful in this burgeoning, AI-augmented, world.

As with any technology, proper application and understanding of its limitations is critical. Consider the following elements.

  • Hallucinations – LLMs have a remarkable ability to respond to natural language, and clearly encode significant amounts of knowledge. However, the stochastic nature of the technology means that responses are based on the probability of word occurrences. An LLM doesn’t model facts so much as it models language. The model has no idea if the words (tokens) generated are factually correct, though it may have successfully modeled the correct sequence of words to represent facts. As a result, LLMs may hallucinate—in other words, they may generate text that is untrue.
  • Quality – While the general knowledge encoded in the latest LLMs is remarkably good, it may not be enough for your business or customer domains. Public and commercial LLMs are based on the knowledge of the internet—not what is behind your business’s closed doors. Adding to the problem, bias and factually incorrect information exists on the internet and there often isn’t enough transparency in what data is used and how commercial models are trained with it. Further, LLMs will only have encoded knowledge since their last training. They may not be up-to-date and businesses do not control the frequency of model retraining.
  • Speed, cost, and capacity – Depending on your use cases, you may find existing commercial LLMs are either too slow or too expensive or be in such high demand that you cannot purchase enough capacity to meet your requirements. (This may only be a temporary challenge because we’ve observed increased capacity and reduced cost as hardware, optimizations, and economies of scale continue to improve).

Thomson Reuters’ customers require professional-grade AI. They are professionals with discerning information needs in legal, corporate, tax, risk, fraud, compliance, and news domains. Take, for example, legal customers. US law is based on legal precedent—the outcomes of past trial cases are used to determine decisions in new cases. Not only does Thomson Reuters curate and enhance publicly available content such as regulations and laws, but it also has decades of editorial content on most aspects of the law that it analyzes and reflects upon. Legal research is a critical area for Thomson Reuters customers—it needs to be as complete as possible. It needs to be grounded in fact—any kind of errors in fact are highly problematic. Solutions should be grounded in the content and data that Thomson Reuters has.

Research and training experimentation

Thinking about the limitations of publicly available, commercial language models as described in the previous section, Thomson Reuters asked themselves the following questions:

  • Can Thomson Reuters’ editorially created, curated, or enhanced data be used to improve LLM knowledge for specific business tasks?
  • Would smaller LLMs (for example, 12–30B parameters) trained with Thomson Reuters data perform on a par with very large LLMs upwards of a trillion parameters?
  • What methods could be employed to train the Thomson Reuters domain-specific models to get the best results?

The potential benefits fell in three areas: quality, agency, and operational efficiency. With full access to model training, it’s possible that Thomson Reuters could tune LLM generation to their domain and allow for tighter Retrieval Augmented Generation (RAG) integration. This would directly impact quality. And if Thomson Reuters own the models, they would control how and when they are trained and updated. Lastly, if smaller tuned models could perform sufficiently, it could be a more cost-effective and scalable solution—improving overall operational efficiency.

Thomson Reuters’ research focused around answering these specific questions:

  • How well do foundation models (FMs) (in the 7–30B parameters range) perform on specific tasks, unmodified? (This would be the baseline.)
  • Does performance improve for specific tasks when augmented with Thomson Reuters domain-specific data using various training techniques?

To frame this research and give concrete evaluation targets, Thomson Reuters focused on several real-world tasks: legal summarization, classification, and question answering. Publicly available general textual data was used, as well as domain specific textual data from Thomson Reuters’ comprehensive stores of primary and secondary US law material. Primary law would include content published by the courts and enhanced by Thomson Reuters. Secondary law would include subject matter expert (SME) analysis and annotation of the law.

Thomson Reuters knew they would need to run a series of experiments—training LLMs from 7B to more than 30B parameters, starting with an FM and continuous pre-training (using various techniques) with a mix of Thomson Reuters and general data. Model fine-tuning would then take place to evaluate how much better it performed on specific legal tasks while at the same time evaluating for any loss in general knowledge or language understanding.

  1. Continuous pre-training – By further pre-training an existing FM, Thomson Reuters wished to enrich its understanding of legalese without compromising its general language abilities. This was largely an experiment in finding the right mix of domain and general training data to retain general knowledge while increasing domain-specific knowledge. Perplexity was used to measure impact of domain-specific training on general knowledge capabilities of the model.
  2. Instruction fine-tuning – This would be an exercise in generating impactful instruction datasets, including legal and general tasks. Thomson Reuters experimented with pre-training open source FMs, such as MPT, Flan-T5, and Mistral, and compared against industry standard commercial models, such as OpenAI’s GPT-4. In this case, Rouge was used to measure how well models performed on tasks.

Scaling language model training with Amazon SageMaker HyperPod

Thomson Reuters knew that training LLMs would require significant computing power. Training an LLM of even 7B parameters is a compute-intensive operation, requiring multi-node distributed computing capabilities. These compute nodes typically need large GPUs or similar hardware. In Thomson Reuters’ case, they focused on NVIDIA’s high performance A100 family of GPUs. Amazon Elastic Compute Cloud (Amazon EC2) P4d and P4de instances provided Thomson Reuters with the high performance they needed.

To estimate just how much compute power was required, Thomson Reuters used the Chinchilla scaling law to determine how much training data (in tokens) would be needed to retain quality at a given model size. The scaling law is based on published research that found that the model size to training tokens scales proportionally. From there, other publicly available information was used to estimate how much time (in days) would be required to complete training with a given number of GPUs.

 . . Model size
P4d #GPUs 2.6b (days) 6.6b (days) 13b (days) 30b (days) 65b (days)
8 64 1 6.6 24 125.4 918.4
16 128 0.5 3.3 12 62.7 459.2
32 256 0.2 1.7 6 31.3 229.6
55 440 0.1 1 3.5 17.9 164
64 512 0.1 0.9 3 15.7 114.8
. Chinchilla point 52b 132b 260b 600b 1.3t

So, for example, a 6.6B parameter model would require 132B input tokens and take just under 7 days to finish training with 64 A100 GPUs (or 8 P4d instances).

Apart from the ability to easily provision compute, there are other factors such as cluster resiliency, cluster management (CRUD operations), and developer experience, which can impact LLM training. With potentially hundreds of GPUs working in parallel, hardware failures are inevitable. To resolve these issues, customers typically have to identify, isolate, repair, and recover the faulty instance, or change configurations to continue without it, further delaying progress.

In order to provision a highly scalable cluster that is resilient to hardware failures, Thomson Reuters turned to Amazon SageMaker HyperPod. SageMaker HyperPod is a managed service that makes it easier for you to train FMs without interruptions or delays. It provides resilient and persistent clusters for large-scale deep learning training of FMs on long-running compute clusters. SageMaker HyperPod offers an interactive experience for rapid experimentation at scale, with resilience to hardware failures, enabling uninterrupted training jobs spanning weeks or months. With Amazon Elastic Kubernetes Service (Amazon EKS) support in SageMaker HyperPod, customers can associate a HyperPod cluster with an EKS cluster and manage ML workloads using the HyperPod cluster nodes as Kubernetes worker nodes, all through the Kubernetes control plane on the EKS cluster.

Amazon EKS support in SageMaker HyperPod offers several key resiliency features to make uninterrupted and efficient training of large ML models possible:

  1. Deep health checks – This is a managed health check for stress testing GPUs and AWS trn1 instances, as well as performing Elastic Fabric Adapter (EFA) checks. These checks can be run during the cluster creation, update, and node replacement phase and can be easily enabled or disabled through HyperPod APIs.
  2. Automatic node replacement – A monitoring agent performs managed, lightweight, and noninvasive checks, coupled with automated node replacement capability. This monitoring agent continuously monitors and detects potential issues, including memory exhaustion, disk failures, GPU anomalies, kernel deadlocks, container runtime issues, and out-of-memory (OOM) crashes. Based on the underlying issue, the monitoring agent either replaces or reboots the node.
  3. Auto-resume – SageMaker HyperPod provides job auto-resume capability using the Kubeflow training operator for PyTorch so that training jobs can recover and continue in the event of interruptions or failures. The extension makes sure that the job waits and restarts after the node is replaced.

Initial findings

Over the course of 5 months, Thomson Reuters successfully ran 20 training jobs using Amazon SageMaker HyperPod. They were able to scale their cluster up to 16 P4d instances, with their largest job using the entire cluster. Thomson Reuters trained a 70B parameter model on 400B input tokens, with the entire training job taking 36 days to complete. During that period, Thomson Reuters experienced zero hardware failures.

Continuous pre-training

In continuous pre-training, you train from an existing open source LLM checkpoint. This is more than a time-saver; it is a strategic decision that allows for the nuanced growth of the model’s capabilities over time. The preliminary results of Thomson Reuters’ experimentation showed that they were able to train models on the legal domain without losing general knowledge.

Thomson Reuters used a measure called perplexity. It quantifies how well the model predicts a sample of text. In essence, perplexity measures the confidence a model has in its predictions. Lower perplexity indicates that the model is more certain about its predictions. From the following graph, you can see that as Thomson Reuters increased their batches of training, legal perplexity decreased while general perplexity increased somewhat, before quickly leveling off.

Instruction fine-tuning (IFT)

Instruct fine-tuned LLMs are tuned to respond to specific instructions, enabling tasks such as question answering, summarization, and brainstorming. For instance, human-written instruction datasets include prompts such as “summarize this article” or “list fun weekend activities.” Thomson Reuters’ hypothesis was that legal LLMs can benefit from diverse legal instructions.

Thomson Reuters has discovered that their legal LLM greatly benefits from a vast array of diverse instructions. By compiling legal instructions, such as drafting legal headnotes, and combining them with publicly available instructions, Thomson Reuters’ MPT-TR-7b model, derived from MPT-7b, has showcased improvements correlated with an increased number of instruction datasets provided.

Thomson Reuters used an automatic measure called Rouge to determine how well domain adapted models performed compared to GPT-4. This automatic measure, based on term overlap, is not the same as human preference judgment, but gives Thomson Reuters some degree of confidence that they are headed in the right direction.

Legal summarization

Thomson Reuters’ MPT-TR-7b model has demonstrated proficiency in legal summarization tasks, rivaling GPT-4’s performance when evaluated with automatic metrics assessing word overlap with reference summaries. While a human-based evaluation would offer deeper insights, the initial results are compelling evidence of the model’s capabilities. The following graph compares Thomson Reuters’ model with GPT-4.

Legal classification

In other legal tasks, such as classification that was measured in accuracy and precision or recall, there’s still room to improve. Nonetheless, the performance uptick is evident with the expansion of instruction datasets, as shown in the following graph. Even more exciting is the leap in performance observed with larger base models such as MPT-30b.

Conclusion

In this post, we have discussed how Thomson Reuters was able to meet their LLM training requirements using Amazon SageMaker HyperPod. Using Amazon EKS on HyperPod, Thomson Reuters was able to scale up their capacity and easily run their training jobs, permitting Thomson Reuters to unlock benefits of LLMs in areas such as legal summarization and classification.

If your business operates in specialized or deep verticals with knowledge not generally available on the web, experimenting with model training may make sense. At the same time, you’ll need to weigh the costs associated with training and inference as well as keeping up with rapidly advancing LLM technology. Like Thomson Reuters, you might want to start with RAG solutions with off-the-shelf LLMs as a first step, then consider customization options from there. If you do decide that training LLMs makes sense, then you’ll need considerable computational power. Amazon SageMaker HyperPod helps you to provision and manage the infrastructure required. Read more about Amazon SageMaker HyperPod and Amazon EKS support in SageMaker HyperPod.


About the Authors

John Duprey is a Distinguished Engineer at Thomson Reuters Labs with over 25 years of experience. In his role, John drives innovative solutions to complex problems and champions engineering excellence and culture. Recently, he has contributed to Thomson Reuters’ generative AI initiatives, focusing on scalability, platform design, and SDK development.

Adam Raffe is a Principal Solutions Architect at AWS. With over 8 years of experience in cloud architecture, Adam helps large enterprise customers solve their business problems using AWS.

Vu San Ha Huynh is a Solutions Architect at AWS. He has a PhD in computer science and enjoys working on different innovative projects to help support large enterprise customers.

Ankit Anand is a Senior Foundation Models Go-To-Market (GTM) Specialist at AWS. He partners with top generative AI model builders, strategic customers, and AWS Service Teams to enable the next generation of AI/ML workloads on AWS. Ankit’s experience includes product management expertise within the financial services industry for high-frequency/low-latency trading and business development for Amazon Alexa.

Arun Kumar Lokanatha is a Senior ML Solutions Architect with the Amazon SageMaker Service team. He specializes in large model training workloads helping customers build LLM workloads using SageMaker HyperPod, SageMaker training jobs, and SageMaker distributed training. Outside of work, he enjoys running, hiking, and cooking.

Simone Zucchet is a Solutions Architect Manager at AWS. With over 6 years of experience as a Cloud Architect, Simone enjoys working on innovative projects that help transform the way organizations approach business problems. He helps support large enterprise customers at AWS and is part of the Machine Learning TFC. Outside of his professional life, he enjoys working on cars and photography.

Read More

Introducing Amazon EKS support in Amazon SageMaker HyperPod

Introducing Amazon EKS support in Amazon SageMaker HyperPod

We are thrilled to introduce Amazon Elastic Kubernetes Service (Amazon EKS) support in Amazon SageMaker HyperPod, a purpose-built infrastructure engineered with resilience at its core. This capability allows for the seamless addition of SageMaker HyperPod managed compute to EKS clusters, using automated node and job resiliency features for foundation model (FM) development.

FMs are typically trained on large-scale compute clusters with hundreds or thousands of accelerators. Under such circumstances, hardware failures pose a significant challenge, because a single accelerator failure among thousands can halt the entire training process. For example, Meta Llama 3 405B pre-training over 54 days on 16K NVIDIA H100 Tensor Core GPUs experienced 419 unexpected interruptions, with 78% attributed to confirmed or suspected hardware issues, and with 58.7% of these interruptions being GPU-related problems, including NVLink failures and HBM3 memory failures.

Since its inception, SageMaker HyperPod was designed with a focus on managed resiliency features to mitigate such hardware failures, enabling FM builders such as Thomson Reuters, Perplexity AI, and Hugging Face to scale their FM training and inference on Slurm clusters. With the EKS support in HyperPod, you can now also benefit from the resiliency features on Kubernetes clusters by managing machine learning (ML) workloads using the HyperPod compute and managed Kubernetes control plane on the EKS cluster.

AI startups like Observea and Articul8, and enterprises like Thomson Reuters use this new feature set to manage their ML model development lifecycle:

“Through our use of SageMaker HyperPod, our customers and internal teams no longer have to worry about operating and configuring the Kubernetes control plane, and SageMaker HyperPod provides the network performance and optimized configurations to support complex HPC workloads. With Amazon EKS support in SageMaker HyperPod, we can reduce time we spent for undifferentiated heavy lifting in infrastructure management and reduce operational costs by over 30%.”

– Observea

“As a Kubernetes house, we are now thrilled to welcome the launch of Amazon EKS support for SageMaker HyperPod. This is a game changer for us as it integrates seamlessly with our existing training pipelines and makes it even easier for us to manage and operate our large-scale Kubernetes clusters. In addition, this also helps our end customers as we are now able to package and productize this capability into our GenAI platform, enabling our customers to run their own training and fine-tuning workloads in a more streamlined manner.”

– Articul8 AI

This post is designed for Kubernetes cluster administrators and ML scientists, providing an overview of the key features that SageMaker HyperPod introduces to facilitate large-scale model training on an EKS cluster.

The post is organized into the following three sections:

  • Overview of Amazon EKS support in SageMaker HyperPod – This section provides a high-level overview of Amazon EKS support in SageMaker HyperPod, introducing three key resiliency features HyperPod compute provides on the EKS cluster. Additionally, this section explains how HyperPod provides a smooth developer experience for admins and scientists.
  • HyperPod cluster setup and node resiliency features – This section provides a detailed guide on integrating HyperPod managed compute into your EKS cluster as Kubernetes worker nodes, emphasizing how its built-in resiliency features provide infrastructure stability. This section is especially beneficial for admins.
  • Training job resiliency with the job auto resume functionality – In this section, we demonstrate how scientists can submit and manage their distributed training jobs using either the native Kubernetes CLI (kubectl) or optionally the new HyperPod CLI (hyperpod) with automatic job recovery enabled.

Overview of EKS support in SageMaker HyperPod

This section provides a high-level overview of Amazon EKS support in SageMaker HyperPod, introduces three key resiliency features HyperPod compute provides on the EKS cluster, and discusses how SageMaker HyperPod provides smooth user experiences for admins and scientists.

Architecture overview

Amazon EKS support in HyperPod supports a 1-to-1 mapping between an EKS cluster (serving as a Kubernetes control plane) and a HyperPod compute (attached as a group of worker nodes). You have three virtual private clouds (VPCs) in this architecture, hosting different types of resources:

  • Amazon EKS VPC – An AWS managed VPC hosts the EKS control plane. This VPC doesn’t appear in the customer account. Amazon EKS creates a highly available endpoint for the managed Kubernetes API server that you use to communicate with your cluster (using tools like kubectl). The managed endpoint uses Network Load Balancer to load balance Kubernetes API servers.
  • HyperPod VPC – An AWS managed VPC hosts the HyperPod compute. This VPC doesn’t appear in the customer account. The nodes connect to the EKS control plane through a cross-account elastic network interface (ENI).
  • SageMaker user VPC – A user-managed VPC hosts resources such as Amazon FSx for Lustre, which is optionally associated with Amazon Simple Storage Service (Amazon S3) using an data repository association, on your account.

Cross-account ENIs also bridge communication between HyperPod compute instances and other AWS services on your account, such as Amazon Elastic Container Registry (Amazon ECR) and Amazon CloudWatch.

The following diagram illustrates the high-level architecture of Amazon EKS support in HyperPod.

HyperPod EKS Architucture

HyperPod-managed resiliency features

Amazon EKS support in HyperPod provides the following three capabilities to make sure the cluster stays healthy and training jobs continue under unexpected interruptions:

  • Deep health checks – This is a managed health check for stress testing GPUs and AWS Trainium instances, as well as performing Elastic Fabric Adapter (EFA) These checks can be run during the cluster creation, update, or node replacement phases, and can be enabled or disabled through HyperPod APIs.
  • Automated node recovery – HyperPod performs managed, lightweight, and non-invasive checks, coupled with automated node replacement capability. The HyperPod monitoring agent continuously monitors and detects potential issues, including memory exhaustion, disk failures, GPU anomalies, kernel deadlocks, container runtime issues, and out-of-memory (OOM) crashes. Based on the underlying issue, the monitoring agent either replaces or reboots the node.
  • Job auto resume – SageMaker HyperPod provides a job auto resume capability using the Kubeflow Training Operator for PyTorch to provide recovery and continuation of training jobs in the event of interruptions or failures. The extension makes sure the job waits and restarts after the node is replaced.

User experiences

In addition to the aforementioned managed resiliency features, SageMaker HyperPod provides smooth user experiences for both admins and scientists that are critical for managing a large cluster and running large-scale training jobs on them as part of the Amazon EKS integration:

  • Admin experience – SageMaker HyperPod provides APIs and a console experience to create and manage node groups in the EKS cluster, along with the ability to SSH into the cluster nodes. SageMaker HyperPod also provides a mechanism to install additional dependencies on the cluster nodes using lifecycle scripts, and an API-based mechanism to provide cluster software updates and improve overall observability.
  • Scientist experience – Along with enabling scientists to train FMs using Amazon EKS as the orchestrator, SageMaker HyperPod provides additional capabilities for scientists to effortlessly train models. With the HyperPod CLI, scientists can submit training jobs by providing a .yaml file and manage jobs (list, describe, view, cancel) without needing to use kubectl. Scientists can use open source tools like Kueue (a Kubernetes tool for job queuing) and adjacent SageMaker capabilities like managed MLflow to manage their experiments and training runs. Scientists can also access native SageMaker distributed training libraries that provide performance improvements by up to 20%. You can also enable SageMaker HyperPod compute with Amazon EKS support using third-party tools like KubeRay, which runs on the Kubernetes API. This allows you to bring your preferred job submission and management capabilities used with other Kubernetes clusters into your HyperPod environment.

HyperPod compute setup and node resiliency features

In this section, we provide a detailed guide on integrating HyperPod managed compute into your EKS cluster as Kubernetes worker nodes, and discuss how its built-in resiliency features provide infrastructure stability.

Prerequisites

You need to have the following in place prior to the HyperPod compute deployment:

  • EKS cluster – You can associate HyperPod compute to an existing EKS cluster that satisfies the set of prerequisites. Alternatively, you can deploy a ready-made EKS cluster with a single AWS CloudFormation template. Refer the architecture guide for step-by-step setup instruction.
  • Custom resources – Running multi-node distributed training requires various resources various components, such as device plugins, CSI drivers, and Training Operators, to be pre-deployed on the EKS cluster. You also need to deploy additional resources for the health monitoring agent and deep health check. HyperPodHelmCharts simplify the process using Helm, one of most commonly used package mangers for Kubernetes. Refer the developer guide for installation.

HyperPod compute setup

With the aforementioned resources successfully deployed, you’re now prepared to create the HyperPod compute. The cluster configuration is specified using a JSON file; the following code provides an example:

cat > cluster-config.json << EOL
{
    "ClusterName": "ml-cluster",
    "Orchestrator": {
        "Eks": {
            "ClusterArn": "${EKS_CLUSTER_ARN}"
        }
    },
    "InstanceGroups": [
        {
            "InstanceGroupName": "worker-group-1",
            "InstanceType": "ml.p5.48xlarge",
            "InstanceCount": 4,
            "LifeCycleConfig": {
                "SourceS3Uri": "s3://${BUCKET_NAME}",
                "OnCreate": "on_create.sh"
            },
            "ExecutionRole": "${EXECUTION_ROLE}",
            "ThreadsPerCore": 1,
            "OnStartDeepHealthChecks": [
                "InstanceStress",
                "InstanceConnectivity"
            ]
        }
    ],
    "VpcConfig": {
        "SecurityGroupIds": [
            "$SECURITY_GROUP"
        ],
        "Subnets": [
            "$SUBNET_ID"
        ]
    },
    "NodeRecovery": "Automatic"
}
EOL

The provided configuration file contains two key highlights:

  • “OnStartDeepHealthChecks”: [“InstanceStress”, “InstanceConnectivity”] – Instructs HyperPod to conduct a deep health check whenever new GPU or Trainium instances are added
  • “NodeRecovery”: “Automatic” – Enables HyperPod’s automated node recovery functionality

You can create a HyperPod compute with the following aws command (you need version 2.17.47 or newer):

aws sagemaker create-cluster 
    --cli-input-json file://cluster-config.json

{
    "ClusterArn": "arn:aws:sagemaker:us-east-2:xxxxxxxxxx:cluster/wccy5z4n4m49"
}

To verify the cluster status, you can use the following command:

aws sagemaker list-clusters --output table 

This command displays the cluster details, including the cluster name, status, and creation time:

-----------------------------------------------------------------------------------------------------------------------
|                                                    ListClusters                                                     |
+---------------------------------------------------------------------------------------------------------------------+
||                                                 ClusterSummaries                                                  ||
|+----------------------------------------------------------------+--------------+----------------+------------------+|
||                           ClusterArn                           | ClusterName  | ClusterStatus  |  CreationTime    ||
|+----------------------------------------------------------------+--------------+----------------+------------------+|
||  arn:aws:sagemaker:us-east-2:111111111111:cluster/wccy5z4n4m49 |  ml-cluster  |  Creating      |  1723724079.337  ||
|+----------------------------------------------------------------+--------------+----------------+------------------+|

Alternatively, you can verify the cluster status through the SageMaker console. After a brief period, you can observe that the status for all nodes transitions to Running.

SageMaker Console

Node resiliency features

To gain further insight into the instances, you can use kubectl get nodes and examine the node labels. The sagemaker.amazonaws.com/node-health-status label reveals the life stage of each node. For instance, nodes with the ml.m5.2xlarge instance type are labeled as Schedulable, indicating that they have successfully passed the regular HyperPod health check. Conversely, nodes with the ml.p5.48xlarge instance type are labeled as Unschedulable, indicating that they have entered the initial deep health checks. The following code shows an example:

# kubectl get nodes --show-labels=true
NAME                         ...  LABELS
hyperpod-i-023cfe933b3b34369 ...  beta.kubernetes.io/instance-type=ml.m5.2xlarge,sagemaker.amazonaws.com/node-health-status=Schedulable,  ...
hyperpod-i-045961b6424401838 ...  beta.kubernetes.io/instance-type=ml.p5.48xlarge,sagemaker.amazonaws.com/node-health-status=Unschedulable, ...
hyperpod-i-074b81fdb5bf52e19 ...  beta.kubernetes.io/instance-type=ml.p5.48xlarge,sagemaker.amazonaws.com/node-health-status=Unschedulable, ...
hyperpod-i-0ae97710b3033cdb1 ...  beta.kubernetes.io/instance-type=ml.m5.2xlarge,sagemaker.amazonaws.com/node-health-status=Schedulable,  ...

The deep health check logs are stored in the CloudWatch log group at /aws/sagemaker/Clusters/<cluster_name>/<cluster_id>. The log streams are logged at DeepHealthCheckResults/<log_stream_id>. When the deep health checks identify an issue, the output log provides detailed information, including the instance ID that failed the deep health checks and the specific failure reason. For example:

# Example1
{
"level": "error",
"ts": "2024-08-15T21:15:22Z",
"msg": "Encountered FaultyInstance. Replace the Instance. Region: us-east-2,
InstanceType: p5.48xlarge. ERROR:Bandwidth has less than threshold: Expected minimum
threshold :80,NCCL Test output Bw: 30"
}
# Example2
{
"level": "error",
"ts": "2024-08-15T21:15:22Z",
"msg": "Encountered Unknownerror. Replace the Instance. Region: us-east-2,
InstanceType: p5.48xlarge. ERROR: Crash detected in dcgm test"
}

You can check the progress of the deep health check with the following values for the sagemaker.amazonaws.com/deep-health-check label on each node:

  • amazonaws.com/deep-health-check: InProgress 
  • amazonaws.com/deep-health-check: Passed
  • amazonaws.com/deep-health-check: Failed

If a node fails the deep health checks, it will be replaced. Otherwise, it will be marked with the Schedulable label:

sagemaker.amazonaws.com/node-health-status: Schedulable

When you want to manually replace a specific node in your cluster, you can do so by manually modifying the label.

For complete list of resilience-related Kubernetes labels, please refer AWS documentation.

Even after the initial deep health checks, HyperPod periodically runs regular health checks. To view the health events detected by the HyperPod health monitoring agent, you can check the CloudWatch stream log:

  • Example log group name/aws/sagemaker/Clusters/<cluster_name>/<cluster_id>
  • Example log stream nameSagemakerHealthMonitoringAgent/<your_node_group_name>/<instance_id>

The SagemakerHealthMonitoringAgent log stream for each node contains only the detection events from the health monitoring agent. For example:

# Example1
{
    "level": "info",
    "ts": "2024-09-06T03:15:11Z",
    "msg": "NPD caught ",
    "condition type: ": "KernelDeadlock",
    "with condition details ": {
        "type": "KernelDeadlock",
        "status": "False",
        "transition": "2024-09-06T03:15:11.539932213Z",
        "reason": "KernelHasNoDeadlock",
        "message": "kernel has no deadlock"
    },
    "HealthMonitoringAgentDetectionEvent": "HealthEvent"
}
# Example2
{
    "level": "info",
    "ts": "2024-09-06T03:15:11Z",
    "msg": "NPD caught ",
    "condition type: ": "NvidiaErrorTerminate",
    "with condition details ": {
        "type": "NvidiaErrorTerminate",
        "status": "False",
        "transition": "2024-09-06T03:15:11.539932283Z",
        "reason": "NvidiaNoErrorRequiredTerminate",
        "message": "Nvidia no error required terminate"
    },
    "HealthMonitoringAgentDetectionEvent": "HealthEvent"
}

The deep health checks or the health monitor agent identify issues in a certain node, the node is labeled with sagemaker.amazonaws.com/node-health-status=UnschedulablePendingReplace:NoSchedule to avoid scheduling pods, and then the node is replaced or rebooted.

You can monitor the health status of HyperPod nodes through CloudWatch Container Insights, now with enhanced observability for Amazon EKS. Container Insights helps collect, aggregate, and summarize metrics and logs from containerized applications and microservices, providing detailed insights into performance, health, and status metrics for CPU, GPU, Trainium, EFA, and file system up to the container level. For the complete list of metrics tracked, see Amazon EKS and Kubernetes Container Insights metrics. With the Container Insights integration with SageMaker HyperPod, you can also check the individual node health status and the total number of schedulable and unschedulable nodes, as shown in the following screenshots.

You can find the Container Insights set up guide in Amazon EKS Support in Amazon SageMaker HyperPod Workshop.

Training job resiliency with the job auto resume functionality

In addition to infrastructure resiliency features, you can use the use job auto resume capability using the Kubeflow Training Operator for PyTorch to maintain the recovery and continuation of training jobs in the event of interruptions or failures. The job auto resume feature attempts to continue the job, whereas the HyperPod node auto recovery functionality works on resolving node failures (node reboot or replacement as needed) to minimize training downtime. This section demonstrates the job auto resume feature using a PyTorch FSDP example on the awsome-distributed-training repository.

To enable the job auto resume feature, you create a PyTorchJob with the fsdp.yaml manifest, which includes the following annotations and nodeSelector:

apiVersion: "kubeflow.org/v1"
kind: PyTorchJob
metadata:
    name: fsdpjob
    namespace: kubeflow
    # config for HyperPod job auto-resume
    annotations: {
        sagemaker.amazonaws.com/enable-job-auto-resume: "true",
        sagemaker.amazonaws.com/job-max-retry-count: "2"
    }
spec:
  pytorchReplicaSpecs:
  ......
  Worker:
      replicas: 10
      restartPolicy: OnFailure

      template:
          spec:
            nodeSelector: sagemaker.amazonaws.com/node-health-status: Schedulable 
......

With the annotations sagemaker.amazonaws.com/enable-job-auto-resume: "true" and sagemaker.amazonaws.com/job-max-retry-count: "2", SageMaker HyperPod resumes interrupted training jobs up to two times and schedules the resumed jobs onto healthy nodes. These healthy nodes are identified by the node selector label sagemaker.amazonaws.com/node-health-status: Schedulable, ensuring that only nodes that have passed basic health checks and are available for running workloads are used for resumed jobs.

Submit the PyTorchJob using the kubectl command:

kubectl apply -f fsdp.yaml

With the job auto resume feature enabled, if a job fails due to a hardware failure or any transient issues during training, SageMaker HyperPod initiates the node replacement workflow and restarts the job after the faulty nodes are replaced. You can verify the status of job auto resume by describing the PyTorchJob:

kubectl describe pytorchjob -n kubeflow <job-name>

In the event of a hardware failure, the Kubeflow training job restarts as follows:

Start Time: 2024-07-11T05:53:10Z
Enable job auto-resume 27

Events:
Type Reason Age From
Message
---- ------ ---- ----

Normal SuccessfulCreateService 9m45s pytorchjob-controller
Created service: pt-job-1-worker-0
Normal SuccessfulCreateService 9m45s pytorchjob-controller
Created service: pt-job-1-worker-1
Normal SuccessfulCreateService 9m45s pytorchjob-controller
Created service: pt-job-1-master-0
Warning PyTorchJobRestarting 7m59s pytorchjob-controller
PyTorchJob pt-job-1 is restarting because 1 Master replica(s) failed.
Normal SuccessfulCreatePod 7m58s (x2 over 9m45s) pytorchjob-controller
Created pod: pt-job-1-worker-0
Normal SuccessfulCreatePod 7m58s (x2 over 9m45s) pytorchjob-controller
Created pod: pt-job-1-worker-1
Normal SuccessfulCreatePod 7m58s (x2 over 9m45s) pytorchjob-controller
Created pod: pt-job-1-master-0
Warning PyTorchJobRestarting 7m58s pytorchjob-controller
PyTorchJob pt-job-1 is restarting because 1 Worker replica(s) failed

When you submit a training job with the HyperPod CLI, you can also request the job to be auto resumed in the following way:

hyperpod start-job 
    --config-file ./config.yaml 
   --auto-resume true  
   --max-retry 2

Refer to config.yaml for full configuration. For other CLI options, refer to the documentation on Github repository.

Clean up

To delete your SageMaker HyperPod compute, use either the SageMaker console or the following AWS Command Line Interface (AWS CLI) command:

aws sagemaker delete-cluster --cluster-name <cluster_name>

Cluster deletion can take a few minutes. You can confirm successful deletion after you see no clusters on the SageMaker console.

Conclusion

With the support for Amazon EKS in SageMaker HyperPod, customers who have standardized their FM development workflows on Kubernetes can adopt SageMaker HyperPod and manage their cluster resources using a familiar Kubernetes interface in SageMaker HyperPod. When training an FM, SageMaker HyperPod automatically monitors cluster health, and when an infrastructure fault such as a GPU failure occurs, SageMaker HyperPod automatically remediates the issue and restarts the training process from the last saved checkpoint, without any human intervention. Amazon EKS further enhances this capability by running deep health checks. Whenever a new instance is added to the SageMaker HyperPod compute, it undergoes a deep health check process to identify and replace potentially problematic instances. SageMaker HyperPod then automatically replaces or reboots nodes identified as faulty and resumes training processes in the event of unexpected interruptions, involving node replacement and job resubmission.

For an end-to-end tutorial on cluster management and FM training, visit the Amazon EKS Support in Amazon SageMaker HyperPod Workshop. For more information on infrastructure deployment and additional distributed training test cases, refer to the awsome-distributed-training repository. If you’re interested in deploying HyperPod with step-by-step commands, you can start from the aws-do-hyperpod repository.


About the authors

Keita Watanabe is a Senior GenAI Specialist Solutions Architect in the world-wide specialist organization at Amazon Web Services, where he helps develop machine learning solutions using OSS projects such as Slurm and Kubernetes. His background is in machine learning research and development. Prior to joining AWS, Keita worked in the ecommerce industry as a research scientist developing image retrieval systems for product search. Keita holds a PhD in Science from the University of Tokyo.

alex iankAlex Iankoulski is a full-stack software and infrastructure architect who likes to do deep, hands-on work. He is currently a Principal Solutions Architect in the world-wide specialist organization at AWS. In his role, he focuses on helping customers with the orchestration and scaling of ML and AI workloads on container-powered AWS services. He is also the author of the open source do framework and a Docker captain who loves applying container technologies to accelerate the pace of innovation while solving the world’s biggest challenges. During the past 10 years, Alex has worked on democratizing generative AI and ML, combating climate change, and making travel safer, healthcare better, and energy smarter.

shimoxTomonori Shimomura is a Senior Solutions Architect on the Amazon SageMaker team, where he provides in-depth technical consultation to SageMaker customers and suggests product improvements to the product team. Before joining Amazon, he worked on the design and development of embedded software for video game consoles, and now he leverages his in-depth skills in cloud-side technology. In his free time, he enjoys playing video games, reading books, and writing software.

arunkumar-LokhArun Kumar Lokanatha is a Senior ML Solutions Architect with the Amazon SageMaker team. He specializes in large language model training workloads, helping customers build LLM workloads using SageMaker HyperPod, SageMaker training jobs, and SageMaker distributed training. Outside of work, he enjoys running, hiking, and cooking.

manojManoj Ravi is a Senior Product Manager on the Amazon SageMaker team. He is passionate about building next-gen AI products and works on applications and tools to make foundation model development and deployment effortless for customers. He holds an MBA from the Haas School of Business and a master’s degree from Carnegie Mellon University. In his spare time, Manoj enjoys playing tennis and pursuing landscape photography.

Read More

A review of purpose-built accelerators for financial services

A review of purpose-built accelerators for financial services

Data contains information, and information can be used to predict future behaviors, from the buying habits of customers to securities returns. Businesses are seeking a competitive advantage by being able to use the data they hold, apply it to their unique understanding of their business domain, and then generate actionable insights from it. The financial services industry (FSI) is no exception to this, and is a well-established producer and consumer of data and analytics. All industries have their own nuances and ways of doing business, and FSI is no exception—here, considerations such as regulation and zero-sum game competitive pressures loom large. This mostly non-technical post is written for FSI business leader personas such as the chief data officer, chief analytics officer, chief investment officer, head quant, head of research, and head of risk. These personas are faced with making strategic decisions on issues such as infrastructure investment, product roadmap, and competitive approach. The aim of this post is to level-set and inform in a rapidly advancing field, helping to understand competitive differentiators, and formulate an associated business strategy.

Accelerated computing is a generic term that is often used to refer to specialist hardware called purpose-built accelerators (PBAs). In financial services, nearly every type of activity, from quant research, to fraud prevention, to real-time trading, can benefit from reducing runtime. By performing a calculation more quickly, the user may be able to solve an equation more accurately, provide a better customer experience, or gain an informational edge over a competitor. These activities cover disparate fields such as basic data processing, analytics, and machine learning (ML). And finally, some activities, such as those involved with the latest advances in artificial intelligence (AI), are simply not practically possible, without hardware acceleration. ML is often associated with PBAs, so we start this post with an illustrative figure. The ML paradigm is learning followed by inference. Typically, learning is offline (not streaming real-time data, but historical data) on large volumes of data, whereas inference is online on small volumes of streaming data. Learning means identifying and capturing historical patterns from the data, and inference means mapping a current value to the historical pattern. PBAs, such as graphics processing units (GPUs), have an important role to play in both these phases. The following figure illustrates the idea of a large cluster of GPUs being used for learning, followed by a smaller number for inference. The distinct computational nature of the learning and inference phases means some hardware providers have developed independent solutions for each phase, whereas others have single solutions for both phases.

As shown in the preceding figure, the ML paradigm is learning (training) followed by inference. PBAs, such as GPUs, can be used for both these steps. In this example figure, features are extracted from raw historical data, which are then are fed into a neural network (NN). Due to model and data size, learning is distributed over multiple PBAs in an approach called parallelism. Labeled data is used to learn the model structure and weights. Unseen new streaming data is then applied to the model, and an inference (prediction) on that data is made.

This post starts by looking at the background of hardware accelerated computing, followed by reviewing the core technologies in this space. We then consider why and how accelerated computing is important for data processing. Then we review four important FSI use cases for accelerated computing. Key problem statements are identified and potential solutions given. The post finishes by summarizing the three key takeaways, and makes suggestions for actionable next steps.

Background on accelerated computing

CPUs are designed for processing small volumes of sequential data, whereas PBAs are suited for processing large volumes of parallel data. PBAs can perform some functions, such as some floating-point (FP) calculations, more efficiently than is possible by software running on CPUs. This can result in advantages such as reduced latency, increased throughput, and decreased energy consumption. The three types of PBAs are the easily reprogrammable chips such as GPUs, and two types of fixed-function acceleration; field-programmable gate arrays (FPGAs), and application-specific integrated circuits (ASICs). Fixed or semi-fixed function acceleration is practical when no updates are needed to the data processing logic. FPGAs are reprogrammable, albeit not very easily, whereas ASICs are custom designed fully fixed for a specific application, and not reprogrammable. As a general rule, the less user-friendly the speedup, the faster it is. In terms of resulting speedups, the approximate order is programming hardware, then programming against PBA APIs, then programming in an unmanaged language such as C++, then a managed language such as Python. Analysis of publications containing accelerated compute workloads by Zeta-Alpha shows a breakdown of 91.5% GPU PBAs, 4% other PBAs, 4% FPGA, and 0.5% ASICs. This post is focused on the easily reprogrammable PBAs.

The recent history of PBAs begins in 1999, when NVIDIA released its first product expressly marketed as a GPU, designed to accelerate computer graphics and image processing. By 2007, GPUs became more generalized computing devices, with applications across scientific computing and industry. In 2018, other forms of PBAs became available, and by 2020, PBAs were being widely used for parallel problems, such as training of NN. Examples of other PBAs now available include AWS Inferentia and AWS Trainium, Google TPU, and Graphcore IPU. Around this time, industry observers reported NVIDIA’s strategy pivoting from its traditional gaming and graphics focus to moving into scientific computing and data analytics.

The union of advances in hardware and ML has led us to the current day. Work by Hinton et al. in 2012 is now widely referred to as ML’s “Cambrian Explosion.” Although NN had been around since the 1960s and never really worked, Hinton noted three key changes. Firstly, they added more layers to their NN, improving their performance. Secondly, there was a massive increase in the volume of labeled data available for training. Thirdly, the presence of GPUs enabled the labeled data to be processed. Together, these elements lead to the start of a period of dramatic progress in ML, with NN being redubbed deep learning. In 2017, the landmark paper “Attention is all you need” was published, which laid out a new deep learning architecture based on the transformer. In order to train transformer models on internet-scale data, huge quantities of PBAs were needed. In November 2022, ChatGPT was released, a large language model (LLM) that used the transformer architecture, and is widely credited with starting the current generative AI boom.

Review of the technology

In this section, we review different components of the technology.

Parallel computing

Parallel computing refers to carrying out multiple processes simultaneously, and can be categorized according to the granularity at which parallelism is supported by the hardware. For example, a grid of connected instances, multiple processors within a single instance, multiple cores within a single processor, PBAs, or a combination of different approaches. Parallel computing uses these multiple processing elements simultaneously to solve a problem. This is accomplished by breaking the problem into independent parts so that each processing element can complete its part of the workload algorithm simultaneously. Parallelism is suited for workloads that are repetitive, fixed tasks, involving little conditional branching and often large amounts of data. It also means not all workloads are equally suitable for acceleration.

In parallel computing, the granularity of a task is a measure of the amount of communication overhead between the processing functional units. Granularity is typically split into the categories of fine-grained and coarse-grained. Fine-grained parallelism refers to a workload being split into a large number of small tasks, whereas coarse-grained refers to splitting into a small number of large tasks. The key difference between the two categories is the degree of communication and synchronization required between the processing units. A thread of execution is the smallest sequence of programmed instructions that can be managed independently by a scheduler, and is typically a component of a process. The multiple threads of a given process may be run concurrently by multithreading, while sharing resources such as memory. An application can achieve parallelism by using multithreading to split data and tasks into parallel subtasks and let the underlying architecture manage how the threads run, either concurrently on one core or in parallel on multiple cores. Here, each thread performs the same operation on different segments of memory so that they can operate in parallel. This, in turn, enables better system utilization and provides faster program execution.

Purpose built accelerators

Flynn’s taxonomy is a classification of computer architectures helpful in understanding PBAs. Two classifications of relevance are single instruction stream, multiple data streams (SIMD), and the SIMD sub-classification of single instruction, multiple thread (SIMT). SIMD describes computers with multiple processing elements that perform the same operation on multiple data points simultaneously. SIMT describes processors that are able to operate on data vectors and arrays (as opposed to just scalars), and therefore handle big data workloads efficiently. Each SIMT core has multiple threads that run in parallel, thereby giving true simultaneous parallel hardware-level execution. CPUs have a relatively small number of complex cores and are designed to run a sequence of operations (threads) as fast as possible, and can run a few tens of these threads in parallel. GPUs, in contrast, feature smaller cores and are designed to run thousands of threads in parallel in the SIMT paradigm. It is this design that primarily distinguishes GPUs from CPUs and allows GPUs to excel at regular, dense, numerical, data-flow-dominated workloads.

Suppliers of data center GPUs include NVIDIA, AMD, Intel, and others. The AWS P5 EC2 instance type range is based on the NVIDIA H100 chip, which uses the Hopper architecture. The Hopper H100 GPU (SXM5 variant) architecture includes 8 GPU processing clusters (GPCs), 66 texture processing clusters (TPCs), 2 Streaming Multiprocessors (SMs)/TPC, 528 Tensor cores/GPU, and 128 CUDA cores/SM. Additionally, it features 80 GB HBM3 GPU memory, 900 GBps NVLink GPU-to-GPU interconnect, and a 50 MB L2 cache minimizing HBM3 trips. An NVIDIA GPU is assembled in a hierarchal manner: the GPU contains multiple GPCs, and the role of each GPC is to act as a container to hold all the components together. Each GPC has a raster engine for graphics and several TPCs. Inside each TPC is a texture unit, some logic control, and multiple SMs. Inside each SM are multiple CUDA and Tensor cores, and it is here that the compute work happens. The ratio of units GPU:GPC:TPC:SM:CUDA core/Tensor core varies according to release and version. This hierarchal architecture is illustrated in the following figure.

SMs are the fundamental building blocks of an NVIDIA GPU, and consist of CUDA cores, Tensor cores, distributed shared memory, and instructions to support dynamic programming. When a CUDA program is invoked, work is distributed to the multithreaded SMs with available execution capacity. The CUDA core, released in 2007, is a GPU core approximately equal to a CPU core. Although it’s not as powerful as a CPU core, the CUDA core advantage is its ability to be used for large-scale parallel computing. Like a CPU core, each CUDA core still only runs one operation per clock cycle; however, the GPU SIMD architecture enables large numbers of CUDA cores to simultaneously address one data point each. CUDA cores are split into support for different precision, meaning that in the same clock cycle, multiple precision work can be done. The CUDA core is well suited for high-performance computing (HPC) use cases, but is not so well suited for the matrix math found in ML. The Tensor core, released in 2017, is another NVIDIA proprietary GPU core that enables mixed-precision computing, and is designed to support the matrix math of ML. Tensor cores support mixed FP accuracy matrix math in a computationally efficient manner by treating matrices as primitives and being able to perform multiple operations in one clock cycle. This makes GPUs well suited for data-heavy, matrix math-based, ML training workloads, and real-time inference workloads needing synchronicity at scale. Both use cases require the ability to move data around the chip quickly and controllably.

From 2010 onwards, other PBAs have started becoming available to consumers, such as AWS Trainium, Google’s TPU, and Graphcore’s IPU. While an in-depth review on other PBAs is beyond the scope of this post, the core principle is one of designing a chip from the ground up, based around ML-style workloads. Specifically, ML workloads are typified by irregular and sparse data access patterns. This means there is a requirement to support fine-grained parallelism based on irregular computation with aperiodic memory access patterns. Other PBAs tackle this problem statement in a variety of different ways from NVIDIA GPUs, including having cores and supporting architecture complex enough for running completely distinct programs, and decoupling thread data access from the instruction flow by having distributed memory next to the cores.

AWS accelerator hardware

AWS currently offers a range of 68 Amazon Elastic Compute Cloud (Amazon EC2) instance types for accelerated compute. Examples include F1 Xilinx FPGAs, P5 NVIDIA Hopper H100 GPUs, G4ad AMD Radeon Pro V520 GPUs, DL2q Qualcomm AI 100, DL1 Habana Gaudi, Inf2 powered by Inferentia2, and Trn1 powered by Trainium. In March 2024, AWS announced it will offer the new NVIDIA Blackwell platform, featuring the new GB200 Grace Blackwell chip. Each EC2 instance type has a number of variables associated with it, such as price, chip maker, Regional availability, amount of memory, amount of storage, and network bandwidth.

AWS chips are produced by our own Annapurna Labs team, a chip and software designer, which is a wholly owned subsidiary of Amazon. The Inferentia chip became generally available (GA) in December 2019, followed by Trainium GA in October 2022, and Inferentia2 GA in April 2023. In November 2023, AWS announced the next generation Trainium2 chip. By owning the supply and manufacturing chain, AWS is able to offer high-levels of availability of its own chips. Availability AWS Regions are shown in the following table, with more Regions coming soon. Both Inferentia2 and Trainium use the same basic components, but with differing layouts, accounting for the different workloads they are designed to support. Both chips use two NeuronCore-v2 cores each, connected by a variable number of NeuronLink-v2 interconnects. The NeuronCores contain four engines: the first three include a ScalarEngine for scalar calculations, a VectorEngine for vector calculations, and a TensorEngine for matrix calculations. By analogy to an NVIDIA GPU, the first two are comparable to CUDA cores, and the latter is equivalent to TensorCores. And finally, there is a C++ programmable GPSIMD-engine allowing for custom operations. The silicon architecture of the two chips is very similar, meaning that the same software can be used for both, minimizing changes on the user side, and this similarity can be mapped back to their two roles. In general, the learning phase of ML is typically bounded by bandwidth associated with moving large volumes of data to the chip and about the chip. The inference phase of ML is typically bounded by memory, not compute. To maximize absolute-performance and price-performance, Trainium chips have twice as many NeuronLink-v2 interconnects as Inferentia2, and Trainium instances also contain more chips per instance than Inferentia2 instances. All these differences are implemented at the server level. AWS customers such as Databricks and Anthropic use these chips to train and run their ML models.

The following figures illustrate the chip-level schematic for the architectures of Inferentia2 and Trainium.

The following table shows the metadata of three of the largest accelerated compute instances.

Instance Name GPU Nvidia H100 Chips Trainium Chips Inferentia Chips vCPU Cores Chip Memory (GiB) Host Memory (GiB) Instance Storage (TB) Instance Bandwidth (Gbps) EBS Bandwidth (Gbps) PBA Chip Peer-to-Peer Bandwidth (GBps)
p5.48xlarge 8 0 0 192 640 2048 8 x 3.84 SSD 3,200 80 900 NVSwitch
inf2.48xlarge 0 0 12 192 384 768 EBS only 100 60 192 NeuronLink-v2
trn1n.32xlarge 0 16 0 128 512 512 4 x 1.9 SSD 1,600 80 768 NeuronLink-v2

The following table summarizes performance and cost.

Instance Name On-Demand Rate ($/hr) 3Yr RI Rate ($/hr) FP8 TFLOPS FP16 TFLOPS FP32 TFLOPS $/TFLOPS (FP16, theoretical) Source Reference
p5.48xlarge 98.32 43.18 16,000 8,000 8,000 $5.40 URL
inf2.48xlarge 12.98 5.19 2,280 2,280 570 $2.28 URL
trn1n.32xlarge 24.78 9.29 3,040 3,040 760 $3.06 URL

The following table summarizes Region availability.

Instance Name Number of AWS Regions Supported In AWS Regions Supported In Default Quota Limit
p5.48xlarge 4 us-east-2; us-east-1; us-west-2; eu-north-1 0
inf2.48xlarge 13 us-east-2; us-east-1; us-west-2; ap-south-1; ap-southeast-1; ap-southeast-2; ap-northeast-1; eu-central-1; eu-west-1; eu-west-2; eu-west-3; eu-north-1; sa-east-1; 0
trn1n.32xlarge 3 us-east-2; us-east-1; us-west-2; eu-north-1; ap-northeast-1; ap-south-1; ap-southeast-4 0

After a user has selected the EC2 instance type, it can then be combined with AWS services designed to support large-scale accelerated computing use cases, including high-bandwidth networking (Elastic Fabric Adapter), virtualization (AWS Nitro Enclaves), hyper-scale clustering (Amazon EC2 UltraClusters), low-latency storage (Amazon FSx for Lustre), and encryption (AWS Key Management Service), while noting not all services are available for all instances in all Regions.

The following figure shows an example of a large-scale deployment of P5 EC2 instances, includes UltraCluster support for 20,000 H100 GPUs, with non-blocking petabit-scale networking, and high-throughput low latency storage. Using the same architecture, UltraCluster supports Trainium scaling to over 60,000 chips.

In summary, we see two general trends in the hardware acceleration space. Firstly, improving price-performance to handle increasing data processing volumes and model sizes, coupled with a need to serve more users, more quickly, and at reduced cost. Secondly, improving security of the associated workloads by preventing unauthorized users from being able to access training data, code, or model weights.

Accelerator software

CPUs and GPUs are designed for different types of workloads. However, CPU workloads can run on GPUs, a process called general-purpose computing on graphics processing units (GPGPU). In order to run a CPU workload on a GPU, the work needs to be reformulated in terms of graphics primitives supported by the GPU. This reformulation can be carried out manually, though it is difficult programming, requiring writing code in a low-level language to map data to graphics, process it, and then map it back. Instead, it is commonly carried out by a GPGPU software framework, allowing the programmer to ignore the underlying graphical concepts, and enabling straightforward coding against the GPU using standard programming languages such as Python. Such frameworks are designed for sequential parallelism against GPUs (or other PBAs) without requiring concurrency or threads. Examples of GPGPU frameworks are the vendor-neutral open source OpenCL and the proprietary NVIDIA CUDA.

For the Amazon PBA chips Inferentia2 and Trainium, the SDK is AWS Neuron. This SDK enables development, profiling, and deployment of workloads onto these PBAs. Neuron has various native integrations to third-party ML frameworks like PyTorch, TensorFlow, and JAX. Additionally, Neuron includes a compiler, runtime driver, as well as debug and profiling utilities. This toolset includes Neuron-top for real-time visualization of the NeuronCore and vCPU utilization, host and device memory usage, and a breakdown of memory allocation. This information is also available in JSON format if neuron-monitor is used, including Neuron-ls for device discovery and topology information. With Neuron, users can use inf2 and trn1n instances with a range of AWS compute services, such as Amazon SageMaker, Amazon Elastic Container Service, Amazon Elastic Kubernetes Service, AWS Batch, and AWS ParallelCluster. This usability, tooling, and integrations of the Neuron SDK has made Amazon PBAs extremely popular with users. For example, over 90% of the top 100 Hugging Face models (now over 100,000 AI models) now run on AWS using Optimum Neuron, enabling the Hugging Face transformer natively supported for Neuron. In summary, the Neuron SDK allows developers to easily parallelize ML algorithms, such as those commonly found in FSI. The following figure illustrates the Neuron software stack.

The CUDA API and SDK were first released by NVIDIA in 2007. CUDA offers high-level parallel programming concepts that can be compiled to the GPU, giving direct access to the GPU’s virtual instruction set and therefore the ability to specify thread-level parallelism. To achieve this, CUDA added one extension to the C language to let users declare functions that could run and compile on the GPU, and a lightweight way to call those functions. The core idea behind CUDA was to remove programmers’ barrier to entry for coding against GPUs by allowing use of existing skills and tools as much as possible, while being more user friendly than OpenCL. The CUDA platform includes drivers, runtime kernels, compilers, libraries, and developer tools. This includes a wide and impressive range of ML libraries like cuDNN and NCCL. The CUDA platform is used through complier directives and extensions to standard languages, such as the Python cuNumeric library. CUDA has continuously optimized over the years, using its proprietary nature to improve performance on NVIDIA hardware relative to vendor-neutral solutions like OpenCL. Over time, the CUDA programming paradigm and stack has become deeply embedded in all aspects of the ML ecosystem, from academia to open source ML repositories.

To date, alternative GPU platforms to CUDA have not seen widespread adoption. There are three key reasons for this. Firstly, CUDA has had a decades-long head start, and benefits from the networking effect of its mature ecosystem, from organizational inertia of change, and from risk aversion to change. Secondly, migrating CUDA code to a different GPU platform can be technically difficult, given the complexity of the ML models typically being accelerated. Thirdly, CUDA has integrations with major third-party ML libraries, such as TensorFlow and PyTorch.

Despite the central role CUDA plays in the AI/ML community, there is movement by users to diversify their accelerated workflows by movement towards a Pythonic programming layer to make training more open. A number of such efforts are underway, including projects like Triton and OneAPI, and cloud service features such as Amazon SageMaker Neo. Triton is an open source project lead by OpenAI that enables developers to use different acceleration hardware using entirely open source code. Triton uses an intermediate compiler to convert models written in supported frameworks into an intermediate representation that can then be lowered into highly optimized code for PBAs. Triton is therefore a hardware-agnostic convergence layer that hides chip differences.

Soon to be released is the AWS neuron kernel interface (NKI) programming interface. NKI is a Python-based programming environment designed for the compiler, which adopts commonly used Triton-like syntax and tile-level semantics. NKI provides customization capabilities to fully optimize performance by enabling users to write custom kernels, by passing almost all of the AWS compiler layers.

OneAPI is an open source project lead by Intel for a unified API across different accelerators including GPUs, other PBAs, and FPGAs. Intel believes that future competition in this space will happen for inference, unlike in the learning phase, where there is no software dependency. To this end, OneAPI toolkits support CUDA code migration, analysis, and debug tools. Other efforts are building on top of OneAPI; for, example the Unified Acceleration Foundation’s (UXL) goal is a new open standard accelerator software ecosystem. UXL consortium members include Intel, Google, and ARM.

Amazon SageMaker is an AWS service providing an ML development environment, where the user can select chip type from the service’s fleet of Intel, AMD, NVIDIA, and AWS hardware, offering varied cost-performance-accuracy trade-offs. Amazon contributes to Apache TVM, an open source ML compiler framework for GPUs and PBAs, enabling computations on any hardware backend. SageMaker Neo uses Apache TVM to perform static optimizations on trained models for inference for any given hardware target. Looking to the future, the accelerator software field is likely to evolve; however, this may be slow to happen.

Accelerator supply-demand imbalances

It has been widely reported for the last few years that GPUs are in short supply. Such shortages have led to industry leaders speaking out. For example, Sam Altman said “We’re so short on GPUs the less people use our products the better… we don’t have enough GPUs,” and Elon Musk said “It seems like everyone and their dog is buying GPUs at this point.”

The factors leading to this have been high demand coupled with low supply. High demand has risen from a range of sectors, including crypto mining, gaming, generic data processing, and AI. Omdia Research estimates 49% of GPUs go to the hyper-clouds (such as AWS or Azure), 27% go to big tech (such as Meta and Tesla), 20% go to GPU clouds (such as Coreweave and Lambda) and 6% go to other companies (such as OpenAI and FSI firms). The State of AI Report gives the size and owners of the largest A100 clusters, the top few being Meta with 21,400, Tesla with 16,000, XTX with 10,000, and Stability AI with 5,408. GPU supply has been limited by factors including lack of manufacturing competition and ability at all levels in the supply chain, and restricted supply of base components such as rare metals and circuit boards. Additionally, rate of manufacturing is slow, with an H100 taking 6 months to make. Socio-political events have also caused delays and issues, such as a COVID backlog, and with inert gases for manufacturing coming from Russia. A final issue impacting supply is that chip makers strategically allocate their supply to meet their long-term business objectives, which may not always align with end-users’ needs.

Supported workloads

In order to benefit from hardware acceleration, a workload needs to be parallelizable. An entire branch of science is dedicated to parallelizable problems. In The Landscape of Parallel Computing Research, 13 fields (termed dwarfs) are found to be fundamentally parallelizable, including dense and sparse linear algebra, Monte Carlo methods, and graphical models. The authors also call out a series of fields they term “embarrassingly sequential” for which the opposite holds. In FSI, one of the main data structures dealt with is time series, a series of sequential observations. Many time series algorithms have the property where each subsequent observation is dependent on previous observations. This means only some time series workloads can be efficiently computed in parallel. For example, a moving average is a good example of a computation that seems inherently sequential, but for which there is an efficient parallel algorithm. Sequential models, such as Recurrent Neural Networks (RNN) and Neural Ordinary Differential Equations, also have parallel implementations. In FSI, non-time series workloads are also underpinned by algorithms that can be parallelized. For example, Markovitz portfolio optimization requires the computationally intensive inversion of large covariance matrices, for which GPU implementations exist.

In computer science, a number can be represented with different levels of precision, such as double precision (FP64), single precision (FP32), and half-precision (FP16). Different chips support different representations, and different representations are suitable for different use cases. The lower the precision, the less storage is required, and the faster the number is to process for a given amount of computational power. FP64 is used in HPC fields, such as the natural sciences and financial modeling, resulting in minimal rounding errors. FP32 provides a balance between accuracy and speed, is used in applications such as graphics, and is the standard for GPUs. FP16 is used in deep learning where computational speed is valued, and the lower precision won’t drastically affect the model’s performance. More recently, other number representations have been developed which aim to improve the balance between acceleration and precision, such as OCP Standard FP8, Google BFloat16, and Posits. An example of a mixed representation use case is the updating of model parameters by gradient decent, part of the backpropagation algorithm, as used in deep learning. Typically this is done using FP32 to reduce rounding errors, however, in order to reduce memory load, the parameters and gradients can be stored in FP16, meaning there is a conversion requirement. In this case, BFloat16 is a good choice because it prevents float overflow errors while keeping enough precision for the algorithm to work.

As lower-precision workloads become more important, hardware and infrastructure trends are changing accordingly. For example, comparing the latest NVIDIA GB200 chip against the previous generation NVIDIA H100 chip, lower representation FP8 performance has increased 505%, but FP64 performance has only increased 265%. Likewise, in the forthcoming Trainium2 chip, the focus has been on lower-bit performance increases, giving a 400% performance increase over the previous generation. Looking to the future, we might expect to see a convergence between HPC and AI workloads, as AI starts to become increasingly important in solving what were traditionally HPC FP64 precision problems.

Accelerator benchmarking

When considering compute services, users benchmark measures such as price-performance, absolute performance, availability, latency, and throughput. Price-performance means how much compute can be done for $1, or what is the equivalent dollar cost for a given number of FP operations. For a perfect system, the price-performance ratio increases linearly as the size of a job scales up. A complicating factor when benchmarking compute grids on AWS is that EC2 instances come in a range of system parameters and a grid might contain more than one instance type, therefore systems are benchmarked at the grid level rather than on a more granular basis. Users often want to complete a job as quickly as possible and at the lowest cost; the constituent details of the system that achieves this aren’t as important.

A second benchmarking measure is absolute-performance, meaning how quickly can a given job be completed independent of price. Given linear scaling, job completion time can be reduced by simply adding more compute. However, it might be that the job isn’t infinitely divisible, and that only a single computational unit is required. In this case, the absolute performance of that computational unit is important. In an earlier section, we provided a table with one performance measure, the $/TFLOP ratio based on the chip specifications. However, as a rule of thumb, when such theoretical values are compared against experimental values, only around 45% is realized.

There are a few different ways to calculate price-performance. The first is to use a standard benchmark, such as LINPACK, HPL-MxP, or MFU (Model FLOPS Utilization). These can run a wide range of calculations that are representative of varying use cases, such as general use, HPC, and mixed HPC and AI workloads. From this, the TFLOP/s at a given FP precision for the system can be measured, along with the dollar-cost of running the system. However, it might be that the user has specific use cases in mind. In this case, the best data will come from price-performance data on a more representative benchmark.

There are various types of representative benchmark commonly seen. Firstly, the user can use real production data and applications with the hardware being benchmarked. This option gives the most reliable results, but can be difficult to achieve due to operational and compliance hurdles. Secondly, the user can replicate their existing use case with a synthetic data generator, avoiding the challenges of getting production data into new test systems. Thirdly, the use can employ a third-party benchmark for the use case, if one exists. For example, STAC is a company that coordinates an FSI community called the STAC Benchmark Council, which maintain a selection of accelerator benchmarks, including A2, A3, ML and AI (LLM). A2 is designed for compute-intensive analytic workloads involved in pricing and risk management. Specifically, the A2 workload uses option price discovery by Monte Carlo estimation of Heston-based Greeks for a path-dependent, multi-asset option with early exercise. STAC members can access A2 benchmarking reports, for example EC2 c5.metal, with the oneAPI. STAC-ML benchmarks the latency of NN inference—the time from receiving new input data until the model output is computed. STAC-A3 benchmarks the backtesting of trading algorithms to determine how strategies would have performed on historical data. This benchmark supports accelerator parallelism to run many backtesting experiments simultaneously, for the same security. For each benchmark, there exists a series of software packages (termed STAC Packs), which are accelerator-API specific. For some of the preceding benchmarks, STAC Packs are maintained by providers such as NVIDIA (CUDA) and Intel (oneAPI).

Some FSI market participants are performing in-house benchmarking at the microarchitecture level, in order to optimize performance as far as possible. Citadel has published microbenchmarks for NVIDIA GPU chips, dissecting the microarchitecture to achieve “bare-metal performance tuning,” noting that peak performance is inaccessible to software written in plain CUDA. Jane Street has looked at performance optimization through functional programming techniques, while PDT Partners has supported work on the Nixpkgs repository of ML packages using CUDA.

Some AWS customers have benchmarked the AWS PBAs against other EC2 instance types. ByteDance, the technology company that runs the video-sharing app TikTok, benchmarked Inf1 against a comparable EC2 GPU instance type. With Inf1, they were able to reduce their inference latency by 25%, and costs by 65%. In a second example, Inf2 is benchmarked against a comparable inference-optimized EC2 instance. The benchmark used is the RoBERTa-Base, a popular model used in natural language processing (NLP) applications, that uses the transformer architecture. In the following figure, on the x-axis we plotted throughput (the number of inferences that are completed in a set period of time), and on the y-axis we plotted latency (the time it takes the deep learning model to provide an output). The figure shows that Inf2 gives higher throughput and lower latency than the comparable EC2 instance type.

In a third benchmark example, Hugging Face benchmarked the trn1.32xlarge instance (16 Trainium chips) and two comparable EC2 instance types. For the first instance type, they ran fine-tuning for the BERT Large model on the full Yelp review dataset, using the BF16 data format with the maximum sequence length supported by the model (512). The benchmark results show the Trainium job is five times faster while being only 30% more expensive, resulting in a “huge improvement in cost-performance.” For the latter instance type, they ran three tests: language pretraining with GPT2, token classification with BERT Large, and image classification with the Vision Transformer. These results showed trn1 to be 2–5 times faster and 3–8 times cheaper than the comparable EC2 instance types.

FSI use cases

As with other industry sectors, there are two reasons why FSI uses acceleration. The first is to get a fixed result in the lowest time possible, for example parsing a dataset. The second is to get the best result in a fixed time, for example overnight parameter re-estimation. Use cases for acceleration exist across the FSI, including banking, capital markets, insurance, and payments. However, the most pressing demand comes from capital markets, because acceleration speeds up workloads and time is one of the easiest edges people can get in the financial markets. Put differently, a time advantage in financial services often equates to an informational advantage.

We begin by providing some definitions:

  • Parsing is the process of converting between data formats
  • Analytics is data processing using either deterministic or simple statistical methods
  • ML is the science of learning models from data, using a variety of different methods, and then making decisions and predictions
  • AI is an application able to solve problems using ML

In this section, we review some of the FSI use cases of PBAs. As many FSI activities can be parallelized, most of what is done in FSI can be sped up with PBAs. This includes most modeling, simulations, and optimization problems— currently in FSI, deep learning is only a small part of the landscape. We identify four classes of FSI use cases and look at applications in each class: parsing financial data, analytics on financial data, ML on financial data, and low-latency applications. To try and show how these classes relate to each other, the following figure shows a simplified representation of a typical capital market’s workflow. In this figure, acceleration categories have been assigned to the workflow steps. However, in reality, every step in the process may be able to benefit from one or more of the defined acceleration categories.

Parsing

A typical capital markets workflow consists of receiving data and then parsing it into a useable form. This data is commonly market data, as output from a trading venue’s matching engine, or onward from a market data vendor. Market participants who are receiving either live or historical data feeds need to ingest this data and perform one or more steps, such as parse the message out of a binary protocol, rebuild the limit order book (LOB), or combine multiple feeds into a single normalized format. Any of these parsing steps that run in parallel could be sped up relative to sequential processing. To give an idea of scale, the largest financial data feed is the consolidated US equity options feed, termed OPRA. This feed comes from 18 different trading venues, with 1.5 million contracts broadcast across 96 channels, with a supported peak message rate of 400 billion messages per day, equating to approximately 12 TB per day, or 3 PB per year. As well as maintaining real-time feeds, participants need to maintain a historical depositary, sometimes of several years in size. Processing of historical repositories is done offline, but is often a source of major cost. Overall, a large consumer of market data, such as an investment bank, might consume 200 feeds from across public and private trading venues, vendors, and redistributors.

Any point in this data processing pipeline that can be parallelized, can potentially be sped up by acceleration. For example:

  • Trading venues broadcast on channels, which can be groupings of alphabetical tickers or products.
  • On a given channel, different tickers update messages are broadcast sequentially. These can then be parsed out into unique streams per ticker.
  • For a given LOB, some events might be applicable to individual price levels independently.
  • Historical data is normally (but not always) independent inter-day, meaning that days can be parsed independently.

In GPU Accelerated Data Preparation for Limit Order Book Modeling, the authors describe a GPU pipeline handling data collection, LOB pre-processing, data normalization, and batching into training samples. The authors note their LOB pre-processing relies on the previous LOB state, and must be done sequentially. For LOB building, FPGAs seem to be used more commonly than GPUs because of the fixed nature of the workload; see examples from Xilinx and Algo-Logic. For example code for a build lab, using the AWS FPGA F1 instance type, refer to the following GitHub repo.

An important part of the data pipeline is the production of features, both online and offline. Features (also called alphas, signals, or predictors) are statistical representations of the data, which can then be used in downstream model building. A current trend in the FSI prediction space is the large-scale automation of dataset ingestion, curation, processing, feature extraction, feature combination, and model building. An example of this approach is given by WorldQuant, an algorithmic trading firm. The WSJ reports “a data group scours the globe for interesting and new data sets, including everything from detailed market pricing data to shipping statistics to footfall in stores captured by apps on smartphones”. WorldQuant states “in 2007 we had two data sets—today [2022] we have more than 1,400.” The general idea being if they could buy, consume, create, and web scrape more data than anyone else, they could create more alphas, and find more opportunities. Such an approach is based on performance being proportional to √N, where N is the number of alphas. Therefore, as long as an alpha is not perfectly correlated with another, there is value in adding it to the set. In 2010, WorldQuant was producing several thousand alphas per year, by 2016 had one million alphas, by 2022, had multiple millions, with a stated ambition to get to 100 million alphas. Although traditional quant finance mandates the importance of an economic rationale behind an alpha, the data-driven approach is led purely by the patterns in the data. After alphas have been produced, they can be intelligently merged together in a time-variant manner. Examples of signal combination methodologies which can benefit from PBA speed-up include Mean Variance Optimization and Bayesian Model Averaging. The same WSJ article states “No one alpha is important. Our edge is putting things together, it’s the implementation…. The idea is that with so many ‘alphas,’ even weak signals can be useful. If counting cars in parking lots next to big box retailers has only a tiny predictive power for those retailers’ stock prices, it can still be used to enhance a bigger prediction if combined with other weak signals. For example, an uptick in cars at Walmart parking lots—itself a relatively weak signal—could combine with similar trends captured by mobile phone apps and credit-card receipts harvested by companies that scan emails to create a more reliable prediction.” The automated process of data ingestion, processing, packaging, combination, and prediction is referred to by WorldQuant as their “alpha factory.”

From examples such as those we’ve discussed, it seems clear that parallelization, speed-up and scale-up, of such huge data pipelines is potentially an important differentiator. All the way through this pipeline, activities could be accelerated using PBAs. For example, for use at the signal combination phase, the Shapley value is a metric that can be used to compute the contribution of a given feature to a prediction. Shapley value computation has PBA-acceleration support in the Python XGBoost library.

Analytics

In this section, we consider the applicability of accelerator parallelism to analytics workloads. One of the parallelizable dwarfs is Monte Carlo, and for FSI and time series work in general, this is an important method. Monte Carlo is a way to compute expected values by generating random scenarios and then averaging them. By using GPUs, a simulated path can be assigned to each thread, allowing simulation of thousands of paths in parallel.

Post the 2008 credit crunch, new regulations require banks to run credit valuation adjustment (CVA) calculations every 24 hours. CVA is an adjustment to a derivatives price as charged by a bank to a counterparty. CVA is one of a family of related valuation adjustments collectively known as xVA, which include debt valuation adjustment (DVA), initial margin valuation adjustment (MVA), capital valuation adjustment (KVA), and funding valuation adjustment (FVA). Because this adjustment calculation can happen over large portfolios of complex, non-linear instruments, closed-form analytical solutions aren’t possible, and as such an empirical approximation by a technique such as Monte Carlo is required. The downside of Monte Carlo here is how computationally demanding it is, due to the size of the search space. The advent of this new regulation coincided with the coming of age of GPUs, and as such banks commonly use GPU grids to run their xVA calculations. In XVA principles, nested Monte Carlo strategies, and GPU optimizations, the authors find a nested simulation time of about an hour for a billion scenarios on the bank portfolio, and a GPU speedup of 100 times faster relative to CPUs. Rather than develop xVA applications internally, banks often use third-party independent software vendor (ISV) solutions to run their xVA calculations, such as Murex M3 or S&P Global XVA. Banking customers can choose to run such ISV software as a service (SaaS) solutions inside their own AWS accounts, and often on AWS accelerated instances.

A second use of PBAs in FSI Monte Carlo is in option pricing, especially for exotic options whose payoff is sometimes too complex to solve in closed-form. The core idea is using a random number generator (RNG) to simulate the stochastic components in a formula and then average the results, leading to the expected value. The more paths that are simulated, the more accurate the result is. In Quasi-Monte Carlo methods for calculating derivatives sensitivities on the GPU, the authors find 200-times greater speedup over CPUs, and additionally develop a number of refinements to reduce variance, leading to fewer paths needing to be simulated. In High Performance Financial Simulation Using Randomized Quasi-Monte Carlo Methods, the authors survey quasi Monte Carlo sequences in GPU libraries and review commercial software tools to help migrate Monte Carlo pricing models to GPU. In GPU Computing in Bayesian Inference of Realized Stochastic Volatility Model, the author computes a volatility measure using Hybrid Monte Carlo (HMC) applied to realized stochastic volatility (RSV), parallelized on a GPU, resulting in a 17-times faster speedup. Finally, in Derivatives Sensitivities Computation under Heston Model on GPU, the authors achieve a 200-times faster speedup; however, the accuracy of the GPU method is inferior for some Greeks relative to CPU.

A third use of PBAs in FSI Monte Carlo is in LOB simulations. We can categorize different types of LOB simulations: replay of the public historical data, replay of the mapped public-private historical data, replay of synthetic LOB data, and replay of a mix of historical and synthetic data to simulate the effects of a feedback loop. For each of these types of simulation, there are multiple ways in which hardware acceleration could occur. For example, for the simple replay case, each accelerator thread could have a different LOB. For the synthetic data case, each thread could have a different version of the same LOB, thereby allowing multiple realizations of a single LOB. In Limit Order Book Simulations: A Review, the authors provide their own simulator classification scheme based on the mathematical modeling technique used—point processes, agent based, deep learning, stochastic differential equations. In JAX-LOB: A GPU-Accelerated limit order book simulator to unlock large scale reinforcement learning for trading, the authors use GPU accelerated training, processing thousands of LOBs in parallel, giving a “notably reduced per message processing time.”

Machine learning

Generative AI is the most topical ML application at this point in time. Generative AI has four main applications: classification, prediction, understanding, and data generation, which in turn map to use cases such as customer experience, knowledge worker productivity, surfacing information and sentiment, and innovation and automation. FSI examples exist for all of these; however, a thorough review of these is beyond the scope of this post. For this post, we remain focused on PBA applicability and look at two of these topics: chatbots and time series prediction.

The 2017, the publication of the paper Attention is all you need resulted in a new wave of interest in ML. The transformer architecture presented in this paper allowed for a highly parallelizable network structure, meaning more data could be processed than before, allowing patterns to be better captured. This has driven impressive real-world performance, as seen by popular public foundation models (FMs) such as OpenAI ChatGPT, and Anthropic Claude. These factors in turn have driven new demand for PBAs for training and inference on these models.

FMs, also termed LLMs, or chatbots when text focused, are models that are typically trained on a broad spectrum of generalized and unlabeled data and are capable of performing a wide variety of general tasks in FSI, such as the Bridgewater Associates LLM-powered Investment Analyst Assistant, which generates charts, computes financial indicators, and summarizes results. FSI LLMs are reviewed in Large Language Models in Finance: A Survey and A Survey of Large Language Models for Financial Applications: Progress, Prospects and Challenges. FMs are often used as base models for developing more specialized downstream applications.

PBAs are used in three different types of FM training. Firstly, to train a FM from scratch. In BloombergGPT: A Large Language Model for Finance, the training dataset was 51% financial data from their systems and 49% public data, such as Wikipedia and Pile. SageMaker was used to train and evaluate their FM. Specifically, 64 p4d.24xlarge instances, giving a total of 512 A100 GPUs. Also used was SageMaker model parallelism, enabling the automatic distribution of the large model across multiple GPU devices and instances. The authors started with a compute budget of 1.3 million GPU hours, and noted training took approximately 53 days.

The second training approach is to fine-tune an existing FM. This requires using an FM whose model parameters are exposed, and updating them in light of new data. This approach can be effective when the data corpus differs significantly from the FM training data. Fine-tuning is cheaper and quicker than training FM from scratch, because the volume of data is likely to be much smaller. As with the larger-scale training from scratch, fine-tuning benefits significantly from hardware acceleration. In an FSI example, Efficient Continual Pre-training for Building Domain Specific Large Language Models, the authors fine-tune an FM and find that their approach outperforms standard continual pre-training performance with just 10% of the corpus size and cost, without any degradation on open-domain standard tasks.

The third training approach is to perform Retrieval Augmented Generation (RAG). To equip FMs with up-to-date and proprietary information, organizations use RAG, a technique that fetches data from company data sources and enriches the prompt to provide more relevant and accurate responses. The two-step workflow consists of ingesting data and vectorizing data, followed by runtime orchestration. Although hardware acceleration is less common in RAG applications, latency of search is a key component and as such the inference step of RAG can be hardware optimized. For example, the performance of OpenSearch, a vectorized database available on AWS, can be improved by using PBAs, with both NVIDIA GPUs and Inferentia being supported.

For these three training approaches, the role of PBAs varies. For processing the huge data volumes of FM building, PBAs are essential. Then, as the training volumes reduce, so does the value-add role of the PBA. Independent of how the model has been trained, PBAs have a key role in LLM inference, again because they are optimized for memory bandwidth and parallelism. The specifics of how to optimally use an accelerator depend on the use case—for example, a paid-for-service chatbot might be latency sensitive, whereas for a free version, a delay of a few milliseconds might be acceptable. If a delay is acceptable, then batching the queries together could help make sure a given chip’s processes are saturated, giving better dollar usage of the resource. Dollar costs are particularly importance in inference, because unlike training, which is a one-time cost, inference is a reoccurring cost.

Using ML for financial time series prediction is nothing new; a large body of public research exists on these methods and applications dating to the 1970s and beyond—for approximately the last decade, PBAs have been applied to this field. As discussed earlier, most ML approaches can be accelerated with hardware; however, the attention-based architecture using the transformer model is currently the most topical. We consider three areas of FSI application: time series FMs, NN for securities prediction, and reinforcement learning (RL).

The initial work on LLMs was conducted on text-based models. This was followed by multi-modal models, able to handle images and other data structures. Subsequent to this, publications have started to appear on time series FMs, including Amazon Chronos, Nixtla TimeGEN-1, and Google TimesFM. The behavior of the time series models appears to be similar to that of the language models. For example, in Scaling-laws for Large Time-series Models, the authors observe the models follow the same scaling laws. A review of these models is provided in Foundation Models for Time Series Analysis: A Tutorial and Survey. As with leading LLMs, time series FMs are likely to be successfully trained on large clusters of PBAs. In terms of size, GPT-3 was trained on a cluster of 10,000 V100s. The size of the GPT-4 training cluster is not public, but is speculated to have been trained on a cluster of 10,000–25,000 A100s. This is analogous in size to one algorithmic trading firm’s statement, “our dedicated research cluster contains … 25,000 A/V100 GPUs (and growing fast).”

Looking to the future, one possible outcome might be that time series FMs, trained at huge expense by a few large corporates, become the base models for all financial prediction. Financial services firms then modify these FMs through additional training with private data or their own insights. Examples of private labeled data might be knowledge of which orders and executions in the public feed belonged to them, or similarly which (meta)orders and executions had parent-child relationships.

Although such financial time series FMs trained on PBA clusters may offer enhanced predictive capabilities, they also bring risks. For example, the EU’s AI act, adopted in March 2024, states that if a model has been trained with a total compute power in excess of 1025 FLOPs, then that model is considered to pose “systemic risk” and is subject to enhanced regulation, including fines of 3% of global turnover, so on this basis Meta announced in June 2024 they will not be enabling some models inside Europe. This legislation assumes that training compute is a direct proxy for model capabilities. EpochAI provides an analysis of the training compute required for a wide range of FMs; for example, GPT-4 took 2.125 FLOPS to train (exceeding the threshold by a factor of 2.1), whereas BloombergGPT took 2.423 FLOPS (under the threshold by a factor of 0.02). It seems possible that in the future, similar legislation may apply to financial FMs, or even to the PBA clusters themselves, with some market participants choosing not to operate in legislative regimes that are subject to such risks.

Feature engineering plays a key role in building NN models, because features are fed into the NN model. As seen earlier in this post, some participants have generated large numbers of features. Examples of features derived from market time series data include bid-ask spreads, weighted mid-points, imbalance measures, decompositions, liquidity predictions, trends, change-points, and mean-reversions. Together, the features are called the feature space. A transformer assigns more importance to part of the input feature space, even though it might only be a small part of the data. Learning which part of the data is more important than another depends on the context of the features. The true power of FMs in time series prediction is the ability to capture these conditional probabilities (the context) across the feature space. To give a simple example, based on historical data, trends might reduce in strength as they go on, leading to a change-point, and then reversion to the mean. A transformer potentially offers the ability to recognize this pattern and capture the relationship between the features more accurately than other approaches. An informative visualization of this for the textual case is given by the FT article Generative AI exists because of the transformer. In order to build and train such FMs on PBAs, access to high-quality historical data tightly coupled with scalable compute to generate the features is an essential prerequisite.

Prior to the advent of the transformer, NN have historically been applied to securities prediction with varying degrees of success. Deep Learning for Limit Order Books uses a cluster of 50 GPUs to predict the sign of the future return by mapping the price levels of the LOB to the visible input layer of a NN, resulting in a trinomial output layer. Conditional on the return the sign, the magnitude of the return is estimated using regression. Deep Learning Financial Market Data uses raw LOB data pre-processed into discrete, fixed-length features for training a recurrent autoencoder, whose recurrent structure allows learning patterns on different time scales. Inference occurs by generating the decoded LOB, and nearest-matching that to the real-time data.

In Multi-Horizon Forecasting for Limit Order Books: Novel Deep Learning Approaches and Hardware Acceleration using Intelligent Processing Units, the authors benchmark the performance of Graphcore IPUs against an NVIDIA GPU on an encoder-decoder NN model. Given that encoder-decoder models rely on recurrent neural layers, they generally suffer from slow training processes. The authors address this by finding that the IPU offers a significant training speedup over the GPU, 694% on average, analogous to the speedup a transformer architecture would provide. In some examples of post-transformer work in this space, Generative AI for End-to-End Limit Order Book Modelling and A Generative Model Of A Limit Order Book Using Recurrent Neural Networks have trained LLM analogues on historical LOB data, interpreting each LOB event (such as insertions, cancellations, and executions) as a word and predicting the series of events following a given word history. However, the authors find the prediction horizon for LOB dynamics appears to be limited to a few tens of events, possibly because of the high-dimensionality of the problem and the presence of long-range correlations in order sign. These results have been improved in the work “Microstructure Modes” — Disentangling the Joint Dynamics of Prices & Order Flow, by down-sampling the data and reducing its dimensionality, allowing identification of stable components.

RL is an ML technique where an algorithm interacts with a dynamic environment that provides feedback to the algorithm, allowing the algorithm to iteratively optimize a reward metric. Because RL closely mimics how human traders interact with the world, there are various areas of applicability in FSI. In JAX-LOB: A GPU-Accelerated limit order book simulator to unlock large scale reinforcement learning for trading, the authors use GPUs for end-to-end RL training. RL agent training with a GPU has a 7-times speedup relative to a CPU based simulation implementation. The authors then apply this to the problem of optimal trade execution. A second FSI application of RL to optimal trade execution has been reported by JPMorgan in an algorithm called LOXM.

Latency-sensitive, real-time workloads

Being able to transmit, process, and act on data more quickly than others gives an informational advantage. In the financial markets, this is directly equivalent to being able to profit from trading. These real-time, latency-sensitive workloads exist on a spectrum, from the most sensitive to the least sensitive. The specific numbers in the following table are open to debate, but present the general idea.

Band Latency Application Examples
1 Less than 1 microsecond Low-latency trading strategy. Tick 2 trade.
2 1–4 microseconds Feed handler. Raw or normalized format.
3 40 microseconds Normalized format and symbology.
4 4–200 milliseconds Consolidated feed. Full tick.
5 1 second to daily Intraday and EOD. Reference, Corp, FI, derivatives.

The most latency-sensitive use cases are typically handled by FPGA or custom ASICs. These react to incoming network traffic, like market data, and put triggering logic directly into the network interface controller. Easily reprogrammable PBAs play little to no role in any latency sensitive work, due to the SIMD architecture being designed for the use case of parallel processing large amounts of data with a bandwidth bottleneck of getting data onto the chip.

However, three factors maybe driving change in the role hardware acceleration plays in the low-latency space. Firstly, as PBAs mature, some of their previous barriers are being reduced. For example, NVIDIA’s new NVLink design now enables significantly higher bandwidth relative to previous chip interconnects, meaning that data can get onto the chip far more quickly than before. Comparing the latest NVIDIA GB200 chip against the previous generation NVIDIA H100 chip, NVLink performance has increased 400%, from 900 GBps to 3.6 TBps.

Secondly, some observers believe the race for speed is shifting to a “race for intelligence.” With approximately only ten major firms competing in the top-tier low latency space, the barrier to entry seems almost unsurmountable for other parties. At some point, low-latency hardware and techniques might slowly diffuse through technology supplier offerings, eventually leveling the playing field, perhaps having been driven by new regulations.

Thirdly, although FPGA/ASIC undoubtedly provides the fastest performance, they come at a cost of being a drain on resources. Their developers are hard to hire for, the work has long deployment cycles, and it results in a significant maintenance burden with bugs that are difficult to diagnose and triage. Firms are keen to identify alternatives.

Although the most latency-sensitive work will remain on FPGA/ASIC, there may be a shift of less latency-sensitive work from FPGA/ASIC to GPUs and other PBAs as users weigh the trade-off between speed and other factors. In comparison, easily reprogrammable PBA processors are now simple to hire for, are straightforward to code against and maintain, and allow for relatively rapid innovation. Looking to the future, we may see innovation at the language level, for example, through functional programming with array-languages such as the Co-dfns project, as well as further innovation at the hardware level, with future chips tightly integrating the best components of today’s FPGAs, GPUs and CPUs.

Key Takeaways

In this section, we present three key takeaways. Firstly, the global supply-demand ratio for GPUs is low, meaning price can be high, but availability can be low. This can be a constraining factor for end-user businesses wanting to innovate in this space. AWS helps address this on behalf of its customers in three ways:

  • Through economies of scale, AWS is able to offer significant availability of the PBAs, including GPUs.
  • Through in-house research and development, AWS is able to offer its own PBAs, developed and manufactured in-house, which are not subject to the constraints of the wider market, while also having optimized price-performance.
  • AWS innovates at the software level to improve allocation to the end-user. Therefore, although total capacity might be fixed, by using intelligent allocation algorithms, AWS is better able to meet customers’ needs. For example, Amazon EC2 Capacity Blocks for ML enables guaranteed access to the required PBAs at the point in time they are needed.

The second takeaway is that proprietary software can lock users in to a single supplier and end up acting as a barrier to innovation. In the case of PBAs, the chips that use proprietary software mean that users can’t easily move between chip manufacturers, as opposed to open source software supporting multiple chip manufacturers. Any future supply constraints, such as regional armed conflict, could further exasperate existing supply-demand imbalances. Although migrating existing legacy workloads from an acceleration chip with proprietary software can be challenging, new greenfield workloads can be built on open source libraries without difficulty. In the FSI space, examples of legacy workloads might include risk calculations, and examples of greenfield workloads might include time series prediction using FMs. In the long term, business leaders need to consider and formulate their strategy for moving away from software lock-in, and enable access to wider acceleration hardware offerings, with the cost benefits that can bring.

The final takeaway is that financial services, and the subsection of capital markets in particular, is subject to constant and evolving competitive pressures. Over time, the industry has seen the race for differentiation move from data access rights, to latency, and now to an increased focus on predictive power. Looking to the future, if the world of financial prediction is based in part on a small number of expensive and complex FMs built and trained by a few large global corporates, where will the differentiation come from? Speculative areas could range from at-scale feature engineering to being able to better handle increased regulatory burdens. Whichever field it comes from, it is certain to include data processing and analytics at its core, and therefore benefit from hardware acceleration.

Conclusion

This post aimed to provide business leaders with a non-technical overview of PBAs and their role within the FSI. With this technology currently being regularly discussed in the mainstream media, it is essential business leaders understand the basis of this technology and its potential future role. Nearly every organization is now looking to a data-centric future, enabled by cloud-based infrastructure and real-time analytics, to support revenue-generating AI and ML use cases. One of the ways organizations will be differentiated in this race will be by making the right strategic decisions about technologies, partners, and approaches. This includes topics such as open source versus closed source, build versus buy, tool complexity and associated ease of use, hiring and retention challenges, and price-performance. Such topics are not just technology decisions within a business, but also cultural and strategic ones.

Business leaders are encouraged to reach out to their AWS point of contact and ask how AWS can help their business win in the long term using PBAs. This might result in a range of outcomes, from a short proof of concept against an existing well-defined business problem, to a written strategy document that can be consumed and debated by peers, to onsite technical workshops and business briefing days. Whatever the outcome, the future of this space is sure to be exciting!

Acknowledgements

I would like to thank the following parties for their kind input and guidance in writing this post: Andrea Rodolico, Alex Kimber, and Shruti Koparkar. Any errors are mine alone.


About the Author

Dr. Hugh Christensen works at Amazon Web Services with a specialization in data analytics. He holds undergraduate and master’s degrees from Oxford University, the latter in computational biophysics, and a PhD in Bayesian inference from Cambridge University. Hugh’s areas of interest include time series data, data strategy, data leadership, and using analytics to drive revenue generation. You can connect with Hugh on LinkedIn.

Read More

Anomaly detection in streaming time series data with online learning using Amazon Managed Service for Apache Flink

Anomaly detection in streaming time series data with online learning using Amazon Managed Service for Apache Flink

Time series data is a distinct category that incorporates time as a fundamental element in its structure. In a time series, data points are collected sequentially, often at regular intervals, and they typically exhibit certain patterns, such as trends, seasonal variations, or cyclical behaviors. Common examples of time series data include sales revenue, system performance data (such as CPU utilization and memory usage), credit card transactions, sensor readings, and user activity analytics.

Time series anomaly detection is the process of identifying unexpected or unusual patterns in data that unfold over time. An anomaly, also known as an outlier, occurs when a data point deviates significantly from an expected pattern.

For some time series, like those with well-defined expected ranges such as machine operating temperatures or CPU usage, a threshold-based approach might suffice. However, in areas like fraud detection and sales, where simple rules fall short due to their inability to catch anomalies across complex relationships, more sophisticated techniques are required to identify unexpected occurrences.

In this post, we demonstrate how to build a robust real-time anomaly detection solution for streaming time series data using Amazon Managed Service for Apache Flink and other AWS managed services.

Solution overview

The following diagram illustrates the core architecture of the Anomaly Detection Stack solution.

This solution employs machine learning (ML) for anomaly detection, and doesn’t require users to have prior AI expertise. It offers an AWS CloudFormation template for straightforward deployment in an AWS account. With the CloudFormation template, you can deploy an application stack with the necessary AWS resources required for detecting anomalies. Setting up one stack creates an application with one anomaly detection task or detector. You can set up multiple such stacks to run them simultaneously, with each one analyzing the data and reporting back the anomalies.

The application, once deployed, constructs an ML model using the Random Cut Forest (RCF) algorithm. It initially sources input time series data from Amazon Managed Streaming for Apache Kafka (Amazon MSK) using this live stream for model training. Post-training, the model continues to process incoming data points from the stream. It evaluates these points against the historical trends of the corresponding time series. The model also generates an initial raw anomaly score while processing and maintains an internal threshold to eliminate noisy data points. Subsequently, the model generates a normalized anomaly score for each data point that the model treats as an anomaly. These scores, ranging from 0–100, indicate the deviation from typical patterns; scores closer to 100 signify higher anomaly levels. You have the flexibility to set a custom threshold on these anomaly scores, allowing you to define what you consider anomalous.

This solution uses a CloudFormation template, which takes inputs such as MSK broker endpoint and topics, AWS Identity and Access Management (IAM) roles, and other parameters related to virtual private cloud (VPC) configuration. The template creates the essential resources like the Apache Flink application and Amazon SageMaker real-time endpoint in the customer account.

To request the access to this solution, send an email to anomalydetection-support-canvas@amazon.com.

In this post, we outline how you can build an end-to-end solution with the Anomaly Detection Stack. Consider a hypothetical sales scenario where AnyBooks, an on-campus bookstore at a large university, sells various supplies to college students. Due to the timing of class schedules, their seasonality is such that they sell around 20 Item-A units and 30 Item-B units during even hours, and approximately half that during odd hours throughout the day. Recently, there have been some unexplained spikes in the quantity of items sold, and the management team wants to start tracking these quantity anomalies so that they can better plan their staffing and inventory levels.

The following diagram shows the detailed architecture for the end-to-end solution.

In the following sections, we discuss each layer shown in the preceding diagram.

Ingestion

In the ingestion layer, an AWS Lambda function retrieves sales transactions for the current minute from a PostgreSQL transactional database, transforms each record into a JSON message, and publishes it to an input Kafka topic. This Lambda function is configured to run every minute using Amazon EventBridge Scheduler.

Anomaly detection stack

The Flink application initiates the process of reading raw data from the input MSK topic, training the model, and commencing the detection of anomalies, ultimately recording them to the MSK output topic. The following code is the output results JSON:

{"detectorName":"canvas-ad-blog-demo-1","measure":"quantity","timeseriesId":"f3c7f14e7a445b79a3a9877dfa02064d56533cc29fb0891945da4512c103e893","anomalyDecisionThreshold":70,"dimensionList":[{"name":"product_name","value":"item-A"}],"aggregatedMeasureValue":14.0,"anomalyScore":0.0,"detectionPeriodStartTime":"2024-08-29 13:35:00","detectionPeriodEndTime":"2024-08-29 13:36:00","processedDataPoints":1261,"anomalyConfidenceScore":80.4674989791107,"anomalyDecision":0,"modelStage":"INFERENCE","expectedValue":0.0}

The following is a brief explanation of the output fields:

  • measure – This represents the metric we are tracking for anomalies. In our case, the measure field is the quantity of sales for Item-A.
  • aggregatedMeasureVaue – This represents the aggregated value of quantity in the time window.
  • timeseriesid – This unique identifier corresponds to a combination of unique values for the dimensions and the metric. In this scenario, it’s the product name, Item-A, within the product_name
  • anomalyConfidenceScore – As the model evolves through learning and inference, this confidence score will progressively improve.
  • anomalyScore – This field represents the score for anomaly detection. With an anomalyThreshold set at 70, any value exceeding 70 is considered a potential anomaly.
  • modelStage – When the model is in the learning phase, the anomalyScore is 0.0 and the value of this field is set to LEARNING. After the learning is complete, the value of this field changes to INFERENCE.
  • anomalyDecisionThreshold – The decision threshold is provided as input in the CloudFormation stack. If you determine there are too many false positives, you can increase this threshold to change the sensitivity.
  • anomalyDecision – If the anomalyScore exceeds the anomalyDecisionThreshold, this field is set to 1, indicating an anomaly is detected.

Transform

In the transformation layer, an Amazon Data Firehose stream is configured to consume data from the output Kafka topic and invoke a Lambda function for transformation. The Lambda function flattens the nested JSON data from the Kafka topic. The transformed results are then partitioned by date and stored in an Amazon Simple Storage Service (Amazon S3) bucket in Parquet format. An AWS Glue crawler is used to crawl the data in the Amazon S3 location and catalog it in the AWS Glue Data Catalog, making it ready for querying and analysis.

Visualize

To visualize the data, we’ve created an Amazon QuickSight dashboard that connects to the data in Amazon S3 through the Data Catalog and queries it using Amazon Athena. The dashboard can be refreshed to display the latest detected anomalies, as shown in the following screenshot.

In this example, the darker blue line in the line graph represents the seasonality of the quantity measure for Item-A over time, showing higher values during even hours and lower values during odd hours. The pink line represents the anomaly detection score, plotted on the right Y-axis. The anomaly score approaches 100 when the quantity value significantly deviates from its seasonal pattern. The blue line represents the anomaly threshold, set at 70. When anomalyScore exceeds this threshold, anomalyDecision is set to 1.

The “Number of Timeseries Tracked” KPI displays how many time series the model is currently monitoring. In this case, because we’re tracking two products (Item-A and Item-B), the count is 2. The “Number of Datapoints Processed” KPI shows the total number of data points the model has processed, and the “Anomaly Confidence Score” indicates the confidence level in predicting anomalies. Initially, this score is low, but will approach 100 as the model matures over time.

Notification

Although visualization is valuable for investigating anomalies, data analysts often prefer to receive near real-time notifications for critical anomalies. This is achieved by adding a Lambda function that reads results from the output Kafka topic and analyzes them. If the anomalyScore value exceeds the defined threshold, the function invokes an Amazon Simple Notification Service (Amazon SNS) topic to send email or SMS notifications to a designated list, alerting the team about the anomaly in near real time.

Conclusion

This post demonstrated how to build a robust real-time anomaly detection solution for streaming time series data using Managed Service for Apache Flink and other AWS services. We walked through an end-to-end architecture that ingests data from a source database, passes it through an Apache Flink application that trains an ML model and detects anomalies, and then lands the anomaly data in an S3 data lake. The anomaly scores and decisions are visualized through a QuickSight dashboard connected to the Amazon S3 data using AWS Glue and Athena. Additionally, a Lambda function analyzes the results and sends notifications in near real time.

With AWS managed services like Amazon MSK, Data Firehose, Lambda, and SageMaker, you can rapidly deploy and scale this anomaly detection solution for your own time series use cases. This allows you to automatically identify unexpected behaviors or patterns in your data streams in real time without manual rules or thresholds.

Give this solution a try, and explore how real-time anomaly detection on AWS can unlock insights and optimize operations across your business!


About the Authors

Noah Soprala is a Solutions Architect based out of Dallas. He is a trusted advisor to his customers and helps them build innovative solutions using AWS technologies. Noah has over 20 years of experience in consulting, development, and solution architecture and delivery.

Dan Sinnreich is a Sr. Product Manager for Amazon SageMaker, focused on expanding no-code / low-code services. He is dedicated to making ML and generative AI more accessible and applying them to solve challenging problems. Outside of work, he can be found playing hockey, scuba diving, and reading science fiction.

Syed Furqhan is a Senior Software Engineer for AI and ML at AWS. He was part of many AWS service launches like Amazon Lookout for Metrics, Amazon Sagemaker and Amazon Bedrock. Currently, he is focusing on generative AI initiatives as part of Amazon Bedrock Core Systems. He is a clean code advocate and a subject-matter expert on server-less and event-driven architecture. You can follow him on linkedin, syedfurqhan

Nirmal Kumar is Sr. Product Manager for the Amazon SageMaker service. Committed to broadening access to AI/ML, he steers the development of no-code and low-code ML solutions. Outside work, he enjoys travelling and reading non-fiction.

Read More

Generative AI-powered technology operations

Generative AI-powered technology operations

Technology operations (TechOps) refers to the set of processes and activities involved in managing and maintaining an organization’s IT infrastructure and services. There are several terminologies used with reference to managing information technology operations, including ITOps, SRE, AIOps, DevOps, and SysOps. For the context of this post, we refer to these terminologies as TechOps. This includes tasks such as managing servers, networks, databases, and applications to maintain reliability, performance, and security of IT systems. However, certain tasks require manual and repetitive efforts such as incident detection and response, analyzing incoming tickets from disparate service providers, finding standard operating procedures for known and unknown issues, and managing support case resolution. In recent years, TechOps has been using AI capabilities—called AIOps—for operational data collection, aggregation, and correlation to generate actionable insights, identity root causes, and more.

This post describes how AWS generative AI solutions (including Amazon Bedrock, Amazon Q Developer, and Amazon Q Business) can further enhance TechOps productivity, reduce time to resolve issues, enhance customer experience, standardize operating procedures, and augment knowledge bases. The ability of generative AI technology to interpret complex situations on a nuanced, case-by-case basis implies that generative AI can solve challenges that other approaches—including traditional artificial intelligence and machine learning (AI/ML)-based pattern matching—couldn’t handle. The following table depicts a few examples of how AWS generative AI services can help with some of the day-to-day TechOps activities.

Amazon Bedrock Amazon Q Developer Amazon Q Business
Root cause analysis Maintenance tasks code generation Standard operating procedure
Knowledge base creation Increase productivity and efficiency Organization policy and procedure
Recurring reporting . Customer experience and sentiment analysis
Outbound support case generation . Shift handover chatbot
Inbound maintenance notifications formatting . .

A typical day in the life of a TechOps team includes issue resolution, root cause analysis, maintenance activities, and updating knowledge bases to provide a positive customer experience. In the following sections, we discuss some of these areas and how generative AI can help enhance TechOps.

Event management

By monitoring systems and analyzing patterns in performance data, an AI model can predict issues before they cause outages or degraded service. When incidents do occur, generative AI models can generate preliminary documentation of the event, including details on impacted systems, potential root causes, and troubleshooting steps. This allows engineers to quickly get up to speed on new incidents and accelerate response efforts.

Generative AI can also generate summary reports of past incidents to help teams identify recurring problems and opportunities for preventative measures. Furthermore, it can help with formatting inbound maintenance notifications from various service providers into a standard format, which can speed up understanding the impact of upcoming maintenance. Similarly, generative AI can automatically generate outbound cases to service providers if it detects an anomaly.

By taking over basic documentation and prediction tasks, generative AI can help infrastructure teams spend less time on repetitive work and more time resolving issues to improve overall system reliability.

To learn more about using Amazon Bedrock for summary tasks, refer to Create summaries of recordings using generative AI with Amazon Bedrock and Amazon Transcribe. To learn how Wiz uses Amazon Bedrock to address security risks, see How Wiz is empowering organizations to remediate security risks faster with Amazon Bedrock. To learn how HappyFox uses Anthropic Claude in Amazon Bedrock, refer to HappyFox Automates Support Agent Responses with Claude in Amazon Bedrock, Increasing Ticket Resolution by 40%.

Knowledge base management

Generative AI has the potential to help engineers automatically create operational documents such as standard operating procedures (SOPs) and supplemental documents, such as server hardening, security policies for external IPs allow lists and operating system patching, and more.

Using natural language models trained on large datasets of existing SOPs and similar content, generative AI systems can understand the common structure and language used in these types of documents. Engineers can then provide the system with high-level requirements or parameters for a new procedure, and generative AI can automatically generate a draft document formatted with the appropriate sections, level of detail, and terminology. This allows engineers to spend less time on documentation and more time focused on other engineering tasks. The initial drafts from AI also provide a strong starting point that engineers can refine.

Overall, generative AI offers a more efficient way for engineers to develop standardized procedural content at scale.

To learn how to use Amazon Bedrock to generate product descriptions, see Automating product description generation with Amazon Bedrock. Additionally, refer to How Skyflow creates technical content in days using Amazon Bedrock to learn how Skyflow Inc.—a data privacy company—uses Amazon Bedrock to streamline the creation of technical content, reducing the process from weeks to days while maintaining the highest standards of data privacy and security.

Automation

Generative AI can assist engineers and automate certain tasks that would otherwise require manual work. One area this could help in is script code generation for repetitive automation processes. By training AI models on large datasets of existing code examples for common programming tasks like file operations or system configuration, generative models can learn patterns and syntax.

An Amazon Q customization is a set of elements that enables Amazon Q to provide you with suggestions based on your company’s code base. Engineers can then provide high-level descriptions or specifications of what they need automated, such as “Generate a script to back up and archive files older than 30 days in this directory.” The AI model would be able to produce working code to accomplish this automatically based on its training. This would save engineers considerable time writing and testing scripts for routine jobs, allowing them to focus on more creative and challenging aspects of their work. As generative AI techniques advance, more complex engineering automation may also be achieved.

Refer to Upgrade your Java applications with Amazon Q Code Transformation to learn about the Amazon Q Code Transformation feature. Also, refer to Using Amazon Bedrock Agents to interactively generate infrastructure as code to learn how to configure Amazon Bedrock Agents to generate infrastructure as code. Lastly, refer to TymeX Accelerates Clean Coding by 40% by Implementing Generative AI on AWS to learn how TymeX uses generative AI on AWS.

Customer experience

Generative AI can analyze large volumes of customer service data, like call logs and support tickets, and identify patterns in issues customers frequently report. This insight allows operations teams to proactively address common problems before they severely impact customers. Generative AI assistants can also automate many routine service tasks, freeing up human agents to focus on more complex inquiries that require personalization. With AI assistance, infrastructure services can be restored more quickly when outages occur. This helps make sure operations are more efficient and transparent, directly enhancing the experience for the customers that infrastructure teams aim to support.

Amazon Q Business offers a conversational experience with generative prompts and tasks that can act as a front-line support engineer, answering customer questions and resolving known issues efficiently. The feature can use data from enterprise systems to provide accurate and timely responses, reducing the burden on human engineers and improving customer satisfaction.

With Amazon Bedrock, you can perform sentiment analysis to help analyze customer emotions and provide context to human engineers, enabling them to provide better support and improve customer loyalty, retention, and growth.

Refer to Develop advanced generative AI chat-based assistants by using RAG and ReAct prompting to learn one way to develop generative AI assistants. Refer to Building a Generative AI Contact Center Solution for DoorDash Using Amazon Bedrock, Amazon Connect, and Anthropic’s Claude to learn how DoorDash built a generative AI contact center solution using AWS services. To learn how PGA TOUR built a generative AI virtual assistant, see The journey of PGA TOUR’s generative AI virtual assistant, from concept to development to prototype.

Staff productivity

An all-day infrastructure operations team faces challenges in maintaining staff productivity during off-hours and nights when the volume of support requests is lower. A generative AI assistant can help improve staff productivity in these periods and streamline the shift-handover process.

The assistant can be trained on historical support conversations to understand and resolve a large percentage of routine queries independently. It can communicate with customers on messaging platforms to provide instant assistance. Simple requests that the assistant can address free up the team to focus on complex issues requiring human expertise. The AI system can escalate any queries it can’t resolve on its own to the on-call staff. This allows the night and weekend crew to work with fewer interruptions. They can work through tasks more efficiently knowing the assistant is handling basic support needs independently. Generative AI-powered contact center solutions can improve an agent’s ability to interact with customers more precisely and speed up issue resolution, increasing overall productivity.

To learn how to automate document and data retrieval for AI assistants, see Automate chatbot for document and data retrieval using Amazon Bedrock Agents and Knowledge Bases. Refer to How LeadSquared accelerated chatbot deployments with generative AI using Amazon Bedrock and Amazon Aurora PostgreSQL to learn how LeadSquared uses Amazon Bedrock and Amazon Aurora PostgreSQL-Compatible Edition to deploy generative AI-powered assistants on their Converse platform, which personalize interactions based on customer-specific training data. This integration reduces customer onboarding costs, minimizes manual effort, and improves chatbot responses, transforming customer support and engagement by providing swift and relevant assistance.

Reporting

Generative AI has the potential to help infrastructure operations teams streamline reporting processes. By using ML algorithms trained on past report examples, a generative AI system can automatically generate draft reports based on incoming data from monitoring systems and other operational tools. This can save teams significant time spent compiling information into standardized report formats. The AI-generated reports could include summary data visualizations, descriptive analyses, and recommendations tailored to each recipient.

Teams would still need to review the drafts for accuracy before finalizing and distributing them. However, having an initial version generated automatically could cut down on routine reporting tasks so engineers have more time for higher-value problem-solving and strategic planning work. The use of AI could help infrastructure teams meet their reporting obligations more efficiently.

Amazon Q in QuickSight is your generative AI assistant that makes it straightforward to build and consume insights. For more information, see Amazon Q is now generally available in Amazon QuickSight, bringing Generative BI capabilities to the entire organization. Also, refer to Anthology uses embedded analytics offered by Amazon QuickSight to democratize decision making for higher education to learn how Anthology is using Amazon Q in QuickSight to offer institutions self-serve options for analytics needs that aren’t directly addressed by the central dashboards.

You can explore more customer stories and case studies at Generative AI Customer Stories to learn how customers are using AWS generative AI services. Refer to Derive meaningful and actionable operational insights from AWS Using Amazon Q Business to learn how to use AWS generative AI services, like Amazon Q Business, with AWS Support cases, AWS Trusted Advisor, and AWS Health data to derive actionable insights based on common patterns, issues, and resolutions while using the AWS recommendations and best practices enabled by support data.

Conclusion

Integrating generative AI into TechOps represents a transformative leap in the management and optimization of IT infrastructure and services. By using AWS generative AI solutions such as Amazon Bedrock, Amazon Q Developer, and Amazon Q Business, organizations can significantly enhance productivity, reduce the time required to resolve issues, and improve overall customer experience. Generative AI’s sophisticated capabilities in predicting and preventing outages, automating documentation, and generating actionable insights from operational data position it as a critical tool for modern TechOps teams.

You can unlock unimagined possibilities with generative AI by using the AWS Generative AI Innovation Center program, which pairs you with AWS science and strategy experts with deep experience in AI/ML and generative AI techniques. To get started, contact your AWS Account Manager. If you don’t have an AWS Account Manager, contact AWS Sales.


About the Authors

Raman Pujani is a Solutions Architect at Amazon Web Services, where he helps customers to accelerate their business transformation journey with AWS. He builds simplified and sustainable solutions for complex business problems with innovative technology. Raman has 25+ years of experience in IT Transformation. Besides work, he enjoys spending time with family, vacationing in the mountains, and music.

Rachanee Singprasong is a Principal Customer Solutions Manager in Strategic Accounts at Amazon Web Services. Her role is focused on enabling customer in their cloud adoption and digital transformation journey. She has a Ph.D. in Operations Research and her passion is to solve complex customer challenges using creative solutions.

Vijay Sivaji is a Senior Technical Account Manager in Strategic Accounts at Amazon Web Services. He helps customers in solving architectural, operational and cost optimization challenges. In his spare time he enjoys playing tennis.

Read More

Optimizing MLOps for Sustainability

Optimizing MLOps for Sustainability

Machine learning operations (MLOps) are a set of practices that automate and simplify machine learning (ML) workflows and deployments. What is MLOps provides a detailed description of this concept. As ML workloads become increasingly complex and consume more energy and resources, a growing number of companies are looking for ways to manage both the costs and the carbon footprint associated with these workloads. AWS published Guidance for Optimizing MLOps for Sustainability on AWS to help customers maximize utilization and minimize waste in their ML workloads.

In this blog post, you will learn how to optimize MLOps for sustainability.

There are three main workflows in the overall process for building, deploying and using ML models, as shown in the following figure. The process begins with data preparation, followed by model training and tuning, and then model deployment and management.

Data preparation

The workflow starts with data preparation, which includes four components: your data stream, Amazon SageMaker Processing jobAmazon SageMaker Feature Store and an Amazon Simple Storage Service (Amazon S3) bucket for raw data, as shown in the following figure.

Data preparation is essential for model training and is also the first phase in the MLOps lifecycle. Optimizing the artificial intelligence and machine learning (AI/ML) data preparation workload on AWS with sustainability best practices helps reduce the carbon footprint and the cost.

The data preparation process can be complex and energy-intensive because of the vast amount of data processing and computations involved. This leads to substantial resource consumption. There are a few things to consider that can help reduce energy consumption.

Start with the AWS Region you choose for your workload. If possible, choose a Region that has low carbon intensity or where the electricity is attributed to 100% renewable energy sources. In addition, consider storing data and training models in the same Region if possible. This reduces the data movement and latency across the network, optimizing the networking resources required.

Using a serverless architecture can help further reduce resource consumption and remove maintenance overhead by provisioning resources only when required. It’s also important to avoid duplication and re-run of code across teams. Look for services such as Amazon SageMaker Feature Store which helps achieve this goal. Finally, choosing the right storage type for the data used for model training can limit the carbon impact of your workload.

For example, by using S3 One Zone-Infrequent Access to store data that isn’t frequently accessed, such as test data and training data, you can optimize the carbon impact of the data stored. Also, using S3 Intelligent-Tiering can help move the data to more energy-efficient tiers based on access patterns.

Model training and tuning

The second area for you to consider is model training and tuning, shown in the following figure.

While data preparation isn’t unique to AI/ML workloads, the model training and tuning workflow is specific to AI/ML. It’s an important step in making the models functionally useful while also reducing the resources required to run them at scale. There are costs in terms of both operations and sustainability. The good news is that optimization for sustainability also helps to optimizing operations.

For example, SageMaker provides the model parallel library to help efficiently distribute and train models on multiple compute nodes. The library has multiple features that can be combined to more efficiently train models from relatively small parameter sets up to sets with hundreds of billions of parameters. The library can also help use the features of Elastic Fabric Adapter (EFA) supported devices to maximize throughput and minimize latency across nodes. Further optimization is possible using SageMaker Training Compiler to compile deep learning models for training on supported GPU instances. SageMaker Training Compiler converts deep learning models from high-level language representation to hardware-optimized instructions. Hardware-optimized instructions can speed up model training by up to 50% by more efficiently using the GPU memory and using a larger batch size per iteration, all without altering the final trained model.

To reduce the time and energy required to tune a model, SageMaker automatic model tuning (AMT) runs multiple training jobs on a given dataset; it then uses the results to converge on a set of hyperparameter values to create the best performing model for a given metric. There are multiple approaches to the process of searching for the right hyperparameter ranges. For example, Bayesian optimization typically requires 10 times fewer jobs to find the best set of values compared to other methods, reducing the resource usage and carbon footprint of the process.

Right-sizing is another method for managing resource usage and minimizing the environmental impact of your workloads. SageMaker debugger helps to optimize resource consumption by detecting under-utilization of system resources, identifying training problems, and using built-in rules to monitor and stop training jobs as soon as bugs are detected.

Data pre- and post-processing and model evaluation tasks can be run as Amazon SageMaker Processing jobs. In addition to evaluating the accuracy of your models, processing jobs help you to make informed decisions about the tradeoffs between a model’s accuracy and its carbon footprint. Thus, you can establish performance criteria that support your sustainability goals while meeting your business requirements. SageMaker Processing also provides Amazon CloudWatch logs and metrics that can be used for monitoring and right-sizing jobs based on CPU, memory, GPU, GPU memory, and disk metrics.

Dedicated Amazon Elastic Compute Cloud (Amazon EC2) Trn1 instances provide both efficiency and environmental benefits for running your training jobs. These instances use Trainium processors: purpose-built chips designed specifically for deep learning training of models that can exceed 100 billion parameters. Each Trn1 instance provides up to 16 Trainium accelerators, ensuring that jobs will be both efficient and cost optimized. EC2 Trn1 instances offer up to 52% cost-to-train savings compared to comparable EC2 instance types.

Next, you can use governance to share information about the environmental impact of your model. Amazon SageMaker Model Cards provide versioned records documenting various aspects and attributes of your model. This allows you to share the intended uses and assessed carbon impact of a model so that data scientists, ML engineers, and other teams can make informed decisions when choosing and running models.

Model deployment and management

The last area of MLOps is deployment and management, shown in the following figure.

Automating the deployment of ML models provides several sustainability benefits. The deployed model can use a lot of resources when data or code is updated and retrained. You want to ensure that the deployed model is as efficient as possible to reduce the carbon footprint of the workload.

One approach is to use Amazon SageMaker Model Registry. This feature helps improve sustainability and resource optimization by providing a centralized repository for cataloging ML models and reducing redundancy. This approach improves model reusability by allowing existing models to be fine-tuned, rather than training new models from scratch. Consider running your deployment code using AWS CodePipeline to ensure repeatability and version control and optimize resource utilization by running only the necessary stages in the pipeline. This helps your workloads remove the waste associated with manual processes and supports incremental improvements over time.

If your workloads can tolerate latency, consider deploying your model on Amazon SageMaker Asynchronous Inference with auto-scaling groups. This can help minimize idle resources and reduce the impact of load spikes. This also means you pay for compute only when the endpoint is actively handling inference requests. Alternatively, if you don’t need real-time inference, use batch transform. Unlike persistent endpoints, clusters are decommissioned when a batch transform job is complete. Batch transform automatically partitions large datasets and distributes workloads across compute to ensure efficient resource utilization.

To simplify deployment and management and increase resource utilization, use multi-model endpoints instead of separate endpoints for each model. One example for this approach is models with different data formats, such as recommendation systems that process text and images using separate endpoints. Or deploying a variety of models that include PyTorch, Scikit-learn, and TensorFlow models. Automatic scaling can amplify resource optimization for your hosted models. Auto scaling dynamically adjusts the number of instances provisioned for a model in response to changes in your workload. This helps you avoid cost and consumes less energy and resources. If your workload has intermittent or unpredictable traffic with idle periods between traffic peaks and can tolerate cold starts, use Amazon SageMaker Serverless Inference endpoints, which automatically launch compute resources and scale depending on traffic. Optionally, you can use Provisioned Concurrency with Serverless Inference when you have predictable bursts in your traffic.

AWS offers a few different options to better utilize your resources and lower emissions when working with inference workloads. AWS Inferentia is designed to deliver high performance at the lowest cost in EC2 instances for your deep learning and generative AI inference applications. AWS Inferentia is built for sustainability and provides up to 50% better performance per watt over comparable EC2 instances. You can further optimize resource utilization by combining AWS Inferentia and Amazon Elastic Inference to attach the right amount of GPU-powered inference acceleration to any EC2 or SageMaker instance type.

After training a model for high accuracy, developers often turn to more expensive large instances with lots of memory and processing power to achieve better throughput. You can reduce resource usage and avoid the need for more powerful instances by using pre-trained models and compiling them into optimized executables that can be hosted in SageMaker or edge devices for inference with Amazon SageMaker Neo.

Monitoring CPU, memory, and GPU resource utilization is critical to optimize model performance and avoid wasted resources. AWS offers a variety of tools that you can use to optimize MLOps for sustainability, such as CloudWatch, SageMaker Inference recommender, and SageMaker Model Monitor. Inference Recommender helps you choose the optimal instance type and configuration for ML models and workloads. You can use SageMaker Model Monitor to automate drift detection of your ML model in production, and only retrain it when prediction performance drops below predetermined key performance indicators (KPIs). This approach improves operational efficiency and retrains the model based on your business metrics.

Conclusion

Sustainability and ML are redefining how many companies deliver value for their customers. Incorporating sustainability into the design, development and deployment of ML models is a crucial long-term consideration. AWS is investing in the sustainability of the cloud and providing resources to assist customers in transforming their workloads to be more energy efficient. In this post, we have reviewed the Guidance for Optimizing MLOps for Sustainability on AWS, providing service-specific practices to understand and reduce the environmental impact of these workloads. MLOps consists of several distinct phases that can be independently optimized for sustainability. Regular reviews using tools such as AWS Well-Architected Machine Learning Lens help you identify optimization opportunities and provide a mechanism for you to meet your sustainability goals.


About the Authors

Archana Srinivasan is a Senior Technical Account Manager within Enterprise Support at Amazon Web Services (AWS). Archana provides strategic technical guidance for independent software vendors (ISVs) to innovate and operate their workloads efficiently on AWS.

Chris Procunier is a Senior Technical Account Manager at AWS, based out of Washington DC. He has been managing systems and infrastructure for 25 years as an entrepreneur, IT Director and architect. Outside of work Chris is passionate about family, friends, music, cooking and cycling.

Meghana Reddy is a Technical Account Manager at AWS, where she offers strategic technical guidance to Independent Software Vendors (ISVs) for optimizing their workloads on AWS. She is passionate about environmental sustainability and actively promotes sustainable practices within the cloud.

Steven David is a Principal Solutions Architect at Amazon Web Services (AWS). He has over 20 years of experience designing solutions for large enterprises. Through these engagements he has developed deep expertise in application development technologies and methodologies.

Read More