NVIDIA Open-Sources cuOpt, Ushering in New Era of Decision Optimization

NVIDIA Open-Sources cuOpt, Ushering in New Era of Decision Optimization

Every second, businesses worldwide are making critical decisions. A logistics company decides which trucks to send where. A retailer figures out how to stock its shelves. An airline scrambles to reroute flights after a storm. These aren’t just routing choices — they’re high-stakes puzzles with millions of variables, and getting them wrong costs money and, sometimes, customers.

That’s changing.

NVIDIA today announced it will open-source cuOpt, an AI-powered decision optimization engine — making the powerful software free for developers to unlock real-time optimization at an unprecedented scale.

Optimization ecosystem leaders COPT, the Xpress team at FICO, HiGHS, IBM and SimpleRose are integrating or evaluating cuOpt, accelerating decision-making across industries.

Gurobi Optimization is evaluating and testing cuOpt solvers to refine first-order algorithms for next-level performance.

NVIDIA is working with the COIN-OR Foundation to make cuOpt open source, in what is widely regarded as the oldest and largest such repository for operations research software.

Meanwhile, a team of researchers at Arizona State University, Cornell Tech, Princeton University, University of Pavia and Zuse Institute of Berlin are exploring its capabilities, developing next-generation solvers and tackling complex optimization problems with exceptional speed.

With the technology, airlines can reconfigure flight schedules mid-air to prevent cascading delays, power grids can rebalance in real time to avoid blackouts and financial institutions can manage portfolios with up-to-the-moment risk analysis.

Faster Optimization, Smarter Decisions

The best-known AI applications are all about predictions — whether forecasting weather or generating the next word in a sentence. But prediction is only half the challenge. The real power comes from acting on information in real time.

That’s where cuOpt comes in.

cuOpt dynamically evaluates billions of variables — inventory levels, factory output, shipping delays, fuel costs, risk factors and regulations — and delivers the best move in near real time.

As AI agents and large language model-driven simulations take on more decision-making tasks, the need for instant optimization has never been greater. cuOpt, powered by NVIDIA GPUs, accelerates these computations by orders of magnitude.

Unlike traditional optimization methods that navigate solution spaces sequentially or with limited parallelism, cuOpt taps into GPU acceleration to evaluate millions of possibilities simultaneously — finding optimal solutions exponentially faster for specific instances.

It doesn’t replace existing techniques — it enhances them. By working alongside traditional solvers, cuOpt rapidly identifies high-quality solutions, helping CPU-based models discard bad paths faster.

Why Optimization Is So Hard — and How cuOpt Does It Better

Every decision — where to send a truck, how to schedule workers and when to rebalance power grids — is a puzzle with an exponential number of possible answers.

To put this into perspective, the number of possible ways to schedule 100 nurses in a hospital for the next month is greater than the number of atoms in the observable universe.

Many traditional solvers search for solutions sequentially or with limited parallelism — like navigating a vast maze with a flashlight, one corridor at a time. cuOpt rewrites the rules by evaluating millions of possibilities intelligently, accelerating optimization exponentially.

For years, workforce scheduling, logistics routing and supply-chain planning all took hours — sometimes days — to compute.

NVIDIA cuOpt changes that — the numbers tell the story:

  • Linear programming acceleration: 70x faster on average than a CPU-based PDLP solver on large-scale benchmarks, with a 10x to 3,000x speedup range.
  • Mixed-integer programming (MIP): 60x faster MIP solves, as demonstrated by SimpleRose.
  • Vehicle routing: 240x speedup in dynamic routing, enabling cost to serve insights and near time route adjustments, as demonstrated by Lyric.

Decisions that once took hours or days now take seconds.

Optimizing for a Better World

Better optimization doesn’t just make businesses more efficient — it makes the world more sustainable, resilient and equitable.

Smarter decision-making leads to less waste. Energy grids can distribute power more efficiently, reducing blackouts and seamlessly integrating renewables like wind and solar. Supply chains can adjust dynamically to minimize excess inventory, cutting both costs and emissions.

Hospitals in underserved regions can allocate beds, doctors and medicine in real time, helping lifesaving treatments reach patients faster. Humanitarian aid groups responding to disasters can instantly recalculate the best way to distribute food, water and medicine, reducing delays in critical moments. And public transit systems can adjust dynamically to demand, reducing congestion and travel times for millions of people.

cuOpt isn’t just about more hardware — it’s about smarter search. Instead of going through every possibility, cuOpt intelligently navigates massive search spaces, focusing on constraint edges to converge faster. By using GPU acceleration, it evaluates multiple solutions in parallel, delivering real-time, high-efficiency optimization.

Industry Support — a New Era for Decision Intelligence

Optimization leaders such as FICO, Gurobi Optimization, IBM and SimpleRose are among the companies who are exploring the benefits of GPU acceleration or evaluating the possibility of integrating cuOpt into their workflows and evaluating its potential, spanning industrial planning to supply chain management and scheduling.

Smarter Decisions, Stronger Systems, Better Outcomes

cuOpt redefines optimization at scale.

For businesses, as described, it means AI-powered optimization can reconfigure schedules, route fleets and reallocate resources in real time — cutting costs and boosting agility.

For developers, it provides a high-performance AI toolkit that can solve decision problems up to 3,000x faster than CPU solvers in complex optimization challenges such as network data routing — optimizing the flow of video, voice, and web traffic to reduce congestion and improve efficiency — or electricity distribution,  balancing supply and demand across power grids while minimizing losses and ensuring stable transmission.

For researchers, it’s an open playground for pushing AI-driven decision-making to new frontiers.

cuOpt will be released as open source and freely available for developers, researchers and enterprises later this year.

See cuOpt in Action

Explore real-world applications of cuOpt at these NVIDIA GTC sessions:

For enterprise production deployments, cuOpt is supported as part of the NVIDIA AI Enterprise software platform and can be deployed as an NVIDIA NIM microservice — making it easy to integrate, scale and deploy across cloud, on-premises and edge environments.

With its open-source release, developers will be able to easily access, modify and integrate the cuOpt source code into their own solutions.

Learn more about how companies are already transforming their operations with cuOpt and sign up to be notified when the open-source software is available.

See notice regarding software product information.

Read More

Build your gen AI–based text-to-SQL application using RAG, powered by Amazon Bedrock (Claude 3 Sonnet and Amazon Titan for embedding)

Build your gen AI–based text-to-SQL application using RAG, powered by Amazon Bedrock (Claude 3 Sonnet and Amazon Titan for embedding)

SQL is one of the key languages widely used across businesses, and it requires an understanding of databases and table metadata. This can be overwhelming for nontechnical users who lack proficiency in SQL. Today, generative AI can help bridge this knowledge gap for nontechnical users to generate SQL queries by using a text-to-SQL application. This application allows users to ask questions in natural language and then generates a SQL query for the user’s request.

Large language models (LLMs) are trained to generate accurate SQL queries for natural language instructions. However, off-the-shelf LLMs can’t be used without some modification. Firstly, LLMs don’t have access to enterprise databases, and the models need to be customized to understand the specific database of an enterprise. Additionally, the complexity increases due to the presence of synonyms for columns and internal metrics available.

The limitation of LLMs in understanding enterprise datasets and human context can be addressed using Retrieval Augmented Generation (RAG). In this post, we explore using Amazon Bedrock to create a text-to-SQL application using RAG. We use Anthropic’s Claude 3.5 Sonnet model to generate SQL queries, Amazon Titan in Amazon Bedrock for text embedding and Amazon Bedrock to access these models.

Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities you need to build generative AI applications with security, privacy, and responsible AI.

Solution overview

This solution is primarily based on the following services:

  1. Foundational model – We use Anthropic’s Claude 3.5 Sonnet on Amazon Bedrock as our LLM to generate SQL queries for user inputs.
  2. Vector embeddings – We use Amazon Titan Text Embeddings v2 on Amazon Bedrock for embeddings. Embedding is the process by which text, images, and audio are given numerical representation in a vector space. Embedding is usually performed by a machine learning (ML) model. The following diagram provides more details about embeddings.vector embeddings
  3. RAG – We use RAG for providing more context about table schema, column synonyms, and sample queries to the FM. RAG is a framework for building generative AI applications that can make use of enterprise data sources and vector databases to overcome knowledge limitations. RAG works by using a retriever module to find relevant information from an external data store in response to a user’s prompt. This retrieved data is used as context, combined with the original prompt, to create an expanded prompt that is passed to the LLM. The language model then generates a SQL query that incorporates the enterprise knowledge. The following diagram illustrates the RAG framework.RAG Framework
  4. Streamlit This open source Python library makes it straightforward to create and share beautiful, custom web apps for ML and data science. In just a few minutes you can build powerful data apps using only Python.

The following diagram shows the solution architecture.

solution architecture

We need to update the LLMs with an enterprise-specific database. This make sure that the model can correctly understand the database and generate a response tailored to enterprise-based data schema and tables. There are multiple file formats available for storing this information, such as JSON, PDF, TXT, and YAML. In our case, we created JSON files to store table schema, table descriptions, columns with synonyms, and sample queries. JSON’s inherently structured format allows for clear and organized representation of complex data such as table schemas, column definitions, synonyms, and sample queries. This structure facilitates quick parsing and manipulation of data in most programming languages, reducing the need for custom parsing logic.

There can be multiple tables with similar information, which can lower the model’s accuracy. To increase the accuracy, we categorized the tables in four different types based on the schema and created four JSON files to store different tables. We’ve added one dropdown menu with four choices. Each choice represents one of these four categories and is lined to individual JSON files. After the user selects the value from the dropdown menu, the relevant JSON file is passed to Amazon Titan Text Embeddings v2, which can convert text into embeddings. These embeddings are stored in a vector database for faster retrieval.

We added the prompt template to the FM to define the roles and responsibilities of the model. You can add additional information such as which SQL engine should be used to generate the SQL queries.

When the user provides the input through the chat prompt, we use similarity search to find the relevant table metadata from the vector database for the user’s query. The user input is combined with relevant table metadata and the prompt template, which is passed to the FM as a single input all together. The FM generates the SQL query based on the final input.

To evaluate the model’s accuracy and track the mechanism, we store every user input and output in Amazon Simple Storage Service (Amazon S3).

Prerequisites

To create this solution, complete the following prerequisites:

  1. Sign up for an AWS account if you don’t already have one.
  2. Enable model access for Amazon Titan Text Embeddings v2 and Anthropic’s Claude 3.5 Sonnet on Amazon Bedrock.
  3. Create an S3 bucket as ‘simplesql-logs-****‘, replace ‘****’ with your unique identifier. Bucket names are unique globally across the entire Amazon S3 service.
  4. Choose your testing environment. We recommend that you test in Amazon SageMaker Studio, although you can use other available local environments.
  5. Install the following libraries to execute the code:
    pip install streamlit
    pip install jq
    pip install openpyxl
    pip install "faiss-cpu"
    pip install langchain

Procedure

There are three main components in this solution:

  1. JSON files store the table schema and configure the LLM
  2. Vector indexing using Amazon Bedrock
  3. Streamlit for the front-end UI

You can download all three components and code snippets provided in the following section.

Generate the table schema

We use the JSON format to store the table schema. To provide more inputs to the model, we added a table name and its description, columns and their synonyms, and sample queries in our JSON files. Create a JSON file as Table_Schema_A.json by copying the following code into it:

{
  "tables": [
    {
      "separator": "table_1",
      "name": "schema_a.orders",
      "schema": "CREATE TABLE schema_a.orders (order_id character varying(200), order_date timestamp without time zone, customer_id numeric(38,0), order_status character varying(200), item_id character varying(200) );",
      "description": "This table stores information about orders placed by customers.",
      "columns": [
        {
          "name": "order_id",
          "description": "unique identifier for orders.",
          "synonyms": ["order id"]
        },
        {
          "name": "order_date",
          "description": "timestamp when the order was placed",
          "synonyms": ["order time", "order day"]
        },
        {
          "name": "customer_id",
          "description": "Id of the customer associated with the order",
          "synonyms": ["customer id", "userid"]
        },
        {
          "name": "order_status",
          "description": "current status of the order, sample values are: shipped, delivered, cancelled",
          "synonyms": ["order status"]
        },
        {
          "name": "item_id",
          "description": "item associated with the order",
          "synonyms": ["item id"]
        }
      ],
      "sample_queries": [
        {
          "query": "select count(order_id) as total_orders from schema_a.orders where customer_id = '9782226' and order_status = 'cancelled'",
          "user_input": "Count of orders cancelled by customer id: 978226"
        }
      ]
    },
    {
      "separator": "table_2",
      "name": "schema_a.customers",
      "schema": "CREATE TABLE schema_a.customers (customer_id numeric(38,0), customer_name character varying(200), registration_date timestamp without time zone, country character varying(200) );",
      "description": "This table stores the details of customers.",
      "columns": [
        {
          "name": "customer_id",
          "description": "Id of the customer, unique identifier for customers",
          "synonyms": ["customer id"]
        },
        {
          "name": "customer_name",
          "description": "name of the customer",
          "synonyms": ["name"]
        },
        {
          "name": "registration_date",
          "description": "registration timestamp when customer registered",
          "synonyms": ["sign up time", "registration time"]
        },
        {
          "name": "country",
          "description": "customer's original country",
          "synonyms": ["location", "customer's region"]
        }
      ],
      "sample_queries": [
        {
          "query": "select count(customer_id) as total_customers from schema_a.customers where country = 'India' and to_char(registration_date, 'YYYY') = '2024'",
          "user_input": "The number of customers registered from India in 2024"
        },
        {
          "query": "select count(o.order_id) as order_count from schema_a.orders o join schema_a.customers c on o.customer_id = c.customer_id where c.customer_name = 'john' and to_char(o.order_date, 'YYYY-MM') = '2024-01'",
          "user_input": "Total orders placed in January 2024 by customer name john"
        }
      ]
    },
    {
      "separator": "table_3",
      "name": "schema_a.items",
      "schema": "CREATE TABLE schema_a.items (item_id character varying(200), item_name character varying(200), listing_date timestamp without time zone );",
      "description": "This table stores the complete details of items listed in the catalog.",
      "columns": [
        {
          "name": "item_id",
          "description": "Id of the item, unique identifier for items",
          "synonyms": ["item id"]
        },
        {
          "name": "item_name",
          "description": "name of the item",
          "synonyms": ["name"]
        },
        {
          "name": "listing_date",
          "description": "listing timestamp when the item was registered",
          "synonyms": ["listing time", "registration time"]
        }
      ],
      "sample_queries": [
        {
          "query": "select count(item_id) as total_items from schema_a.items where to_char(listing_date, 'YYYY') = '2024'",
          "user_input": "how many items are listed in 2024"
        },
        {
          "query": "select count(o.order_id) as order_count from schema_a.orders o join schema_a.customers c on o.customer_id = c.customer_id join schema_a.items i on o.item_id = i.item_id where c.customer_name = 'john' and i.item_name = 'iphone'",
          "user_input": "how many orders are placed for item 'iphone' by customer name john"
        }
      ]
    }
  ]
}

Configure the LLM and initialize vector indexing using Amazon Bedrock

Create a Python file as library.py by following these steps:

  1. Add the following import statements to add the necessary libraries:
    import boto3  # AWS SDK for Python
    from langchain_community.document_loaders import JSONLoader  # Utility to load JSON files
    from langchain.llms import Bedrock  # Large Language Model (LLM) from Anthropic
    from langchain_community.chat_models import BedrockChat  # Chat interface for Bedrock LLM
    from langchain.embeddings import BedrockEmbeddings  # Embeddings for Titan model
    from langchain.memory import ConversationBufferWindowMemory  # Memory to store chat conversations
    from langchain.indexes import VectorstoreIndexCreator  # Create vector indexes
    from langchain.vectorstores import FAISS  # Vector store using FAISS library
    from langchain.text_splitter import RecursiveCharacterTextSplitter  # Split text into chunks
    from langchain.chains import ConversationalRetrievalChain  # Conversational retrieval chain
    from langchain.callbacks.manager import CallbackManager

  2. Initialize the Amazon Bedrock client and configure Anthropic’s Claude 3.5 You can limit the number of output tokens to optimize the cost:
    # Create a Boto3 client for Bedrock Runtime
    bedrock_runtime = boto3.client(
        service_name="bedrock-runtime",
        region_name="us-east-1"
    )
    
    # Function to get the LLM (Large Language Model)
    def get_llm():
        model_kwargs = {  # Configuration for Anthropic model
            "max_tokens": 512,  # Maximum number of tokens to generate
            "temperature": 0.2,  # Sampling temperature for controlling randomness
            "top_k": 250,  # Consider the top k tokens for sampling
            "top_p": 1,  # Consider the top p probability tokens for sampling
            "stop_sequences": ["nnHuman:"]  # Stop sequence for generation
        }
        # Create a callback manager with a default callback handler
        callback_manager = CallbackManager([])
        
        llm = BedrockChat(
            model_id="anthropic.claude-3-5-sonnet-20240620-v1:0",  # Set the foundation model
            model_kwargs=model_kwargs,  # Pass the configuration to the model
            callback_manager=callback_manager
            
        )
    
        return llm

  3. Create and return an index for the given schema type. This approach is an efficient way to filter tables and provide relevant input to the model:
    # Function to load the schema file based on the schema type
    def load_schema_file(schema_type):
        if schema_type == 'Schema_Type_A':
            schema_file = "Table_Schema_A.json"  # Path to Schema Type A
        elif schema_type == 'Schema_Type_B':
            schema_file = "Table_Schema_B.json"  # Path to Schema Type B
        elif schema_type == 'Schema_Type_C':
            schema_file = "Table_Schema_C.json"  # Path to Schema Type C
        return schema_file
    
    # Function to get the vector index for the given schema type
    def get_index(schema_type):
        embeddings = BedrockEmbeddings(model_id="amazon.titan-embed-text-v2:0",
                                       client=bedrock_runtime)  # Initialize embeddings
    
        db_schema_loader = JSONLoader(
            file_path=load_schema_file(schema_type),  # Load the schema file
            # file_path="Table_Schema_RP.json",  # Uncomment to use a different file
            jq_schema='.',  # Select the entire JSON content
            text_content=False)  # Treat the content as text
    
        db_schema_text_splitter = RecursiveCharacterTextSplitter(  # Create a text splitter
            separators=["separator"],  # Split chunks at the "separator" string
            chunk_size=10000,  # Divide into 10,000-character chunks
            chunk_overlap=100  # Allow 100 characters to overlap with previous chunk
        )
    
        db_schema_index_creator = VectorstoreIndexCreator(
            vectorstore_cls=FAISS,  # Use FAISS vector store
            embedding=embeddings,  # Use the initialized embeddings
            text_splitter=db_schema_text_splitter  # Use the text splitter
        )
    
        db_index_from_loader = db_schema_index_creator.from_loaders([db_schema_loader])  # Create index from loader
    
        return db_index_from_loader

  4. Use the following function to create and return memory for the chat session:
    # Function to get the memory for storing chat conversations
    def get_memory():
        memory = ConversationBufferWindowMemory(memory_key="chat_history", return_messages=True)  # Create memory
    
        return memory

  5. Use the following prompt template to generate SQL queries based on user input:
    # Template for the question prompt
    template = """ Read table information from the context. Each table contains the following information:
    - Name: The name of the table
    - Description: A brief description of the table
    - Columns: The columns of the table, listed under the 'columns' key. Each column contains:
      - Name: The name of the column
      - Description: A brief description of the column
      - Type: The data type of the column
      - Synonyms: Optional synonyms for the column name
    - Sample Queries: Optional sample queries for the table, listed under the 'sample_data' key
    
    Given this structure, Your task is to provide the SQL query using Amazon Redshift syntax that would retrieve the data for following question. The produced query should be functional, efficient, and adhere to best practices in SQL query optimization.
    
    Question: {}
    """

  6. Use the following function to get a response from the RAG chat model:
    # Function to get the response from the conversational retrieval chain
    def get_rag_chat_response(input_text, memory, index):
        llm = get_llm()  # Get the LLM
    
        conversation_with_retrieval = ConversationalRetrievalChain.from_llm(
            llm, index.vectorstore.as_retriever(), memory=memory, verbose=True)  # Create conversational retrieval chain
    
        chat_response = conversation_with_retrieval.invoke({"question": template.format(input_text)})  # Invoke the chain
    
        return chat_response['answer']  # Return the answer

Configure Streamlit for the front-end UI

Create the file app.py by following these steps:

  1. Import the necessary libraries:
    import streamlit as st
    import library as lib
    from io import StringIO
    import boto3
    from datetime import datetime
    import csv
    import pandas as pd
    from io import BytesIO

  2. Initialize the S3 client:
    s3_client = boto3.client('s3')
    bucket_name = 'simplesql-logs-****'
    #replace the 'simplesql-logs-****’ with your S3 bucket name
    log_file_key = 'logs.xlsx'

  3. Configure Streamlit for UI:
    st.set_page_config(page_title="Your App Name")
    st.title("Your App Name")
    
    # Define the available menu items for the sidebar
    menu_items = ["Home", "How To", "Generate SQL Query"]
    
    # Create a sidebar menu using radio buttons
    selected_menu_item = st.sidebar.radio("Menu", menu_items)
    
    # Home page content
    if selected_menu_item == "Home":
        # Display introductory information about the application
        st.write("This application allows you to generate SQL queries from natural language input.")
        st.write("")
        st.write("**Get Started** by selecting the button Generate SQL Query !")
        st.write("")
        st.write("")
        st.write("**Disclaimer :**")
        st.write("- Model's response depends on user's input (prompt). Please visit How-to section for writing efficient prompts.")
               
    # How-to page content
    elif selected_menu_item == "How To":
        # Provide guidance on how to use the application effectively
        st.write("The model's output completely depends on the natural language input. Below are some examples which you can keep in mind while asking the questions.")
        st.write("")
        st.write("")
        st.write("")
        st.write("")
        st.write("**Case 1 :**")
        st.write("- **Bad Input :** Cancelled orders")
        st.write("- **Good Input :** Write a query to extract the cancelled order count for the items which were listed this year")
        st.write("- It is always recommended to add required attributes, filters in your prompt.")
        st.write("**Case 2 :**")
        st.write("- **Bad Input :** I am working on XYZ project. I am creating a new metric and need the sales data. Can you provide me the sales at country level for 2023 ?")
        st.write("- **Good Input :** Write an query to extract sales at country level for orders placed in 2023 ")
        st.write("- Every input is processed as tokens. Do not provide un-necessary details as there is a cost associated with every token processed. Provide inputs only relevant to your query requirement.") 

  4. Generate the query:
    # SQL-AI page content
    elif selected_menu_item == "Generate SQL Query":
        # Define the available schema types for selection
        schema_types = ["Schema_Type_A", "Schema_Type_B", "Schema_Type_C"]
        schema_type = st.sidebar.selectbox("Select Schema Type", schema_types)

  5. Use the following for SQL generation:
    if schema_type:
            # Initialize or retrieve conversation memory from session state
            if 'memory' not in st.session_state:
                st.session_state.memory = lib.get_memory()
    
            # Initialize or retrieve chat history from session state
            if 'chat_history' not in st.session_state:
                st.session_state.chat_history = []
    
            # Initialize or update vector index based on selected schema type
            if 'vector_index' not in st.session_state or 'current_schema' not in st.session_state or st.session_state.current_schema != schema_type:
                with st.spinner("Indexing document..."):
                    # Create a new index for the selected schema type
                    st.session_state.vector_index = lib.get_index(schema_type)
                    # Update the current schema in session state
                    st.session_state.current_schema = schema_type
    
            # Display the chat history
            for message in st.session_state.chat_history:
                with st.chat_message(message["role"]):
                    st.markdown(message["text"])
    
            # Get user input through the chat interface, set the max limit to control the input tokens.
            input_text = st.chat_input("Chat with your bot here", max_chars=100)
            
            if input_text:
                # Display user input in the chat interface
                with st.chat_message("user"):
                    st.markdown(input_text)
    
                # Add user input to the chat history
                st.session_state.chat_history.append({"role": "user", "text": input_text})
    
                # Generate chatbot response using the RAG model
                chat_response = lib.get_rag_chat_response(
                    input_text=input_text, 
                    memory=st.session_state.memory,
                    index=st.session_state.vector_index
                )
                
                # Display chatbot response in the chat interface
                with st.chat_message("assistant"):
                    st.markdown(chat_response)
    
                # Add chatbot response to the chat history
                st.session_state.chat_history.append({"role": "assistant", "text": chat_response})

  6. Log the conversations to the S3 bucket:
    timestamp = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
    
                try:
                    # Attempt to download the existing log file from S3
                    log_file_obj = s3_client.get_object(Bucket=bucket_name, Key=log_file_key)
                    log_file_content = log_file_obj['Body'].read()
                    df = pd.read_excel(BytesIO(log_file_content))
    
                except s3_client.exceptions.NoSuchKey:
                    # If the log file doesn't exist, create a new DataFrame
                    df = pd.DataFrame(columns=["User Input", "Model Output", "Timestamp", "Schema Type"])
    
                # Create a new row with the current conversation data
                new_row = pd.DataFrame({
                    "User Input": [input_text], 
                    "Model Output": [chat_response], 
                    "Timestamp": [timestamp],
                    "Schema Type": [schema_type]
                })
                # Append the new row to the existing DataFrame
                df = pd.concat([df, new_row], ignore_index=True)
                
                # Prepare the updated DataFrame for S3 upload
                output = BytesIO()
                df.to_excel(output, index=False)
                output.seek(0)
                
                # Upload the updated log file to S3
                s3_client.put_object(Body=output.getvalue(), Bucket=bucket_name, Key=log_file_key)
    

Test the solution

Open your terminal and invoke the following command to run the Streamlit application.

streamlit run app.py

To visit the application using your browser, navigate to the localhost.

To visit the application using SageMaker, copy your notebook URL and replace ‘default/lab’ in the URL with ‘default/proxy/8501/ ‘ . It should look something like the following:

https://your_sagemaker_lab_url.studio.us-east-1.sagemaker.aws/jupyterlab/default/proxy/8501/

Choose Generate SQL query to open the chat window. Test your application by asking questions in natural language. We tested the application with the following questions and it generated accurate SQL queries.

Count of orders placed from India last month?
Write a query to extract the canceled order count for the items that were listed this year.
Write a query to extract the top 10 item names having highest order for each country.

Troubleshooting tips

Use the following solutions to address errors:

Error – An error raised by inference endpoint means that an error occurred (AccessDeniedException) when calling the InvokeModel operation. You don’t have access to the model with the specified model ID.
Solution – Make sure you have access to the FMs in Amazon Bedrock, Amazon Titan Text Embeddings v2, and Anthropic’s Claude 3.5 Sonnet.

Error – app.py does not exist
Solution – Make sure your JSON file and Python files are in the same folder and you’re invoking the command in the same folder.

Error – No module named streamlit
Solution – Open the terminal and install the streamlit module by running the command pip install streamlit

Error – An error occurred (NoSuchBucket) when calling the GetObject operation. The specified bucket doesn’t exist.
Solution – Verify your bucket name in the app.py file and update the name based on your S3 bucket name.

Clean up

Clean up the resources you created to avoid incurring charges. To clean up your S3 bucket, refer to Emptying a bucket.

Conclusion

In this post, we showed how Amazon Bedrock can be used to create a text-to-SQL application based on enterprise-specific datasets. We used Amazon S3 to store the outputs generated by the model for corresponding inputs. These logs can be used to test the accuracy and enhance the context by providing more details in the knowledge base. With the aid of a tool like this, you can create automated solutions that are accessible to nontechnical users, empowering them to interact with data more efficiently.

Ready to get started with Amazon Bedrock? Start learning with these interactive workshops.

For more information on SQL generation, refer to these posts:

We recently launched a managed NL2SQL module to retrieve structured data in Amazon Bedrock Knowledge  . To learn more, visit Amazon Bedrock Knowledge Bases now supports structured data retrieval.


About the Author

rajendra choudharyRajendra Choudhary is a Sr. Business Analyst at Amazon. With 7 years of experience in developing data solutions, he possesses profound expertise in data visualization, data modeling, and data engineering. He is passionate about supporting customers by leveraging generative AI–based solutions. Outside of work, Rajendra is an avid foodie and music enthusiast, and he enjoys swimming and hiking.

Read More

Unleash AI innovation with Amazon SageMaker HyperPod

Unleash AI innovation with Amazon SageMaker HyperPod

The rise of generative AI has significantly increased the complexity of building, training, and deploying machine learning (ML) models. It now demands deep expertise, access to vast datasets, and the management of extensive compute clusters. Customers also face the challenges of writing specialized code for distributed training, continuously optimizing models, addressing hardware issues, and keeping projects on track and within budget. To simplify this process, AWS introduced Amazon SageMaker HyperPod during AWS re:Invent 2023, and it has emerged as a pioneering solution, revolutionizing how companies approach AI development and deployment.

As Amazon CEO Andy Jassy recently shared, “One of the most exciting innovations we’ve introduced is SageMaker HyperPod. HyperPod accelerates the training of machine learning models by distributing and parallelizing workloads across numerous powerful processors like AWS’s Trainium chips or GPUs. HyperPod also constantly monitor your infrastructure for problems, automatically repairing them when detected. During repair, your work is automatically saved, ensuring seamless resumption. This innovation is widely adopted, with most SageMaker AI customers relying on HyperPod for their demanding training needs.”

In this post, we show how SageMaker HyperPod, and its new features introduced at AWS re:Invent 2024, is designed to meet the demands of modern AI workloads, offering a persistent and optimized cluster tailored for distributed training and accelerated inference at cloud scale and attractive price-performance.

Customers using SageMaker HyperPod

Leading startups like Writer, Luma AI, and Perplexity, as well as major enterprises such as Thomson Reuters and Salesforce, are accelerating model development with SageMaker HyperPod. Amazon itself used SageMaker HyperPod to train its new Amazon Nova models, significantly reducing training costs, enhancing infrastructure performance, and saving months of manual effort that would have otherwise been spent on cluster setup and end-to-end process management.

Today, more organizations are eager to fine-tune popular publicly available models or train their own specialized models to revolutionize their businesses and applications with generative AI. To support this demand, SageMaker HyperPod continues to evolve, introducing new innovations that make it straightforward, faster, and more cost-effective for customers to build, train, and deploy these models at scale.

Deep infrastructure control

SageMaker HyperPod offers persistent clusters with deep infrastructure control, enabling builders to securely connect using SSH to Amazon Elastic Compute Cloud (Amazon EC2) instances for advanced model training, infrastructure management, and debugging. To maximize availability, HyperPod maintains a pool of dedicated and spare instances (at no additional cost), minimizing downtime during critical node replacements.

You can use familiar orchestration tools such as Slurm or Amazon Elastic Kubernetes Service (Amazon EKS), along with the libraries built on these tools, to enable flexible job scheduling and compute sharing. Integrating SageMaker HyperPod clusters with Slurm also allows the use of NVIDIA’s Enroot and Pyxis for efficient container scheduling in performant, unprivileged sandboxes. The underlying operating system and software stack are based on the Deep Learning AMI, preconfigured with NVIDIA CUDA, NVIDIA cuDNN, and the latest versions of PyTorch and TensorFlow. SageMaker HyperPod also is integrated with Amazon SageMaker AI distributed training libraries, optimized for AWS infrastructure, enabling automatic workload distribution across thousands of accelerators for efficient parallel training.

Builders can use built-in ML tools within SageMaker HyperPod to enhance model performance. For example, Amazon SageMaker with TensorBoard helps visualize model architecture and address convergence issues, as shown in the following screenshot. Integration with observability tools like Amazon CloudWatch Container Insights, Amazon Managed Service for Prometheus, and Amazon Managed Grafana offers deeper insights into cluster performance, health, and utilization, streamlining development time.

SageMaker HyperPod allows you to implement custom libraries and frameworks, enabling the service to be tailored to specific AI project needs. This level of personalization is essential in the rapidly evolving AI landscape, where innovation often requires experimenting with cutting-edge techniques and technologies. The adaptability of SageMaker HyperPod means that businesses are not constrained by infrastructure limitations, fostering creativity and technological advancement.

Intelligent resource management

As organizations increasingly provision large amounts of accelerated compute capacity for model training, they face challenges in effectively governing resource usage. These compute resources are both expensive and finite, making it crucial to prioritize critical model development tasks and avoid waste or under utilization. Without proper controls over task prioritization and resource allocation, some projects stall due to insufficient resources, while others leave resources underused. This creates a significant burden for administrators, who must constantly reallocate resources, and for data scientists, who struggle to maintain progress. These inefficiencies delay AI innovation and drive up costs.

SageMaker HyperPod addresses these challenges with its task governance capabilities, enabling you to maximize accelerator utilization for model training, fine-tuning, and inference. With just a few clicks, you can define task priorities and set limits on compute resource usage for teams. Once configured, SageMaker HyperPod automatically manages the task queue, making sure the most critical work receives the necessary resources. This reduction in operational overhead allows organizations to reallocate valuable human resources toward more innovative and strategic initiatives. This reduces model development costs by up to 40%.

For instance, if an inference task powering a customer-facing service requires urgent compute capacity but all resources are currently in use, SageMaker HyperPod reallocates underutilized or non-urgent resources to prioritize the critical task. Non-urgent tasks are automatically paused, checkpoints are saved to preserve progress, and these tasks resume seamlessly when resources become available. This makes sure you maximize your compute investments without compromising ongoing work.

As a fast-growing generative AI startup, Articul8 AI constantly optimizes their compute environment to allocate accelerated compute resources as efficiently as possible. With automated task prioritization and resource allocation in SageMaker HyperPod, they have seen a dramatic improvement in GPU utilization, reducing idle time and accelerating their model development process by optimizing tasks ranging from training and fine-tuning to inference. The ability to automatically shift resources to high-priority tasks has increased their team’s productivity, allowing them to bring new generative AI innovations to market faster than ever before.

At its core, SageMaker HyperPod represents a paradigm shift in AI infrastructure, moving beyond the traditional emphasis on raw computational power to focus on intelligent and adaptive resource management. By prioritizing optimized resource allocation, SageMaker HyperPod minimizes waste, maximizes efficiency, and accelerates innovation—all while reducing costs. This makes AI development more accessible and scalable for organizations of all sizes.

Get started faster with SageMaker HyperPod recipes

Many customers want to customize popular publicly available models, like Meta’s Llama and Mistral, for their specific use cases using their organization’s data. However, optimizing training performance often requires weeks of iterative testing—experimenting with algorithms, fine-tuning parameters, monitoring training impact, debugging issues, and benchmarking performance.

To simplify this process, SageMaker HyperPod now offers over 30 curated model training recipes for some of today’s most popular models, including DeepSeek R1, DeepSeek R1 Distill Llama, DeepSeek R1 Distill Qwen, Llama, Mistral, and Mixtral. These recipes enable you to get started in minutes by automating key steps like loading training datasets, applying distributed training techniques, and configuring systems for checkpointing and recovery from infrastructure failures. This empowers users of all skill levels to achieve better price-performance for model training on AWS infrastructure from the outset, eliminating weeks of manual evaluation and testing.

You can browse the GitHub repo to explore available training recipes, customize parameters to fit your needs, and deploy in minutes. With a simple one-line change, you can seamlessly switch between GPU or AWS Trainium based instances to further optimize price-performance.

Researchers at Salesforce were looking for ways to quickly get started with foundation model (FM) training and fine-tuning, without having to worry about the infrastructure, or spend weeks optimizing their training stack for each new model. With SageMaker HyperPod recipes, researchers at Salesforce can conduct rapid prototyping when customizing FMs. Now, Salesforce’s AI Research teams are able to get started in minutes with a variety of pre-training and fine-tuning recipes, and can operationalize frontier models with high performance.

Integrating Kubernetes with SageMaker Hyperpod

Though the standalone capabilities of SageMaker HyperPod are impressive, its integration with Amazon EKS takes AI workloads to new levels of power and flexibility. Amazon EKS simplifies the deployment, scaling, and management of containerized applications, making it an ideal solution for orchestrating complex AI/ML infrastructure.

By running SageMaker HyperPod on Amazon EKS, organizations can use Kubernetes’s advanced scheduling and orchestration features to dynamically provision and manage compute resources for AI/ML workloads, providing optimal resource utilization and scalability.

“We were able to meet our large language model training requirements using Amazon SageMaker HyperPod,” says John Duprey, Distinguished Engineer, Thomson Reuters Labs. “Using Amazon EKS on SageMaker HyperPod, we were able to scale up capacity and easily run training jobs, enabling us to unlock benefits of LLMs in areas such as legal summarization and classification.”

This integration also enhances fault tolerance and high availability. With self-healing capabilities, HyperPod automatically replaces failed nodes, maintaining workload continuity. Automated GPU health monitoring and seamless node replacement provide reliable execution of AI/ML workloads with minimal downtime, even during hardware failures.

Additionally, running SageMaker HyperPod on Amazon EKS enables efficient resource isolation and sharing using Kubernetes namespaces and resource quotas. Organizations can isolate different AI/ML workloads or teams while maximizing resource utilization across the cluster.

Flexible training plans help meet timelines and budgets

Although infrastructure innovations help reduce costs and improve training efficiency, customers still face challenges in planning and managing the compute capacity needed to complete training tasks on time and within budget. To address this, AWS is introducing flexible training plans for SageMaker HyperPod.

With just a few clicks, you can specify your desired completion date and the maximum amount of compute resources needed. SageMaker HyperPod then helps acquire capacity and sets up clusters, saving teams weeks of preparation time. This eliminates much of the uncertainty customers encounter when acquiring large compute clusters for model development tasks.


SageMaker HyperPod training plans are now available in US East (N. Virginia), US East (Ohio), and US West (Oregon) AWS Regions and support ml.p4d.48xlarge, ml.p5.48xlarge, ml.p5e.48xlarge, ml.p5en.48xlarge, and ml.trn2.48xlarge instances. Trn2 and P5en instances are only in the US East (Ohio) Region. To learn more, visit the SageMaker HyperPod product page and SageMaker pricing page.

Hippocratic AI is an AI company that develops the first safety-focused large language model (LLM) for healthcare. To train its primary LLM and the supervisor models, Hippocratic AI required powerful compute resources, which were in high demand and difficult to obtain. SageMaker HyperPod flexible training plans made it straightforward for them to gain access to EC2 P5 instances.

Developers and data scientists at OpenBabylon, an AI company that customizes LLMs for underrepresented languages, has been using SageMaker HyperPod flexible training plans for a few months to streamline their access to GPU resources to run large-scale experiments. Using the multi-node SageMaker HyperPod distributed training capabilities, they conducted 100 large-scale model training experiments, achieving state-of-the-art results in English-to-Ukrainian translation. This breakthrough was achieved on time and cost-effectively, demonstrating the ability of SageMaker HyperPod to successfully deliver complex projects on time and at budget.

Integrating training and inference infrastructures

A key focus area is integrating next-generation AI accelerators like the anticipated AWS Trainium2 release. These advanced accelerators promise unparalleled computational performance, offering 30–40% better price-performance than the current generation of GPU-based EC2 instances, significantly boosting AI model training and deployment efficiency and speed. This will be crucial for real-time applications and processing vast datasets simultaneously. The seamless accelerator integration with SageMaker HyperPod enables businesses to harness cutting-edge hardware advancements, driving AI initiatives forward.

Another pivotal aspect is that SageMaker HyperPod, through its integration with Amazon EKS, enables scalable inference solutions. As real-time data processing and decision-making demands grow, the SageMaker HyperPod architecture efficiently handles these requirements. This capability is essential across sectors like healthcare, finance, and autonomous systems, where timely, accurate AI inferences are critical. Offering scalable inference enables deploying high-performance AI models under varying workloads, enhancing operational effectiveness.

Moreover, integrating training and inference infrastructures represents a significant advancement, streamlining the AI lifecycle from development to deployment and providing optimal resource utilization throughout. Bridging this gap facilitates a cohesive, efficient workflow, reducing transition complexities from development to real-world applications. This holistic integration supports continuous learning and adaptation, which is key for next-generation, self-evolving AI models (continuously learning models, which possess the ability to adapt and refine themselves in real time based on their interactions with the environment).

SageMaker HyperPod uses established open source technologies, including MLflow integration through SageMaker, container orchestration through Amazon EKS, and Slurm workload management, providing users with familiar and proven tools for their ML workflows. By engaging the global AI community and encouraging knowledge sharing, SageMaker HyperPod continuously evolves, incorporating the latest research advancements. This collaborative approach helps SageMaker HyperPod remain at the forefront of AI technology, providing the tools to drive transformative change.

Conclusion

SageMaker HyperPod represents a fundamental change in AI infrastructure, offering a future-fit solution that empowers organizations to unlock the full potential of AI technologies. With its intelligent resource management, versatility, scalability, and forward-thinking design, SageMaker HyperPod enables businesses to accelerate innovation, reduce operational costs, and stay ahead of the curve in the rapidly evolving AI landscape.

Whether it’s optimizing the training of LLMs, processing complex datasets for medical imaging inference, or exploring novel AI architectures, SageMaker HyperPod provides a robust and flexible foundation for organizations to push the boundaries of what is possible in AI.

As AI continues to reshape industries and redefine what is possible, SageMaker HyperPod stands at the forefront, enabling organizations to navigate the complexities of AI workloads with unparalleled agility, efficiency, and innovation. With its commitment to continuous improvement, strategic partnerships, and alignment with emerging technologies, SageMaker HyperPod is poised to play a pivotal role in shaping the future of AI, empowering organizations to unlock new realms of possibility and drive transformative change.

Take the first step towards revolutionizing your AI initiatives by scheduling a consultation with our experts. Let us guide you through the process of harnessing the power of SageMaker HyperPod and unlock a world of possibilities for your business.


About the authors

Ilan Gleiser is a Principal GenAI Specialist at AWS WWSO Frameworks team focusing on developing scalable Artificial General Intelligence architectures and optimizing foundation model training and inference. With a rich background in AI and machine learning, Ilan has published over 20 blogs and delivered 100+ prototypes globally over the last 5 years. Ilan holds a Master’s degree in mathematical economics.

Trevor Harvey is a Principal Specialist in Generative AI at Amazon Web Services and an AWS Certified Solutions Architect – Professional. Trevor works with customers to design and implement machine learning solutions and leads go-to-market strategies for generative AI services.

Shubha Kumbadakone is a Sr. Mgr on the AWS WWSO Frameworks team focusing on Foundation Model Builders and self-managed machine learning with a focus on open-source software and tools. She has more than 19 years of experience in cloud infrastructure and machine learning and is helping customers build their distributed training and inference at scale for their ML models on AWS. She also holds a patent on a caching algorithm for rapid resume from hibernation for mobile systems.

Matt Nightingale is a Solutions Architect Manager on the AWS WWSO Frameworks team focusing on Generative AI Training and Inference. Matt specializes in distributed training architectures with a focus on hardware performance and reliability. Matt holds a bachelors degree from University of Virginia and is based in Boston, Massachusetts.

Read More

Revolutionizing clinical trials with the power of voice and AI

Revolutionizing clinical trials with the power of voice and AI

In the rapidly evolving healthcare landscape, patients often find themselves navigating a maze of complex medical information, seeking answers to their questions and concerns. However, accessing accurate and comprehensible information can be a daunting task, leading to confusion and frustration. This is where the integration of cutting-edge technologies, such as audio-to-text translation and large language models (LLMs), holds the potential to revolutionize the way patients receive, process, and act on vital medical information.

As the healthcare industry continues to embrace digital transformation, solutions that combine advanced technologies like audio-to-text translation and LLMs will become increasingly valuable in addressing key challenges, such as patient education, engagement, and empowerment. By taking advantage of these innovative technologies, healthcare providers can deliver more personalized, efficient, and effective care, ultimately improving patient outcomes and driving progress in the life sciences domain.

For instance, envision a voice-enabled virtual assistant that not only understands your spoken queries, but also transcribes them into text with remarkable accuracy. This transcription then serves as the input for a powerful LLM, which draws upon its vast knowledge base to provide personalized, context-aware responses tailored to your specific situation. This solution can transform the patient education experience, empowering individuals to make informed decisions about their healthcare journey.

In this post, we discuss possible use cases for combining speech recognition technology with LLMs, and how the solution can revolutionize clinical trials.

By combining speech recognition technology with LLMs, the solution can accurately transcribe a patient’s spoken queries into text, enabling the LLM to understand and analyze the context of the question. The LLM can then use its extensive knowledge base, which can be regularly updated with the latest medical research and clinical trial data, to provide relevant and trustworthy responses tailored to the patient’s specific situation.

Some of the potential benefits of this integrated approach are that patients can receive instant access to reliable information, empowering them to make more informed decisions about their healthcare. Additionally, the solution can help alleviate the burden on healthcare professionals by providing patients with a convenient and accessible source of information, freeing up valuable time for more critical tasks. Furthermore, the voice-enabled interface can enhance accessibility for patients with disabilities or those who prefer verbal communication, making sure that no one is left behind in the pursuit of better health outcomes.

Use cases overview

In this section, we discuss several possible use cases for this solution.

Use case 1: Audio-to-text translation and LLM integration for clinical trial patient interactions

In the domain of clinical trials, effective communication between patients and physicians is crucial for gathering accurate data, enforcing patient adherence, and maintaining study integrity. This use case demonstrates how audio-to-text translation combined with LLM capabilities can streamline and enhance the process of capturing and analyzing patient-physician interactions during clinical trial visits and telemedicine sessions.

Don’t feel like reading the full use case? No problem! You can listen to the key details in our audio file instead.

The process flow consists of the following steps:

  1. Audio capture – During patient visits or telemedicine sessions, the audio of the patient-physician interaction is recorded securely, with appropriate consent and privacy measures in place.
  2. Audio-to-text translation – The recorded audio is processed through an advanced speech recognition (ASR) system, which converts the audio into text transcripts. This step provides an accurate and efficient conversion of spoken words into a format suitable for further analysis.
  3. Text preprocessing – The transcribed text undergoes preprocessing steps, such as removing identifying information, formatting the data, and enforcing compliance with relevant data privacy regulations.
  4. LLM integration – The preprocessed text is fed into a powerful LLM tailored for the healthcare and life sciences (HCLS) domain. The LLM analyzes the text, identifying key information relevant to the clinical trial, such as patient symptoms, adverse events, medication adherence, and treatment responses.
  5. Intelligent insights and recommendations – Using its large knowledge base and advanced natural language processing (NLP) capabilities, the LLM provides intelligent insights and recommendations based on the analyzed patient-physician interaction. These insights can include:
    1. Potential adverse event detection and reporting.
    2. Identification of protocol deviations or non-compliance.
    3. Recommendations for personalized patient care or adjustments to treatment regimens.
    4. Extraction of relevant data points for electronic health records (EHRs) and clinical trial databases.
  6. Data integration and reporting – The extracted insights and recommendations are integrated into the relevant clinical trial management systems, EHRs, and reporting mechanisms. This streamlines the process of data collection, analysis, and decision-making for clinical trial stakeholders, including investigators, sponsors, and regulatory authorities.

The solution offers the following potential benefits:

  • Improved data accuracy – By accurately capturing and analyzing patient-physician interactions, this approach minimizes the risks of manual transcription errors and provides high-quality data for clinical trial analysis and decision-making.
  • Enhanced patient safety – The LLM’s ability to detect potential adverse events and protocol deviations can help identify and mitigate risks, improving patient safety and study integrity.
  • Personalized patient care – Using the LLM’s insights, physicians can provide personalized care recommendations, tailored treatment plans, and better manage patient adherence, leading to improved patient outcomes.
  • Streamlined data collection and analysis – Automating the process of extracting relevant data points from patient-physician interactions can significantly reduce the time and effort required for manual data entry and analysis, enabling more efficient clinical trial management.
  • Regulatory compliance – By integrating the extracted insights and recommendations into clinical trial management systems and EHRs, this approach facilitates compliance with regulatory requirements for data capture, adverse event reporting, and trial monitoring.

This use case demonstrates the potential of combining audio-to-text translation and LLM capabilities to enhance patient-physician communication, improve data quality, and support informed decision-making in the context of clinical trials. By using advanced technologies, this integrated approach can contribute to more efficient, effective, and patient-centric clinical research processes.

Use case 2: Intelligent site monitoring with audio-to-text translation and LLM capabilities

In the HCLS domain, site monitoring plays a crucial role in maintaining the integrity and compliance of clinical trials. Site monitors conduct on-site visits, interview personnel, and verify documentation to assess adherence to protocols and regulatory requirements. However, this process can be time-consuming and prone to errors, particularly when dealing with extensive audio recordings and voluminous documentation.

By integrating audio-to-text translation and LLM capabilities, we can streamline and enhance the site monitoring process, leading to improved efficiency, accuracy, and decision-making support.

Don’t feel like reading the full use case? No problem! You can listen to the key details in our audio file instead.

The process flow consists of the following steps:

  1. Audio capture and transcription – During site visits, monitors record interviews with site personnel, capturing valuable insights and observations. These audio recordings are then converted into text using ASR and audio-to-text translation technologies.
  2. Document ingestion – Relevant site documents, such as patient records, consent forms, and protocol manuals, are digitized and ingested into the system.
  3. LLM-powered data analysis – The transcribed interviews and ingested documents are fed into a powerful LLM, which can understand and correlate the information from multiple sources. The LLM can identify key insights, potential issues, and areas of non-compliance by analyzing the content and context of the data.
  4. Case report form generation – Based on the LLM’s analysis, a comprehensive case report form (CRF) is generated, summarizing the site visit findings, identifying potential risks or deviations, and providing recommendations for corrective actions or improvements.
  5. Decision support and site selection – The CRFs and associated data can be further analyzed by the LLM to identify patterns, trends, and potential risks across multiple sites. This information can be used to support decision-making processes, such as site selection for future clinical trials, based on historical performance and compliance data.

The solution offers the following potential benefits:

  • Improved efficiency – By automating the transcription and data analysis processes, site monitors can save significant time and effort, allowing them to focus on more critical tasks and cover more sites within the same time frame.
  • Enhanced accuracy – LLMs can identify and correlate subtle patterns and nuances within the data, reducing the risk of overlooking critical information or making erroneous assumptions.
  • Comprehensive documentation – The generated CRFs provide a standardized and detailed record of site visits, facilitating better communication and collaboration among stakeholders.
  • Regulatory compliance – The LLM-powered analysis can help identify potential areas of non-compliance, enabling proactive measures to address issues and mitigate risks.
  • Informed decision-making – The insights derived from the LLM’s analysis can support data-driven decision-making processes, such as site selection for future clinical trials, based on historical performance and compliance data.

By combining audio-to-text translation and LLM capabilities, this integrated approach offers a powerful solution for intelligent site monitoring in the HCLS domain, supporting improved efficiency, accuracy, and decision-making while providing regulatory compliance and quality assurance.

Use case 3: Enhancing adverse event reporting in clinical trials with audio-to-text and LLMs

Clinical trials are crucial for evaluating the safety and efficacy of investigational drugs and therapies. Accurate and comprehensive adverse event reporting is essential for identifying potential risks and making informed decisions. By combining audio-to-text translation with LLM capabilities, we can streamline and augment the adverse event reporting process, leading to improved patient safety and more efficient clinical research.

Don’t feel like reading the full use case? No problem! You can listen to the key details in our audio file instead.

The process flow consists of the following steps:

  1. Audio data collection – During clinical trial visits or follow-ups, audio recordings of patient-doctor interactions are captured, capturing detailed descriptions of adverse events or symptoms experienced by the participants. These audio recordings can be obtained through various channels, such as in-person visits, telemedicine consultations, or dedicated voice reporting systems.
  2. Audio-to-text transcription – The audio recordings are processed through an audio-to-text translation system, converting the spoken words into written text format. ASR and NLP techniques provide accurate transcription, accounting for factors like accents, background noise, and medical terminology.
  3. Text data integration – The transcribed text data is integrated with other sources of adverse event reporting, such as electronic case report forms (eCRFs), patient diaries, and medication logs. This comprehensive dataset provides a holistic view of the adverse events reported across multiple data sources.
  4. LLM analysis – The integrated dataset is fed into an LLM specifically trained on medical and clinical trial data. The LLM analyzes the textual data, identifying patterns, extracting relevant information, and generating insights related to adverse event occurrences, severity, and potential causal relationships.
  5. Intelligent reporting and decision support – The LLM generates detailed adverse event reports, highlighting key findings, trends, and potential safety signals. These reports can be presented to clinical trial teams, regulatory bodies, and safety monitoring committees, supporting informed decision-making processes. The LLM can also provide recommendations for further investigation, protocol modifications, or risk mitigation strategies based on the identified adverse event patterns.

The solution offers the following potential benefits:

  • Improved data capture – By using audio-to-text translation, valuable information from patient-doctor interactions can be captured and included in adverse event reporting, reducing the risk of missed or incomplete data.
  • Enhanced accuracy and completeness – The integration of multiple data sources, combined with the LLM’s analysis capabilities, provides a comprehensive and accurate understanding of adverse events, reducing the potential for errors or omissions.
  • Efficient data analysis – The LLM can rapidly process large volumes of textual data, identifying patterns and insights that might be difficult or time-consuming for human analysts to detect manually.
  • Timely decision support – Real-time adverse event reporting and analysis enable clinical trial teams to promptly identify and address potential safety concerns, mitigating risks and providing participant well-being.
  • Regulatory compliance – Comprehensive adverse event reporting and detailed documentation facilitate compliance with regulatory requirements and support transparent communication with regulatory agencies.

By integrating audio-to-text translation with LLM capabilities, this approach addresses the critical need for accurate and timely adverse event reporting in clinical trials, ultimately enhancing patient safety, improving research efficiency, and supporting informed decision-making in the HCLS domain.

Use case 4: Audio-to-text and LLM integration for enhanced patient care

In the healthcare domain, effective communication and accurate data capture are crucial for providing personalized and high-quality care. By integrating audio-to-text translation capabilities with LLM technology, we can streamline processes and unlock valuable insights, ultimately improving patient outcomes.

Don’t feel like reading the full use case? No problem! You can listen to the key details in our audio file instead.

The process flow consists of the following steps:

  1. Audio input collection – Caregivers or healthcare professionals can record audio updates on a patient’s condition, mood, or relevant observations using a secure and user-friendly interface. This could be done through mobile devices, dedicated recording stations, or during virtual consultations.
  2. Audio-to-text transcription – The recorded audio files are securely transmitted to a speech-to-text engine, which converts the spoken words into text format. Advanced NLP techniques provide accurate transcription, handling accents, medical terminology, and background noise.
  3. Text processing and contextualization – The transcribed text is then fed into an LLM trained on various healthcare datasets, including medical literature, clinical guidelines, and deidentified patient records. The LLM processes the text, identifies key information, and extracts relevant context and insights.
  4. LLM-powered analysis and recommendations – Using its sizeable knowledge base and natural language understanding capabilities, the LLM can perform various tasks, such as:
    1. Identifying potential health concerns or risks based on the reported symptoms and observations.
    2. Suggesting personalized care plans or treatment options aligned with evidence-based practices.
    3. Providing recommendations for follow-up assessments, diagnostic tests, or specialist consultations.
    4. Flagging potential drug interactions or contraindications based on the patient’s medical history.
    5. Generating summaries or reports in a structured format for efficient documentation and communication.
  5. Integration with EHRs – The analyzed data and recommendations from the LLM can be seamlessly integrated into the patient’s EHR, providing a comprehensive and up-to-date medical profile. This enables healthcare professionals to access relevant information promptly and make informed decisions during consultations or treatment planning.

The solution offers the following potential benefits:

  • Improved efficiency – By automating the transcription and analysis process, healthcare professionals can save time and focus on providing personalized care, rather than spending extensive hours on documentation and data entry.
  • Enhanced accuracy – ASR and NLP techniques provide accurate transcription, reducing errors and improving data quality.
  • Comprehensive patient insights – The LLM’s ability to process and contextualize unstructured audio data provides a more holistic understanding of the patient’s condition, enabling better-informed decision-making.
  • Personalized care plans – By using the LLM’s knowledge base and analytical capabilities, healthcare professionals can develop tailored care plans aligned with the patient’s specific needs and medical history.
  • Streamlined communication – Structured reports and summaries generated by the LLM facilitate efficient communication among healthcare teams, making sure everyone has access to the latest patient information.
  • Continuous learning and improvement – As more data is processed, the LLM can continuously learn and refine its recommendations, improving its performance over time.

By integrating audio-to-text translation and LLM capabilities, healthcare organizations can unlock new efficiencies, enhance patient-provider communication, and ultimately deliver superior care while staying at the forefront of technological advancements in the industry.

Use case 5: Audio-to-text translation and LLM integration for clinical trial protocol design

Efficient and accurate protocol design is crucial for successful study execution and regulatory compliance. By combining audio-to-text translation capabilities with the power of LLMs, we can streamline the protocol design process, using diverse data sources and AI-driven insights to create high-quality protocols in a timely manner.

Don’t feel like reading the full use case? No problem! You can listen to the key details in our audio file instead.

The process flow consists of the following steps:

  1. Audio input collection – Clinical researchers, subject matter experts, and stakeholders provide audio inputs, such as recorded meetings, discussions, or interviews, related to the proposed clinical trial. These audio files can capture valuable insights, requirements, and domain-specific knowledge.
  2. Audio-to-text transcription – Using ASR technology, the audio inputs are converted into text transcripts with high accuracy. This step makes sure that valuable information is captured and transformed into a format suitable for further processing by LLMs.
  3. Data integration – Relevant data sources, such as previous clinical trial protocols, regulatory guidelines, scientific literature, and medical databases, are integrated into the workflow. These data sources provide contextual information and serve as a knowledge base for the LLM.
  4. LLM processing – The transcribed text, along with the integrated data sources, is fed into a powerful LLM. The LLM uses its knowledge base and NLP capabilities to analyze the inputs, identify key elements, and generate a draft clinical trial protocol.
  5. Protocol refinement and review – The draft protocol generated by the LLM is reviewed by clinical researchers, medical experts, and regulatory professionals. They provide feedback, make necessary modifications, and enforce compliance with relevant guidelines and best practices.
  6. Iterative improvement – As the AI system receives feedback and correlated outcomes from completed clinical trials, it continuously learns and refines its protocol design capabilities. This iterative process enables the LLM to become more accurate and efficient over time, leading to higher-quality protocol designs.

The solution offers the following potential benefits:

  • Efficiency – By automating the initial protocol design process, researchers can save valuable time and resources, allowing them to focus on more critical aspects of clinical trial execution.
  • Accuracy and consistency – LLMs can use vast amounts of data and domain-specific knowledge, reducing the risk of errors and providing consistency across protocols.
  • Knowledge integration – The ability to seamlessly integrate diverse data sources, including audio recordings, scientific literature, and regulatory guidelines, enhances the quality and comprehensiveness of the protocol design.
  • Continuous improvement – The iterative learning process allows the AI system to adapt and improve its protocol design capabilities based on real-world outcomes, leading to increasingly accurate and effective protocols over time.
  • Decision-making support – By providing well-structured and comprehensive protocols, the AI-driven approach enables better-informed decision-making for clinical researchers, sponsors, and regulatory bodies.

This integrated approach using audio-to-text translation and LLM capabilities has the potential to revolutionize the clinical trial protocol design process, ultimately contributing to more efficient and successful clinical trials, accelerating the development of life-saving treatments, and improving patient outcomes.

Use case 6: Voice-enabled clinical trial and disease information assistant

In the HCLS domain, effective communication and access to accurate information are crucial for patients, caregivers, and healthcare professionals. This use case demonstrates how audio-to-text translation combined with LLM capabilities can address these needs by providing an intelligent, voice-enabled assistant for clinical trial and disease information.

Don’t feel like reading the full use case? No problem! You can listen to the key details in our audio file instead.

The process flow consists of the following steps:

  1. Audio input – The user, whether a patient, caregiver, or healthcare professional, can initiate the process by providing a voice query related to a specific disease or clinical trial. This could include questions about the disease itself, treatment options, ongoing trials, eligibility criteria, or other relevant information.
  2. Audio-to-text translation – The audio input is converted into text using state-of-the-art speech recognition technology. This step makes sure that the user’s query is accurately transcribed and ready for further processing by the LLM.
  3. Data integration – The system integrates various data sources, including clinical trial data, disease-specific information from reputable sources (such as PubMed or WebMD), and other relevant third-party resources. This comprehensive data integration makes sure that the LLM has access to a large knowledge base for generating accurate and comprehensive responses.
  4. LLM processing – The transcribed query is fed into the LLM, which uses its natural language understanding capabilities to comprehend the user’s intent and extract relevant information from the integrated data sources. The LLM can provide intelligent responses, insights, and recommendations based on the query and the available data.
  5. Response generation – The LLM generates a detailed, context-aware response addressing the user’s query. This response can be presented in various formats, such as text, audio (using text-to-speech technology), or a combination of both, depending on the user’s preferences and accessibility needs.
  6. Feedback and continuous improvement – The system can incorporate user feedback mechanisms to improve its performance over time. This feedback can be used to refine the LLM’s understanding, enhance the data integration process, and make sure that the system remains up to date with the latest clinical trial and disease information.

The solution offers the following potential benefits:

  • Improved access to information – By using voice input and NLP capabilities, the system empowers patients, caregivers, and healthcare professionals to access accurate and comprehensive information about diseases and clinical trials, regardless of their technical expertise or literacy levels.
  • Enhanced communication – The voice-enabled interface facilitates seamless communication between users and the system, enabling them to ask questions and receive responses in a conversational manner, mimicking human-to-human interaction.
  • Personalized insights – The LLM can provide personalized insights and recommendations based on the user’s specific query and context, enabling more informed decision-making and tailored support for individuals.
  • Time and efficiency gains – By automating the process of information retrieval and providing intelligent responses, the system can significantly reduce the time and effort required for healthcare professionals to manually search and synthesize information from multiple sources.
  • Improved patient engagement – By offering accessible and user-friendly access to disease and clinical trial information, the system can empower patients and caregivers to actively participate in their healthcare journey, fostering better engagement and understanding.

This use case highlights the potential of integrating audio-to-text translation with LLM capabilities to address real-world challenges in the HCLS domain. By using cutting-edge technologies, this solution can improve information accessibility, enhance communication, and support more informed decision-making for all stakeholders involved in clinical trials and disease management.

For the demonstration purpose we will focus on following use case:

Use case overview: Patient reporting and analysis in clinical trials

In clinical trials, it’s crucial to gather accurate and comprehensive patient data to assess the safety and efficacy of investigational drugs or therapies. Traditional methods of collecting patient reports can be time-consuming, prone to errors, and might result in incomplete or inconsistent data. By combining audio-to-text translation with LLM capabilities, we can streamline the patient reporting process and unlock valuable insights to support decision-making.

Don’t feel like reading the full use case? No problem! You can listen to the key details in our audio file instead.

The process flow consists of the following steps:

  1. Audio input – Patients participating in clinical trials can provide their updates, symptoms, and feedback through voice recordings using a mobile application or a dedicated recording device.
  2. Audio-to-text transcription – The recorded audio files are securely transmitted to a cloud-based infrastructure, where they undergo automated transcription using ASR technology. The audio is converted into text, providing accurate and verbatim transcripts.
  3. Data consolidation – The transcribed patient reports are consolidated into a structured database, enabling efficient storage, retrieval, and analysis.
  4. LLM processing – The consolidated textual data is then processed by an LLM trained on biomedical and clinical trial data. The LLM can perform various tasks, including:
    1. Natural language processing – Extracting relevant information and identifying key symptoms, adverse events, or treatment responses from the patient reports.
    2. Sentiment analysis – Analyzing the emotional and psychological state of patients based on their language and tone, which can provide valuable insights into their overall well-being and treatment experience.
    3. Pattern recognition – Identifying recurring themes, trends, or anomalies across multiple patient reports, enabling early detection of potential safety concerns or efficacy signals.
    4. Knowledge extraction – Using the LLM’s understanding of biomedical concepts and clinical trial protocols to derive meaningful insights and recommendations from the patient data.
  5. Insights and reporting – The processed data and insights derived from the LLM are presented through interactive dashboards, visualizations, and reports. These outputs can be tailored to different stakeholders, such as clinical researchers, medical professionals, and regulatory authorities.

The solution offers the following potential benefits:

  • Improved data quality – By using audio-to-text transcription, the risk of errors and inconsistencies associated with manual data entry is minimized, providing high-quality patient data.
  • Time and cost-efficiency – Automated transcription and LLM-powered analysis can significantly reduce the time and resources required for data collection, processing, and analysis, leading to faster decision-making and cost savings.
  • Enhanced patient experience – Patients can provide their updates conveniently through voice recordings, reducing the burden of manual data entry and enabling more natural communication.
  • Comprehensive analysis – The combination of NLP, sentiment analysis, and pattern recognition capabilities offered by LLMs allows for a holistic understanding of patient experiences, treatment responses, and potential safety signals.
  • Regulatory compliance – Accurate and comprehensive patient data, coupled with robust analysis, can support compliance with regulatory requirements for clinical trial reporting and data documentation.

By integrating audio-to-text translation and LLM capabilities, clinical trial sponsors and research organizations can benefit from streamlined patient reporting, enhanced data quality, and powerful insights to support informed decision-making throughout the clinical development process.

Solution overview

The following diagram illustrates the solution architecture.

Solution overview: patient reporting and analysis in clinical trials

Solution overview: patient reporting and analysis in clinical trials

Key AWS services used in this solution include Amazon Simple Storage Service (Amazon S3), AWS HealthScribe, Amazon Transcribe, and Amazon Bedrock.

Prerequisites

This solution requires the following prerequisites:

Data samples

To illustrate the concept and provide a practical understanding, we have curated a collection of audio samples. These samples serve as representative examples, simulating site interviews conducted by researchers at clinical trial sites with patient participants.

The audio recordings offer a glimpse into the type of data typically encountered during such interviews. We encourage you to listen to these samples to gain a better appreciation of the data and its context.

These samples are for demonstration purposes only and don’t contain any real patient information or sensitive data. They are intended solely to provide a sample structure and format for the audio recordings used in this particular use case.

Sample Data Audio File
Site interview 1
Site Interview 2
Site Interview 3
Site Interview 4
Site Interview 5

Prompt templates

Prior to deploying and executing this solution, it’s essential to comprehend the input prompts and the anticipated output from the LLM. Although this is merely a sample, the potential outcomes and possibilities can be vastly expanded by crafting creative prompts.

We use the following input prompt template:

You are an expert medical research analyst for clinical trials of medicines.

You will be provided with a dictionary containing text transcriptions of clinical trial interviews conducted between patients and interviewers.

The dictionary keys represent the interview_id, and the values contain the interview transcripts.

<interview_transcripts>add_interview_transcripts</interview_transcripts>

Your task is to analyze all the transcripts and generate a comprehensive report summarizing the key findings and conclusions from the clinical trial.

The response Amazon Bedrock will be as below:

Based on the interview transcripts provided, here is a comprehensive report summarizing the key findings and conclusions from the clinical trial:

Introduction:

This report analyzes transcripts from interviews conducted with patients participating in a clinical trial for a new investigational drug. The interviews cover various aspects of the trial, including the informed consent process, randomization procedures, dosing schedules, follow-up visits, and patient experiences with potential side effects.

Key Findings:

1. Informed Consent Process:

– The informed consent process was thorough, with detailed explanations provided to patients about the trial’s procedures, potential risks, and benefits (Transcript 5).

– Patients were given ample time to review the consent documents, discuss them with family members, and have their questions addressed satisfactorily by the study team (Transcript 5).

– Overall, patients felt they fully understood the commitments and requirements of participating in the trial (Transcript 5).

2. Randomization and Blinding:

– Patients were randomized to either receive the investigational drug or a placebo, as part of a placebo-controlled study design (Transcript 2).

– The randomization process was adequately explained to patients, and they understood the rationale behind blinding, which is to prevent bias in the results (Transcript 2).

– Patients expressed acceptance of the possibility of receiving a placebo, recognizing its importance for the research (Transcript 2).

3. Dosing Schedule and Adherence:

– The dosing schedule involved taking the medication twice daily, in the morning and evening (Transcript 4).

– Some patients reported occasional difficulties in remembering the evening dose but implemented strategies like setting reminders on their phones to improve adherence (Transcript 4).

4. Follow-up Visits and Assessments:

– Follow-up visits were scheduled at specific intervals, such as 30 days, 3 months, and 6 months after the last dose (Transcripts 1 and 3).

– During these visits, various assessments were conducted, including blood tests, physical exams, ECGs, and evaluation of patient-reported outcomes like pain levels (Transcripts 1 and 3).

– Patients were informed that they would receive clinically significant findings from these assessments (Transcript 3).

5. Patient-Reported Side Effects:

– Some patients reported experiencing mild side effects, such as headaches, nausea, and joint pain improvement (Transcripts 3 and 4).

– The study team diligently documented and monitored these side effects, noting them in case report forms for further evaluation (Transcript 4).

6. Study Conduct and Communication:

– The study team provided 24/7 contact information, allowing patients to reach out with concerns between scheduled visits (Transcript 1).

– Patients were informed that they would receive information about the overall study results once available (Transcript 1).

– Patients were made aware of their ability to withdraw from the study at any time if they became uncomfortable (Transcript 2).

Conclusions:

Based on the interview transcripts, the clinical trial appears to have been conducted in a thorough and ethical manner, adhering to principles of informed consent, randomization, and blinding. Patients were adequately informed about the trial procedures, potential risks, and their rights as participants. The study team diligently monitored patient safety, documented adverse events, and maintained open communication channels. Overall, the transcripts suggest a well-managed clinical trial with a focus on patient safety, data integrity, and adherence to research protocols.

Deploy resources with AWS CloudFormation

To deploy the solution, use AWS CloudFormation template

Test the application

To test the application, complete the following steps:

  1. On the Amazon S3 console, choose Buckets in the navigation pane.
  2. Locate your bucket starting with blog-hcls-assets-*.
  3. Navigate to the S3 prefix hcls-framework/samples-input-audio/. You will see sample audio files, which we reviewed earlier in this post.
  4. Select these files, and on the Actions menu, choose Copy.Select these files, and on the Actions menu, choose Copy.
  5. For Destination, choose Browse S3 and navigate to the S3 path for hcls-framework/input-audio/.For Destination, choose Browse S3 and navigate to the S3 path

Copying these sample files will trigger an S3 event invoking the AWS Lambda function audio-to-text. To review the invocations of the Lambda function on the AWS Lambda console, navigate to the audio-to-text function and then the Monitor tab, which contains detailed logs.

Review AWS Lambda execution logs

You can review the status of the Amazon Transcribe jobs on the Amazon Transcribe console.

You can review the status of the Amazon Transcribe jobs on the Amazon Transcribe console.

At this step, the interview transcripts are ready. They should be available in Amazon S3 under the prefix hcls-framework/input-text/.

At this step, the interview transcripts are ready. They should be available in Amazon S3.

You can download a sample file and review the contents. You will notice the content of this file as JSON with a text transcript available under the key transcripts, along with other metadata.

You can download a sample file and review the contents. You will notice the content of this file as JSON with a text transcript available under the key transcripts, along with other metadata.

Now let’s run Anthropic’s Claude 3 Sonnet using the Lambda function hcls_clinical_trial_analysis to analyze the transcripts and generate a comprehensive report summarizing the key findings and conclusions from the clinical trial.

  1. On the Lambda console, navigate to the function named hcls_clinical_trial_analysis.
  2. Choose Test.
  3. If the console prompts you to create a new test event, do so with default or no input to the test event.

If the console prompts you to create a new test event, do so with default or no input to the test event.

  1. Run the test event.

To review the output, open the Lambda console and navigate to the function named hcls_clinical_trial_analysis, and then on the Monitor tab, for detailed logs, choose View CloudWatch Logs. In the logs, you will see your comprehensive report on the clinical trial.

In the logs, you will see your comprehensive report on the clinical trial.

So far, we have completed a process involving:

  • Collecting audio interviews from clinical trials
  • Transcribing the audio to text
  • Compiling transcripts into a dictionary
  • Using Amazon Bedrock (Anthropic’s Claude 3 Sonnet) to generate a comprehensive summary

Although we focused on summarization, this approach can be extended to other applications such as sentiment analysis, extracting key learnings, identifying common complaints, and more.

Summary

Healthcare patients often find themselves in need of reliable information about their conditions, clinical trials, or treatment options. However, accessing accurate and up-to-date medical knowledge can be a daunting task. Our innovative solution integrates cutting-edge audio-to-text translation and LLM capabilities to revolutionize how patients receive vital healthcare information. By using speech recognition technology, we can accurately transcribe patients’ spoken queries, allowing our LLM to comprehend the context and provide personalized, evidence-based responses tailored to their specific needs. This empowers patients to make informed decisions, enhances accessibility for those with disabilities or preferences for verbal communication, and alleviates the workload on healthcare professionals, ultimately improving patient outcomes and driving progress in the HCLS domain.

Take charge of your healthcare journey with our innovative voice-enabled virtual assistant. Empower yourself with accurate and personalized information by simply asking your questions aloud. Our cutting-edge solution integrates speech recognition and advanced language models to provide reliable, context-aware responses tailored to your specific needs. Embrace the future of healthcare today and experience the convenience of instantaneous access to vital medical information.


About the Authors

Vrinda Dabke leads AWS Professional Services North America Delivery. Prior to joining AWS, Vrinda held a variety of leadership roles in Fortune 100 companies like UnitedHealth Group, The Hartford, Aetna, and Pfizer. Her work has been focused on in the areas of business intelligence, analytics, and AI/ML. She is a motivational people leader with experience in leading and managing high-performing global teams in complex matrix organizations.

Kannan Raman leads the North America Delivery for AWS Professional Services Healthcare and Life Sciences practice at AWS. He has over 24 years of healthcare and life sciences experience and provides thought leadership in digital transformation. He works with C level customer executives to help them with their digital transformation agenda.

Rushabh Lokhande is a Senior Data & ML Engineer with AWS Professional Services Analytics Practice. He helps customers implement big data, machine learning, and analytics solutions. Outside of work, he enjoys spending time with family, reading, running, and playing golf.

Bruno Klein is a Senior Machine Learning Engineer with AWS Professional Services Analytics Practice. He helps customers implement big data and analytics solutions. Outside of work, he enjoys spending time with family, traveling, and trying new food.

Read More

Introducing KBLaM: Bringing plug-and-play external knowledge to LLMs

Introducing KBLaM: Bringing plug-and-play external knowledge to LLMs

KBLaM blog | A flowchart illustrating the process of handling a prompt using a language model. The process begins with documents being used to construct and summarize a knowledge base (KB) offline. The summarized KB is then encoded and fed into the main process. A prompt goes through a tokenizer, followed by rectangular attention, and then into the large language model (LLM). The LLM retrieves information from the encoded KB to generate an answer.

Large language models (LLMs) have demonstrated remarkable capabilities in reasoning, language understanding, and even creative tasks. Yet, a key challenge persists: how to efficiently integrate external knowledge.

Traditional methods such as fine-tuning and Retrieval-Augmented Generation (RAG) come with trade-offs—fine-tuning demands costly retraining, while RAG introduces separate retrieval modules that increase complexity and prevent seamless, end-to-end training. In-context learning, on the other hand, becomes increasingly inefficient as knowledge bases grow, facing quadratic computational scaling that hinders its ability to handle large repositories. A comparison of these approaches can be seen in Figure 1.

A new way to integrate knowledge

To address these challenges, we introduce the Knowledge Base-Augmented Language Model (KBLaM) —a novel approach that integrates structured knowledge bases into pre-trained LLMs. Instead of relying on external retrieval modules or costly fine-tuning, KBLaM encodes knowledge into continuous key-value vector pairs, efficiently embedding them within the model’s attention layers using a specialized rectangular attention mechanism, which implicitly performs retrieval in an integrated manner.

We use structured knowledge bases to represent the data, allowing us to consolidate knowledge and leverage structure. This design allows it to scale linearly with the size of the knowledge base while maintaining dynamic updates without retraining, making it far more efficient than existing methods.

Microsoft research podcast

NeurIPS 2024: The co-evolution of AI and systems with Lidong Zhou

Just after his NeurIPS 2024 keynote on the co-evolution of systems and AI, Microsoft CVP Lidong Zhou joins the podcast to discuss how rapidly advancing AI impacts the systems supporting it and the opportunities to use AI to enhance systems engineering itself.


Scalable, efficient, and future-ready

At its core, KBLaM is designed to integrate structured knowledge into LLMs, making them more efficient and scalable. It achieves this by converting external knowledge bases—collections of facts structured as triples consisting of an entity, a property, and a value—into a format that LLMs can process naturally.  Such knowledge bases allow for consolidated, reliable sources of knowledge.

To create these knowledge bases, we first extract structured data in JSON format using small language models. We then apply Project Alexandria’s probabilistic clustering. Once we have this structured knowledge base, KBLaM follows a three-step pipeline:

  1. Knowledge Encoding: Each knowledge triple is mapped into a key-value vector pair using a pre-trained sentence encoder with lightweight linear adapters. The key vector, derived from the entity name and property, encodes “index information,” while the value vector captures the corresponding property value. This allows us to create continuous, learnable key-value representations.
  2. Integration with LLMs: These key-value pairs, or knowledge tokens, are augmented into the model’s attention layers using a specialized rectangular attention structure. Unlike traditional transformer models that process all tokens equally and come with quadratic cost—such as GPT-4, Phi, and Llama—rectangular attention enables the model to attend over knowledge with linear cost, as illustrated in Figure 2. Compared to standard attention mechanisms in generative language models, where each token attends to all preceding tokens, our approach introduces a more efficient structure. In this setup, language tokens (such as those from a user’s question) attend to all knowledge tokens. However, knowledge tokens do not attend to one another, nor do they attend back to the language tokens. This selective attention pattern significantly reduces computational cost while preserving the model’s ability to incorporate external knowledge effectively.

    This linear cost, which is crucial for the efficiency of KBLaM, effectively amounts to treating each fact independently—an assumption that holds for most facts. For example, the model’s name, KBLaM, and the fact that the research was conducted at Microsoft Research are very weakly correlated. This rectangular attention is implemented as an extension of standard attention. During training, we keep the base model’s weights frozen, ensuring that when no knowledge tokens are provided, the model functions exactly as it did originally.

  3. Efficient Knowledge Retrieval: Through this rectangular attention, the model learns to dynamically retrieve relevant knowledge tokens during inference, eliminating the need for separate retrieval steps.
Figure 1: A diagram comparing KBLaM and existing approaches. With RAG, we take the user’s prompt and use that to retrieve relevant documents from an external corpus using some retriever module, and append a tokenized version of those relevant documents in the context. This is relatively cheap, but requires many components. On the other hand, In Context Learning just puts the entire corpus into the context. This is simple, involving only one component, but is expensive. Our method, KBLaM, makes a structured knowledge base from the documents in an offline process, and includes the entire knowledge base to the context, while using a novel variant of attention, rectangular attention, so that the cost is linear in the size of the knowledge base. This results in a system where the retrieval only requires a single, trainable component, that is also cheap.
Figure 1: KBLaM allows for attention over the entire knowledge base instead of having an external retriever.
Figure 2: A diagram illustrating rectangular attention. Unlike regular attention, the attention matrix is not square, as we remove the parts where the knowledge base would attend over itself. This allows for KBLaM to scale linearly with the number of items in its context.
Figure 2: By having the user’s question attend to the knowledge base, while treating facts in the knowledge base independently, KBLaM scales efficiently and linearly with the size of the knowledge base.

Unlike RAG, which appends retrieved document chunks to prompts, KBLaM allows for direct integration of knowledge into the model. Compared to in-context learning,  KBLaM’s rectangular attention maintains a linear memory footprint, making it vastly more scalable for large knowledge bases. 

Its efficiency is a game-changer. While traditional in-context learning methods struggle with quadratic memory growth due to self-attention overhead, KBLaM’s linear overhead means we can store much more knowledge in the context. In practice, this means KBLaM can store and process over 10,000 knowledge triples, the equivalent of approximately 200,000 text tokens on a single GPU—a feat that would be computationally prohibitive with conventional in-context learning. The results across a wide range of triples and can be seen in Figure 3. Remarkably, it achieves this while extending a base model that has a context length of only 8K tokens. Additionally, KBLaM enables dynamic updates: modifying a single knowledge triple does not require retraining or re-computation of the entire knowledge base. 

Figure 3: Two graphs, showing time to first token, and memory usage for both KBLaM and RAG. KBLaM’s time to first token remains relatively constant across a large range of knowledge base sizes, with the time-to-first-token with 4096 triples in the context being lower than that of conventional RAG with 5 triples in the context. The memory usage is also much lower, with KBLaM with 512 triples having a similar memory usage to RAG at 5 triples.
Figure 3: KBLaM is much faster and uses much less memory than adding the equivalent number of triples in the context using conventional RAG-like approaches. In particular, we have lower time to first token with 4,096 tripes in the context with KBLaM than we would with 5 triples in the context.

Enhancing interpretability and reliability

Another major benefit of KBLaM is its interpretability. Unlike in-context learning, where knowledge injection is opaque, KBLAM’s attention weights provide clear insights into how the model utilizes knowledge tokens. Experiments show that KBLaM assigns high attention scores to relevant knowledge triples, effectively mimicking a soft retrieval process.

Furthermore, KBLaM enhances model reliability by learning through its training examples when not to answer a question if the necessary information is missing from the knowledge base. In particular, with knowledge bases larger than approximately 200 triples, we found that the model refuses to answer questions it has no knowledge about more precisely than a model given the information as text in context. This feature helps reduce hallucinations, a common problem in LLMs that rely on internal knowledge alone, making responses more accurate and trustworthy.

The future of knowledge-augmented AI

KBLaM represents a major step forward in integrating structured knowledge into LLMs. By offering a scalable, efficient, and interpretable alternative to existing techniques, it paves the way for AI systems that can stay up to date and provide reliable, knowledge-driven responses. In fields where accuracy and trust are critical—such as medicine, finance, and scientific research—this approach has the potential to transform how language models interact with real-world information.

As AI systems increasingly rely on dynamic knowledge rather than static model parameters, we hope KBLaM will serve as a bridge between raw computational power and real-world understanding.

However, there is still work to be done before it can be deployed at scale. Our current model has been trained primarily on factual question-answer pairs, and further research is needed to expand its capabilities across more complex reasoning tasks and diverse knowledge domains.

To accelerate progress, we are releasing KBLaM’s code and datasets (opens in new tab) to the research community, and we are planning integrations with the Hugging Face transformers library. By making these resources available, we hope to inspire further research and adoption of scalable, efficient knowledge augmentation for LLMs. The future of AI isn’t just about generating text—it’s about generating knowledge that is accurate, adaptable, and deeply integrated with the evolving world. KBLaM is a step in that direction.

The post Introducing KBLaM: Bringing plug-and-play external knowledge to LLMs appeared first on Microsoft Research.

Read More

Intelligent healthcare assistants: Empowering stakeholders with personalized support and data-driven insights

Intelligent healthcare assistants: Empowering stakeholders with personalized support and data-driven insights

Large language models (LLMs) have revolutionized the field of natural language processing, enabling machines to understand and generate human-like text with remarkable accuracy. However, despite their impressive language capabilities, LLMs are inherently limited by the data they were trained on. Their knowledge is static and confined to the information they were trained on, which becomes problematic when dealing with dynamic and constantly evolving domains like healthcare.

The healthcare industry is a complex, ever-changing landscape with a vast and rapidly growing body of knowledge. Medical research, clinical practices, and treatment guidelines are constantly being updated, rendering even the most advanced LLMs quickly outdated. Additionally, patient data, including electronic health records (EHRs), diagnostic reports, and medical histories, are highly personalized and unique to each individual. Relying solely on an LLM’s pre-trained knowledge is insufficient for providing accurate and personalized healthcare recommendations.

Furthermore, healthcare decisions often require integrating information from multiple sources, such as medical literature, clinical databases, and patient records. LLMs lack the ability to seamlessly access and synthesize data from these diverse and distributed sources. This limits their potential to provide comprehensive and well-informed insights for healthcare applications.

Overcoming these challenges is crucial for using the full potential of LLMs in the healthcare domain. Patients, healthcare providers, and researchers require intelligent agents that can provide up-to-date, personalized, and context-aware support, drawing from the latest medical knowledge and individual patient data.

Enter LLM function calling, a powerful capability that addresses these challenges by allowing LLMs to interact with external functions or APIs, enabling them to access and use additional data sources or computational capabilities beyond their pre-trained knowledge. By combining the language understanding and generation abilities of LLMs with external data sources and services, LLM function calling opens up a world of possibilities for intelligent healthcare agents.

In this blog post, we will explore how Mistral LLM on Amazon Bedrock can address these challenges and enable the development of intelligent healthcare agents with LLM function calling capabilities, while maintaining robust data security and privacy through Amazon Bedrock Guardrails.

Healthcare agents equipped with LLM function calling can serve as intelligent assistants for various stakeholders, including patients, healthcare providers, and researchers. They can assist patients by answering medical questions, interpreting test results, and providing personalized health advice based on their medical history and current conditions. For healthcare providers, these agents can help with tasks such as summarizing patient records, suggesting potential diagnoses or treatment plans, and staying up to date with the latest medical research. Additionally, researchers can use LLM function calling to analyze vast amounts of scientific literature, identify patterns and insights, and accelerate discoveries in areas such as drug development or disease prevention.

Benefits of LLM function calling

LLM function calling offers several advantages for enterprise applications, including enhanced decision-making, improved efficiency, personalized experiences, and scalability. By combining the language understanding capabilities of LLMs with external data sources and computational resources, enterprises can make more informed and data-driven decisions, automate and streamline various tasks, provide tailored recommendations and experiences for individual users or customers, and handle large volumes of data and process multiple requests concurrently.

Potential use cases for LLM function calling in the healthcare domain include patient triage, medical question answering, and personalized treatment recommendations. LLM-powered agents can assist in triaging patients by analyzing their symptoms, medical history, and risk factors, and providing initial assessments or recommendations for seeking appropriate care. Patients and healthcare providers can receive accurate and up-to-date answers to medical questions by using LLMs’ ability to understand natural language queries and access relevant medical knowledge from various data sources. Additionally, by integrating with electronic health records (EHRs) and clinical decision support systems, LLM function calling can provide personalized treatment recommendations tailored to individual patients’ medical histories, conditions, and preferences.

Amazon Bedrock supports a variety of foundation models. In this post, we will be exploring how to perform function calling using Mistral from Amazon Bedrock. Mistral supports function calling, which allows agents to invoke external functions or APIs from within a conversation flow. This capability enables agents to retrieve data, perform calculations, or use external services to enhance their conversational abilities. Function calling in Mistral is achieved through the use of specific function call blocks that define the external function to be invoked and handle the response or output.

Solution overview

LLM function calling typically involves integrating an LLM model with an external API or function that provides access to additional data sources or computational capabilities. The LLM model acts as an interface, processing natural language inputs and generating responses based on its pre-trained knowledge and the information obtained from the external functions or APIs. The architecture typically consists of the LLM model, a function or API integration layer, and external data sources and services.

Healthcare agents can integrate LLM models and call external functions or APIs through a series of steps: natural language input processing, self-correction, chain of thought, function or API calling through an integration layer, data integration and processing, and persona adoption. The agent receives natural language input, processes it through the LLM model, calls relevant external functions or APIs if additional data or computations are required, combines the LLM model’s output with the external data or results, and provides a comprehensive response to the user.

High Level Architecture

High Level Architecture- Healthcare assistant

The architecture for the Healthcare Agent is shown in the preceding figure and is as follows:

  1. Consumers interact with the system through Amazon API Gateway.
  2. AWS Lambda orchestrator, along with tool configuration and prompts, handles orchestration and invokes the Mistral model on Amazon Bedrock.
  3. Agent function calling allows agents to invoke Lambda functions to retrieve data, perform computations, or use external services.
  4. Functions such as insurance, claims, and pre-filled Lambda functions handle specific tasks.
  5. Data is stored in a conversation history, and a member database (MemberDB) is used to store member information and the knowledge base has static documents used by the agent.
  6. AWS CloudTrail, AWS Identity and Access Management (IAM), and Amazon CloudWatch handle data security.
  7. AWS Glue, Amazon SageMaker, and Amazon Simple Storage Service (Amazon S3) facilitate data processing.

A sample code using function calling through the Mistral LLM can be found at mistral-on-aws.

Security and privacy considerations

Data privacy and security are of utmost importance in the healthcare sector because of the sensitive nature of personal health information (PHI) and the potential consequences of data breaches or unauthorized access. Compliance with regulations such as HIPAA and GDPR is crucial for healthcare organizations handling patient data. To maintain robust data protection and regulatory compliance, healthcare organizations can use Amazon Bedrock Guardrails, a comprehensive set of security and privacy controls provided by Amazon Web Services (AWS).

Amazon Bedrock Guardrails offers a multi-layered approach to data security, including encryption at rest and in transit, access controls, audit logging, ground truth validation and incident response mechanisms. It also provides advanced security features such as data residency controls, which allow organizations to specify the geographic regions where their data can be stored and processed, maintaining compliance with local data privacy laws.

When using LLM function calling in the healthcare domain, it’s essential to implement robust security measures and follow best practices for handling sensitive patient information. Amazon Bedrock Guardrails can play a crucial role in this regard by helping to provide a secure foundation for deploying and operating healthcare applications and services that use LLM capabilities.

Some key security measures enabled by Amazon Bedrock Guardrails are:

  • Data encryption: Patient data processed by LLM functions can be encrypted at rest and in transit, making sure that sensitive information remains secure even in the event of unauthorized access or data breaches.
  • Access controls: Amazon Bedrock Guardrails enables granular access controls, allowing healthcare organizations to define and enforce strict permissions for who can access, modify, or process patient data through LLM functions.
  • Secure data storage: Patient data can be stored in secure, encrypted storage services such as Amazon S3 or Amazon Elastic File System (Amazon EFS), making sure that sensitive information remains protected even when at rest.
  • Anonymization and pseudonymization: Healthcare organizations can use Amazon Bedrock Guardrails to implement data anonymization and pseudonymization techniques, making sure that patient data used for training or testing LLM models doesn’t contain personally identifiable information (PII).
  • Audit logging and monitoring: Comprehensive audit logging and monitoring capabilities provided by Amazon Bedrock Guardrails enable healthcare organizations to track and monitor all access and usage of patient data by LLM functions, enabling timely detection and response to potential security incidents.
  • Regular security audits and assessments: Amazon Bedrock Guardrails facilitates regular security audits and assessments, making sure that the healthcare organization’s data protection measures remain up-to-date and effective in the face of evolving security threats and regulatory requirements.

By using Amazon Bedrock Guardrails, healthcare organizations can confidently deploy LLM function calling in their applications and services, maintaining robust data security, privacy protection, and regulatory compliance while enabling the transformative benefits of AI-powered healthcare assistants.

Case studies and real-world examples

3M Health Information Systems is collaborating with AWS to accelerate AI innovation in clinical documentation by using AWS machine learning (ML) services, compute power, and LLM capabilities. This collaboration aims to enhance 3M’s natural language processing (NLP) and ambient clinical voice technologies, enabling intelligent healthcare agents to capture and document patient encounters more efficiently and accurately. These agents, powered by LLMs, can understand and process natural language inputs from healthcare providers, such as spoken notes or queries, and use LLM function calling to access and integrate relevant medical data from EHRs, knowledge bases, and other data sources. By combining 3M’s domain expertise with AWS ML and LLM capabilities, the companies can improve clinical documentation workflows, reduce administrative burdens for healthcare providers, and ultimately enhance patient care through more accurate and comprehensive documentation.

GE Healthcare developed Edison, a secure intelligence solution running on AWS, to ingest and analyze data from medical devices and hospital information systems. This solution uses AWS analytics, ML, and Internet of Things (IoT) services to generate insights and analytics that can be delivered through intelligent healthcare agents powered by LLMs. These agents, equipped with LLM function calling capabilities, can seamlessly access and integrate the insights and analytics generated by Edison, enabling them to assist healthcare providers in improving operational efficiency, enhancing patient outcomes, and supporting the development of new smart medical devices. By using LLM function calling to retrieve and process relevant data from Edison, the agents can provide healthcare providers with data-driven recommendations and personalized support, ultimately enabling better patient care and more effective healthcare delivery.

Future trends and developments

Future advancements in LLM function calling for healthcare might include more advanced natural language processing capabilities, such as improved context understanding, multi-turn conversational abilities, and better handling of ambiguity and nuances in medical language. Additionally, the integration of LLM models with other AI technologies, such as computer vision and speech recognition, could enable multimodal interactions and analysis of various medical data formats.

Emerging technologies such as multimodal models, which can process and generate text, images, and other data formats simultaneously, could enhance LLM function calling in healthcare by enabling more comprehensive analysis and visualization of medical data. Personalized language models, trained on individual patient data, could provide even more tailored and accurate responses. Federated learning techniques, which allow model training on decentralized data while preserving privacy, could address data-sharing challenges in healthcare.

These advancements and emerging technologies could shape the future of healthcare agents by making them more intelligent, adaptive, and personalized. Agents could seamlessly integrate multimodal data, such as medical images and lab reports, into their analysis and recommendations. They could also continuously learn and adapt to individual patients’ preferences and health conditions, providing truly personalized care. Additionally, federated learning could enable collaborative model development while maintaining data privacy, fostering innovation and knowledge sharing across healthcare organizations.

Conclusion

LLM function calling has the potential to revolutionize the healthcare industry by enabling intelligent agents that can understand natural language, access and integrate various data sources, and provide personalized recommendations and insights. By combining the language understanding capabilities of LLMs with external data sources and computational resources, healthcare organizations can enhance decision-making, improve operational efficiency, and deliver superior patient experiences. However, addressing data privacy and security concerns is crucial for the successful adoption of this technology in the healthcare domain.

As the healthcare industry continues to embrace digital transformation, we encourage readers to explore and experiment with LLM function calling in their respective domains. By using this technology, healthcare organizations can unlock new possibilities for improving patient care, advancing medical research, and streamlining operations. With a focus on innovation, collaboration, and responsible implementation, the healthcare industry can harness the power of LLM function calling to create a more efficient, personalized, and data-driven future. AWS can help organizations use LLM function calling and build intelligent healthcare assistants through its AI/ML services, including Amazon Bedrock, Amazon Lex, and Lambda, while maintaining robust security and compliance using Amazon Bedrock Guardrails. To learn more, see AWS for Healthcare & Life Sciences.


About the Authors

Laks Sundararajan is a seasoned Enterprise Architect helping companies reset, transform and modernize their IT, digital, cloud, data and insight strategies. A proven leader with significant expertise around Generative AI, Digital, Cloud and Data/Analytics Transformation, Laks is a Sr. Solutions Architect with Healthcare and Life Sciences (HCLS).

Subha Venugopal is a Senior Solutions Architect at AWS with over 15 years of experience in the technology and healthcare sectors. Specializing in digital transformation, platform modernization, and AI/ML, she leads AWS Healthcare and Life Sciences initiatives. Subha is dedicated to enabling equitable healthcare access and is passionate about mentoring the next generation of professionals.

Read More

Explaining Tokens — the Language and Currency of AI

Explaining Tokens — the Language and Currency of AI

Under the hood of every AI application are algorithms that churn through data in their own language, one based on a vocabulary of tokens.

Tokens are tiny units of data that come from breaking down bigger chunks of information. AI models process tokens to learn the relationships between them and unlock capabilities including prediction, generation and reasoning. The faster tokens can be processed, the faster models can learn and respond.

AI factories — a new class of data centers designed to accelerate AI workloads — efficiently crunch through tokens, converting them from the language of AI to the currency of AI, which is intelligence.

With AI factories, enterprises can take advantage of the latest full-stack computing solutions to process more tokens at lower computational cost, creating additional value for customers. In one case, integrating software optimizations and adopting the latest generation NVIDIA GPUs reduced cost per token by 20x compared to unoptimized processes on previous-generation GPUs — delivering 25x more revenue in just four weeks.

By efficiently processing tokens, AI factories are manufacturing intelligence — the most valuable asset in the new industrial revolution powered by AI.

What Is Tokenization? 

Whether a transformer AI model is processing text, images, audio clips, videos or another modality, it will translate the data into tokens. This process is known as tokenization.

Efficient tokenization helps reduce the amount of computing power required for training and inference. There are numerous tokenization methods — and tokenizers tailored for specific data types and use cases can require a smaller vocabulary, meaning there are fewer tokens to process.

For large language models (LLMs), short words may be represented with a single token, while longer words may be split into two or more tokens.

The word darkness, for example, would be split into two tokens, “dark” and “ness,” with each token bearing a numerical representation, such as 217 and 655. The opposite word, brightness, would similarly be split into “bright” and “ness,” with corresponding numerical representations of 491 and 655.

In this example, the shared numerical value associated with “ness” can help the AI model understand that the words may have something in common. In other situations, a tokenizer may assign different numerical representations for the same word depending on its meaning in context.

For example, the word “lie” could refer to a resting position or to saying something untruthful. During training, the model would learn the distinction between these two meanings and assign them different token numbers.

For visual AI models that process images, video or sensor data, a tokenizer can help map visual inputs like pixels or voxels into a series of discrete tokens.

Models that process audio may turn short clips into spectrograms — visual depictions of sound waves over time that can then be processed as images. Other audio applications may instead focus on capturing the meaning of a sound clip containing speech, and use another kind of tokenizer that captures semantic tokens, which represent language or context data instead of simply acoustic information.

How Are Tokens Used During AI Training?

Training an AI model starts with the tokenization of the training dataset.

Based on the size of the training data, the number of tokens can number in the billions or trillions — and, per the pretraining scaling law, the more tokens used for training, the better the quality of the AI model.

As an AI model is pretrained, it’s tested by being shown a sample set of tokens and asked to predict the next token. Based on whether or not its prediction is correct, the model updates itself to improve its next guess. This process is repeated until the model learns from its mistakes and reaches a target level of accuracy, known as model convergence.

After pretraining, models are further improved by post-training, where they continue to learn on a subset of tokens relevant to the use case where they’ll be deployed. These could be tokens with domain-specific information for an application in law, medicine or business — or tokens that help tailor the model to a specific task, like reasoning, chat or translation. The goal is a model that generates the right tokens to deliver a correct response based on a user’s query — a skill better known as inference.

How Are Tokens Used During AI Inference and Reasoning? 

During inference, an AI receives a prompt — which, depending on the model, may be text, image, audio clip, video, sensor data or even gene sequence — that it translates into a series of tokens. The model processes these input tokens, generates its response as tokens and then translates it to the user’s expected format.

Input and output languages can be different, such as in a model that translates English to Japanese, or one that converts text prompts into images.

To understand a complete prompt, AI models must be able to process multiple tokens at once. Many models have a specified limit, referred to as a context window — and different use cases require different context window sizes.

A model that can process a few thousand tokens at once might be able to process a single high-resolution image or a few pages of text. With a context length of tens of thousands of tokens, another model might be able to summarize a whole novel or an hourlong podcast episode. Some models even provide context lengths of a million or more tokens, allowing users to input massive data sources for the AI to analyze.

Reasoning AI models, the latest advancement in LLMs, can tackle more complex queries by treating tokens differently than before. Here, in addition to input and output tokens, the model generates a host of reasoning tokens over minutes or hours as it thinks about how to solve a given problem.

These reasoning tokens allow for better responses to complex questions, just like how a person can formulate a better answer given time to work through a problem. The corresponding increase in tokens per prompt can require over 100x more compute compared with a single inference pass on a traditional LLM — an example of test-time scaling, aka long thinking.

How Do Tokens Drive AI Economics? 

During pretraining and post-training, tokens equate to investment into intelligence, and during inference, they drive cost and revenue. So as AI applications proliferate, new principles of AI economics are emerging.

AI factories are built to sustain high-volume inference, manufacturing intelligence for users by turning tokens into monetizable insights. That’s why a growing number of AI services are measuring the value of their products based on the number of tokens consumed and generated, offering pricing plans based on a model’s rates of token input and output.

Some token pricing plans offer users a set number of tokens shared between input and output. Based on these token limits, a customer could use a short text prompt that uses just a few tokens for the input to generate a lengthy, AI-generated response that took thousands of tokens as the output. Or a user could spend the majority of their tokens on input, providing an AI model with a set of documents to summarize into a few bullet points.

To serve a high volume of concurrent users, some AI services also set token limits, the maximum number of tokens per minute generated for an individual user.

Tokens also define the user experience for AI services. Time to first token, the latency between a user submitting a prompt and the AI model starting to respond, and inter-token or token-to-token latency, the rate at which subsequent output tokens are generated, determine how an end user experiences the output of an AI application.

There are tradeoffs involved for each metric, and the right balance is dictated by use case.

For LLM-based chatbots, shortening the time to first token can help improve user engagement by maintaining a conversational pace without unnatural pauses. Optimizing inter-token latency can enable text generation models to match the reading speed of an average person, or video generation models to achieve a desired frame rate. For AI models engaging in long thinking and research, more emphasis is placed on generating high-quality tokens, even if it adds latency.

Developers have to strike a balance between these metrics to deliver high-quality user experiences with optimal throughput, the number of tokens an AI factory can generate.

To address these challenges, the NVIDIA AI platform offers a vast collection of software, microservices and blueprints alongside powerful accelerated computing infrastructure — a flexible, full-stack solution that enables enterprises to evolve, optimize and scale AI factories to generate the next wave of intelligence across industries.

Understanding how to optimize token usage across different tasks can help developers, enterprises and even end users reap the most value from their AI applications.

Learn more in this ebook and get started at build.nvidia.com.

Read More