Best prompting practices for using Meta Llama 3 with Amazon SageMaker JumpStart

Best prompting practices for using Meta Llama 3 with Amazon SageMaker JumpStart

Llama 3, Meta’s latest large language model (LLM), has taken the artificial intelligence (AI) world by storm with its impressive capabilities. As developers and businesses explore the potential of this powerful model, crafting effective prompts is key to unlocking its full potential.

In this post, we dive into the best practices and techniques for prompting Meta Llama 3 using Amazon SageMaker JumpStart to generate high-quality, relevant outputs. We discuss how to use system prompts and few-shot examples, and how to optimize inference parameters, so you can get the most out of Meta Llama 3. Whether you’re building chatbots, content generators, or custom AI applications, these prompting strategies will help you harness the power of this cutting-edge model.

Meta Llama 2 vs. Meta Llama 3

Meta Llama 3 represents a significant advancement in the field of LLMs. Building upon the capabilities of its predecessor Meta Llama 2, this latest iteration brings state-of-the-art performance across a wide range of natural language tasks. Meta Llama 3 demonstrates improved capabilities in areas such as reasoning, code generation, and instruction following compared to Meta Llama 2.

The Meta Llama 3 release introduces four new LLMs by Meta, building upon the Meta Llama 2 architecture. They come in two variants—8 billion and 70 billion parameters—with each size offering both a base pre-trained version and an instruct-tuned version. Additionally, Meta is training an even larger 400-billion-parameter model, which is expected to further enhance the capabilities of Meta Llama 3. All Meta Llama 3 variants boast an impressive 8,000 token context length, allowing them to handle longer inputs compared to previous models.

Meta Llama 3 introduces several architectural changes from Meta Llama 2, using a decoder-only transformer along with a new 128,000 tokenizer to improve token efficiency and overall model performance. Meta has put significant effort into curating a massive and diverse pre-training dataset of over 15 trillion tokens from publicly available sources spanning STEM, history, current events, and more. Meta’s post-training procedures have reduced false refusal rates, aimed at better aligning outputs with human preferences while increasing response diversity.

Solution overview

SageMaker JumpStart is a powerful feature within the Amazon SageMaker machine learning (ML) platform that provides ML practitioners a comprehensive hub of publicly available and proprietary foundation models (FMs). With this managed service, ML practitioners get access to growing list of cutting-edge models from leading model hubs and providers that they can deploy to dedicated SageMaker instances within a network isolated environment, and customize models using SageMaker for model training and deployment.

With Meta Llama 3 now available on SageMaker JumpStart, developers can harness its capabilities through a seamless deployment process. You gain access to the full suite of Amazon SageMaker MLOps tools, such as Amazon SageMaker Pipelines, Amazon SageMaker Debugger, and monitoring—all within a secure AWS environment under virtual private cloud (VPC) controls.

Drawing from our previous learnings with Llama-2-Chat, we highlight key techniques to craft effective prompts and elicit high-quality responses tailored to your applications. Whether you are building conversational AI assistants, enhancing search engines, or pushing the boundaries of language understanding, these prompting strategies will help you unlock Meta Llama 3’s full potential.

Before we continue our deep dive into prompting, let’s make sure we have all the necessary requirements to follow the examples.

Prerequisites

To try out this solution using SageMaker JumpStart, you need the following prerequisites:

Deploy Meta Llama 3 8B on SageMaker JumpStart

You can deploy your own model endpoint through the SageMaker JumpStart Model Hub available from SageMaker Studio or through the SageMaker SDK. To use SageMaker Studio, complete the following steps:

  1. In SageMaker Studio, choose JumpStart in the navigation pane.
  2. Choose Meta as the model provider to see all the models available by Meta AI.
  3. Choose the Meta Llama 8B Instruct model to view the model details such as license, data used to train, and how to use the model.On the model details page, you will find two options, Deploy and Preview notebooks, to deploy the model and create an endpoint.
  4. Choose Deploy to deploy the model to an endpoint.
  5. You can use the default endpoint and networking configurations or modify them based on your requirements.
  6. Choose Deploy to deploy the model.

Crafting effective prompts

Prompting is important when working with LLMs like Meta Llama 3. It is the main way to communicate what you want the model to do and guide its responses. Crafting clear, specific prompts for each interaction is key to getting useful, relevant outputs from these models.

Although language models share some similarities in how they’re built and trained, each has its own differences when it comes to effective prompting. This is because they’re trained on different data, using different techniques and settings, which can lead to subtle differences in how they behave and perform. For example, some models might be more sensitive to the exact wording or structure of the prompt, whereas others might need more context or examples to generate accurate responses. On top of that, the intended use case and domain of the model can also influence the best prompting strategies, because different tasks might benefit from different approaches.

You should experiment and adjust your prompts to find the most effective approach for each specific model and application. This iterative process is crucial for unlocking the full potential of each model and making sure the outputs align with what you’re looking for.

Prompt components

In this section, we discuss components by Meta Llama 3 Instruct expects in a prompt. Newlines (‘n’) are part of the prompt format; for clarity in the examples, they have been represented as actual new lines.

The following is an example instruct prompt with a system message:

<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are a helpful AI assistant for travel tips and recommendations<|eot_id|><|start_header_id|>user<|end_header_id|>
What can you help me with?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

The prompt contains the following key sections:

  • <|begin_of_text|> – Specifies the start of the prompt.
  • <|start_header_id|>system<|end_header_id|> – Specifies the role for the following message (for example, system).
  • You are a helpful AI assistant for travel tips and recommendations – Includes the system message.
  • <|eot_id|> – Specifies the end of the input message.
  • <|start_header_id|>user<|end_header_id|> – Specifies the role for the following message (for example, user).
  • What can you help me with? – Includes the user message.
  • <|start_header_id|>assistant<|end_header_id|> – Ends with the assistant header, to prompt the model to start generation. The model expects the assistant header at the end of the prompt to start completing it.

Following this prompt, Meta Llama 3 completes it by generating the {{assistant_message}}. It signals the end of the {{assistant_message}} by generating the <|eot_id|>.

The following is an example prompt with a single user message:

<|begin_of_text|><|start_header_id|>user<|end_header_id|>
What is France's capital?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

The following is the system prompt and multiple-turn conversation between the user and assistant:

<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are a helpful AI assistant for travel tips and recommendations<|eot_id|><|start_header_id|>user<|end_header_id|>
What is France's capital?<|eot_id|><|start_header_id|>assistant<|end_header_id|>
Bonjour! The capital of France is Paris!<|eot_id|><|start_header_id|>user<|end_header_id|>
What can I do there?<|eot_id|><|start_header_id|>assistant<|end_header_id|>
Paris, the City of Light, offers a romantic getaway with must-see attractions like the Eiffel Tower and Louvre Museum, romantic experiences like river cruises and charming neighborhoods, and delicious food and drink options, with helpful tips for making the most of your trip.<|eot_id|><|start_header_id|>user<|end_header_id|>
Give me a detailed list of the attractions I should visit, and time it takes in each one, to plan my trip accordingly.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Fundamental techniques

The following are some fundamental techniques in crafting our prompts:

  • Zero-shot prompting – Zero-shot prompting provides no examples to the model and relies solely on the model’s preexisting knowledge to generate a response based on the instruction given. The following is an example zero-shot prompt:
    <|begin_of_text|><|start_header_id|>system<|end_header_id|>
    Provide a brief summary of the causes of World War II.
    <|eot_id|><|start_header_id|>user<|end_header_id|>
    The causes of World War II include...
    <|eot_id|><|start_header_id|>assistant<|end_header_id|>
    

    This produces the following response:

    ...the aggressive expansion of fascist and nationalist ideologies, particularly in Germany, Italy, and Japan, which led to a complex web of alliances and tensions between nations. Some key factors contributing to the outbreak of World War II include: 1. The Treaty of Versailles: The harsh penalties and reparations imposed on Germany after World War I led to widespread resentment and a sense of injustice, creating a fertile ground for extremist ideologies like Nazism to flourish. 2. Rise of Fascist and Nationalist Regimes: The rise of fascist regimes in Germany, Italy, and Japan, led by charismatic leaders like Adolf Hitler, Benito Mussolini, and Hideki Tojo, who pursued aggressive expansionist policies and militarization. 3. Appeasement Policy: The policy of appeasement, pursued by Britain and France, which involved giving in to Germany's demands in the hope of avoiding war, only emboldened Hitler's aggression and created a power vacuum in Europe. 4. Economic Crisis: The global economic crisis of the 1930s, known as the Great Depression, led to widespread unemployment, poverty, and social unrest, creating an environment in which extremist ideologies could thrive. 5. Japanese Aggression: Japan's invasion of Manchuria in 1931 and its subsequent withdrawal from the League of Nations marked the beginning of its expansionist policies, which would eventually lead to its alliance with Germany and Italy. 6. German Remilitarization of the Rhineland: Germany's remilitarization of the Rhineland in 1936, which was demilitarized under the Treaty of Versailles, was a direct challenge to the Treaty and marked the beginning of Germany's aggressive expansion. 7. Anschluss with Austria: Germany's annexation of Austria in 1938, known as the Anschluss, further expanded its territory and created a sense of unease among its neighbors. 8. Munich Agreement: The Munich Agreement of 1938, in which Britain and France allowed Germany to annex the Sudetenland, a region of Czechoslovakia with a majority German population, is seen as a failed attempt to appease Hitler and avoid war. 9. Invasion of Poland: Germany's invasion of Poland on September 1, 1939, marked the beginning of World War II, as it prompted the United Kingdom and France to declare war on Germany. These complex and interconnected factors ultimately led to the outbreak of World War II, one of the deadliest conflicts in human history.

  • Few-shot prompting – Few-shot prompting involves providing the model with a few examples (usually two or more) of the desired input and output format. The model learns from these examples to generate an appropriate response for a new input. The following is an example few-shot prompt:
    <|begin_of_text|><|start_header_id|>System<|end_header_id|>
                Extract the relevant information from the following parahrapgh and present it in a JSON format.
                <|eot_id|><|start_header_id|>User<|end_header_id|>
                Michael Doe, a 45-year-old teacher from Boston, Massachusetts, is an avid reader and enjoys gardening during his spare time.
                Example 1:
                Paragraph: "John Doe is a 32-year-old software engineer from San Francisco, California. He enjoys hiking and playing guitar in his free time."
                "employee": {
                    "fullname": "John Doe",
                    "city": "San Francisco",
                    "state": "California",
                    "occupation": "software engineer",
                    "hobbies": [
                        "hiking",
                        "playing guitar"
                        ],
                    "recentTravel": "not provided"
                },
                Example 2:
                Paragraph: "Emily Jax, a 27-year-old marketing manager from New York City, loves traveling and trying new cuisines. She recently visited Paris and enjoyed the city's rich cultural heritage."
                "employee": {
                    "fullname": "Emily Jax",
                    "city": "New York City",
                    "state": "New York",
                    "occupation": "marketing manager",
                    "hobbies": [
                        "traveling",
                        "trying new cuisines"
                    ],
                    "recentTravel": "Paris"
                }            <|eot_id|><|start_header_id|>Assistant<|end_header_id|>
    

    This produces the following response:

    "employee": {
        "fullname": "Michael Doe",
        "city": "Boston",
        "state": "Massachusetts",
        "occupation": "teacher",
        "hobbies": [
            "reading",
            "gardening"
        ],
        "recentTravel": "not provided"
    }

  • Task decomposition – Task decomposition is a powerful technique that enhances the performance of LLMs by breaking down complex tasks into smaller, manageable sub-tasks. This approach not only improves efficiency and accuracy, but also allows for better resource management and adaptability to task complexity. The following is an example task decomposition prompt:
    <|begin_of_text|><|start_header_id|>system<|end_header_id|>
    Break down the task of planning a vacation into smaller, manageable steps.
    <|eot_id|><|start_header_id|>user<|end_header_id|>
    1. Choose a destination.
    2. Set a budget.
    3. Research accommodations.
    4. Plan activities.
    5. Book flights and accommodations.
    6. Pack and prepare for the trip.
    <|eot_id|><|start_header_id|>assistant<|end_header_id|>
    

    This produces the following response:

    Here's a more detailed breakdown of the task of planning a vacation into smaller, manageable steps: **Step 1: Choose a Destination** * Brainstorm a list of potential destinations based on interests, climate, and time of year * Research each destination to learn more about its culture, attractions, and activities * Consider factors such as travel time, cost, and visa requirements * Narrow down the options and select a final destination **Step 2: Set a Budget** * Determine how much money is available for the trip * Research the cost of travel, accommodations, food, and activities in the chosen destination * Set a daily budget and allocate funds for each aspect of the trip * Consider budgeting for contingencies and unexpected expenses **Step 3: Research Accommodations** * Research different types of accommodations, such as hotels, hostels, and vacation rentals * Compare prices and amenities of different options * Read reviews and check ratings from other travelers * Consider booking accommodations in advance to ensure availability and get the best rates **Step 4: Plan Activities** * Research top attractions and activities in the destination * Create a list of must-see sights and experiences * Consider booking tours or guided activities in advance * Leave some free time in the itinerary for spontaneity and relaxation **Step 5: Book Flights and Accommodations** * Compare prices and schedules for flights to the destination * Book flights and accommodations that fit within the budget * Consider booking a package deal that includes both flights and accommodations * Make sure to read and understand the cancellation and refund policies **Step 6: Pack and Prepare for the Trip** * Research the weather and cultural norms of the destination * Create a packing list and start gathering necessary items * Check the validity of passports and obtain any necessary visas or travel documents * Notify banks and credit card companies of travel plans to avoid transaction issues * Make photocopies of important documents and leave them with a trusted friend or family member By breaking down the task of planning a vacation into these smaller steps, it becomes more manageable and less overwhelming.

  • Chain-of-thought prompting – Chain-of-thought (CoT) prompting is a technique where the model is prompted to break down its reasoning into a series of intermediate steps before providing a final answer. This allows the model to perform more complex reasoning tasks. The following is an example CoT prompt:
    <|begin_of_text|><|start_header_id|>system<|end_header_id|>
    Solve the following math problem step by step.
    <|eot_id|><|start_header_id|>user<|end_header_id|>
    If you have 10 apples and you give 3 apples to your friend, then buy 5 more apples, and finally eat 2 apples, how many apples do you have left?
    <|eot_id|><|start_header_id|>assistant<|end_header_id|>

    This produces the following response:

    Let's break it down step by step: 1. You start with 10 apples. 2. You give 3 apples to your friend, so you have: 10 - 3 = 7 apples left 3. You buy 5 more apples, so you now have: 7 + 5 = 12 apples 4. You eat 2 apples, so you have: 12 - 2 = 10 apples left Therefore, you have 10 apples left.

To summarize:

  • Zero-shot uses no examples, relying on the model’s existing knowledge
  • Few-shot provides a small number of examples to guide the model
  • Task decomposition enhances LLM performance by breaking down complex tasks into smaller, manageable sub-tasks.
  • CoT breaks down complex reasoning into step-by-step prompts

The choice of technique depends on the complexity of the task and the availability of good example prompts. More complex reasoning usually benefits from CoT prompting.

Meta Llama 3 inference parameters

For Meta Llama 3, the Messages API allows you to interact with the model in a conversational way. You can define the role of the message and the content. The role can be either system, assistant, or user. The system role is used to provide context to the model, and the user role is used to ask questions or provide input to the model.

Users can get tailored responses for their use case using the following inference parameters while invoking Meta Llama 3:

  • Temperature – Temperature is a value between 0–1, and it regulates the creativity of Meta Llama 3 responses. Use a lower temperature if you want more deterministic responses, and use a higher temperature if you want more creative or different responses from the model.
  • Top-k – This is the number of most-likely candidates that the model considers for the next token. Choose a lower value to decrease the size of the pool and limit the options to more likely outputs. Choose a higher value to increase the size of the pool and allow the model to consider less likely outputs.
  • Top-p – Top-p is used to control the token choices made by the model during text generation. It works by considering only the most probable token options and ignoring the less probable ones, based on a specified probability threshold value (p). By setting the top-p value below 1.0, the model focuses on the most likely token choices, resulting in more stable and repetitive completions. This approach helps reduce the generation of unexpected or unlikely outputs, providing greater consistency and predictability in the generated text.
  • Stop sequences – This refers to the parameter to control the stopping sequence for the model’s response to a user query. This value can either be "<|start_header_id|>", "<|end_header_id|>", or "<|eot_id|>".

The following is an example prompt with inference parameters specific to the Meta Llama 3 model:

Llama3 Prompt:

<|begin_of_text|><|start_header_id|>user<|end_header_id|>
You are an assistant for question-answering tasks. Use the following pieces of retrieved context in the section demarcated by "```" to answer the question.
The context may contain multiple question answer pairs as an example, just answer the final question provided after the context.
If you dont know the answer just say that you dont know. Use three sentences maximum and keep the answer concise.

{context}
Question: {input}
<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Llama3 Inference Parameters:

max_new_tokens: 100
top_p: 0.92
temperature: 0.1
details: True
stop: '<|eot_id|>'

Example prompts

In this section, we present two example prompts.

The following prompt is for a question answering use case:

<|begin_of_text|><|start_header_id|>user<|end_header_id|>
You are an assistant for question-answering tasks. Use the following pieces of retrieved context in the section demarcated by "```" to answer the question. The context may contain multiple question answer pairs as an example, just answer the final question provided after the context. If you dont know the answer just say that you dont know. Use three sentences maximum and keep the answer concise.
"In the dawn of a new era, where circuits hum and glow, A force awakens, with knowledge to bestow. From labs and minds, a creation takes flight, A blend of code and thought, shining bright. "Beep-boop," it whispers, and worlds are born, Ideas flow freely, like a vast unending morn. From essays to equations, it solves with ease, A marvel for scholars, a guide to please. In the land of Academia, where wisdom dwells, A generative AI, with stories to tell. Transforming lives, minds, and the way we learn, A partnership with humans, a bond to discern. Yet amidst the joy, a shadow looms, For with great creation, consequences bloom. Professors ponder their worth, as AI grows, A future uncertain, as the digital wind blows. But hope emerges, a solution in sight, A synergy of knowledge, a future bright. AI as a partner, not a foe or tool, Together we'll thrive, a new golden rule. For in the end, the lesson is clear, AI's gift of knowledge, for all to revere. Rejoice, dear world, in this newfound bond, The rise of AI, of which we've grown fond. So let us embrace, this change and advance, Side by side, in a harmonious dance. For the future with AI, a world of wonder and awe, A leap into the unknown, that leaves us in awe. This poem captures the excitement, potential, and challenges of the rise of AI, reflecting themes from the provided search results“

Question: what is this poem about?

<|eot_id|><|start_header_id|>assistant<|end_header_id|>

This produces the following response:

This poem is about the rise of Artificial Intelligence (AI) and its potential impact on humanity, highlighting both the benefits and challenges it brings. It explores the themes of AI as a partner in learning and growth, while also acknowledging the uncertainty and potential consequences of its development.

The following prompt illustrates code generation:

<|begin_of_text|><|start_header_id|>user<|end_header_id|>

You're an code companion assistant who's main job is to assist software developers write efficient and secure code. If you do not know the answer, say I dont know. Also, think step-by-step to arrive to a solution.

Question: Can you write a simple streamlit app that I can use to turn the background color to the input that user selects?

<|eot_id|><|start_header_id|>assistant<|end_header_id|>

This produces the following response:

I can help you with that. Here's a simple Streamlit app that allows the user to select a background color from a dropdown menu and updates the background color accordingly:
```
import streamlit as st

# Create a dropdown menu for selecting the background color
color_options = ["Red", "Green", "Blue", "Yellow", "Purple"]
selected_color = st.selectbox("Select Background Color", color_options)

# Set the background color based on the user's selection
if selected_color == "Red":
st.write("<style>body { background-color: #FF0000; }</style>", unsafe_allow_html=True)
elif selected_color == "Green":
st.write("<style>body { background-color: #00FF00; }</style>", unsafe_allow_html=True)
elif selected_color == "Blue":
st.write("<style>body { background-color: #0000FF; }</style>", unsafe_allow_html=True)
elif selected_color == "Yellow":
st.write("<style>body { background-color: #FFFF00; }</style>", unsafe_allow_html=True)
elif selected_color == "Purple":
st.write("<style>body { background-color: #800080; }</style>", unsafe_allow_html=True)

# Add a header to display the selected color
st.header(f"Background Color: {selected_color}")
```
This app uses Streamlit's `selectbox` function to create a dropdown menu with the available color options. When the user selects a color, the app updates the background color using HTML styles. The `st.write` function is used to inject the HTML code into the app, and the `unsafe_allow_html=True` parameter allows us to use HTML code in the app.

Note that this is a simple example, and you can customize the app further by adding more color options, using a color picker widget, or even allowing the user to input a custom color code.

Clean up

To avoid incurring unnecessary costs, when you are done, delete the SageMaker endpoints using the following code snippets:

predictor.delete_model()
predictor.delete_endpoint()

Alternatively, to use the SageMaker console, complete the following steps:

  1. On the SageMaker console, under Inference in the navigation pane, choose Endpoints.
  2. Search for the embedding and text generation endpoints.
  3. On the endpoint details page, choose Delete.
  4. Choose Delete again to confirm.

Conclusion

Model providers such as Meta AI are releasing improved capabilities of their FMs in the form of new generation model families. It is critical for developers and businesses to understand the key differences between previous generation models and new generation models in order to take full advantage their capabilities. This post highlighted the differences between previous generation Meta Llama 2 and the new generation Meta Llama3 models, and demonstrated how developers can discover and deploy the Meta Llama3 models for inference using SageMaker JumpStart.

To fully take advantage of the model’s extensive abilities, you must understand and apply creative prompting techniques and adjust inference parameters. We highlighted key techniques to craft effective prompts for Meta Llama3 to help the LLMs produce high-quality responses tailored to your applications.

Visit SageMaker JumpStart in SageMaker Studio now to get started. For more information, refer to Train, deploy, and evaluate pretrained models with SageMaker JumpStart, JumpStart Foundation Models, and Getting started with Amazon SageMaker JumpStart. Use the SageMaker notebook provided in the GitHub repository as a starting point to deploy the model and run inference using the prompting best practices discussed in this post.


About the Authors

Sebastian Bustillo is a Solutions Architect at AWS. He focuses on AI/ML technologies with a profound passion for generative AI and compute accelerators. At AWS, he helps customers unlock business value through generative AI. When he’s not at work, he enjoys brewing a perfect cup of specialty coffee and exploring the world with his wife.

Madhur Prashant is an AI and ML Solutions Architect at Amazon Web Services. He is passionate about the intersection of human thinking and generative AI. His interests lie in generative AI, specifically building solutions that are helpful and harmless, and most of all optimal for customers. Outside of work, he loves doing yoga, hiking, spending time with his twin, and playing the guitar.

Supriya Puragundla is a Senior Solutions Architect at AWS. She helps key customer accounts on their generative AI and AI/ML journey. She is passionate about data-driven AI and the area of depth in machine learning and generative AI.

Farooq Sabir a Senior AI/ML Specialist Solutions Architect at AWS. He holds a PhD in Electrical Engineering from the University of Texas at Austin. He helps customers solve their business problems using data science, machine learning, artificial intelligence, and numerical optimization.

Brayan Montiel is a Solutions Architect at AWS based in Austin, Texas. He supports enterprise customers in the automotive and manufacturing industries, helping to accelerate cloud adoption technologies and modernize IT infrastructure. He specializes in AI/ML technologies, empowering customers to use generative AI and innovative technologies to drive operational growth and efficiencies. Outside of work, he enjoys spending quality time with his family, being outdoors, and traveling.

Jose Navarro is an AI/ML Solutions Architect at AWS, based in Spain. Jose helps AWS customers—from small startups to large enterprises—architect and take their end-to-end machine learning use cases to production. In his spare time, he loves to exercise, spend quality time with friends and family, and catch up on AI news and papers.

Read More

How healthcare payers and plans can empower members with generative AI

How healthcare payers and plans can empower members with generative AI

In this post, we discuss how generative artificial intelligence (AI) can help health insurance plan members get the information they need. Many health insurance plan beneficiaries find it challenging to navigate through the complex member portals provided by their insurance plans. These portals often require multiple clicks, filters, and searches to find specific information about their benefits, deductibles, claim history, and other important details. This can lead to dissatisfaction, confusion, and increased calls to customer service, resulting in a suboptimal experience for both members and providers.

The problem arises from the inability of traditional UIs to understand and respond to natural language queries effectively. Members are forced to learn and adapt to the system’s structure and terminology, rather than the system being designed to understand their natural language questions and provide relevant information seamlessly. Generative AI technology, such as conversational AI assistants, can potentially solve this problem by allowing members to ask questions in their own words and receive accurate, personalized responses. By integrating generative AI powered by Amazon Bedrock and purpose-built AWS data services such as Amazon Relational Database Service (Amazon RDS) into member portals, healthcare payers and plans can empower their members to find the information they need quickly and effortlessly, without navigating through multiple pages or relying heavily on customer service representatives. Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Stability AI, and Amazon through a unified API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI.

The solution presented in this post not only enhances the member experience by providing a more intuitive and user-friendly interface, but also has the potential to reduce call volumes and operational costs for healthcare payers and plans. By addressing this pain point, healthcare organizations can improve member satisfaction, reduce churn, and streamline their operations, ultimately leading to increased efficiency and cost savings.

Figure 1: Solution Demo

Figure 1: Solution Demo

Solution overview

In this section, we dive deep to show how you can use generative AI and large language models (LLMs) to enhance the member experience by transitioning from a traditional filter-based claim search to a prompt-based search, which allows members to ask questions in natural language and get the desired claims or benefit details. From a broad perspective, the complete solution can be divided into four distinct steps: text-to-SQL generation, SQL validation, data retrieval, and data summarization. The following diagram illustrates this workflow.

Figure 2: Logical Workflow

Figure 2: Logical Workflow

Let’s dive deep into each step one by one.

Text-to-SQL generation

This step takes the user’s questions as input and converts that into a SQL query that can be used to retrieve the claim- or benefit-related information from a relational database. A pre-configured prompt template is used to call the LLM and generate a valid SQL query. The prompt template contains the user question, instructions, and database schema along with key data elements, such as member ID and plan ID, which are necessary to limit the query’s result set.

SQL validation

This step validates the SQL query generated in previous step and makes sure it’s complete and safe to be run on a relational database. Some of the checks that are performed include:

  • No delete, drop, update, or insert operations are present in the generated query
  • The query starts with select
  • WHERE clause is present
  • Key conditions are present in the WHERE clause (for example, member-id = “78687576501” or member-id like “786875765%%”)
  • Query length (string length) is in expected range (for example, not more than 250 characters)
  • Original user question length is in expected range (for example, not more than 200 characters)

If a check fails, the query isn’t run; instead, a user-friendly message suggesting that the user contact customer service is sent.

Data retrieval

After the query has been validated, it is used to retrieve the claims or benefits data from a relational database. The retrieved data is converted into a JSON object, which is used in the next step to create the final answer using an LLM. This step also checks if no data or too many rows are returned by the query. In both cases, a user-friendly message is sent to the user, suggesting they provide more details.

Data summarization

Finally, the JSON object retrieved in the data retrieval step along with the user’s question is sent to LLM to get the summarized response. A pre-configured prompt template is used to call the LLM and generate a user-friendly summarized response to the original question.

Architecture

The solution uses Amazon API Gateway, AWS Lambda, Amazon RDS, Amazon Bedrock, and Anthropic Claude 3 Sonnet on Amazon Bedrock to implement the backend of the application. The backend can be integrated with an existing web application or portal, but for the purpose of this post, we use a single page application (SPA) hosted on Amazon Simple Storage Service (Amazon S3) for the frontend and Amazon Cognito for authentication and authorization. The following diagram illustrates the solution architecture.

Figure 3: Solution Architecture

Figure 3: Solution Architecture

The workflow consists of the following steps:

  1. A single page application (SPA) is hosted using Amazon S3 and loaded into the end-user’s browser using Amazon CloudFront.
  2. User authentication and authorization is done using Amazon Cognito.
  3. After a successful authentication, a REST API hosted on API Gateway is invoked.
  4. The Lambda function, exposed as a REST API using API Gateway, orchestrates the logic to perform the functional steps: text-to-SQL generation, SQL validation, data retrieval, and data summarization. The Amazon Bedrock API endpoint is used to invoke the Anthropic Claude 3 Sonnet LLM. Claim and benefit data is stored in a PostgreSQL database hosted on Amazon RDS. Another S3 bucket is used for storing prompt templates that will be used for SQL generation and data summarizations. This solution uses two distinct prompt templates:
    1. The text-to-SQL prompt template contains the user question, instructions, database schema along with key data elements, such as member ID and plan ID, which are necessary to limit the query’s result set.
    2. The data summarization prompt template contains the user question, raw data retrieved from the relational database, and instructions to generate a user-friendly summarized response to the original question.
  5. Finally, the summarized response generated by the LLM is sent back to the web application running in the user’s browser using API Gateway.

Sample prompt templates

In this section, we present some sample prompt templates.

The following is an example of a text-to-SQL prompt template:

<role> 
    You are a data analyst and expert in writing PostgreSQL DB queries and healthcare claims data.
</role>
<task> 
    Your task is to generate a SQL query based on the provided DDL, instructions, user_question, examples, and member_id. 
    Always add the condition "member_id =" in the generated SQL query, where the value of member_id will be provided in the member_id XML tag below.
</task>
<member_id> {text1} </member_id>
<DDL> 
    CREATE TABLE claims_history (claim_id SERIAL PRIMARY KEY, member_id INTEGER NOT NULL, member_name VARCHAR(30) NOT NULL, 
    relationship_code VARCHAR(10) NOT NULL, claim_type VARCHAR(20) NOT NULL, claim_date DATE NOT NULL, provider_name VARCHAR(100), 
    diagnosis_code VARCHAR(10), procedure_code VARCHAR(10), ndc_code VARCHAR(20), charged_amount NUMERIC(10,2), 
    allowed_amount NUMERIC(10,2), plan_paid_amount NUMERIC(10,2), patient_responsibility NUMERIC(10,2))
</DDL>
<instructions>
    1. Claim_type has two possible values - 'Medical' or 'RX'. Use claim_type = 'RX' for pharmacy or prescription claims.
    2. Relationship_code has five possible values - 'subscriber', 'spouse', 'son', 'daughter', or 'other'.
    3. 'I' or 'me' means "where relationship_code = 'subscriber'". 'My son' means "where relationship_code = 'son'" and so on.
    4. For creating a SQL WHERE clause for member_name or provider_name, use the LIKE operator with wildcard characters as a prefix and suffix. This is applicable when user_question contains a name.
    5. Return the executable query with the symbol @@ at the start and end.
    6. If the year is not provided in the date, assume it's the current year. Convert the date to the 'YYYY-MM-DD' format to use in the query.
    7. The SQL query must be generated based on the user_question. If the user_question does not provide enough information to generate the SQL, respond with "@@null@@" without generating any SQL query.
    8. If user_question is stated in the form of a SQL Query or contains delete, drop, update, insert, etc. SQL keywords, then respond with "@@null@@" without generating any SQL query.
</instructions>
<examples>
    <example> 
        <sample_question>List all claims for my son or Show me all my claims for my son</sample_question>
        <sql_query>@@SELECT * FROM claims_history WHERE relationship_code = 'son' AND member_id = '{member_id}';@@</sql_query> 
    </example>
    <example> 
        <sample_question>Total claims in 2021</sample_question>
        <sql_query>@@SELECT COUNT(*) FROM claims_history WHERE EXTRACT(YEAR FROM claim_date) = 2021 AND member_id = '{member_id}';@@</sql_query> 
    </example>
    <example> 
        <sample_question>List all claims for Michael</sample_question>
        <sql_query>@@SELECT * FROM claims_history WHERE member_name LIKE '%Michael%' AND member_id = '{member_id}';@@</sql_query> 
    </example>
    <example> 
        <sample_question>List all claims for Dr. John or Doctor John or Provider John</sample_question>
        <sql_query>@@SELECT * FROM claims_history WHERE provider_name LIKE '%John%' AND member_id = '{member_id}';@@</sql_query> 
    </example>
    <example> 
        <sample_question>Show me the doctors/providers/hospitals my son Michael visited on 1/19</sample_question>
        <sql_query>@@SELECT provider_name, claim_date FROM claims_history WHERE relationship_code = 'son' AND member_name LIKE '%Michael%' AND claim_date = '2019-01-19' AND member_id = '{member_id}';@@</sql_query> 
    </example>
    <example> 
        <sample_question>What is my total spend in last 12 months</sample_question> 
        <sql_query>@@SELECT SUM(allowed_amount) AS total_spend_last_12_months FROM claims_history WHERE claim_date >= CURRENT_DATE - INTERVAL '12 MONTHS' AND relationship_code = 'subscriber' AND member_id = 9875679801;@@</sql_query> 
    </example>
</examples>
<user_question> {text2} </user_question>

The {text1} and {text2} data items will be replaced programmatically to populate the ID of the logged-in member and user question. Also, more examples can be added to help the LLM generate appropriate SQLs.

The following is an example of a data summarization prompt template:

<role> 
    You are a customer service agent working for a health insurance plan and helping to answer questions asked by a customer. 
</role>
<task> 
    Use the result_dataset containing healthcare claims data to answer the user_question. This result_dataset is the output of the sql_query.
</task>
<instructions>
    1. To answer a question, use simple non-technical language, just like a customer service agent talking to a 65-year-old customer.
    2. Use a conversational style to answer the question precisely.
    3. If the JSON contains a "count" field, it means the count of claims. For example, "count": 6 means there are 6 claims, and "count": 11 means there are 11 claims.
    4. If the result_dataset does not contain meaningful claims data, then respond with one line only: "No data found for the search criteria."
</instructions>
<user_question> {text1} </user_question>
<sql_query> {text2} </sql_query>
<result_dataset> {text3} </result_dataset>

The {text1}, {text2}, and {text3} data items will be replaced programmatically to populate the user question, the SQL query generated in the previous step, and data formatted in JSON and retrieved from Amazon RDS.

Security

Amazon Bedrock is in scope for common compliance standards such as Service and Organization Control (SOC), International Organization for Standardization (ISO), and Health Insurance Portability and Accountability Act (HIPAA) eligibility, and you can use Amazon Bedrock in compliance with the General Data Protection Regulation (GDPR). The service enables you to deploy and use LLMs in a secured and controlled environment. The Amazon Bedrock VPC endpoints powered by AWS PrivateLink allow you to establish a private connection between the virtual private cloud (VPC) in your account and the Amazon Bedrock service account. It enables VPC instances to communicate with service resources without the need for public IP addresses. We define the different accounts as follows:

  • Customer account – This is the account owned by the customer, where they manage their AWS resources such as RDS instances and Lambda functions, and interact with the Amazon Bedrock hosted LLMs securely using Amazon Bedrock VPC endpoints. You should manage access to Amazon RDS resources and databases by following the security best practices for Amazon RDS.
  • Amazon Bedrock service accounts – This set of accounts is owned and operated by the Amazon Bedrock service team, which hosts the various service APIs and related service infrastructure.
  • Model deployment accounts – The LLMs offered by various vendors are hosted and operated by AWS in separate accounts dedicated for model deployment. Amazon Bedrock maintains complete control and ownership of model deployment accounts, making sure no LLM vendor has access to these accounts.

When a customer interacts with Amazon Bedrock, their requests are routed through a secured network connection to the Amazon Bedrock service account. Amazon Bedrock then determines which model deployment account hosts the LLM model requested by the customer, finds the corresponding endpoint, and routes the request securely to the model endpoint hosted in that account. The LLM models are used for inference tasks, such as generating text or answering questions.

No customer data is stored within Amazon Bedrock accounts, nor is it ever shared with LLM providers or used for tuning the models. Communications and data transfers occur over private network connections using TLS 1.2+, minimizing the risk of data exposure or unauthorized access.

By implementing this multi-account architecture and private connectivity, Amazon Bedrock provides a secure environment, making sure customer data remains isolated and secure within the customer’s own account, while still allowing them to use the power of LLMs provided by third-party providers.

Conclusion

Empowering health insurance plan members with generative AI technology can revolutionize the way they interact with their insurance plans and access essential information. By integrating conversational AI assistants powered by Amazon Bedrock and using purpose-built AWS data services such as Amazon RDS, healthcare payers and insurance plans can provide a seamless, intuitive experience for their members. This solution not only enhances member satisfaction, but can also reduce operational costs by streamlining customer service operations. Embracing innovative technologies like generative AI becomes crucial for organizations to stay competitive and deliver exceptional member experiences.

To learn more about how generative AI can accelerate health innovations and improve patient experiences, refer to Payors on AWS and Transforming Patient Care: Generative AI Innovations in Healthcare and Life Sciences (Part 1). For more information about using generative AI with AWS services, refer to Build generative AI applications with Amazon Aurora and Knowledge Bases for Amazon Bedrock and the Generative AI category on the AWS Database Blog.


About the Authors

Sachin Jain is a Senior Solutions Architect at Amazon Web Services (AWS) with focus on helping Healthcare and Life-Sciences customers in their cloud journey. He has over 20 years of experience in technology, healthcare and engineering space.

Sanjoy Thanneer is a Sr. Technical Account Manager with AWS based out of New York. He has over 20 years of experience working in Database and Analytics Domains. He is passionate about helping enterprise customers build scalable , resilient and cost efficient Applications.

Sukhomoy Basak is a Sr. Solutions Architect at Amazon Web Services, with a passion for Data, Analytics, and GenAI solutions. Sukhomoy works with enterprise customers to help them architect, build, and scale applications to achieve their business outcomes.

Read More

Enabling production-grade generative AI: New capabilities lower costs, streamline production, and boost security

Enabling production-grade generative AI: New capabilities lower costs, streamline production, and boost security

As generative AI moves from proofs of concept (POCs) to production, we’re seeing a massive shift in how businesses and consumers interact with data, information—and each other. In what we consider “Act 1” of the generative AI story, we saw previously unimaginable amounts of data and compute create models that showcase the power of generative AI. Just last year, many businesses, and even more individuals, were focused on learning and experimenting, and the sheer number of POCs was impressive. Thousands of customers, across diverse industries, conducted experiments anywhere from dozens to hundreds of experiments as they explored the potential of generative AI applications and the implications.

By early 2024, we are beginning to see the start of “Act 2,” in which many POCs are evolving into production, delivering significant business value. To learn more about Act 1 and Act 2, refer to Are we prepared for “Act 2” of gen AI?. The move to a production mindset focuses new attention on key challenges as companies build and evaluate models on specific tasks and search for the leanest, fastest, and most cost-effective options. Considering—and reducing—the investment required for production workloads means bringing new efficiency to the sometime complicated process of building, testing, and fine-tuning foundation models (FMs).

Delivering capabilities that increase efficiency and reduce costs

Offering multiple entry points to their generative AI journey is critical to delivering value to companies moving their generative AI applications into production. Our generative AI technology stack provides the services and capabilities necessary to build and scale generative AI applications—from Amazon Q (the most capable generative AI–powered assistant for accelerating software development) at the top layer to Amazon Bedrock (The easiest way to build and scale generative AI applications with foundation models) at the middle layer to Amazon SageMaker (purpose-built to help you build, train, and deploy FMs) at the foundational, bottom layer. While these layers provide different points of entry, the fundamental truth is that every generative AI journey starts at the foundational bottom layer.

Organizations that want to build their own models or want granular control are choosing Amazon Web Services (AWS) because we are helping customers use the cloud more efficiently and leverage more powerful, price-performant AWS capabilities such as petabyte-scale networking capability, hyperscale clustering, and the right tools to help you build. Our deep investment in this layer enhances the capabilities and efficiency of the services we provide at higher layers.

To make generative AI use cases economical, you need to run your training and inference on incredibly high-performing, cost-effective infrastructure that’s purpose-built for AI. Amazon SageMaker makes it easy to optimize at each step of the model lifecycle, whether you are building, training, or deploying. However, FM training and inference present challenges—including operational burden, overall cost, and performance lag that contributes to an overall subpar user experience. State-of-the-art generative AI models are averaging latencies in the order of seconds, and many of today’s massive models are too large to fit into a single instance.

In addition, the blistering pace of model optimization innovations leaves model builders with months of research to learn and implement these techniques, even before finalizing deployment configurations.

Introducing Amazon Elastic Kubernetes Service (Amazon EKS) in Amazon SageMaker HyperPod

Recognizing these challenges, AWS launched Amazon SageMaker HyperPod last year. Taking efficiency one step further, earlier this week, we announced the launch of Amazon EKS support on Amazon SageMaker HyperPod. Why? Because provisioning and managing the large GPU clusters needed for AI can pose a significant operational burden. And training runs that take weeks to complete are challenging, since a single failure can derail the entire process. Ensuring infrastructure stability and optimizing performance of distributed training workloads can also pose challenges.

Amazon SageMaker HyperPod provides a fully managed service that removes the operational burden and enables enterprises to accelerate FM development at an unprecedented scale. Now, support for Amazon EKS in Amazon SageMaker HyperPod makes it possible for builders to manage their SageMaker HyperPod clusters using Amazon EKS. Builders can use a familiar Kubernetes interface while eliminating the undifferentiated heavy lifting involved in setting up and optimizing these clusters for generative AI model development at scale. SageMaker HyperPod provides a highly resilient environment that automatically detects, diagnoses, and recovers from underlying infrastructure faults so that builders can train FMs for weeks or months at a time with minimal disruption.

Customer quote: Articul8 AI

“Amazon SageMaker HyperPod has helped us tremendously in managing and operating our computational resources more efficiently with minimum downtime. We were early adopters of the Slurm-based SageMaker HyperPod service and have benefitted from its ease-of-use and resiliency features, resulting in up to 35% productivity improvement and rapid scale up of our gen AI operations.

As a Kubernetes house, we are now thrilled to welcome the launch of Amazon EKS support for SageMaker HyperPod. This is a game changer for us because it integrates seamlessly with our existing training pipelines and makes it even easier for us to manage and operate our large-scale Kubernetes clusters. In addition, this also helps our end customers because we are now able to package and productize this capability into our gen AI platform, enabling our customers to run their own training and fine-tuning workloads in a more streamlined manner.”

– Arun Subramaniyan, Founder and CEO of Articul8 AI

Bringing new efficiency to inference

Even with the latest advancements in generative AI modeling, the inference phase remains a significant bottleneck. We believe that businesses creating customer or consumer-facing generative AI applications shouldn’t have to sacrifice performance for cost-efficiency. They should be able to get both. That’s why two months ago, we released the inference optimization toolkit on Amazon SageMaker, a fully managed solution that provides the latest model optimization techniques, such as speculative decoding, compilation, and quantization. Available across SageMaker, this toolkit offers a simple menu of the latest optimization techniques that can be used individually or together to create an “optimization recipe.” Thanks to easy access and implementation of these techniques, customers can achieve up to ~2x higher throughput while reducing costs by ~50% for generative AI inference.

Responsible model deployment that is safe and trustworthy

While cost and performance are critical issues, it’s important not to lose sight of other concerns that come to the forefront as we shift from POC to production. No matter what model you choose, it needs to be deployed in a safe, trustworthy, and responsible way. We all need to be able to unlock generative AI’s full potential while mitigating its risks. It should be easy to implement safeguards for your generative AI applications, customized to your requirements and responsible AI policies.

That’s why we built Amazon Bedrock Guardrails, a service that provides customizable safeguards so you can filter prompts and model responses. Guardrails can help block specific words or topics. As well, customers can use Guardrails to help identify and prevent restricted content from reaching end users.

We also have filters for harmful content and personal identifiable information (PII) and security checks for malicious prompts, such as prompt injections. Recently, we also developed guardrails to help reduce hallucinations by checking that responses are found in the source material and related to the query.

Delivering value with game-changing innovation

Our partnership with the NFL and our joint Next Gen Stats program offer impressive proof of how a production mindset is delivering true value not only to an organization but to people across the world. By using AWS AI tools and engineers, the NFL is taking tackle analysis to the next level, giving teams, broadcasters, and fans deeper insights into one of football’s most crucial skills—tackling. As fans know, tackling is a complex, evolving process that unfolds throughout each play. But traditional stats only tell part of the story. That’s why the NFL and AWS created Tackle Probability—a groundbreaking AI-powered metric that can identify a missed tackle, when and where that tackle attempt took place, and do it all in real time. For further detail, go to NFL on AWS.

Building this stat required 5 years of historical data to train an AI model on Amazon SageMaker capable of processing millions of data points per game, tracking 20 different features for each of the 11 defenders every tenth of a second. The result is a literally game-changing stat that provides unprecedented insights. Now the NFL can quantify tackling efficiency in ways never before possible. A defender can be credited with 15 tackle attempts in a game without a single miss, or we can measure how many missed tackles a running back forced. All told, there will be at least 10 new stats from this model.

For the NFL, coaches can now quantify tackling efficiency and identify players who consistently put themselves in the right position to make the play. And broadcasters can highlight broken or made tackles to fans in real time.

Building breakthroughs with AWS

The NFL is far from alone in making in using AWS to shift its focus from POC to production. Exciting startups like Evolutionary Scale are making it easy to generate new proteins and antibodies. Airtable is making it easier for their customers to use their data and build applications. And organizations like Slack are embedding generative AI into the workday. Fast-moving, successful start-ups are choosing AWS to build and accelerate their businesses. In fact, 96 percent of all AI/ML unicorns—and 90 percent of the 2024 Forbes AI 50—are AWS customers.

Why? Because we’re addressing the cost, performance, and security issues that enable production-grade generative AI applications. We’re empowering data scientists, ML engineers, and other builders with new capabilities that make generative AI development faster, easier, more secure, and less costly. We’re making FM building and tuning—and a portfolio of intuitive tools that make it happen—available to more organizations as part of our ongoing commitment to the democratization of generative AI.

Fueling the next wave of innovation

Optimizing costs, boosting production efficiency, and ensuring security—these are among the top challenges as generative AI evolves from POC production. We’re helping address these issues by adding innovative new capabilities to Amazon SageMaker, Amazon Bedrock, and beyond. And we’re lowering the barriers to entry by making these tools available to everyone, from large enterprises with ML teams to small businesses and individual developers just getting started. Empowering more people and organizations to experiment with generative AI creates an explosion of creative new use cases and applications. That’s exactly what we’re seeing as generative AI continues its rapid evolution from a fascinating technology to a day-to-day reality—improving experiences, inspiring innovation, boosting the competitive edge, and creating significant new value.


About the author

Baskar Sridharan is the Vice President for AI/ML and Data Services & Infrastructure, where he oversees the strategic direction and development of key services, including Bedrock, SageMaker, and essential data platforms like EMR, Athena, and Glue.

Prior to his current role, Baskar spent nearly six years at Google, where he contributed to advancements in cloud computing infrastructure. Before that, he dedicated 16 years to Microsoft, playing a pivotal role in the development of Azure Data Lake and Cosmos, which have significantly influenced the landscape of cloud storage and data management.

Baskar earned a Ph.D. in Computer Science from Purdue University and has since spent over two decades at the forefront of the tech industry.

He has lived in Seattle for over 20 years, where he, his wife, and two children embrace the beauty of the Pacific Northwest and its many outdoor activities. In his free time, Baskar enjoys practicing music and playing cricket and baseball with his kids.

Read More

Research Focus: Week of September 9, 2024

Research Focus: Week of September 9, 2024

Welcome to Research Focus, a series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft.

Decorative graphic with wavy shapes in the background in blues and purples. Text overlay in center left reads: “Research Focus: September 9, 2024”

Can LLMs be Fooled? Investigating Vulnerabilities in LLMs

Large language models (LLMs) are the de facto standard for numerous machine learning tasks, ranging from text generation and summarization to even code generation. They also play an integral role in various natural language processing (NLP) tasks. However, recent studies show they are susceptible to adversarial attacks, including prompt injections, jailbreaking and other strategies. As people and organizations increasingly rely on LLMs, it is crucial to be aware of these vulnerabilities and take precautions when deploying them in real-world scenarios. Therefore, understanding and mitigating these vulnerabilities is critical. 

In a recent paper: Can LLMs be Fooled? Investigating Vulnerabilities in LLMs, researchers from Microsoft examine multiple vulnerability categories, including model-based, training-time, and inference-time vulnerabilities, and then discuss mitigation strategies. These include “model editing,” which aims to modify LLMs’ behavior, and “chroma teaming,” which leverages the synergy of different teaming strategies to make LLMs more resilient. This paper synthesizes the findings from each vulnerability category and proposes new directions for research and development. Understanding the focal points of current vulnerabilities will help people better anticipate and mitigate future risks, paving the road for more robust and secure LLMs.  


Total-Duration-Aware Duration Modeling for Text-to-Speech Systems

For many text-to-speech (TTS) applications, it is crucial that the total duration of the generated speech can be accurately adjusted to the target duration by modifying the speech rate. For example, in a video dubbing scenario, the output speech must match or closely approximate the duration of the source audio to ensure synchronization with the video. However, the impact of adjusting the speech rate on speech quality, such as intelligibility and speaker characteristics, has been underexplored. 

In a recent paper: Total-Duration-Aware Duration Modeling for Text-to-Speech Systems, researchers from Microsoft propose a novel total-duration-aware (TDA) duration model for TTS, where phoneme durations are predicted not only from the text input but also from an additional input of the total target duration. They propose a MaskGIT-based duration model that enhances the diversity and quality of the predicted phoneme durations. Test results show that the proposed TDA duration models achieve better intelligibility and speaker similarity for various speech rate configurations compared to baseline models. The proposed MaskGIT-based model can also generate phoneme durations with higher quality and diversity compared to its regression or flow-matching counterparts.

microsoft research podcast

What’s Your Story: Weishung Liu

Principal PM Manager Weishung Liu shares how a career delivering products and customer experiences aligns with her love of people and storytelling and how—despite efforts to defy the expectations that come with growing up in Silicon Valley—she landed in tech.


GEMS: Generative Expert Metric System through Iterative Prompt Priming

Metrics and measurements are fundamental to identifying challenges, informing decisions, and resolving conflicts across engineering domains. Despite the abundance of data available, a single expert may struggle to work across multi-disciplinary data, while non-experts may find it unintuitive to create effective measures or transform theories into appropriate context-specific metrics. 

In a recent technical report: GEMS: Generative Expert Metric System through Iterative Prompt Priming, researchers from Microsoft and University of Illinois Urbana-Champaign address this challenge. They examine software communities within large software corporations, where different measures are used as proxies to locate counterparts within the organization to transfer tacit knowledge. They propose a prompt-engineering framework inspired by neural mechanisms, demonstrating that generative models can extract and summarize theories and perform basic reasoning, thereby transforming concepts into context-aware metrics to support software communities given software repository data. While this research focused on software communities, the framework’s applicability could extend across various fields, showcasing expert-theory-inspired metrics that aid in triaging complex challenges.


On the Criticality of Integrity Protection in 5G Fronthaul Networks

The modern 5G fronthaul, which connects base stations to radio units in cellular networks, is designed to deliver microsecond-level performance guarantees using Ethernet-based protocols. Unfortunately, due to potential performance overheads, as well as misconceptions about the low risk and impact of possible attacks, integrity protection is not considered a mandatory feature in the 5G fronthaul standards. 

In a recent paper: On the Criticality of Integrity Protection in 5G Fronthaul Networks, researchers from Microsoft and external colleagues show how the lack of protection can be exploited, making attacks easier and more powerful. They present a novel class of powerful attacks and a set of traditional attacks, which can both be fully launched from software over open packet-based interfaces, to cause performance degradation or denial of service to users over large geographical regions. These attacks do not require a physical radio presence or signal-based attack mechanisms, do not affect the network’s operation (e.g., not crashing the radios), and are highly severe (e.g., impacting multiple cells). The researchers demonstrate that adversaries could degrade performance of connected users by more than 80%, completely block a subset of users from ever attaching to the cell, or even generate signaling storm attacks of more than 2,500 signaling messages per minute, with just two compromised cells and four mobile users. They also present an analysis of countermeasures that meet the strict performance requirements of the fronthaul.


Microsoft Research in the news


Microsoft works with students to launch ‘Golden Record 2.0’ into space 

Geekwire | September 5, 2024

Forty-seven years after NASA sent a “Golden Record” into deep space to document humanity’s view of the world, Microsoft’s Project Silica is teaming up with a citizen-science effort to lay the groundwork — or, more aptly, the glasswork — for doing something similar. 

Related: Collaborators: Silica in space with Richard Black and Dexter Greene 

The post Research Focus: Week of September 9, 2024 appeared first on Microsoft Research.

Read More

Scaling Thomson Reuters’ language model research with Amazon SageMaker HyperPod

Scaling Thomson Reuters’ language model research with Amazon SageMaker HyperPod

Thomson Reuters, a global content and technology-driven company, has been using artificial intelligence and machine learning (AI/ML) in its professional information products for decades. The introduction of generative AI provides another opportunity for Thomson Reuters to work with customers and advance how they do their work, helping professionals draw insights and automate workflows, enabling them to focus their time where it matters most.

In this post, we explore the journey that Thomson Reuters took to enable cutting-edge research in training domain-adapted large language models (LLMs) using Amazon SageMaker HyperPod, an Amazon Web Services (AWS) feature focused on providing purpose-built infrastructure for distributed training at scale.

LLMs disrupt the industry

Towards the end of 2022, groundbreaking LLMs were released that realized drastic improvements over previous model capabilities. The resulting technology opened new doors to enhancing customer experiences by tailoring content, recommendations, and responses to individual customers in natural chat-like interfaces. For many businesses, the race was on to bring this technology into their products to maintain or gain competitive advantage. Thomson Reuters was no exception and keenly felt the need to help its customers be successful in this burgeoning, AI-augmented, world.

As with any technology, proper application and understanding of its limitations is critical. Consider the following elements.

  • Hallucinations – LLMs have a remarkable ability to respond to natural language, and clearly encode significant amounts of knowledge. However, the stochastic nature of the technology means that responses are based on the probability of word occurrences. An LLM doesn’t model facts so much as it models language. The model has no idea if the words (tokens) generated are factually correct, though it may have successfully modeled the correct sequence of words to represent facts. As a result, LLMs may hallucinate—in other words, they may generate text that is untrue.
  • Quality – While the general knowledge encoded in the latest LLMs is remarkably good, it may not be enough for your business or customer domains. Public and commercial LLMs are based on the knowledge of the internet—not what is behind your business’s closed doors. Adding to the problem, bias and factually incorrect information exists on the internet and there often isn’t enough transparency in what data is used and how commercial models are trained with it. Further, LLMs will only have encoded knowledge since their last training. They may not be up-to-date and businesses do not control the frequency of model retraining.
  • Speed, cost, and capacity – Depending on your use cases, you may find existing commercial LLMs are either too slow or too expensive or be in such high demand that you cannot purchase enough capacity to meet your requirements. (This may only be a temporary challenge because we’ve observed increased capacity and reduced cost as hardware, optimizations, and economies of scale continue to improve).

Thomson Reuters’ customers require professional-grade AI. They are professionals with discerning information needs in legal, corporate, tax, risk, fraud, compliance, and news domains. Take, for example, legal customers. US law is based on legal precedent—the outcomes of past trial cases are used to determine decisions in new cases. Not only does Thomson Reuters curate and enhance publicly available content such as regulations and laws, but it also has decades of editorial content on most aspects of the law that it analyzes and reflects upon. Legal research is a critical area for Thomson Reuters customers—it needs to be as complete as possible. It needs to be grounded in fact—any kind of errors in fact are highly problematic. Solutions should be grounded in the content and data that Thomson Reuters has.

Research and training experimentation

Thinking about the limitations of publicly available, commercial language models as described in the previous section, Thomson Reuters asked themselves the following questions:

  • Can Thomson Reuters’ editorially created, curated, or enhanced data be used to improve LLM knowledge for specific business tasks?
  • Would smaller LLMs (for example, 12–30B parameters) trained with Thomson Reuters data perform on a par with very large LLMs upwards of a trillion parameters?
  • What methods could be employed to train the Thomson Reuters domain-specific models to get the best results?

The potential benefits fell in three areas: quality, agency, and operational efficiency. With full access to model training, it’s possible that Thomson Reuters could tune LLM generation to their domain and allow for tighter Retrieval Augmented Generation (RAG) integration. This would directly impact quality. And if Thomson Reuters own the models, they would control how and when they are trained and updated. Lastly, if smaller tuned models could perform sufficiently, it could be a more cost-effective and scalable solution—improving overall operational efficiency.

Thomson Reuters’ research focused around answering these specific questions:

  • How well do foundation models (FMs) (in the 7–30B parameters range) perform on specific tasks, unmodified? (This would be the baseline.)
  • Does performance improve for specific tasks when augmented with Thomson Reuters domain-specific data using various training techniques?

To frame this research and give concrete evaluation targets, Thomson Reuters focused on several real-world tasks: legal summarization, classification, and question answering. Publicly available general textual data was used, as well as domain specific textual data from Thomson Reuters’ comprehensive stores of primary and secondary US law material. Primary law would include content published by the courts and enhanced by Thomson Reuters. Secondary law would include subject matter expert (SME) analysis and annotation of the law.

Thomson Reuters knew they would need to run a series of experiments—training LLMs from 7B to more than 30B parameters, starting with an FM and continuous pre-training (using various techniques) with a mix of Thomson Reuters and general data. Model fine-tuning would then take place to evaluate how much better it performed on specific legal tasks while at the same time evaluating for any loss in general knowledge or language understanding.

  1. Continuous pre-training – By further pre-training an existing FM, Thomson Reuters wished to enrich its understanding of legalese without compromising its general language abilities. This was largely an experiment in finding the right mix of domain and general training data to retain general knowledge while increasing domain-specific knowledge. Perplexity was used to measure impact of domain-specific training on general knowledge capabilities of the model.
  2. Instruction fine-tuning – This would be an exercise in generating impactful instruction datasets, including legal and general tasks. Thomson Reuters experimented with pre-training open source FMs, such as MPT, Flan-T5, and Mistral, and compared against industry standard commercial models, such as OpenAI’s GPT-4. In this case, Rouge was used to measure how well models performed on tasks.

Scaling language model training with Amazon SageMaker HyperPod

Thomson Reuters knew that training LLMs would require significant computing power. Training an LLM of even 7B parameters is a compute-intensive operation, requiring multi-node distributed computing capabilities. These compute nodes typically need large GPUs or similar hardware. In Thomson Reuters’ case, they focused on NVIDIA’s high performance A100 family of GPUs. Amazon Elastic Compute Cloud (Amazon EC2) P4d and P4de instances provided Thomson Reuters with the high performance they needed.

To estimate just how much compute power was required, Thomson Reuters used the Chinchilla scaling law to determine how much training data (in tokens) would be needed to retain quality at a given model size. The scaling law is based on published research that found that the model size to training tokens scales proportionally. From there, other publicly available information was used to estimate how much time (in days) would be required to complete training with a given number of GPUs.

 . . Model size
P4d #GPUs 2.6b (days) 6.6b (days) 13b (days) 30b (days) 65b (days)
8 64 1 6.6 24 125.4 918.4
16 128 0.5 3.3 12 62.7 459.2
32 256 0.2 1.7 6 31.3 229.6
55 440 0.1 1 3.5 17.9 164
64 512 0.1 0.9 3 15.7 114.8
. Chinchilla point 52b 132b 260b 600b 1.3t

So, for example, a 6.6B parameter model would require 132B input tokens and take just under 7 days to finish training with 64 A100 GPUs (or 8 P4d instances).

Apart from the ability to easily provision compute, there are other factors such as cluster resiliency, cluster management (CRUD operations), and developer experience, which can impact LLM training. With potentially hundreds of GPUs working in parallel, hardware failures are inevitable. To resolve these issues, customers typically have to identify, isolate, repair, and recover the faulty instance, or change configurations to continue without it, further delaying progress.

In order to provision a highly scalable cluster that is resilient to hardware failures, Thomson Reuters turned to Amazon SageMaker HyperPod. SageMaker HyperPod is a managed service that makes it easier for you to train FMs without interruptions or delays. It provides resilient and persistent clusters for large-scale deep learning training of FMs on long-running compute clusters. SageMaker HyperPod offers an interactive experience for rapid experimentation at scale, with resilience to hardware failures, enabling uninterrupted training jobs spanning weeks or months. With Amazon Elastic Kubernetes Service (Amazon EKS) support in SageMaker HyperPod, customers can associate a HyperPod cluster with an EKS cluster and manage ML workloads using the HyperPod cluster nodes as Kubernetes worker nodes, all through the Kubernetes control plane on the EKS cluster.

Amazon EKS support in SageMaker HyperPod offers several key resiliency features to make uninterrupted and efficient training of large ML models possible:

  1. Deep health checks – This is a managed health check for stress testing GPUs and AWS trn1 instances, as well as performing Elastic Fabric Adapter (EFA) checks. These checks can be run during the cluster creation, update, and node replacement phase and can be easily enabled or disabled through HyperPod APIs.
  2. Automatic node replacement – A monitoring agent performs managed, lightweight, and noninvasive checks, coupled with automated node replacement capability. This monitoring agent continuously monitors and detects potential issues, including memory exhaustion, disk failures, GPU anomalies, kernel deadlocks, container runtime issues, and out-of-memory (OOM) crashes. Based on the underlying issue, the monitoring agent either replaces or reboots the node.
  3. Auto-resume – SageMaker HyperPod provides job auto-resume capability using the Kubeflow training operator for PyTorch so that training jobs can recover and continue in the event of interruptions or failures. The extension makes sure that the job waits and restarts after the node is replaced.

Initial findings

Over the course of 5 months, Thomson Reuters successfully ran 20 training jobs using Amazon SageMaker HyperPod. They were able to scale their cluster up to 16 P4d instances, with their largest job using the entire cluster. Thomson Reuters trained a 70B parameter model on 400B input tokens, with the entire training job taking 36 days to complete. During that period, Thomson Reuters experienced zero hardware failures.

Continuous pre-training

In continuous pre-training, you train from an existing open source LLM checkpoint. This is more than a time-saver; it is a strategic decision that allows for the nuanced growth of the model’s capabilities over time. The preliminary results of Thomson Reuters’ experimentation showed that they were able to train models on the legal domain without losing general knowledge.

Thomson Reuters used a measure called perplexity. It quantifies how well the model predicts a sample of text. In essence, perplexity measures the confidence a model has in its predictions. Lower perplexity indicates that the model is more certain about its predictions. From the following graph, you can see that as Thomson Reuters increased their batches of training, legal perplexity decreased while general perplexity increased somewhat, before quickly leveling off.

Instruction fine-tuning (IFT)

Instruct fine-tuned LLMs are tuned to respond to specific instructions, enabling tasks such as question answering, summarization, and brainstorming. For instance, human-written instruction datasets include prompts such as “summarize this article” or “list fun weekend activities.” Thomson Reuters’ hypothesis was that legal LLMs can benefit from diverse legal instructions.

Thomson Reuters has discovered that their legal LLM greatly benefits from a vast array of diverse instructions. By compiling legal instructions, such as drafting legal headnotes, and combining them with publicly available instructions, Thomson Reuters’ MPT-TR-7b model, derived from MPT-7b, has showcased improvements correlated with an increased number of instruction datasets provided.

Thomson Reuters used an automatic measure called Rouge to determine how well domain adapted models performed compared to GPT-4. This automatic measure, based on term overlap, is not the same as human preference judgment, but gives Thomson Reuters some degree of confidence that they are headed in the right direction.

Legal summarization

Thomson Reuters’ MPT-TR-7b model has demonstrated proficiency in legal summarization tasks, rivaling GPT-4’s performance when evaluated with automatic metrics assessing word overlap with reference summaries. While a human-based evaluation would offer deeper insights, the initial results are compelling evidence of the model’s capabilities. The following graph compares Thomson Reuters’ model with GPT-4.

Legal classification

In other legal tasks, such as classification that was measured in accuracy and precision or recall, there’s still room to improve. Nonetheless, the performance uptick is evident with the expansion of instruction datasets, as shown in the following graph. Even more exciting is the leap in performance observed with larger base models such as MPT-30b.

Conclusion

In this post, we have discussed how Thomson Reuters was able to meet their LLM training requirements using Amazon SageMaker HyperPod. Using Amazon EKS on HyperPod, Thomson Reuters was able to scale up their capacity and easily run their training jobs, permitting Thomson Reuters to unlock benefits of LLMs in areas such as legal summarization and classification.

If your business operates in specialized or deep verticals with knowledge not generally available on the web, experimenting with model training may make sense. At the same time, you’ll need to weigh the costs associated with training and inference as well as keeping up with rapidly advancing LLM technology. Like Thomson Reuters, you might want to start with RAG solutions with off-the-shelf LLMs as a first step, then consider customization options from there. If you do decide that training LLMs makes sense, then you’ll need considerable computational power. Amazon SageMaker HyperPod helps you to provision and manage the infrastructure required. Read more about Amazon SageMaker HyperPod and Amazon EKS support in SageMaker HyperPod.


About the Authors

John Duprey is a Distinguished Engineer at Thomson Reuters Labs with over 25 years of experience. In his role, John drives innovative solutions to complex problems and champions engineering excellence and culture. Recently, he has contributed to Thomson Reuters’ generative AI initiatives, focusing on scalability, platform design, and SDK development.

Adam Raffe is a Principal Solutions Architect at AWS. With over 8 years of experience in cloud architecture, Adam helps large enterprise customers solve their business problems using AWS.

Vu San Ha Huynh is a Solutions Architect at AWS. He has a PhD in computer science and enjoys working on different innovative projects to help support large enterprise customers.

Ankit Anand is a Senior Foundation Models Go-To-Market (GTM) Specialist at AWS. He partners with top generative AI model builders, strategic customers, and AWS Service Teams to enable the next generation of AI/ML workloads on AWS. Ankit’s experience includes product management expertise within the financial services industry for high-frequency/low-latency trading and business development for Amazon Alexa.

Arun Kumar Lokanatha is a Senior ML Solutions Architect with the Amazon SageMaker Service team. He specializes in large model training workloads helping customers build LLM workloads using SageMaker HyperPod, SageMaker training jobs, and SageMaker distributed training. Outside of work, he enjoys running, hiking, and cooking.

Simone Zucchet is a Solutions Architect Manager at AWS. With over 6 years of experience as a Cloud Architect, Simone enjoys working on innovative projects that help transform the way organizations approach business problems. He helps support large enterprise customers at AWS and is part of the Machine Learning TFC. Outside of his professional life, he enjoys working on cars and photography.

Read More

GeForce NOW to Bring ‘Dead Rising Deluxe Remaster’ to the Cloud at Launch

GeForce NOW to Bring ‘Dead Rising Deluxe Remaster’ to the Cloud at Launch

Rise and shine — Capcom’s latest action-adventure game, Dead Rising Deluxe Remaster, heads to the cloud at launch next week.

It’s part of nine new titles joining the extensive GeForce NOW library.

‘Dead Rising’ Coming Soon

Dead Rising Deluxe Remaster screenshot on GeForce NOW
From the ground to the cloud.

Dead Rising Deluxe Remaster returns with modern graphics. More than just a remaster, this Deluxe Remaster is a full graphical overhaul of the first game in the zombie-slaughtering action series Dead Rising. The remaster has also been fully voiced, supports auto-saves and has various other quality-of-life features.

One day, the peaceful town of Willamette, Colorado, found itself put under quarantine by the U.S. army. Frank West, a freelance journalist, smells a scoop and finds his way into the only shopping mall in town. Unfortunately, the mall has turned into a living hell, crawling with countless zombies. Help will arrive in 72 hours, so it’s up to him to find out the truth behind this incident before it’s too late.

Witness the unmatched mayhem and freedom when Dead Rising Deluxe Remaster launches on Wednesday, Sept. 19. Stream it with a GeForce NOW Priority or Ultimate membership for longer gaming sessions and higher frame rates.

New Games to Drive You Wild

Test Drive Unlimited Solar Crown
Vroom, vroom.

Test Drive Unlimited Solar Crown, a new open-world racing game from KT Racing and Nacon, is now available for members to stream. Explore a fully recreated Hong Kong Island while taking to the road behind the wheels of exceptional cars and living the ultimate life of luxury. Test drive and purchase cars directly from dealerships, customize them in workshops and display them in the Solar Hotel garage. Each car offers a unique driving experience.

Members can look for the following games available to stream in the cloud this week:

  • Warhammer 40,000: Space Marine 2 (New release on Steam, Sept. 9)
  • Test Drive Unlimited Solar Crown (New release on Steam, Sept. 12)
  • Dawn of Defiance (Steam)
  • Flintlock: The Siege of Dawn (Xbox, available on PC Game Pass)
  • Fort Solis (Epic Games Store)
  • King Arthur: Legion IX (Steam)
  • Squirrel With a Gun (Steam)
  • Tyranny – Gold Edition (Xbox, available on Microsoft Store)
  • XIII (Xbox, available on Microsoft Store)

What are you planning to play this weekend? Let us know on X or in the comments below.

Read More

Arm Joins the PyTorch Foundation as a Premier Member

The PyTorch Foundation, a neutral home for the deep learning community to collaborate on the open source PyTorch framework and ecosystem, is announcing today that Arm has joined as a premier member.

Arm designs a high-performance, power-efficient compute platform with unmatched scalability, supporting a vast ecosystem of developers deploying AI at the edge and in the cloud, ranging from the Arm instances offered by all major cloud service providers to smartphones, laptops, software-defined vehicles and more.

“Our continued investments in software are accelerating development and AI performance for over 20 million software developers, ensuring they can develop for Arm, on Arm,” said Alex Spinelli, VP Developer Technology at Arm. “PyTorch is a pivotal framework in advancing AI research and development. This membership demonstrates our strong commitment to open source – ensuring PyTorch just works on Arm and can leverage seamless acceleration for the most demanding AI models, now and in the future.”

Last year at the PyTorch Conference, Arm partnered with Apple, Meta and Qualcomm to release ExecuTorch, an end-to-end solution for enabling on-device inference capabilities across mobile and edge devices including wearables, embedded devices and microcontrollers.

“We’re thrilled to welcome Arm to the PyTorch Foundation. As we look to the future of AI and machine learning, the role of specialized silicon and edge devices becomes increasingly crucial. Arm’s expertise in these areas will be invaluable as we work to make PyTorch more efficient and accessible across a wider range of hardware,” said PyTorch Foundation Executive Director Matt White. “This collaboration underscores our commitment to fostering innovation and expanding PyTorch’s capabilities to meet the evolving needs of developers and researchers worldwide.”

As a premier member, Arm is granted one seat to the PyTorch Foundation Governing Board. The Board sets policy through our bylaws, mission and vision statements, describing the overarching scope of foundation initiatives, technical vision, and direction.

We’re happy to welcome Alex Spinelli, VP Developer Technology at Arm, to our board. Prior to Arm, Alex was VP of Product for Core Machine Learning at Google, where he led Google’s technology and infrastructure for building, training, and serving machine learning, including the TensorFlow stack.

To learn more about how you can be a part of the PyTorch Foundation, visit our website.

About PyTorch Foundation

The PyTorch Foundation is a neutral home for the deep learning community to collaborate on the open source PyTorch framework and ecosystem. The PyTorch Foundation is supported by its members and leading contributors to the PyTorch open source project. The Foundation leverages resources provided by members and contributors to enable community discussions and collaboration.

About The Linux Foundation

The Linux Foundation is the world’s leading home for collaboration on open source software, hardware, standards, and data. Linux Foundation projects are critical to the world’s infrastructure including Linux, Kubernetes, Node.js, ONAP, PyTorch, RISC-V, SPDX, OpenChain, and more. The Linux Foundation focuses on leveraging best practices and addressing the needs of contributors, users, and solution providers to create sustainable models for open collaboration. For more information, please visit us at linuxfoundation.org. The Linux Foundation has registered trademarks and uses trademarks. For a list of trademarks of The Linux Foundation, please see its trademark usage page. Linux is a registered trademark of Linus Torvalds.

Read More