Boost your content editing with Contentful and Amazon Bedrock

Boost your content editing with Contentful and Amazon Bedrock

This post is co-written with Matt Middleton from Contentful.

Today, jointly with Contentful, we are announcing the launch of the AI Content Generator powered by Amazon Bedrock.

The AI Content Generator powered by Amazon Bedrock is an app available on the Contentful Marketplace that allows users to create, rewrite, summarize, and translate content using cutting-edge generative artificial intelligence (AI) models available and accessible through Amazon Bedrock in a simple and secure manner. This app helps content producers reduce their lead time to publishing content while enhancing the quality and consistency of the content produced.

Contentful is an intelligent composable content platform that allows the creation and management of content to build digital experiences. Contentful is an AWS customer and partner.

Amazon Bedrock is a fully managed service that offers quick access to a choice of industry-leading large language models and other foundation models from AI21 Labs, Anthropic, Cohere, Meta, Stability AI, and Amazon, along with a broad set of capabilities to build generative AI applications, simplifying development while supporting privacy and security.

With this newly launched app, Contentful customers can take advantage of the range of models provided by Amazon Bedrock directly from within Contentful. They can pick the model that best suits their desired language style, creativity, speed, and budget.

Use case

Contentful customers use the platform to scale content across global experiences, brands, and channels. A frequent task a content editor may have is to rewrite existing content to make it fit for another channel, for example to make it fit a smaller screen by shortening it. This is a task that the AI Content Generator powered by Amazon Bedrock can now do automatically:

  1. First, open an existing content entry. The app is available in the sidebar.AI Content Generator in the side bar
  2. When you choose Rewrite, a dialog asks you to choose the fields for input and output, in this case Body and Body Short, respectively.
  3. You can describe the modifications that should be done to the existing content; in this case, we choose the pre-provided option Shorter.
  4. Choose Generate to invoke Amazon Bedrock to perform the desired modification.
    Content rewrite modal
  5. Finally, you can modify the generated text and then choose Apply to apply the text to the specified output field.Generated text

Getting started

To get started, you need to sign up for Contentful, which is free. Next, install the AI Content Generator powered by Amazon Bedrock app by visiting the Contentful Marketplace and choosing Install now.

The installation dialog asks you for an AWS Region where you want to use Amazon Bedrock, as well as an AWS access key ID and secret access key for authentication. To help you meet your organization’s security rules, we have published an AWS CloudFormation template that creates an AWS Identity and Access Management (IAM) user with the minimum privileges you need for this app.

Using generative AI at scale can become expensive. To help prevent surprises in your AWS bill, we have published another CloudFormation template that deploys a budgeting application into your AWS account. You’re able to define a soft limit, which invokes an email notification, and a hard limit, which disables access to Amazon Bedrock entirely for the IAM user you created.

During app installation, you’re able to provide company-specific information such as branding guidelines, which will always be applied when interacting with the app. At the end of the installation dialog, make sure you assign the app to all content types where you may need it. This will enable the app in the sidebar of your content editing experience.

Conclusion

With the AI Content Generator powered by Amazon Bedrock, content teams can unlock powerful tools to save time, reduce feedback loops, and increase creativity. Contentful customers can use generative AI to generate content on demand, transform content to shift voice and tone, translate and localize content for global markets, and even make sure that content remains on brand. Generative AI also plays a critical role in eliminating the “blank page” problem, where a digital team spends more time figuring out where to start than actually creating great content.

Bringing Amazon Bedrock to Contentful means that digital teams can now use a range of leading large language models to unlock their creativity, create more efficiently, and reach their customers in the most impactful way.


About the Authors

Ulrich Hinze is a Solutions Architect at AWS. He partners with software companies to architect and implement cloud-based solutions on AWS. Before joining AWS, he worked for AWS customers and partners in software engineering, consulting, and architecture roles for 8+ years.

Aris Tsakpinis is a Specialist Solutions Architect for AI & Machine Learning with a special focus on natural language processing (NLP), large language models (LLMs), and generative AI. In his free time he is pursuing a PhD in ML Engineering at University of Regensburg, focussing on applied NLP in the science domain.

Matt Middleton is the Senior Product Partner Ecosystem Manager at Contentful. He runs the strategy and operations of Contentful’s Technology Partner Program. Matt’s background includes eCommerce, enterprise search, personalization, and marketing automation.

Read More

Unlock the potential of generative AI in industrial operations

Unlock the potential of generative AI in industrial operations

In the evolving landscape of manufacturing, the transformative power of AI and machine learning (ML) is evident, driving a digital revolution that streamlines operations and boosts productivity. However, this progress introduces unique challenges for enterprises navigating data-driven solutions. Industrial facilities grapple with vast volumes of unstructured data, sourced from sensors, telemetry systems, and equipment dispersed across production lines. Real-time data is critical for applications like predictive maintenance and anomaly detection, yet developing custom ML models for each industrial use case with such time series data demands considerable time and resources from data scientists, hindering widespread adoption.

Generative AI using large pre-trained foundation models (FMs) such as Claude can rapidly generate a variety of content from conversational text to computer code based on simple text prompts, known as zero-shot prompting. This eliminates the need for data scientists to manually develop specific ML models for each use case, and therefore democratizes AI access, benefitting even small manufacturers. Workers gain productivity through AI-generated insights, engineers can proactively detect anomalies, supply chain managers optimize inventories, and plant leadership makes informed, data-driven decisions.

Nevertheless, standalone FMs face limitations in handling complex industrial data with context size constraints (typically less than 200,000 tokens), which poses challenges. To address this, you can use the FM’s ability to generate code in response to natural language queries (NLQs). Agents like PandasAI come into play, running this code on high-resolution time series data and handling errors using FMs. PandasAI is a Python library that adds generative AI capabilities to pandas, the popular data analysis and manipulation tool.

However, complex NLQs, such as time series data processing, multi-level aggregation, and pivot or joint table operations, may yield inconsistent Python script accuracy with a zero-shot prompt.

To enhance code generation accuracy, we propose dynamically constructing multi-shot prompts for NLQs. Multi-shot prompting provides additional context to the FM by showing it several examples of desired outputs for similar prompts, boosting accuracy and consistency. In this post, multi-shot prompts are retrieved from an embedding containing successful Python code run on a similar data type (for example, high-resolution time series data from Internet of Things devices). The dynamically constructed multi-shot prompt provides the most relevant context to the FM, and boosts the FM’s capability in advanced math calculation, time series data processing, and data acronym understanding. This improved response facilitates enterprise workers and operational teams in engaging with data, deriving insights without requiring extensive data science skills.

Beyond time series data analysis, FMs prove valuable in various industrial applications. Maintenance teams assess asset health, capture images for Amazon Rekognition-based functionality summaries, and anomaly root cause analysis using intelligent searches with Retrieval Augmented Generation (RAG). To simplify these workflows, AWS has introduced Amazon Bedrock, enabling you to build and scale generative AI applications with state-of-the-art pre-trained FMs like Claude v2. With Knowledge Bases for Amazon Bedrock, you can simplify the RAG development process to provide more accurate anomaly root cause analysis for plant workers. Our post showcases an intelligent assistant for industrial use cases powered by Amazon Bedrock, addressing NLQ challenges, generating part summaries from images, and enhancing FM responses for equipment diagnosis through the RAG approach.

Solution overview

The following diagram illustrates the solution architecture.

The workflow includes three distinct use cases:

Use case 1: NLQ with time series data

The workflow for NLQ with time series data consists of the following steps:

  1. We use a condition monitoring system with ML capabilities for anomaly detection, such as Amazon Monitron, to monitor industrial equipment health. Amazon Monitron is able to detect potential equipment failures from the equipment’s vibration and temperature measurements.
  2. We collect time series data by processing Amazon Monitron data through Amazon Kinesis Data Streams and Amazon Data Firehose, converting it into a tabular CSV format and saving it in an Amazon Simple Storage Service (Amazon S3) bucket.
  3. The end-user can start chatting with their time series data in Amazon S3 by sending a natural language query to the Streamlit app.
  4. The Streamlit app forwards user queries to the Amazon Bedrock Titan text embedding model to embed this query, and performs a similarity search within an Amazon OpenSearch Service index, which contains prior NLQs and example codes.
  5. After the similarity search, the top similar examples, including NLQ questions, data schema, and Python codes, are inserted in a custom prompt.
  6. PandasAI sends this custom prompt to the Amazon Bedrock Claude v2 model.
  7. The app uses the PandasAI agent to interact with the Amazon Bedrock Claude v2 model, generating Python code for Amazon Monitron data analysis and NLQ responses.
  8. After the Amazon Bedrock Claude v2 model returns the Python code, PandasAI runs the Python query on the Amazon Monitron data uploaded from the app, collecting code outputs and addressing any necessary retries for failed runs.
  9. The Streamlit app collects the response via PandasAI, and provides the output to users. If the output is satisfactory, the user can mark it as helpful, saving the NLQ and Claude-generated Python code in OpenSearch Service.

Use case 2: Summary generation of malfunctioning parts

Our summary generation use case consists of the following steps:

  1. After the user knows which industrial asset shows anomalous behavior, they can upload images of the malfunctioning part to identify if there is something physically wrong with this part according to its technical specification and operation condition.
  2. The user can use the Amazon Recognition DetectText API to extract text data from these images.
  3. The extracted text data is included in the prompt for the Amazon Bedrock Claude v2 model, enabling the model to generate a 200-word summary of the malfunctioning part. The user can use this information to perform further inspection of the part.

Use case 3: Root cause diagnosis

Our root cause diagnosis use case consists of the following steps:

  1. The user obtains enterprise data in various document formats (PDF, TXT, and so on) related with malfunctioning assets, and uploads them to an S3 bucket.
  2. A knowledge base of these files is generated in Amazon Bedrock with a Titan text embeddings model and a default OpenSearch Service vector store.
  3. The user poses questions related to the root cause diagnosis for malfunctioning equipment. Answers are generated through the Amazon Bedrock knowledge base with a RAG approach.

Prerequisites

To follow along with this post, you should meet the following prerequisites:

Deploy the solution infrastructure

To set up your solution resources, complete the following steps:

  1. Deploy the AWS CloudFormation template opensearchsagemaker.yml, which creates an OpenSearch Service collection and index, Amazon SageMaker notebook instance, and S3 bucket. You can name this AWS CloudFormation stack as: genai-sagemaker.
  2. Open the SageMaker notebook instance in JupyterLab. You will find the following GitHub repo already downloaded on this instance: unlocking-the-potential-of-generative-ai-in-industrial-operations.
  3. Run the notebook from the following directory in this repository: unlocking-the-potential-of-generative-ai-in-industrial-operations/SagemakerNotebook/nlq-vector-rag-embedding.ipynb. This notebook will load the OpenSearch Service index using the SageMaker notebook to store key-value pairs from the existing 23 NLQ examples.
  4. Upload documents from the data folder assetpartdoc in the GitHub repository to the S3 bucket listed in the CloudFormation stack outputs.

Next, you create the knowledge base for the documents in Amazon S3.

  1. On the Amazon Bedrock console, choose Knowledge base in the navigation pane.
  2. Choose Create knowledge base.
  3. For Knowledge base name, enter a name.
  4. For Runtime role, select Create and use a new service role.
  5. For Data source name, enter the name of your data source.
  6. For S3 URI, enter the S3 path of the bucket where you uploaded the root cause documents.
  7. Choose Next.
    The Titan embeddings model is automatically selected.
  8. Select Quick create a new vector store.
  9. Review your settings and create the knowledge base by choosing Create knowledge base.
  10. After the knowledge base is successfully created, choose Sync to sync the S3 bucket with the knowledge base.
  11. After you set up the knowledge base, you can test the RAG approach for root cause diagnosis by asking questions like “My actuator travels slow, what might be the issue?”

The next step is to deploy the app with the required library packages on either your PC or an EC2 instance (Ubuntu Server 22.04 LTS).

  1. Set up your AWS credentials with the AWS CLI on your local PC. For simplicity, you can use the same admin role you used to deploy the CloudFormation stack. If you’re using Amazon EC2, attach a suitable IAM role to the instance.
  2. Clone GitHub repo:
    git clone https://github.com/aws-samples/unlocking-the-potential-of-generative-ai-in-industrial-operations

  3. Change the directory to unlocking-the-potential-of-generative-ai-in-industrial-operations/src and run the setup.sh script in this folder to install the required packages, including LangChain and PandasAI:
    cd unlocking-the-potential-of-generative-ai-in-industrial-operations/src
    chmod +x ./setup.sh
    ./setup.sh   
  4. Run the Streamlit app with the following command:
    source monitron-genai/bin/activate
    python3 -m streamlit run app_bedrock.py <REPLACE WITH YOUR BEDROCK KNOWLEDGEBASE ARN>
    

Provide the OpenSearch Service collection ARN you created in Amazon Bedrock from the previous step.

Chat with your asset health assistant

After you complete the end-to-end deployment, you can access the app via localhost on port 8501, which opens a browser window with the web interface. If you deployed the app on an EC2 instance, allow port 8501 access via the security group inbound rule. You can navigate to different tabs for various use cases.

Explore use case 1

To explore the first use case, choose Data Insight and Chart. Begin by uploading your time series data. If you don’t have an existing time series data file to use, you can upload the following sample CSV file with anonymous Amazon Monitron project data. If you already have an Amazon Monitron project, refer to Generate actionable insights for predictive maintenance management with Amazon Monitron and Amazon Kinesis to stream your Amazon Monitron data to Amazon S3 and use your data with this application.

When the upload is complete, enter a query to initiate a conversation with your data. The left sidebar offers a range of example questions for your convenience. The following screenshots illustrate the response and Python code generated by the FM when inputting a question such as “Tell me the unique number of sensors for each site shown as Warning or Alarm respectively?” (a hard-level question) or “For sensors shown temperature signal as NOT Healthy, can you calculate the time duration in days for each sensor shown abnormal vibration signal?” (a challenge-level question). The app will answer your question, and will also show the Python script of data analysis it performed to generate such results.

If you’re satisfied with the answer, you can mark it as Helpful, saving the NLQ and Claude-generated Python code to an OpenSearch Service index.

Explore use case 2

To explore the second use case, choose the Captured Image Summary tab in the Streamlit app. You can upload an image of your industrial asset, and the application will generate a 200-word summary of its technical specification and operation condition based on the image information. The following screenshot shows the summary generated from an image of a belt motor drive. To test this feature, if you lack a suitable image, you can use the following example image.

Hydraulic elevator motor label” by Clarence Risher is licensed under CC BY-SA 2.0.

Explore use case 3

To explore the third use case, choose the Root cause diagnosis tab. Input a query related to your broken industrial asset, such as, “My actuator travels slow, what might be the issue?” As depicted in the following screenshot, the application delivers a response with the source document excerpt used to generate the answer.

Use case 1: Design details

In this section, we discuss the design details of the application workflow for the first use case.

Custom prompt building

The user’s natural language query comes with different difficult levels: easy, hard, and challenge.

Straightforward questions may include the following requests:

  • Select unique values
  • Count total numbers
  • Sort values

For these questions, PandasAI can directly interact with the FM to generate Python scripts for processing.

Hard questions require basic aggregation operation or time series analysis, such as the following:

  • Select value first and group results hierarchically
  • Perform statistics after initial record selection
  • Timestamp count (for example, min and max)

For hard questions, a prompt template with detailed step-by-step instructions assists FMs in providing accurate responses.

Challenge-level questions need advanced math calculation and time series processing, such as the following:

  • Calculate anomaly duration for each sensor
  • Calculate anomaly sensors for site on a monthly basis
  • Compare sensor readings under normal operation and abnormal conditions

For these questions, you can use multi-shots in a custom prompt to enhance response accuracy. Such multi-shots show examples of advanced time series processing and math calculation, and will provide context for the FM to perform relevant inference on similar analysis. Dynamically inserting the most relevant examples from an NLQ question bank into the prompt can be a challenge. One solution is to construct embeddings from existing NLQ question samples and save these embeddings in a vector store like OpenSearch Service. When a question is sent to the Streamlit app, the question will be vectorized by BedrockEmbeddings. The top N most-relevant embeddings to that question are retrieved using opensearch_vector_search.similarity_search and inserted into the prompt template as a multi-shot prompt.

The following diagram illustrates this workflow.

The embedding layer is constructed using three key tools:

  • Embeddings model – We use Amazon Titan Embeddings available through Amazon Bedrock (amazon.titan-embed-text-v1) to generate numerical representations of textual documents.
  • Vector store – For our vector store, we use OpenSearch Service via the LangChain framework, streamlining the storage of embeddings generated from NLQ examples in this notebook.
  • Index – The OpenSearch Service index plays a pivotal role in comparing input embeddings to document embeddings and facilitating the retrieval of relevant documents. Because the Python example codes were saved as a JSON file, they were indexed in OpenSearch Service as vectors via an OpenSearchVevtorSearch.fromtexts API call.

Continuous collection of human-audited examples via Streamlit

At the outset of app development, we began with only 23 saved examples in the OpenSearch Service index as embeddings. As the app goes live in the field, users start inputting their NLQs via the app. However, due to the limited examples available in the template, some NLQs may not find similar prompts. To continuously enrich these embeddings and offer more relevant user prompts, you can use the Streamlit app for gathering human-audited examples.

Within the app, the following function serves this purpose. When end-users find the output helpful and select Helpful, the application follows these steps:

  1. Use the callback method from PandasAI to collect the Python script.
  2. Reformat the Python script, input question, and CSV metadata into a string.
  3. Check whether this NLQ example already exists in the current OpenSearch Service index using opensearch_vector_search.similarity_search_with_score.
  4. If there’s no similar example, this NLQ is added to the OpenSearch Service index using opensearch_vector_search.add_texts.

In the event that a user selects Not Helpful, no action is taken. This iterative process makes sure that the system continually improves by incorporating user-contributed examples.

def addtext_opensearch(input_question, generated_chat_code, df_column_metadata, opensearch_vector_search,similarity_threshold,kexamples, indexname):
    #######build the input_question and generated code the same format as existing opensearch index##########
    reconstructed_json = {}
    reconstructed_json["question"]=input_question
    reconstructed_json["python_code"]=str(generated_chat_code)
    reconstructed_json["column_info"]=df_column_metadata
    json_str = ''
    for key,value in reconstructed_json.items():
        json_str += key + ':' + value
    reconstructed_raw_text =[]
    reconstructed_raw_text.append(json_str)
    
    results = opensearch_vector_search.similarity_search_with_score(str(reconstructed_raw_text[0]), k=kexamples)  # our search query  # return 3 most relevant docs
    if (dumpd(results[0][1])<similarity_threshold):    ###No similar embedding exist, then add text to embedding
        response = opensearch_vector_search.add_texts(texts=reconstructed_raw_text, engine="faiss", index_name=indexname)
    else:
        response = "A similar embedding is already exist, no action."
    
    return response

By incorporating human auditing, the quantity of examples in OpenSearch Service available for prompt embedding grows as the app gains usage. This expanded embedding dataset results in enhanced search accuracy over time. Specifically, for challenging NLQs, the FM’s response accuracy reaches approximately 90% when dynamically inserting similar examples to construct custom prompts for each NLQ question. This represents a notable 28% increase compared to scenarios without multi-shot prompts.

Use case 2: Design details

On the Streamlit app’s Captured Image Summary tab, you can directly upload an image file. This initiates the Amazon Rekognition API (detect_text API), extracting text from the image label detailing machine specifications. Subsequently, the extracted text data is sent to the Amazon Bedrock Claude model as the context of a prompt, resulting in a 200-word summary.

From a user experience perspective, enabling streaming functionality for a text summarization task is paramount, allowing users to read the FM-generated summary in smaller chunks rather than waiting for the entire output. Amazon Bedrock facilitates streaming via its API (bedrock_runtime.invoke_model_with_response_stream).

Use case 3: Design details

In this scenario, we’ve developed a chatbot application focused on root cause analysis, employing the RAG approach. This chatbot draws from multiple documents related to bearing equipment to facilitate root cause analysis. This RAG-based root cause analysis chatbot uses knowledge bases for generating vector text representations, or embeddings. Knowledge Bases for Amazon Bedrock is a fully managed capability that helps you implement the entire RAG workflow, from ingestion to retrieval and prompt augmentation, without having to build custom integrations to data sources or manage data flows and RAG implementation details.

When you’re satisfied with the knowledge base response from Amazon Bedrock, you can integrate the root cause response from the knowledge base to the Streamlit app.

Clean up

To save costs, delete the resources you created in this post:

  1. Delete the knowledge base from Amazon Bedrock.
  2. Delete the OpenSearch Service index.
  3. Delete the genai-sagemaker CloudFormation stack.
  4. Stop the EC2 instance if you used an EC2 instance to run the Streamlit app.

Conclusion

Generative AI applications have already transformed various business processes, enhancing worker productivity and skill sets. However, the limitations of FMs in handling time series data analysis have hindered their full utilization by industrial clients. This constraint has impeded the application of generative AI to the predominant data type processed daily.

In this post, we introduced a generative AI Application solution designed to alleviate this challenge for industrial users. This application uses an open source agent, PandasAI, to strengthen an FM’s time series analysis capability. Rather than sending time series data directly to FMs, the app employs PandasAI to generate Python code for the analysis of unstructured time series data. To enhance the accuracy of Python code generation, a custom prompt generation workflow with human auditing has been implemented.

Empowered with insights into their asset health, industrial workers can fully harness the potential of generative AI across various use cases, including root cause diagnosis and part replacement planning. With Knowledge Bases for Amazon Bedrock, the RAG solution is straightforward for developers to build and manage.

The trajectory of enterprise data management and operations is unmistakably moving towards deeper integration with generative AI for comprehensive insights into operational health. This shift, spearheaded by Amazon Bedrock, is significantly amplified by the growing robustness and potential of LLMs like Amazon Bedrock Claude 3 to further elevate solutions. To learn more, visit consult the Amazon Bedrock documentation, and get hands-on with the Amazon Bedrock workshop.


About the authors

Julia Hu is a Sr. AI/ML Solutions Architect at Amazon Web Services. She is specialized in Generative AI, Applied Data Science and IoT architecture. Currently she is part of the Amazon Q team, and an active member/mentor in Machine Learning Technical Field Community. She works with customers, ranging from start-ups to enterprises, to develop AWSome generative AI solutions. She is particularly passionate about leveraging Large Language Models for advanced data analytics and exploring practical applications that address real-world challenges.

Sudeesh Sasidharan is a Senior Solutions Architect at AWS, within the Energy team. Sudeesh loves experimenting with new technologies and building innovative solutions that solve complex business challenges. When he is not designing solutions or tinkering with the latest technologies, you can find him on the tennis court working on his backhand.

Neil Desai is a technology executive with over 20 years of experience in artificial intelligence (AI), data science, software engineering, and enterprise architecture. At AWS, he leads a team of Worldwide AI services specialist solutions architects who help customers build innovative Generative AI-powered solutions, share best practices with customers, and drive product roadmap. In his previous roles at Vestas, Honeywell, and Quest Diagnostics, Neil has held leadership roles in developing and launching innovative products and services that have helped companies improve their operations, reduce costs, and increase revenue. He is passionate about using technology to solve real-world problems and is a strategic thinker with a proven track record of success.

Read More

Enhance performance of generative language models with self-consistency prompting on Amazon Bedrock

Enhance performance of generative language models with self-consistency prompting on Amazon Bedrock

Generative language models have proven remarkably skillful at solving logical and analytical natural language processing (NLP) tasks. Furthermore, the use of prompt engineering can notably enhance their performance. For example, chain-of-thought (CoT) is known to improve a model’s capacity for complex multi-step problems. To additionally boost accuracy on tasks that involve reasoning, a self-consistency prompting approach has been suggested, which replaces greedy with stochastic decoding during language generation.

Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models from leading AI companies and Amazon via a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI. With the batch inference API, you can use Amazon Bedrock to run inference with foundation models in batches and get responses more efficiently. This post shows how to implement self-consistency prompting via batch inference on Amazon Bedrock to enhance model performance on arithmetic and multiple-choice reasoning tasks.

Overview of solution

Self-consistency prompting of language models relies on the generation of multiple responses that are aggregated into a final answer. In contrast to single-generation approaches like CoT, the self-consistency sample-and-marginalize procedure creates a range of model completions that lead to a more consistent solution. The generation of different responses for a given prompt is possible due to the use of a stochastic, rather than greedy, decoding strategy.

The following figure shows how self-consistency differs from greedy CoT in that it generates a diverse set of reasoning paths and aggregates them to produce the final answer.

Differences between self-consistency and CoT prompting.

Decoding strategies for text generation

Text generated by decoder-only language models unfolds word by word, with the subsequent token being predicted on the basis of the preceding context. For a given prompt, the model computes a probability distribution indicating the likelihood of each token to appear next in the sequence. Decoding involves translating these probability distributions into actual text. Text generation is mediated by a set of inference parameters that are often hyperparameters of the decoding method itself. One example is the temperature, which modulates the probability distribution of the next token and influences the randomness of the model’s output.

Greedy decoding is a deterministic decoding strategy that at each step selects the token with the highest probability. Although straightforward and efficient, the approach risks falling into repetitive patterns, because it disregards the broader probability space. Setting the temperature parameter to 0 at inference time essentially equates to implementing greedy decoding.

Sampling introduces stochasticity into the decoding process by randomly selecting each subsequent token based on the predicted probability distribution. This randomness results in greater output variability. Stochastic decoding proves more adept at capturing the diversity of potential outputs and often yields more imaginative responses. Higher temperature values introduce more fluctuations and increase the creativity of the model’s response.

Prompting techniques: CoT and self-consistency

The reasoning ability of language models can be augmented via prompt engineering. In particular, CoT has been shown to elicit reasoning in complex NLP tasks. One way to implement a zero-shot CoT is via prompt augmentation with the instruction to “think step by step.” Another is to expose the model to exemplars of intermediate reasoning steps in few-shot prompting fashion. Both scenarios typically use greedy decoding. CoT leads to significant performance gains compared to simple instruction prompting on arithmetic, commonsense, and symbolic reasoning tasks.

Self-consistency prompting is based on the assumption that introducing diversity in the reasoning process can be beneficial to help models converge on the correct answer. The technique uses stochastic decoding to achieve this goal in three steps:

  1. Prompt the language model with CoT exemplars to elicit reasoning.
  2. Replace greedy decoding with a sampling strategy to generate a diverse set of reasoning paths.
  3. Aggregate the results to find the most consistent answer in the response set.

Self-consistency is shown to outperform CoT prompting on popular arithmetic and commonsense reasoning benchmarks. A limitation of the approach is its larger computational cost.

This post shows how self-consistency prompting enhances performance of generative language models on two NLP reasoning tasks: arithmetic problem-solving and multiple-choice domain-specific question answering. We demonstrate the approach using batch inference on Amazon Bedrock:

  • We access the Amazon Bedrock Python SDK in JupyterLab on an Amazon SageMaker notebook instance.
  • For arithmetic reasoning, we prompt Cohere Command on the GSM8K dataset of grade school math problems.
  • For multiple-choice reasoning, we prompt AI21 Labs Jurassic-2 Mid on a small sample of questions from the AWS Certified Solutions Architect – Associate exam.

Prerequisites

This walkthrough assumes the following prerequisites:

Manage model access on Amazon Bedrock

The estimated cost to run the code shown in this post is $100, assuming you run self-consistency prompting one time with 30 reasoning paths using one value for the temperature-based sampling.

Dataset to probe arithmetic reasoning capabilities

GSM8K is a dataset of human-assembled grade school math problems featuring a high linguistic diversity. Each problem takes 2–8 steps to solve and requires performing a sequence of elementary calculations with basic arithmetic operations. This data is commonly used to benchmark the multi-step arithmetic reasoning capabilities of generative language models. The GSM8K train set comprises 7,473 records. The following is an example:

{"question": "Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?", "answer": "Natalia sold 48/2 = <<48/2=24>>24 clips in May.nNatalia sold 48+24 = <<48+24=72>>72 clips altogether in April and May.n#### 72"}

Set up to run batch inference with Amazon Bedrock

Batch inference allows you to run multiple inference calls to Amazon Bedrock asynchronously and improve the performance of model inference on large datasets. The service is in preview as of this writing and only available through the API. Refer to Run batch inference to access batch inference APIs via custom SDKs.

After you have downloaded and unzipped the Python SDK in a SageMaker notebook instance, you can install it by running the following code in a Jupyter notebook cell:

# Install preview SDK packages
!pip install -q $(ls ./bedrock-python-sdk-reinvent/botocore-*.whl | head -1)
!pip install -q $(ls ./bedrock-python-sdk-reinvent/boto3-*.whl | head -1)

Format and upload input data to Amazon S3

Input data for batch inference needs to be prepared in JSONL format with recordId and modelInput keys. The latter should match the body field of the model to be invoked on Amazon Bedrock. In particular, some supported inference parameters for Cohere Command are temperature for randomness, max_tokens for output length, and num_generations to generate multiple responses, all of which are passed together with the prompt as modelInput:

data = [
    {
        "recordId": "1",
        "modelInput": {
            "prompt": prompt,
            "temperature": temperature,
            "max_tokens": max_tokens,
            "num_generations": n,
        },
    },
    ...,
]

See Inference parameters for foundation models for more details, including other model providers.

Our experiments on arithmetic reasoning are performed in the few-shot setting without customizing or fine-tuning Cohere Command. We use the same set of eight few-shot exemplars from the chain-of-thought (Table 20) and self-consistency (Table 17) papers. Prompts are created by concatenating the exemplars with each question from the GSM8K train set.

We set max_tokens to 512 and num_generations to 5, the maximum allowed by Cohere Command. For greedy decoding, we set temperature to 0 and for self-consistency, we run three experiments at temperatures 0.5, 0.7, and 1. Each setting yields different input data according to the respective temperature values. Data is formatted as JSONL and stored in Amazon S3.

# Set up S3 client
session = boto3.Session()
s3 = session.client("s3")

# Create S3 bucket with unique name to store input/output data
suffix = str(uuid.uuid4())[:8]
bucket = f"bedrock-self-consistency-{suffix}"
s3.create_bucket(
    Bucket=bucket, CreateBucketConfiguration={"LocationConstraint": session.region_name}
)

# Process data and output to new lines as JSONL
input_key = f"gsm8k/T{temperature}/input.jsonl"
s3_data = ""
for row in data:
    s3_data += json.dumps(row) + "n"
s3.put_object(Body=s3_data, Bucket=bucket, Key=input_key)

Create and run batch inference jobs in Amazon Bedrock

Batch inference job creation requires an Amazon Bedrock client. We specify the S3 input and output paths and give each invocation job a unique name:

# Create Bedrock client							    
bedrock = boto3.client("bedrock")

# Input and output config						     
input_config = {"s3InputDataConfig": {"s3Uri": f"s3://{bucket}/{input_key}"}}
output_config = {"s3OutputDataConfig": {"s3Uri": f"s3://{bucket}/{output_key}"}}

# Create a unique job name
suffix = str(uuid.uuid4())[:8] 
job_name = f"command-batch-T{temperature}-{suffix}"

Jobs are created by passing the IAM role, model ID, job name, and input/output configuration as parameters to the Amazon Bedrock API:

response = bedrock.create_model_invocation_job(
    roleArn=f"arn:aws:iam::{account_id}:role/BedrockBatchInferenceRole",
    modelId="cohere.command-text-v14",
    jobName=job_name,
    inputDataConfig=input_config,
    outputDataConfig=output_config,
)
job_arn = response["jobArn"]

Listing, monitoring, and stopping batch inference jobs is supported by their respective API calls. On creation, jobs appear first as Submitted, then as InProgress, and finally as Stopped, Failed, or Completed.

# Get job details
job_details = bedrock.get_model_invocation_job(jobIdentifier=job_arn)

If the jobs are successfully complete, the generated content can be retrieved from Amazon S3 using its unique output location.

# Get the output file key
s3_prefix = f"s3://{bucket}/"
output_path = job_details["outputDataConfig"]["s3OutputDataConfig"]["s3Uri"].replace(
    s3_prefix, ""
)
output_folder = job_details["jobArn"].split("/")[1]
output_file = (
    f'{job_details["inputDataConfig"]["s3InputDataConfig"]["s3Uri"].split("/")[-1]}.out'
)
result_key = f"{output_path}{output_folder}/{output_file}"

# Get output data
obj = s3.get_object(Bucket=bucket, Key=result_key)
content = obj["Body"].read().decode("utf-8").strip().split("n")

# Show answer to the first question
print(json.loads(content[0])["modelOutput"]["generations"][0]["text"])

[Out]: 'Natalia sold 48 * 1/2 = 24 clips less in May. This means she sold 48 + 24 = 72 clips in April and May. The answer is 72.'

Self-consistency enhances model accuracy on arithmetic tasks

Self-consistency prompting of Cohere Command outperforms a greedy CoT baseline in terms of accuracy on the GSM8K dataset. For self-consistency, we sample 30 independent reasoning paths at three different temperatures, with topP and topK set to their default values. Final solutions are aggregated by choosing the most consistent occurrence via majority voting. In case of a tie, we randomly choose one of the majority responses. We compute accuracy and standard deviation values averaged over 100 runs.

The following figure shows the accuracy on the GSM8K dataset from Cohere Command prompted with greedy CoT (blue) and self-consistency at temperature values 0.5 (yellow), 0.7 (green), and 1.0 (orange) as a function of the number of sampled reasoning paths.

Accuracy of Cohere Command using self-consistency vs CoT prompting.

The preceding figure shows that self-consistency enhances arithmetic accuracy over greedy CoT when the number of sampled paths is as low as three. Performance increases consistently with further reasoning paths, confirming the importance of introducing diversity in the thought generation. Cohere Command solves the GSM8K question set with 51.7% accuracy when prompted with CoT vs. 68% with 30 self-consistent reasoning paths at T=1.0. All three surveyed temperature values yield similar results, with lower temperatures being comparatively more performant at less sampled paths.

Practical considerations on efficiency and cost

Self-consistency is limited by the increased response time and cost incurred when generating multiple outputs per prompt. As a practical illustration, batch inference for greedy generation with Cohere Command on 7,473 GSM8K records finished in less than 20 minutes. The job took 5.5 million tokens as input and generated 630,000 output tokens. At current Amazon Bedrock inference prices, the total cost incurred was around $9.50.

For self-consistency with Cohere Command, we use inference parameter num_generations to create multiple completions per prompt. As of this writing, Amazon Bedrock allows a maximum of five generations and three concurrent Submitted batch inference jobs. Jobs proceed to the InProgress status sequentially, therefore sampling more than five paths requires multiple invocations.

The following figure shows the runtimes for Cohere Command on the GSM8K dataset. Total runtime is shown on the x axis and runtime per sampled reasoning path on the y axis. Greedy generation runs in the shortest time but incurs a higher time cost per sampled path.

Runtimes for Cohere Command

Greedy generation completes in less than 20 minutes for the full GSM8K set and samples a unique reasoning path. Self-consistency with five samples requires about 50% longer to complete and costs around $14.50, but produces five paths (over 500%) in that time. Total runtime and cost increase step-wise with every extra five sampled paths. A cost-benefit analysis suggests that 1–2 batch inference jobs with 5–10 sampled paths is the recommended setting for practical implementation of self-consistency. This achieves enhanced model performance while keeping cost and latency at bay.

Self-consistency enhances model performance beyond arithmetic reasoning

A crucial question to prove the suitability of self-consistency prompting is whether the method succeeds across further NLP tasks and language models. As an extension to an Amazon-related use case, we perform a small-sized analysis on sample questions from the AWS Solutions Architect Associate Certification. This is a multiple-choice exam on AWS technology and services that requires domain knowledge and the ability to reason and decide among several options.

We prepare a dataset from SAA-C01 and SAA-C03 sample exam questions. From the 20 available questions, we use the first 4 as few-shot exemplars and prompt the model to answer the remaining 16. This time, we run inference with the AI21 Labs Jurassic-2 Mid model and generate a maximum of 10 reasoning paths at temperature 0.7. Results show that self-consistency enhances performance: although greedy CoT produces 11 correct answers, self-consistency succeeds on 2 more.

The following table shows the accuracy results for 5 and 10 sampled paths averaged over 100 runs.

. Greedy decoding T = 0.7
# sampled paths: 5 68.6 74.1 ± 0.7
# sampled paths: 10 68.6 78.9 ± 0.3

In the following table, we present two exam questions that are incorrectly answered by greedy CoT while self-consistency succeeds, highlighting in each case the correct (green) or incorrect (red) reasoning traces that led the model to produce correct or incorrect responses. Although not every sampled path generated by self-consistency is correct, the majority converges on the true answer as the number of sampled paths increases. We observe that 5–10 paths are typically enough to improve over the greedy results, with diminishing returns in terms of efficiency past those values.

Question

A web application allows customers to upload orders to an S3 bucket. The resulting Amazon S3 events trigger a Lambda function that inserts a message to an SQS queue. A single EC2 instance reads messages from the queue, processes them, and stores them in a DynamoDB table partitioned by unique order ID. Next month traffic is expected to increase by a factor of 10 and a Solutions Architect is reviewing the architecture for possible scaling problems.

Which component is MOST likely to need re-architecting to be able to scale to accommodate the new traffic?

A. Lambda function
B. SQS queue
C. EC2 instance
D. DynamoDB table

An application running on AWS uses an Amazon Aurora Multi-AZ DB cluster deployment for its database. When evaluating performance metrics, a solutions architect discovered that the database reads are causing high I/O and adding latency to the write requests against the database.

What should the solutions architect do to separate the read requests from the write requests?

A. Enable read-through caching on the Aurora database.
B. Update the application to read from the Multi-AZ standby instance.
C. Create an Aurora replica and modify the application to use the appropriate endpoints.
D. Create a second Aurora database and link it to the primary database as a read replica.

Correct answer C – A single EC2 instance will not scale and is a single point of failure in the architecture. A much better solution would be to have EC2 instances in an Auto Scaling group across 2 availability zones read messages from the queue. The other responses are all managed services that can be configured to scale or will scale automatically. C – Aurora Replicas provide a way to offload read traffic. Aurora Replicas share the same underlying storage as the main database, so lag time is generally very low. Aurora Replicas have their own endpoints, so the application will need to be configured to direct read traffic to the new endpoints.
Greedy decoding Scaling the EC2 instance directly may not be possible, as it may not be possible to resize the instance. Scaling the Lambda function may not be sufficient, as Lambda functions have a 6 GB of memory allocation limit. Scaling the SQS queue may not be sufficient, as SQS has a 10,000 message limit. DynamoDB tables can be scaled by provisioning more capacity units. The correct answer is D. Modifying the application to read from the Multi-AZ standby instance (option B) is the best approach. The standby instance is asynchronously replicated from the primary instance, so read operations will not interfere with write operations. Enabling read-through caching (option A) or creating an Aurora replica (option C) will not solve the I/O contention issue. Creating a second Aurora database and linking it as a read replica (option D) will not solve the I/O contention issue either. The correct answer is B.
Sampled path #1 Scaling the EC2 instance to 10 times its current size may not be feasible, either from a cost or performance perspective. It would require a larger instance, which can be expensive. A more cost-effective approach might be to use a swarm of smaller instances (e.g., Auto Scaling Group) to share the load. In this case, the EC2 instance would likely need to be re-architected. The correct answer is C. Modifying the application to use the appropriate endpoints is not a practical solution as it would require changes to the application code. The correct answer is C: Create an Aurora replica and modify the application to use the appropriate endpoints.
Sampled path #2 Scaling a single EC2 instance is more difficult and problematic than scaling the other service components. Amazon SQS, Lambda, and DynamoDB can all scale automatically to cope with the increased load. The correct answer is C.

(C) Create an Aurora replica and modify the application to use the appropriate endpoints.

By configuring an Aurora Replica, you can separate read traffic from write traffic. The Aurora Replicas use different endpoint URLs, allowing you to direct read traffic to the replica instead of the primary database. The replica can process read requests in parallel with write requests to the primary database, reducing I/O and latency.

Clean up

Running batch inference in Amazon Bedrock is subject to charges according to the Amazon Bedrock Pricing. When you complete the walkthrough, delete your SageMaker notebook instance and remove all data from your S3 buckets to avoid incurring future charges.

Considerations

Although the demonstrated solution shows improved performance of language models when prompted with self-consistency, it’s important to note that the walkthrough is not production-ready. Before you deploy to production, you should adapt this proof of concept to your own implementation, keeping in mind the following requirements:

  • Access restriction to APIs and databases to prevent unauthorized usage.
  • Adherence to AWS security best practices regarding IAM role access and security groups.
  • Validation and sanitization of user input to prevent prompt injection attacks.
  • Monitoring and logging of triggered processes to enable testing and auditing.

Conclusion

This post shows that self-consistency prompting enhances performance of generative language models in complex NLP tasks that require arithmetic and multiple-choice logical skills. Self-consistency uses temperature-based stochastic decoding to generate various reasoning paths. This increases the ability of the model to elicit diverse and useful thoughts to arrive at correct answers.

With Amazon Bedrock batch inference, the language model Cohere Command is prompted to generate self-consistent answers to a set of arithmetic problems. Accuracy improves from 51.7% with greedy decoding to 68% with self-consistency sampling 30 reasoning paths at T=1.0. Sampling five paths already enhances accuracy by 7.5 percent points. The approach is transferable to other language models and reasoning tasks, as demonstrated by results of the AI21 Labs Jurassic-2 Mid model on an AWS Certification exam. In a small-sized question set, self-consistency with five sampled paths increases accuracy by 5 percent points over greedy CoT.

We encourage you to implement self-consistency prompting for enhanced performance in your own applications with generative language models. Learn more about Cohere Command and AI21 Labs Jurassic models available on Amazon Bedrock. For more information about batch inference, refer to Run batch inference.

Acknowledgements

The author thanks technical reviewers Amin Tajgardoon and Patrick McSweeney for helpful feedback.


About the Author

Lucía Santamaría is a Sr. Applied Scientist at Amazon’s ML University, where she’s focused on raising the level of ML competency across the company through hands-on education. Lucía has a PhD in astrophysics and is passionate about democratizing access to tech knowledge and tools.

Read More

Optimize price-performance of LLM inference on NVIDIA GPUs using the Amazon SageMaker integration with NVIDIA NIM Microservices

Optimize price-performance of LLM inference on NVIDIA GPUs using the Amazon SageMaker integration with NVIDIA NIM Microservices

NVIDIA NIM microservices now integrate with Amazon SageMaker, allowing you to deploy industry-leading large language models (LLMs) and optimize model performance and cost. You can deploy state-of-the-art LLMs in minutes instead of days using technologies such as NVIDIA TensorRT, NVIDIA TensorRT-LLM, and NVIDIA Triton Inference Server on NVIDIA accelerated instances hosted by SageMaker.

NIM, part of the NVIDIA AI Enterprise software platform listed on AWS marketplace, is a set of inference microservices that bring the power of state-of-the-art LLMs to your applications, providing natural language processing (NLP) and understanding capabilities, whether you’re developing chatbots, summarizing documents, or implementing other NLP-powered applications. You can use pre-built NVIDIA containers to host popular LLMs that are optimized for specific NVIDIA GPUs for quick deployment or use NIM tools to create your own containers.

In this post, we provide a high-level introduction to NIM and show how you can use it with SageMaker.

An introduction to NVIDIA NIM

NIM provides optimized and pre-generated engines for a variety of popular models for inference. These microservices support a variety of LLMs, such as Llama 2 (7B, 13B, and 70B), Mistral-7B-Instruct, Mixtral-8x7B, NVIDIA Nemotron-3 22B Persona, and Code Llama 70B, out of the box using pre-built NVIDIA TensorRT engines tailored for specific NVIDIA GPUs for maximum performance and utilization. These models are curated with the optimal hyperparameters for model-hosting performance for deploying applications with ease.

If your model is not in NVIDIA’s set of curated models, NIM offers essential utilities such as the Model Repo Generator, which facilitates the creation of a TensorRT-LLM-accelerated engine and a NIM-format model directory through a straightforward YAML file. Furthermore, an integrated community backend of vLLM provides support for cutting-edge models and emerging features that may not have been seamlessly integrated into the TensorRT-LLM-optimized stack.

In addition to creating optimized LLMs for inference, NIM provides advanced hosting technologies such as optimized scheduling techniques like in-flight batching, which can break down the overall text generation process for an LLM into multiple iterations on the model. With in-flight batching, rather than waiting for the whole batch to finish before moving on to the next set of requests, the NIM runtime immediately evicts finished sequences from the batch. The runtime then begins running new requests while other requests are still in flight, making the best use of your compute instances and GPUs.

Deploying NIM on SageMaker

NIM integrates with SageMaker, allowing you to host your LLMs with performance and cost optimization while benefiting from the capabilities of SageMaker. When you use NIM on SageMaker, you can use capabilities such as scaling out the number of instances to host your model, performing blue/green deployments, and evaluating workloads using shadow testing—all with best-in-class observability and monitoring with Amazon CloudWatch.

Conclusion

Using NIM to deploy optimized LLMs can be a great option for both performance and cost. It also helps make deploying LLMs effortless. In the future, NIM will also allow for Parameter-Efficient Fine-Tuning (PEFT) customization methods like LoRA and P-tuning. NIM also plans to have LLM support by supporting Triton Inference Server, TensorRT-LLM, and vLLM backends.

We encourage you to learn more about NVIDIA microservices and how to deploy your LLMs using SageMaker and try out the benefits available to you. NIM is available as a paid offering as part of the NVIDIA AI Enterprise software subscription available on AWS Marketplace.

In the near future, we will post an in-depth guide for NIM on SageMaker.


About the authors

James Park is a Solutions Architect at Amazon Web Services. He works with Amazon.com to design, build, and deploy technology solutions on AWS, and has a particular interest in AI and machine learning. In h is spare time he enjoys seeking out new cultures, new experiences,  and staying up to date with the latest technology trends.You can find him on LinkedIn.

Saurabh Trikande is a Senior Product Manager for Amazon SageMaker Inference. He is passionate about working with customers and is motivated by the goal of democratizing machine learning. He focuses on core challenges related to deploying complex ML applications, multi-tenant ML models, cost optimizations, and making deployment of deep learning models more accessible. In his spare time, Saurabh enjoys hiking, learning about innovative technologies, following TechCrunch, and spending time with his family.

Qing Lan is a Software Development Engineer in AWS. He has been working on several challenging products in Amazon, including high performance ML inference solutions and high performance logging system. Qing’s team successfully launched the first Billion-parameter model in Amazon Advertising with very low latency required. Qing has in-depth knowledge on the infrastructure optimization and Deep Learning acceleration.

Nikhil Kulkarni is a software developer with AWS Machine Learning, focusing on making machine learning workloads more performant on the cloud, and is a co-creator of AWS Deep Learning Containers for training and inference. He’s passionate about distributed Deep Learning Systems. Outside of work, he enjoys reading books, fiddling with the guitar, and making pizza.

Harish Tummalacherla is Software Engineer with Deep Learning Performance team at SageMaker. He works on performance engineering for serving large language models efficiently on SageMaker. In his spare time, he enjoys running, cycling and ski mountaineering.

Eliuth Triana Isaza is a Developer Relations Manager at NVIDIA empowering Amazon’s AI MLOps, DevOps, Scientists and AWS technical experts to master the NVIDIA computing stack for accelerating and optimizing Generative AI Foundation models spanning from data curation, GPU training, model inference and production deployment on AWS GPU instances. In addition, Eliuth is a passionate mountain biker, skier, tennis and poker player.

Jiahong Liu is a Solution Architect on the Cloud Service Provider team at NVIDIA. He assists clients in adopting machine learning and AI solutions that leverage NVIDIA accelerated computing to address their training and inference challenges. In his leisure time, he enjoys origami, DIY projects, and playing basketball.

Kshitiz Gupta is a Solutions Architect at NVIDIA. He enjoys educating cloud customers about the GPU AI technologies NVIDIA has to offer and assisting them with accelerating their machine learning and deep learning applications. Outside of work, he enjoys running, hiking and wildlife watching.

Read More

Fine-tune Code Llama on Amazon SageMaker JumpStart

Fine-tune Code Llama on Amazon SageMaker JumpStart

Today, we are excited to announce the capability to fine-tune Code Llama models by Meta using Amazon SageMaker JumpStart. The Code Llama family of large language models (LLMs) is a collection of pre-trained and fine-tuned code generation models ranging in scale from 7 billion to 70 billion parameters. Fine-tuned Code Llama models provide better accuracy and explainability over the base Code Llama models, as evident on its testing against HumanEval and MBPP datasets. You can fine-tune and deploy Code Llama models with SageMaker JumpStart using the Amazon SageMaker Studio UI with a few clicks or using the SageMaker Python SDK. Fine-tuning of Llama models is based on the scripts provided in the llama-recipes GitHub repo from Meta using PyTorch FSDP, PEFT/LoRA, and Int8 quantization techniques.

In this post, we walk through how to fine-tune Code Llama pre-trained models via SageMaker JumpStart through a one-click UI and SDK experience available in the following GitHub repository.

What is SageMaker JumpStart

With SageMaker JumpStart, machine learning (ML) practitioners can choose from a broad selection of publicly available foundation models. ML practitioners can deploy foundation models to dedicated Amazon SageMaker instances from a network isolated environment and customize models using SageMaker for model training and deployment.

What is Code Llama

Code Llama is a code-specialized version of Llama 2 that was created by further training Llama 2 on its code-specific datasets and sampling more data from that same dataset for longer. Code Llama features enhanced coding capabilities. It can generate code and natural language about code, from both code and natural language prompts (for example, “Write me a function that outputs the Fibonacci sequence”). You can also use it for code completion and debugging. It supports many of the most popular programming languages used today, including Python, C++, Java, PHP, Typescript (JavaScript), C#, Bash, and more.

Why fine-tune Code Llama models

Meta published Code Llama performance benchmarks on HumanEval and MBPP for common coding languages such as Python, Java, and JavaScript. The performance of Code Llama Python models on HumanEval demonstrated varying performance across different coding languages and tasks ranging from 38% on 7B Python model to 57% on 70B Python models. In addition, fine-tuned Code Llama models on SQL programming language have shown better results, as evident in SQL evaluation benchmarks. These published benchmarks highlight the potential benefits of fine-tuning Code Llama models, enabling better performance, customization, and adaptation to specific coding domains and tasks.

No-code fine-tuning via the SageMaker Studio UI

To start fine-tuning your Llama models using SageMaker Studio, complete the following steps:

  1. On the SageMaker Studio console, choose JumpStart in the navigation pane.

You will find listings of over 350 models ranging from open source and proprietary models.

  1. Search for Code Llama models.

If you don’t see Code Llama models, you can update your SageMaker Studio version by shutting down and restarting. For more information about version updates, refer to Shut down and Update Studio Apps. You can also find other model variants by choosing Explore all Code Generation Models or searching for Code Llama in the search box.

SageMaker JumpStart currently supports instruction fine-tuning for Code Llama models. The following screenshot shows the fine-tuning page for the Code Llama 2 70B model.

  1. For Training dataset location, you can point to the Amazon Simple Storage Service (Amazon S3) bucket containing the training and validation datasets for fine-tuning.
  2. Set your deployment configuration, hyperparameters, and security settings for fine-tuning.
  3. Choose Train to start the fine-tuning job on a SageMaker ML instance.

We discuss the dataset format you need prepare for instruction fine-tuning in the next section.

  1. After the model is fine-tuned, you can deploy it using the model page on SageMaker JumpStart.

The option to deploy the fine-tuned model will appear when fine-tuning is finished, as shown in the following screenshot.

Fine-tune via the SageMaker Python SDK

In this section, we demonstrate how to fine-tune Code LIama models using the SageMaker Python SDK on an instruction-formatted dataset. Specifically, the model is fine-tuned for a set of natural language processing (NLP) tasks described using instructions. This helps improve the model’s performance for unseen tasks with zero-shot prompts.

Complete the following steps to complete your fine-tuning job. You can get the entire fine-tuning code from the GitHub repository.

First, let’s look at the dataset format required for the instruction fine-tuning. The training data should be formatted in a JSON lines (.jsonl) format, where each line is a dictionary representing a data sample. All training data must be in a single folder. However, it can be saved in multiple .jsonl files. The following is a sample in JSON lines format:

{
	'system_prompt': 'a chat',
	'question': 'Please focus on the efficiency of this problem and provide code in python:nYou are given two strings `s` and `t` consisting of only lowercase English letters.nnReturn _the minimum number of characters that need to be appended to the end of_ `s` _so that_ `t` _becomes a **subsequence** of_ `s`.nnA **subsequence** is a string that can be derived from another string by deleting some or no characters without changing the order of the remaining characters.nn**Example 1:**nn**Input:** s = "coaching ", t = "coding "n**Output:** 4n**Explanation:** Append the characters "ding " to the end of s so that s = "coachingding ".nNow, t is a subsequence of s ( "**co**aching**ding** ").nIt can be shown that appending any 3 characters to the end of s will never make t a subsequence.nn**Example 2:**nn**Input:** s = "abcde ", t = "a "n**Output:** 0n**Explanation:** t is already a subsequence of s ( "**a**bcde ").nn**Example 3:**nn**Input:** s = "z ", t = "abcde "n**Output:** 5n**Explanation:** Append the characters "abcde " to the end of s so that s = "zabcde ".nNow, t is a subsequence of s ( "z**abcde** ").nIt can be shown that appending any 4 characters to the end of s will never make t a subsequence.nn**Constraints:**nn* `1 <= s.length, t.length <= 105`n* `s` and `t` consist only of lowercase English letters.',
	'response': '### Explanationn The algorithm maintains two pointers, `i` and `j`, to iterate through the given strings `s` and `t`. The pointer `i` iterates through the string `s` and the pointer `j` iterates through the string `t`. While iterating through `s`, if we find a character that matches the current character in `t`, we increment the `j` pointer. This is because the matched character can be a part of the subsequence we are looking for. The iteration stops when either `i` reaches the end of `s` or `j` reaches the end of `t`.nnAt the end, we subtract the index `j` from the length of `t`. This gives us the minimum number of characters that need to be appended to the end of `s` so that `t` becomes a subsequence of `s`.nn### Pseudocode:n```nMIN-ADDITION(s, t)n s_len = LENGTH(s)n t_len = LENGTH(t)n i = 1n j = 1n WHILE i <= s_len AND j <= t_lenn IF s[i] == t[j]n j = j + 1n ENDIFn i = i + 1n ENDWHILEn RETURN t_len - j + 1nENDMIN-ADDITIONn```nn### Coden```pythonndef min_addition(s: str, t: str) -> int:n s_len = len(s)n t_len = len(t)n i, j = 0, 0n while i < s_len and j < t_len:n if s[i] == t[j]:n j += 1n i += 1n return t_len - jn```n'
}

The training folder can contain a template.json file describing the input and output formats. The following is an example template:

{
    "prompt": "{system_prompt} ### Input: {question}",
    "completion": "{response}"
}

To match the template, each sample in the JSON lines files must include system_prompt, question, and response fields. In this demonstration, we use the Dolphin Coder dataset from Hugging Face.

After you prepare the dataset and upload it to the S3 bucket, you can start fine-tuning using the following code:

from sagemaker.jumpstart.estimator import JumpStartEstimator

model_id = "meta-textgeneration-llama-codellama-7b" 
model_version = "*"
train_data_location = f"s3://{your_own_bucket_hosting_training_data}/" # training data in s3 bucket

estimator = JumpStartEstimator(
    model_id=model_id,
    model_version=model_version,
    hyperparameters= hyperparameters,
    environment={
        "accept_eula": "false"
    },  # please change `accept_eula` to be `true` to accept EULA.
)

estimator.fit({"training": train_data_location})

You can deploy the fine-tuned model directly from the estimator, as shown in the following code. For details, see the notebook in the GitHub repository.

finetuned_predictor = estimator.deploy()

Fine-tuning techniques

Language models such as Llama are more than 10 GB or even 100 GB in size. Fine-tuning such large models requires instances with significantly high CUDA memory. Furthermore, training these models can be very slow due to the size of the model. Therefore, for efficient fine-tuning, we use the following optimizations:

  • Low-Rank Adaptation (LoRA) – This is a type of parameter efficient fine-tuning (PEFT) for efficient fine-tuning of large models. With this method, you freeze the whole model and only add a small set of adjustable parameters or layers into the model. For instance, instead of training all 7 billion parameters for Llama 2 7B, you can fine-tune less than 1% of the parameters. This helps in significant reduction of the memory requirement because you only need to store gradients, optimizer states, and other training-related information for only 1% of the parameters. Furthermore, this helps in reduction of training time as well as the cost. For more details on this method, refer to LoRA: Low-Rank Adaptation of Large Language Models.
  • Int8 quantization – Even with optimizations such as LoRA, models such as Llama 70B are still too big to train. To decrease the memory footprint during training, you can use Int8 quantization during training. Quantization typically reduces the precision of floating point data types. Although this decreases the memory required to store model weights, it degrades the performance due to loss of information. Int8 quantization uses only a quarter precision but doesn’t incur degradation of performance because it doesn’t simply drop the bits. It rounds the data from one type to the another. To learn about Int8 quantization, refer to LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale.
  • Fully Sharded Data Parallel (FSDP) – This is a type of data-parallel training algorithm that shards the model’s parameters across data parallel workers and can optionally offload part of the training computation to the CPUs. Although the parameters are sharded across different GPUs, computation of each microbatch is local to the GPU worker. It shards parameters more uniformly and achieves optimized performance via communication and computation overlapping during training.

The following table summarizes the details of each model with different settings.

Model Default Setting LORA + FSDP LORA + No FSDP Int8 Quantization + LORA + No FSDP
Code Llama 2 7B LORA + FSDP Yes Yes Yes
Code Llama 2 13B LORA + FSDP Yes Yes Yes
Code Llama 2 34B INT8 + LORA + NO FSDP No No Yes
Code Llama 2 70B INT8 + LORA + NO FSDP No No Yes

Fine-tuning of Llama models is based on scripts provided by the following GitHub repo.

Supported hyperparameters for training

Code Llama 2 fine-tuning supports a number of hyperparameters, each of which can impact the memory requirement, training speed, and performance of the fine-tuned model:

  • epoch – The number of passes that the fine-tuning algorithm takes through the training dataset. Must be an integer greater than 1. Default is 5.
  • learning_rate – The rate at which the model weights are updated after working through each batch of training examples. Must be a positive float greater than 0. Default is 1e-4.
  • instruction_tuned – Whether to instruction-train the model or not. Must be True or False. Default is False.
  • per_device_train_batch_size – The batch size per GPU core/CPU for training. Must be a positive integer. Default is 4.
  • per_device_eval_batch_size – The batch size per GPU core/CPU for evaluation. Must be a positive integer. Default is 1.
  • max_train_samples – For debugging purposes or quicker training, truncate the number of training examples to this value. Value -1 means using all of the training samples. Must be a positive integer or -1. Default is -1.
  • max_val_samples – For debugging purposes or quicker training, truncate the number of validation examples to this value. Value -1 means using all of the validation samples. Must be a positive integer or -1. Default is -1.
  • max_input_length – Maximum total input sequence length after tokenization. Sequences longer than this will be truncated. If -1, max_input_length is set to the minimum of 1024 and the maximum model length defined by the tokenizer. If set to a positive value, max_input_length is set to the minimum of the provided value and the model_max_length defined by the tokenizer. Must be a positive integer or -1. Default is -1.
  • validation_split_ratio – If validation channel is none, the ratio of the train-validation split from the train data must be between 0–1. Default is 0.2.
  • train_data_split_seed – If validation data is not present, this fixes the random splitting of the input training data to training and validation data used by the algorithm. Must be an integer. Default is 0.
  • preprocessing_num_workers – The number of processes to use for preprocessing. If None, the main process is used for preprocessing. Default is None.
  • lora_r – Lora R. Must be a positive integer. Default is 8.
  • lora_alpha – Lora Alpha. Must be a positive integer. Default is 32
  • lora_dropout – Lora Dropout. must be a positive float between 0 and 1. Default is 0.05.
  • int8_quantization – If True, the model is loaded with 8-bit precision for training. Default for 7B and 13B is False. Default for 70B is True.
  • enable_fsdp – If True, training uses FSDP. Default for 7B and 13B is True. Default for 70B is False. Note that int8_quantization is not supported with FSDP.

When choosing the hyperparameters, consider the following:

  • Setting int8_quantization=True decreases the memory requirement and leads to faster training.
  • Decreasing per_device_train_batch_size and max_input_length reduces the memory requirement and therefore can be run on smaller instances. However, setting very low values may increase the training time.
  • If you’re not using Int8 quantization (int8_quantization=False), use FSDP (enable_fsdp=True) for faster and efficient training.

Supported instance types for training

The following table summarizes the supported instance types for training different models.

Model Default Instance Type Supported Instance Types
Code Llama 2 7B ml.g5.12xlarge

ml.g5.12xlarge,

ml.g5.24xlarge,

ml.g5.48xlarge,

ml.p3dn.24xlarge,

ml.g4dn.12xlarge

Code Llama 2 13B ml.g5.12xlarge

ml.g5.24xlarge,

ml.g5.48xlarge,

ml.p3dn.24xlarge,

ml.g4dn.12xlarge

Code Llama 2 70B ml.g5.48xlarge

ml.g5.48xlarge

ml.p4d.24xlarge

When choosing the instance type, consider the following:

  • G5 instances provide the most efficient training among the instance types supported. Therefore, if you have G5 instances available, you should use them.
  • Training time largely depends on the amount of the number of GPUs and the CUDA memory available. Therefore, training on instances with the same number of GPUs (for example, ml.g5.2xlarge and ml.g5.4xlarge) is roughly the same. Therefore, you can use the cheaper instance for training (ml.g5.2xlarge).
  • When using p3 instances, training will be done with 32-bit precision because bfloat16 is not supported on these instances. Therefore, the training job will consume double the amount of CUDA memory when training on p3 instances compared to g5 instances.

To learn about the cost of training per instance, refer to Amazon EC2 G5 Instances.

Evaluation

Evaluation is an important step to assess the performance of fine-tuned models. We present both qualitative and quantitative evaluations to show improvement of fine-tuned models over non-fine-tuned ones. In qualitative evaluation, we show an example response from both fine-tuned and non-fine-tuned models. In quantitative evaluation, we use HumanEval, a test suite developed by OpenAI to generate Python code to test the abilities of producing correct and accurate results. The HumanEval repository is under MIT license. We fine-tuned Python variants of all Code LIama models over different sizes (Code LIama Python 7B, 13B, 34B, and 70B on the Dolphin Coder dataset), and present the evaluation results in the following sections.

Qualitatively evaluation

With your fine-tuned model deployed, you can start using the endpoint to generate code. In the following example, we present responses from both base and fine-tuned Code LIama 34B Python variants on a test sample in the Dolphin Coder dataset:

alpaca_format = """{system_prompt}

### Input:
{question}

### Response:n"""

system="You are a coding AI"
prompt = """Please explain the reasoning behind the following solution and 
provide code in python: Given a string `s`, return the longest palindromic 
substring in `s`. **Example 1:** **Input:** s = "babad " **Output:** "bab " 
**Explanation:** "aba " is also a valid answer. **Example 2:** 
**Input:** s = "cbbd " **Output:** "bb " 
**Constraints:** * `1 <= s.length <= 1000` * `s` consist of only digits 
and English letters."""

payload = json.dumps({
    "inputText": alpaca_format.format(system_prompt=system, question=prompt),
    "textGenerationConfig": {
    "maxTokenCount": 2048,
    "temperature":0.2
}})

finetuned_response = finetuned_predictor.predict(payload)
finetuned_response["generated_text"]

The fine-tuned Code Llama model, in addition to providing the code for the preceding query, generates a detailed explanation of the approach and a pseudo code.

Code Llama 34b Python Non-Fine-Tuned Response:

### Code
def longestPalindrome(s):
    # Fill this in.
    longest = ""
    for i in range(len(s)):
        for j in range(len(s), i, -1):
            if len(longest) >= j-i:
                break
            elif s[i:j] == s[i:j][::-1]:
                longest = s[i:j]
                break
    return longest

print(longestPalindrome("abaxyzzyxf"))
# axyzzyx
###Reference:
- [https://www.geeksforgeeks.org/longest-palindrome-substring-set-1/]
(https://www.geeksforgeeks.org/longest-palindrome-substring-set-1/)

Code Llama 34B Python Fine-Tuned Response

### Explanation
The algorithm uses dynamic programming to find the longest palindromic substring. 
It initializes a 2D boolean table `dp` with the same dimensions as the input string `s`. 
The `dp[i][j]` value is set to `true` if the substring from index `i` to index `j` is a palindrome.

The algorithm then iterates through the string in a nested loop, starting from the first character to the last. 
It checks if the substring from index `i` to index `j` is a palindrome. 
If it is, it sets the `dp[i][j]` value to `true`. 
If the substring is of length 2 or 3, it sets the `dp[i][j]` value to `true` as well.

After the nested loop, the algorithm iterates through the `dp` table to find the longest palindromic substring. 
It keeps track of the maximum length and the starting and ending indices of the substring.

Finally, the algorithm returns the substring from the starting index to the ending index.

### Code
```python
def longestPalindrome(s: str) -> str:
    n = len(s)
    dp = [[False] * n for _ in range(n)]

    for i in range(n):
        dp[i][i] = True
        if i + 1 < n:
            dp[i][i + 1] = s[i] == s[i + 1]

    max_len = 1
    start = 0
    end = 0

    for i in range(n - 1):
        for j in range(i + 1, n):
            if j - i == 1 or dp[i + 1][j - 1]:
                dp[i][j] = s[i] == s[j]
                if j - i + 1 > max_len:
                    max_len = j - i + 1
                    start = i
                    end = j

    return s[start:end + 1]
```

Ground Truth

### Explanation 
1. Initialize `start` and `maxLength` for result substring. 
2. Iterate through the given string `s` using the index `i`. 
3. For each index `i`, create two pointers `l` and `r` starting at `i`. 
4. Check if there's a consecutive sequence of identical characters, increment the right pointer `r` until the end of the sequence is reached. 
5. Update the index `i` to the current value of `r`. 
6. Expand the pointers `l` and `r` outwards to find the longest palindromic substring, checking that characters on both sides are equal. 
7. If the current length of the substring is greater than `maxLength`, update `start` and `maxLength`. 
8. Return the longest palindromic substring using the `start` and `maxLength`. 

### Code 
```python 
def longest_palindromic_substring(s: str) → str:
    n = len(s)
    if n == 0:
        return ""
    
    start, max_length = 0, 1
    for i in range(n):
        l, r = i, i
        while r < n - 1 and s[r] == s[r + 1]:
            r += 1
            i = r
        while l > 0 and r < n - 1 and s[l - 1] == s[r + 1]:
            l -= 1
            r += 1
        length = r - l + 1
        if length > max_length:
            start, max_length = l, length
    return s[start:start + max_length]
```

Interestingly, our fine-tuned version of Code Llama 34B Python provides a dynamic programming-based solution to the longest palindromic substring, which is different from the solution provided in the ground truth from the selected test example. Our fine-tuned model reasons and explains the dynamic programming-based solution in detail. On the other hand, the non-fine-tuned model hallucinates potential outputs right after the print statement (shown in the left cell) because the output axyzzyx is not the longest palindrome in the given string. In terms of time complexity, the dynamic programming solution is generally better than the initial approach. The dynamic programming solution has a time complexity of O(n^2), where n is the length of the input string. This is more efficient than the initial solution from the non-fine-tuned model, which also had a quadratic time complexity of O(n^2) but with a less optimized approach.

This looks promising! Remember, we only fine-tuned the Code LIama Python variant with 10% of the Dolphin Coder dataset. There is a lot more to explore!

Despite of thorough instructions in the response, we still need examine the correctness of the Python code provided in the solution. Next, we use an evaluation framework called Human Eval to run integration tests on the generated response from Code LIama to systematically examine its quality.

Quantitative evaluation with HumanEval

HumanEval is an evaluation harness for evaluating an LLM’s problem-solving capabilities on Python-based coding problems, as described in the paper Evaluating Large Language Models Trained on Code. Specifically, it consists of 164 original Python-based programming problems that assess a language model’s ability to generate code based on provided information like function signature, docstring, body, and unit tests.

For each Python-based programming question, we send it to a Code LIama model deployed on a SageMaker endpoint to get k responses. Next, we run each of the k responses on the integration tests in the HumanEval repository. If any response of the k responses passes the integration tests, we count that test case succeed; otherwise, failed. Then we repeat the process to calculate the ratio of successful cases as the final evaluation score named pass@k. Following standard practice, we set k as 1 in our evaluation, to only generate one response per question and test whether it passes the integration test.

The following is a sample code to use HumanEval repository. You can access the dataset and generate a single response using a SageMaker endpoint. For details, see the notebook in the GitHub repository.

%pip3 install human_eval
import json
from human_eval.evaluation import evaluate_functional_correctness
from human_eval.data import write_jsonl, read_problems
from tqdm import tqdm
problems = read_problems()

num_samples_per_task = 1 # value k: number of responses for each question
samples = [
    dict(task_id=task_id, completion=generate_one_completion(problems[task_id]["prompt"]))
    for task_id in tqdm(problems)
    for _ in range(num_samples_per_task)
]
write_jsonl("samples.jsonl", samples)

evaluate_functional_correctness('./samples.jsonl')

The following table shows the improvements of the fine-tuned Code LIama Python models over the non-fine-tuned models across different model sizes. To ensure correctness, we also deploy the non-fine-tuned Code LIama models in SageMaker endpoints and run through Human Eval evaluations. The pass@1 numbers (the first row in the following table) match the reported numbers in the Code Llama research paper. The inference parameters are consistently set as "parameters": {"max_new_tokens": 384, "temperature": 0.2}.

As we can see from the results, all the fine-tuned Code LIama Python variants show significant improvement over the non-fine-tuned models. In particular, Code LIama Python 70B outperforms the non-fine-tuned model by approximately 12%.

. 7B Python 13B Python 34B 34B Python 70B Python
Pre-trained model performance (pass@1) 38.4 43.3 48.8 53.7 57.3
Fine-tuned model performance (pass@1) 45.12 45.12 59.1 61.5 69.5

Now you can try fine-tuning Code LIama models on your own dataset.

Clean up

If you decide that you no longer want to keep the SageMaker endpoint running, you can delete it using AWS SDK for Python (Boto3), AWS Command Line Interface (AWS CLI), or SageMaker console. For more information, see Delete Endpoints and Resources. Additionally, you can shut down the SageMaker Studio resources that are no longer required.

Conclusion

In this post, we discussed fine-tuning Meta’s Code Llama 2 models using SageMaker JumpStart. We showed that you can use the SageMaker JumpStart console in SageMaker Studio or the SageMaker Python SDK to fine-tune and deploy these models. We also discussed the fine-tuning technique, instance types, and supported hyperparameters. In addition, we outlined recommendations for optimized training based on various tests we carried out. As we can see from these results of fine-tuning three models over two datasets, fine-tuning improves summarization compared to non-fine-tuned models. As a next step, you can try fine-tuning these models on your own dataset using the code provided in the GitHub repository to test and benchmark the results for your use cases.


About the Authors

Dr. Xin Huang is a Senior Applied Scientist for Amazon SageMaker JumpStart and Amazon SageMaker built-in algorithms. He focuses on developing scalable machine learning algorithms. His research interests are in the area of natural language processing, explainable deep learning on tabular data, and robust analysis of non-parametric space-time clustering. He has published many papers in ACL, ICDM, KDD conferences, and Royal Statistical Society: Series A.

Vishaal Yalamanchali is a Startup Solutions Architect working with early-stage generative AI, robotics, and autonomous vehicle companies. Vishaal works with his customers to deliver cutting-edge ML solutions and is personally interested in reinforcement learning, LLM evaluation, and code generation. Prior to AWS, Vishaal was an undergraduate at UCI, focused on bioinformatics and intelligent systems.

Meenakshisundaram Thandavarayan works for AWS as an AI/ ML Specialist. He has a passion to design, create, and promote human-centered data and analytics experiences. Meena focuses on developing sustainable systems that deliver measurable, competitive advantages for strategic customers of AWS. Meena is a connector and design thinker, and strives to drive businesses to new ways of working through innovation, incubation, and democratization.

Dr. Ashish Khetan is a Senior Applied Scientist with Amazon SageMaker built-in algorithms and helps develop machine learning algorithms. He got his PhD from University of Illinois Urbana-Champaign. He is an active researcher in machine learning and statistical inference, and has published many papers in NeurIPS, ICML, ICLR, JMLR, ACL, and EMNLP conferences.

Read More

Transform one-on-one customer interactions: Build speech-capable order processing agents with AWS and generative AI

Transform one-on-one customer interactions: Build speech-capable order processing agents with AWS and generative AI

In today’s landscape of one-on-one customer interactions for placing orders, the prevailing practice continues to rely on human attendants, even in settings like drive-thru coffee shops and fast-food establishments. This traditional approach poses several challenges: it heavily depends on manual processes, struggles to efficiently scale with increasing customer demands, introduces the potential for human errors, and operates within specific hours of availability. Additionally, in competitive markets, businesses adhering solely to manual processes might find it challenging to deliver efficient and competitive service. Despite technological advancements, the human-centric model remains deeply ingrained in order processing, leading to these limitations.

The prospect of utilizing technology for one-on-one order processing assistance has been available for some time. However, existing solutions can often fall into two categories: rule-based systems that demand substantial time and effort for setup and upkeep, or rigid systems that lack the flexibility required for human-like interactions with customers. As a result, businesses and organizations face challenges in swiftly and efficiently implementing such solutions. Fortunately, with the advent of generative AI and large language models (LLMs), it’s now possible to create automated systems that can handle natural language efficiently, and with an accelerated on-ramping timeline.

Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Stability AI, and Amazon via a single API, along with a broad set of capabilities you need to build generative AI applications with security, privacy, and responsible AI. In addition to Amazon Bedrock, you can use other AWS services like Amazon SageMaker JumpStart and Amazon Lex to create fully automated and easily adaptable generative AI order processing agents.

In this post, we show you how to build a speech-capable order processing agent using Amazon Lex, Amazon Bedrock, and AWS Lambda.

Solution overview

The following diagram illustrates our solution architecture.

The workflow consists of the following steps:

  1. A customer places the order using Amazon Lex.
  2. The Amazon Lex bot interprets the customer’s intents and triggers a DialogCodeHook.
  3. A Lambda function pulls the appropriate prompt template from the Lambda layer and formats model prompts by adding the customer input in the associated prompt template.
  4. The RequestValidation prompt verifies the order with the menu item and lets the customer know via Amazon Lex if there’s something they want to order that isn’t part of the menu and will provide recommendations. The prompt also performs a preliminary validation for order completeness.
  5. The ObjectCreator prompt converts the natural language requests into a data structure (JSON format).
  6. The customer validator Lambda function verifies the required attributes for the order and confirms if all necessary information is present to process the order.
  7. A customer Lambda function takes the data structure as an input for processing the order and passes the order total back to the orchestrating Lambda function.
  8. The orchestrating Lambda function calls the Amazon Bedrock LLM endpoint to generate a final order summary including the order total from the customer database system (for example, Amazon DynamoDB).
  9. The order summary is communicated back to the customer via Amazon Lex. After the customer confirms the order, the order will be processed.

Prerequisites

This post assumes that you have an active AWS account and familiarity with the following concepts and services:

Also, in order to access Amazon Bedrock from the Lambda functions, you need to make sure the Lambda runtime has the following libraries:

  • boto3>=1.28.57
  • awscli>=1.29.57
  • botocore>=1.31.57

This can be done with a Lambda layer or by using a specific AMI with the required libraries.

Furthermore, these libraries are required when calling the Amazon Bedrock API from Amazon SageMaker Studio. This can be done by running a cell with the following code:

%pip install --no-build-isolation --force-reinstall 
"boto3>=1.28.57" 
"awscli>=1.29.57" 
"botocore>=1.31.57"

Finally, you create the following policy and later attach it to any role accessing Amazon Bedrock:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "Statement1",
            "Effect": "Allow",
            "Action": "bedrock:*",
            "Resource": "*"
        }
    ]
}

Create a DynamoDB table

In our specific scenario, we’ve created a DynamoDB table as our customer database system, but you could also use Amazon Relational Database Service (Amazon RDS). Complete the following steps to provision your DynamoDB table (or customize the settings as needed for your use case):

  1. On the DynamoDB console, choose Tables in the navigation pane.
  2. Choose Create table.

  1. For Table name, enter a name (for example, ItemDetails).
  2. For Partition key, enter a key (for this post, we use Item).
  3. For Sort key, enter a key (for this post, we use Size).
  4. Choose Create table.

Now you can load the data into the DynamoDB table. For this post, we use a CSV file. You can load the data to the DynamoDB table using Python code in a SageMaker notebook.

First, we need to set up a profile named dev.

  1. Open a new terminal in SageMaker Studio and run the following command:
aws configure --profile dev

This command will prompt you to enter your AWS access key ID, secret access key, default AWS Region, and output format.

  1. Return to the SageMaker notebook and write a Python code to set up a connection to DynamoDB using the Boto3 library in Python. This code snippet creates a session using a specific AWS profile named dev and then creates a DynamoDB client using that session. The following is the code sample to load the data:
%pip install boto3
import boto3
import csv

# Create a session using a profile named 'dev'
session = boto3.Session(profile_name='dev')

# Create a DynamoDB resource using the session
dynamodb = session.resource('dynamodb')

# Specify your DynamoDB table name
table_name = 'your_table_name'
table = dynamodb.Table(table_name)

# Specify the path to your CSV file
csv_file_path = 'path/to/your/file.csv'

# Read CSV file and put items into DynamoDB
with open(csv_file_path, 'r', encoding='utf-8-sig') as csvfile:
    csvreader = csv.reader(csvfile)
    
    # Skip the header row
    next(csvreader, None)

    for row in csvreader:
        # Extract values from the CSV row
        item = {
            'Item': row[0],  # Adjust the index based on your CSV structure
            'Size': row[1],
            'Price': row[2]
        }
        
        # Put item into DynamoDB
        response = table.put_item(Item=item)
        
        print(f"Item added: {response}")
print(f"CSV data has been loaded into the DynamoDB table: {table_name}")

Alternatively, you can use NoSQL Workbench or other tools to quickly load the data to your DynamoDB table.

The following is a screenshot after the sample data is inserted into the table.

Create templates in a SageMaker notebook using the Amazon Bedrock invocation API

To create our prompt template for this use case, we use Amazon Bedrock. You can access Amazon Bedrock from the AWS Management Console and via API invocations. In our case, we access Amazon Bedrock via API from the convenience of a SageMaker Studio notebook to create not only our prompt template, but our complete API invocation code that we can later use on our Lambda function.

  1. On the SageMaker console, access an existing SageMaker Studio domain or create a new one to access Amazon Bedrock from a SageMaker notebook.

  1. After you create the SageMaker domain and user, choose the user and choose Launch and Studio. This will open a JupyterLab environment.
  2. When the JupyterLab environment is ready, open a new notebook and begin importing the necessary libraries.

There are many FMs available via the Amazon Bedrock Python SDK. In this case, we use Claude V2, a powerful foundational model developed by Anthropic.

The order processing agent needs a few different templates. This can change depending on the use case, but we have designed a general workflow that can apply to multiple settings. For this use case, the Amazon Bedrock LLM template will accomplish the following:

  • Validate the customer intent
  • Validate the request
  • Create the order data structure
  • Pass a summary of the order to the customer
  1. To invoke the model, create a bedrock-runtime object from Boto3.

#Model api request parameters
modelId = 'anthropic.claude-v2' # change this to use a different version from the model provider
accept = 'application/json'
contentType = 'application/json'

import boto3
import json
bedrock = boto3.client(service_name='bedrock-runtime')

Let’s start by working on the intent validator prompt template. This is an iterative process, but thanks to Anthropic’s prompt engineering guide, you can quickly create a prompt that can accomplish the task.

  1. Create the first prompt template along with a utility function that will help prepare the body for the API invocations.

The following is the code for prompt_template_intent_validator.txt:

"{"prompt": "Human: I will give you some instructions to complete my request.\n<instructions>Given the Conversation between Human and Assistant, you need to identify the intent that the human wants to accomplish and respond appropriately. The valid intents are: Greeting,Place Order, Complain, Speak to Someone. Always put your response to the Human within the Response tags. Also add an XML tag to your output identifying the human intent.\nHere are some examples:\n<example><Conversation> H: hi there.\n\nA: Hi, how can I help you today?\n\nH: Yes. I would like a medium mocha please</Conversation>\n\nA:<intent>Place Order</intent><Response>\nGot it.</Response></example>\n<example><Conversation> H: hello\n\nA: Hi, how can I help you today?\n\nH: my coffee does not taste well can you please re-make it?</Conversation>\n\nA:<intent>Complain</intent><Response>\nOh, I am sorry to hear that. Let me get someone to help you.</Response></example>\n<example><Conversation> H: hi\n\nA: Hi, how can I help you today?\n\nH: I would like to speak to someone else please</Conversation>\n\nA:<intent>Speak to Someone</intent><Response>\nSure, let me get someone to help you.</Response></example>\n<example><Conversation> H: howdy\n\nA: Hi, how can I help you today?\n\nH:can I get a large americano with sugar and 2 mochas with no whipped cream</Conversation>\n\nA:<intent>Place Order</intent><Response>\nSure thing! Please give me a moment.</Response></example>\n<example><Conversation> H: hi\n\n</Conversation>\n\nA:<intent>Greeting</intent><Response>\nHi there, how can I help you today?</Response></example>\n</instructions>\n\nPlease complete this request according to the instructions and examples provided above:<request><Conversation>REPLACEME</Conversation></request>\n\nAssistant:\n", "max_tokens_to_sample": 250, "temperature": 1, "top_k": 250, "top_p": 0.75, "stop_sequences": ["\n\nHuman:", "\n\nhuman:", "\n\nCustomer:", "\n\ncustomer:"]}"


  1. Save this template into a file in order to upload to Amazon S3 and call from the Lambda function when needed. Save the templates as JSON serialized strings in a text file. The previous screenshot shows the code sample to accomplish this as well.
  2. Repeat the same steps with the other templates.

The following are some screenshots of the other templates and the results when calling Amazon Bedrock with some of them.

The following is the code for prompt_template_request_validator.txt:

"{"prompt": "Human: I will give you some instructions to complete my request.\n<instructions>Given the context do the following steps: 1. verify that the items in the input are valid. If customer provided an invalid item, recommend replacing it with a valid one. 2. verify that the customer has provided all the information marked as required. If the customer missed a required information, ask the customer for that information. 3. When the order is complete, provide a summary of the order and ask for confirmation always using this phrase: 'is this correct?' 4. If the customer confirms the order, Do not ask for confirmation again, just say the phrase inside the brackets [Great, Give me a moment while I try to process your order]</instructions>\n<context>\nThe VALID MENU ITEMS are: [latte, frappe, mocha, espresso, cappuccino, romano, americano].\nThe VALID OPTIONS are: [splenda, stevia, raw sugar, honey, whipped cream, sugar, oat milk, soy milk, regular milk, skimmed milk, whole milk, 2 percent milk, almond milk].\nThe required information is: size. Size can be: small, medium, large.\nHere are some examples: <example>H: I would like a medium latte with 1 Splenda and a small romano with no sugar please.\n\nA: <Validation>:\nThe Human is ordering a medium latte with one splenda. Latte is a valid menu item and splenda is a valid option. The Human is also ordering a small romano with no sugar. Romano is a valid menu item.</Validation>\n<Response>\nOk, I got: \n\t-Medium Latte with 1 Splenda and.\n\t-Small Romano with no Sugar.\nIs this correct?</Response>\n\nH: yep.\n\nA:\n<Response>\nGreat, Give me a moment while I try to process your order</example>\n\n<example>H: I would like a cappuccino and a mocha please.\n\nA: <Validation>:\nThe Human is ordering a cappuccino and a mocha. Both are valid menu items. The Human did not provide the size for the cappuccino. The human did not provide the size for the mocha. I will ask the Human for the required missing information.</Validation>\n<Response>\nSure thing, but can you please let me know the size for the Cappuccino and the size for the Mocha? We have Small, Medium, or Large.</Response></example>\n\n<example>H: I would like a small cappuccino and a large lemonade please.\n\nA: <Validation>:\nThe Human is ordering a small cappuccino and a large lemonade. Cappuccino is a valid menu item. Lemonade is not a valid menu item. I will suggest the Human a replacement from our valid menu items.</Validation>\n<Response>\nSorry, we don't have Lemonades, would you like to order something else instead? Perhaps a Frappe or a Latte?</Response></example>\n\n<example>H: Can I get a medium frappuccino with sugar please?\n\nA: <Validation>:\n The Human is ordering a Frappuccino. Frappuccino is not a valid menu item. I will suggest a replacement from the valid menu items in my context.</Validation>\n<Response>\nI am so sorry, but Frappuccino is not in our menu, do you want a frappe or a cappuccino instead? perhaps something else?</Response></example>\n\n<example>H: I want two large americanos and a small latte please.\n\nA: <Validation>:\n The Human is ordering 2 Large Americanos, and a Small Latte. Americano is a valid menu item. Latte is a valid menu item.</Validation>\n<Response>\nOk, I got: \n\t-2 Large Americanos and.\n\t-Small Latte.\nIs this correct?</Response>\n\nH: looks correct, yes.\n\nA:\n<Response>\nGreat, Give me a moment while I try to process your order.</Response></example>\n\n</Context>\n\nPlease complete this request according to the instructions and examples provided above:<request>REPLACEME</request>\n\nAssistant:\n", "max_tokens_to_sample": 250, "temperature": 0.3, "top_k": 250, "top_p": 0.75, "stop_sequences": ["\n\nHuman:", "\n\nhuman:", "\n\nCustomer:", "\n\ncustomer:"]}"

The following is our response from Amazon Bedrock using this template.

The following is the code for prompt_template_object_creator.txt:

"{"prompt": "Human: I will give you some instructions to complete my request.\n<instructions>Given the Conversation between Human and Assistant, you need to create a json object in Response with the appropriate attributes.\nHere are some examples:\n<example><Conversation> H: I want a latte.\n\nA:\nCan I have the size?\n\nH: Medium.\n\nA: So, a medium latte.\nIs this Correct?\n\nH: Yes.</Conversation>\n\nA:<Response>{\"1\":{\"item\":\"latte\",\"size\":\"medium\",\"addOns\":[]}}</Response></example>\n<example><Conversation> H: I want a large frappe and 2 small americanos with sugar.\n\nA: Okay, let me confirm:\n\n1 large frappe\n\n2 small americanos with sugar\n\nIs this correct?\n\nH: Yes.</Conversation>\n\nA:<Response>{\"1\":{\"item\":\"frappe\",\"size\":\"large\",\"addOns\":[]},\"2\":{\"item\":\"americano\",\"size\":\"small\",\"addOns\":[\"sugar\"]},\"3\":{\"item\":\"americano\",\"size\":\"small\",\"addOns\":[\"sugar\"]}}</Response>\n</example>\n<example><Conversation> H: I want a medium americano.\n\nA: Okay, let me confirm:\n\n1 medium americano\n\nIs this correct?\n\nH: Yes.</Conversation>\n\nA:<Response>{\"1\":{\"item\":\"americano\",\"size\":\"medium\",\"addOns\":[]}}</Response></example>\n<example><Conversation> H: I want a large latte with oatmilk.\n\nA: Okay, let me confirm:\n\nLarge latte with oatmilk\n\nIs this correct?\n\nH: Yes.</Conversation>\n\nA:<Response>{\"1\":{\"item\":\"latte\",\"size\":\"large\",\"addOns\":[\"oatmilk\"]}}</Response></example>\n<example><Conversation> H: I want a small mocha with no whipped cream please.\n\nA: Okay, let me confirm:\n\nSmall mocha with no whipped cream\n\nIs this correct?\n\nH: Yes.</Conversation>\n\nA:<Response>{\"1\":{\"item\":\"mocha\",\"size\":\"small\",\"addOns\":[\"no whipped cream\"]}}</Response>\n\n</example></instructions>\n\nPlease complete this request according to the instructions and examples provided above:<request><Conversation>REPLACEME</Conversation></request>\n\nAssistant:\n", "max_tokens_to_sample": 250, "temperature": 0.3, "top_k": 250, "top_p": 0.75, "stop_sequences": ["\n\nHuman:", "\n\nhuman:", "\n\nCustomer:", "\n\ncustomer:"]}"


The following is the code for prompt_template_order_summary.txt:

"{"prompt": "Human: I will give you some instructions to complete my request.\n<instructions>Given the Conversation between Human and Assistant, you need to create a summary of the order with bullet points and include the order total.\nHere are some examples:\n<example><Conversation> H: I want a large frappe and 2 small americanos with sugar.\n\nA: Okay, let me confirm:\n\n1 large frappe\n\n2 small americanos with sugar\n\nIs this correct?\n\nH: Yes.</Conversation>\n\n<OrderTotal>10.50</OrderTotal>\n\nA:<Response>\nHere is a summary of your order along with the total:\n\n1 large frappe\n\n2 small americanos with sugar.\nYour Order total is $10.50</Response></example>\n<example><Conversation> H: I want a medium americano.\n\nA: Okay, let me confirm:\n\n1 medium americano\n\nIs this correct?\n\nH: Yes.</Conversation>\n\n<OrderTotal>3.50</OrderTotal>\n\nA:<Response>\nHere is a summary of your order along with the total:\n\n1 medium americano.\nYour Order total is $3.50</Response></example>\n<example><Conversation> H: I want a large latte with oat milk.\n\nA: Okay, let me confirm:\n\nLarge latte with oat milk\n\nIs this correct?\n\nH: Yes.</Conversation>\n\n<OrderTotal>6.75</OrderTotal>\n\nA:<Response>\nHere is a summary of your order along with the total:\n\nLarge latte with oat milk.\nYour Order total is $6.75</Response></example>\n<example><Conversation> H: I want a small mocha with no whipped cream please.\n\nA: Okay, let me confirm:\n\nSmall mocha with no whipped cream\n\nIs this correct?\n\nH: Yes.</Conversation>\n\n<OrderTotal>4.25</OrderTotal>\n\nA:<Response>\nHere is a summary of your order along with the total:\n\nSmall mocha with no whipped cream.\nYour Order total is $6.75</Response>\n\n</example>\n</instructions>\n\nPlease complete this request according to the instructions and examples provided above:<request><Conversation>REPLACEME</Conversation>\n\n<OrderTotal>REPLACETOTAL</OrderTotal></request>\n\nAssistant:\n", "max_tokens_to_sample": 250, "temperature": 0.3, "top_k": 250, "top_p": 0.75, "stop_sequences": ["\n\nHuman:", "\n\nhuman:", "\n\nCustomer:", "\n\ncustomer:", "[Conversation]"]}"


As you can see, we have used our prompt templates to validate menu items, identify missing required information, create a data structure, and summarize the order. The foundational models available on Amazon Bedrock are very powerful, so you could accomplish even more tasks via these templates.

You have completed engineering the prompts and saved the templates to text files. You can now begin creating the Amazon Lex bot and the associated Lambda functions.

Create a Lambda layer with the prompt templates

Complete the following steps to create your Lambda layer:

  1. In SageMaker Studio, create a new folder with a subfolder named python.
  2. Copy your prompt files to the python folder.

  1. You can add the ZIP library to your notebook instance by running the following command.
!conda install -y -c conda-forge zip

  1. Now, run the following command to create the ZIP file for uploading to the Lambda layer.
!zip -r prompt_templates_layer.zip prompt_templates_layer/.

  1. After you create the ZIP file, you can download the file. Go to Lambda, create a new layer by uploading the file directly or by uploading to Amazon S3 first.
  2. Then attach this new layer to the orchestration Lambda function.

Now your prompt template files are locally stored in your Lambda runtime environment. This will speed up the process during your bot runs.

Create a Lambda layer with the required libraries

Complete the following steps to create your Lambda layer with the required librarues:

  1. Open an AWS Cloud9 instance environment, create a folder with a subfolder called python.
  2. Open a terminal inside the python folder.
  3. Run the following commands from the terminal:
pip install “boto3>=1.28.57” -t .
pip install “awscli>=1.29.57" -t .
pip install “botocore>=1.31.57” -t .
  1. Run cd .. and position yourself inside your new folder where you also have the python subfolder.
  2. Run the following command:
zip -r lambda-layer.zip
  1. After you create the ZIP file, you can download the file. Go to Lambda, create a new layer by uploading the file directly or by uploading to Amazon S3 first.
  2. Then attach this new layer to the orchestration Lambda function.

Create the bot in Amazon Lex v2

For this use case, we build an Amazon Lex bot that can provide an input/output interface for the architecture in order to call Amazon Bedrock using voice or text from any interface. Because the LLM will handle the conversation piece of this order processing agent, and Lambda will orchestrate the workflow, you can create a bot with three intents and no slots.

  1. On the Amazon Lex console, create a new bot with the method Create a blank bot.

Now you can add an intent with any appropriate initial utterance for the end-users to start the conversation with the bot. We use simple greetings and add an initial bot response so end-users can provide their requests. When creating the bot, make sure to use a Lambda code hook with the intents; this will trigger a Lambda function that will orchestrate the workflow between the customer, Amazon Lex, and the LLM.

  1. Add your first intent, which triggers the workflow and uses the intent validation prompt template to call Amazon Bedrock and identify what the customer is trying to accomplish. Add a few simple utterances for end-users to start conversation.

You don’t need to use any slots or initial reading in any of the bot intents. In fact, you don’t need to add utterances to the second or third intents. That is because the LLM will guide Lambda throughout the process.

  1. Add a confirmation prompt. You can customize this message in the Lambda function later.

  1. Under Code hooks, select Use a Lambda function for initialization and validation.

  1. Create a second intent with no utterance and no initial response. This is the PlaceOrder intent.

When the LLM identifies that the customer is trying to place an order, the Lambda function will trigger this intent and validate the customer request against the menu, and make sure that no required information is missing. Remember that all of this is on the prompt templates, so you can adapt this workflow for any use case by changing the prompt templates.

  1. Don’t add any slots, but add a confirmation prompt and decline response.

  1. Select Use a Lambda function for initialization and validation.

  1. Create a third intent named ProcessOrder with no sample utterances and no slots.
  2. Add an initial response, a confirmation prompt, and a decline response.

After the LLM has validated the customer request, the Lambda function triggers the third and last intent to process the order. Here, Lambda will use the object creator template to generate the order JSON data structure to query the DynamoDB table, and then use the order summary template to summarize the whole order along with the total so Amazon Lex can pass it to the customer.

  1. Select Use a Lambda function for initialization and validation. This can use any Lambda function to process the order after the customer has given the final confirmation.

  1. After you create all three intents, go to the Visual builder for the ValidateIntent, add a go-to intent step, and connect the output of the positive confirmation to that step.
  2. After you add the go-to intent, edit it and choose the PlaceOrder intent as the intent name.

  1. Similarly, to go the Visual builder for the PlaceOrder intent and connect the output of the positive confirmation to the ProcessOrder go-to intent. No editing is required for the ProcessOrder intent.
  2. You now need to create the Lambda function that orchestrates Amazon Lex and calls the DynamoDB table, as detailed in the following section.

Create a Lambda function to orchestrate the Amazon Lex bot

You can now build the Lambda function that orchestrates the Amazon Lex bot and workflow. Complete the following steps:

  1. Create a Lambda function with the standard execution policy and let Lambda create a role for you.
  2. In the code window of your function, add a few utility functions that will help: format the prompts by adding the lex context to the template, call the Amazon Bedrock LLM API, extract the desired text from the responses, and more. See the following code:
import json
import re
import boto3
import logging

logger = logging.getLogger()
logger.setLevel(logging.DEBUG)

bedrock = boto3.client(service_name='bedrock-runtime')
def CreatingCustomPromptFromLambdaLayer(object_key,replace_items):
   
    folder_path = '/opt/order_processing_agent_prompt_templates/python/'
    try:
        file_path = folder_path + object_key
        with open(file_path, "r") as file1:
            raw_template = file1.read()
            # Modify the template with the custom input prompt
            #template['inputs'][0].insert(1, {"role": "user", "content": '### Input:n' + user_request})
            for key,value in replace_items.items():
                value = json.dumps(json.dumps(value).replace('"','')).replace('"','')
                raw_template = raw_template.replace(key,value)
            modified_prompt = raw_template

            return modified_prompt
    except Exception as e:
        return {
            'statusCode': 500,
            'body': f'An error occurred: {str(e)}'
        }
def CreatingCustomPrompt(object_key,replace_items):
    logger.debug('replace_items is: {}'.format(replace_items))
    #retrieve user request from intent_request
    #we first propmt the model with current order
    
    bucket_name = 'your-bucket-name'
    
    #object_key = 'prompt_template_order_processing.txt'
    try:
        s3 = boto3.client('s3')
        # Retrieve the existing template from S3
        response = s3.get_object(Bucket=bucket_name, Key=object_key)
        raw_template = response['Body'].read().decode('utf-8')
        raw_template = json.loads(raw_template)
        logger.debug('raw template is {}'.format(raw_template))
        #template_json = json.loads(raw_template)
        #logger.debug('template_json is {}'.format(template_json))
        #template = json.dumps(template_json)
        #logger.debug('template is {}'.format(template))

        # Modify the template with the custom input prompt
        #template['inputs'][0].insert(1, {"role": "user", "content": '### Input:n' + user_request})
        for key,value in replace_items.items():
            raw_template = raw_template.replace(key,value)
            logger.debug("Replacing: {} nwith: {}".format(key,value))
        modified_prompt = json.dumps(raw_template)
        logger.debug("Modified template: {}".format(modified_prompt))
        logger.debug("Modified template type is: {}".format(print(type(modified_prompt))))
        
        #modified_template_json = json.loads(modified_prompt)
        #logger.debug("Modified template json: {}".format(modified_template_json))
        
        return modified_prompt
    except Exception as e:
        return {
            'statusCode': 500,
            'body': f'An error occurred: {str(e)}'
        }
    
def validate_intent(intent_request):
    logger.debug('starting validate_intent: {}'.format(intent_request))
    #retrieve user request from intent_request
    user_request = 'Human: ' + intent_request['inputTranscript'].lower()
    #getting current context variable
    current_session_attributes =  intent_request['sessionState']['sessionAttributes']
    if len(current_session_attributes) > 0:
        full_context = current_session_attributes['fullContext'] + '\n\n' + user_request
        dialog_context = current_session_attributes['dialogContext'] + '\n\n' + user_request
    else:
        full_context = user_request
        dialog_context = user_request
    #Preparing validation prompt by adding context to prompt template
    object_key = 'prompt_template_intent_validator.txt'
    #replace_items = {"REPLACEME":full_context}
    #replace_items = {"REPLACEME":dialog_context}
    replace_items = {"REPLACEME":dialog_context}
    #validation_prompt = CreatingCustomPrompt(object_key,replace_items)
    validation_prompt = CreatingCustomPromptFromLambdaLayer(object_key,replace_items)

    #Prompting model for request validation
    intent_validation_completion = prompt_bedrock(validation_prompt)
    intent_validation_completion = re.sub(r'["]','',intent_validation_completion)

    #extracting response from response completion and removing some special characters
    validation_response = extract_response(intent_validation_completion)
    validation_intent = extract_intent(intent_validation_completion)
    
    

    #business logic depending on intents
    if validation_intent == 'Place Order':
        return validate_request(intent_request)
    elif validation_intent in ['Complain','Speak to Someone']:
        ##adding session attributes to keep current context
        full_context = full_context + '\n\n' + intent_validation_completion
        dialog_context = dialog_context + '\n\nAssistant: ' + validation_response
        intent_request['sessionState']['sessionAttributes']['fullContext'] = full_context
        intent_request['sessionState']['sessionAttributes']['dialogContext'] = dialog_context
        intent_request['sessionState']['sessionAttributes']['customerIntent'] = validation_intent
        return close(intent_request['sessionState']['sessionAttributes'],intent_request['sessionState']['intent']['name'],'Fulfilled','Close',validation_response)
    if validation_intent == 'Greeting':
        ##adding session attributes to keep current context
        full_context = full_context + '\n\n' + intent_validation_completion
        dialog_context = dialog_context + '\n\nAssistant: ' + validation_response
        intent_request['sessionState']['sessionAttributes']['fullContext'] = full_context
        intent_request['sessionState']['sessionAttributes']['dialogContext'] = dialog_context
        intent_request['sessionState']['sessionAttributes']['customerIntent'] = validation_intent
        return close(intent_request['sessionState']['sessionAttributes'],intent_request['sessionState']['intent']['name'],'InProgress','ConfirmIntent',validation_response)

def validate_request(intent_request):
    logger.debug('starting validate_request: {}'.format(intent_request))
    #retrieve user request from intent_request
    user_request = 'Human: ' + intent_request['inputTranscript'].lower()
    #getting current context variable
    current_session_attributes =  intent_request['sessionState']['sessionAttributes']
    if len(current_session_attributes) > 0:
        full_context = current_session_attributes['fullContext'] + '\n\n' + user_request
        dialog_context = current_session_attributes['dialogContext'] + '\n\n' + user_request
    else:
        full_context = user_request
        dialog_context = user_request
   
    #Preparing validation prompt by adding context to prompt template
    object_key = 'prompt_template_request_validator.txt'
    replace_items = {"REPLACEME":dialog_context}
    #validation_prompt = CreatingCustomPrompt(object_key,replace_items)
    validation_prompt = CreatingCustomPromptFromLambdaLayer(object_key,replace_items)

    #Prompting model for request validation
    request_validation_completion = prompt_bedrock(validation_prompt)
    request_validation_completion = re.sub(r'["]','',request_validation_completion)

    #extracting response from response completion and removing some special characters
    validation_response = extract_response(request_validation_completion)

    ##adding session attributes to keep current context
    full_context = full_context + '\n\n' + request_validation_completion
    dialog_context = dialog_context + '\n\nAssistant: ' + validation_response
    intent_request['sessionState']['sessionAttributes']['fullContext'] = full_context
    intent_request['sessionState']['sessionAttributes']['dialogContext'] = dialog_context
    
    return close(intent_request['sessionState']['sessionAttributes'],'PlaceOrder','InProgress','ConfirmIntent',validation_response)
    
def process_order(intent_request):
    logger.debug('starting process_order: {}'.format(intent_request))

     #retrieve user request from intent_request
    user_request = 'Human: ' + intent_request['inputTranscript'].lower()
    #getting current context variable
    current_session_attributes =  intent_request['sessionState']['sessionAttributes']
    if len(current_session_attributes) > 0:
        full_context = current_session_attributes['fullContext'] + '\n\n' + user_request
        dialog_context = current_session_attributes['dialogContext'] + '\n\n' + user_request
    else:
        full_context = user_request
        dialog_context = user_request
    #   Preparing object creator prompt by adding context to prompt template
    object_key = 'prompt_template_object_creator.txt'
    replace_items = {"REPLACEME":dialog_context}
    #object_creator_prompt = CreatingCustomPrompt(object_key,replace_items)
    object_creator_prompt = CreatingCustomPromptFromLambdaLayer(object_key,replace_items)
    #Prompting model for object creation
    object_creation_completion = prompt_bedrock(object_creator_prompt)
    #extracting response from response completion
    object_creation_response = extract_response(object_creation_completion)
    inputParams = json.loads(object_creation_response)
    inputParams = json.dumps(json.dumps(inputParams))
    logger.debug('inputParams is: {}'.format(inputParams))
    client = boto3.client('lambda')
    response = client.invoke(FunctionName = 'arn:aws:lambda:us-east-1:<AccountNumber>:function:aws-blog-order-validator',InvocationType = 'RequestResponse',Payload = inputParams)
    responseFromChild = json.load(response['Payload'])
    validationResult = responseFromChild['statusCode']
    if validationResult == 205:
        order_validation_error = responseFromChild['validator_response']
        return close(intent_request['sessionState']['sessionAttributes'],'PlaceOrder','InProgress','ConfirmIntent',order_validation_error)
    #invokes Order Processing lambda to query DynamoDB table and returns order total
    response = client.invoke(FunctionName = 'arn:aws:lambda:us-east-1: <AccountNumber>:function:aws-blog-order-processing',InvocationType = 'RequestResponse',Payload = inputParams)
    responseFromChild = json.load(response['Payload'])
    orderTotal = responseFromChild['body']
    ###Prompting the model to summarize the order along with order total
    object_key = 'prompt_template_order_summary.txt'
    replace_items = {"REPLACEME":dialog_context,"REPLACETOTAL":orderTotal}
    #order_summary_prompt = CreatingCustomPrompt(object_key,replace_items)
    order_summary_prompt = CreatingCustomPromptFromLambdaLayer(object_key,replace_items)
    order_summary_completion = prompt_bedrock(order_summary_prompt)
    #extracting response from response completion
    order_summary_response = extract_response(order_summary_completion)  
    order_summary_response = order_summary_response + '. Shall I finalize processing your order?'
    ##adding session attributes to keep current context
    full_context = full_context + '\n\n' + order_summary_completion
    dialog_context = dialog_context + '\n\nAssistant: ' + order_summary_response
    intent_request['sessionState']['sessionAttributes']['fullContext'] = full_context
    intent_request['sessionState']['sessionAttributes']['dialogContext'] = dialog_context
    return close(intent_request['sessionState']['sessionAttributes'],'ProcessOrder','InProgress','ConfirmIntent',order_summary_response)
    

""" --- Main handler and Workflow functions --- """

def lambda_handler(event, context):
    """
    Route the incoming request based on intent.
    The JSON body of the request is provided in the event slot.
    """
    logger.debug('event is: {}'.format(event))

    return dispatch(event)

def dispatch(intent_request):
    """
    Called when the user specifies an intent for this bot. If intent is not valid then returns error name
    """
    logger.debug('intent_request is: {}'.format(intent_request))
    intent_name = intent_request['sessionState']['intent']['name']
    confirmation_state = intent_request['sessionState']['intent']['confirmationState']
    # Dispatch to your bot's intent handlers
    if intent_name == 'ValidateIntent' and confirmation_state == 'None':
        return validate_intent(intent_request)
    if intent_name == 'PlaceOrder' and confirmation_state == 'None':
        return validate_request(intent_request)
    elif intent_name == 'PlaceOrder' and confirmation_state == 'Confirmed':
        return process_order(intent_request)
    elif intent_name == 'PlaceOrder' and confirmation_state == 'Denied':
        return close(intent_request['sessionState']['sessionAttributes'],intent_request['sessionState']['intent']['name'],'Fulfilled','Close','Got it. Let me know if I can help you with something else.')
    elif intent_name == 'PlaceOrder' and confirmation_state not in ['Denied','Confirmed','None']:
        return close(intent_request['sessionState']['sessionAttributes'],intent_request['sessionState']['intent']['name'],'Fulfilled','Close','Sorry. I am having trouble completing the request. Let me get someone to help you.')
        logger.debug('exiting intent {} here'.format(intent_request['sessionState']['intent']['name']))
    elif intent_name == 'ProcessOrder' and confirmation_state == 'None':
        return validate_request(intent_request)
    elif intent_name == 'ProcessOrder' and confirmation_state == 'Confirmed':
        return close(intent_request['sessionState']['sessionAttributes'],intent_request['sessionState']['intent']['name'],'Fulfilled','Close','Perfect! Your order has been processed. Please proceed to payment.')
    elif intent_name == 'ProcessOrder' and confirmation_state == 'Denied':
        return close(intent_request['sessionState']['sessionAttributes'],intent_request['sessionState']['intent']['name'],'Fulfilled','Close','Got it. Let me know if I can help you with something else.')
    elif intent_name == 'ProcessOrder' and confirmation_state not in ['Denied','Confirmed','None']:
        return close(intent_request['sessionState']['sessionAttributes'],intent_request['sessionState']['intent']['name'],'Fulfilled','Close','Sorry. I am having trouble completing the request. Let me get someone to help you.')
        logger.debug('exiting intent {} here'.format(intent_request['sessionState']['intent']['name']))
    raise Exception('Intent with name ' + intent_name + ' not supported')
    
def prompt_bedrock(formatted_template):
    logger.debug('prompt bedrock input is:'.format(formatted_template))
    body = json.loads(formatted_template)

    modelId = 'anthropic.claude-v2' # change this to use a different version from the model provider
    accept = 'application/json'
    contentType = 'application/json'

    response = bedrock.invoke_model(body=body, modelId=modelId, accept=accept, contentType=contentType)
    response_body = json.loads(response.get('body').read())
    response_completion = response_body.get('completion')
    logger.debug('response is: {}'.format(response_completion))

    #print_ww(response_body.get('completion'))
    #print(response_body.get('results')[0].get('outputText'))
    return response_completion

#function to extract text between the <Response> and </Response> tags within model completion
def extract_response(response_completion):
    
    if '<Response>' in response_completion:
        customer_response = response_completion.replace('<Response>','||').replace('</Response>','').split('||')[1]
        
        logger.debug('modified response is: {}'.format(response_completion))

        return customer_response
    else:
        
        logger.debug('modified response is: {}'.format(response_completion))

        return response_completion
        
#function to extract text between the <Response> and </Response> tags within model completion
def extract_intent(response_completion):
    if '<intent>' in response_completion:
        customer_intent = response_completion.replace('<intent>','||').replace('</intent>','||').split('||')[1]
        return customer_intent
    else:
        return customer_intent
        
def close(session_attributes, intent, fulfillment_state, action_type, message):
    #This function prepares the response in the appropiate format for Lex V2

    response = {
        "sessionState": {
            "sessionAttributes":session_attributes,
            "dialogAction": {
                "type": action_type
            },
            "intent": {
                "name":intent,
                "state":fulfillment_state
                
            },
            
            },
        "messages":
            [{
                "contentType":"PlainText",
                "content":message,
            }]
            ,
        
    }
    return response
  1. Attach the Lambda layer you created earlier to this function.
  2. Additionally, attach the layer to the prompt templates you created.
  3. In the Lambda execution role, attach the policy to access Amazon Bedrock, which was created earlier.

The Lambda execution role should have the following permissions.

Attach the Orchestration Lambda function to the Amazon Lex bot

  1. After you create the function in the previous section, return to the Amazon Lex console and navigate to your bot.
  2. Under Languages in the navigation pane, choose English.
  3. For Source, choose your order processing bot.
  4. For Lambda function version or alias, choose $LATEST.
  5. Choose Save.

Create assisting Lambda functions

Complete the following steps to create additional Lambda functions:

  1. Create a Lambda function to query the DynamoDB table that you created earlier:
import json
import boto3
import logging

logger = logging.getLogger()
logger.setLevel(logging.DEBUG)
# Initialize the DynamoDB client
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('your-table-name')

def calculate_grand_total(input_data):
    # Initialize the total price
    total_price = 0
    
    try:
        # Loop through each item in the input JSON
        for item_id, item_data in input_data.items():
            item_name = item_data['item'].lower()  # Convert item name to lowercase
            item_size = item_data['size'].lower()  # Convert item size to lowercase
            
            # Query the DynamoDB table for the item based on Item and Size
            response = table.get_item(
                Key={'Item': item_name,
                    'Size': item_size}
            )
            
            # Check if the item was found in the table
            if 'Item' in response:
                item = response['Item']
                price = float(item['Price'])
                total_price += price  # Add the item's price to the total
    
        return total_price
    except Exception as e:
        raise Exception('An error occurred: {}'.format(str(e)))

def lambda_handler(event, context):
    try:
       
        # Parse the input JSON from the Lambda event
        input_json = json.loads(event)

        # Calculate the grand total
        grand_total = calculate_grand_total(input_json)
    
        # Return the grand total in the response
        return {'statusCode': 200,'body': json.dumps(grand_total)}
    except Exception as e:
        return {
            'statusCode': 500,
            'body': json.dumps('An error occurred: {}'.format(str(e)))
  1. Navigate to the Configuration tab in the Lambda function and choose Permissions.
  2. Attach a resource-based policy statement allowing the order processing Lambda function to invoke this function.

  1. Navigate to the IAM execution role for this Lambda function and add a policy to access the DynamoDB table.

  1. Create another Lambda function to validate if all required attributes were passed from the customer. In the following example, we validate if the size attribute is captured for an order:
import json
import logging

logger = logging.getLogger()
logger.setLevel(logging.DEBUG)

def lambda_handler(event, context):
    # Define customer orders from the input event
    customer_orders = json.loads(event)

    # Initialize a list to collect error messages
    order_errors = {}
    missing_size = []
    error_messages = []
    # Iterate through each order in customer_orders
    for order_id, order in customer_orders.items():
        if "size" not in order or order["size"] == "":
            missing_size.append(order['item'])
            order_errors['size'] = missing_size
    if order_errors:
        items_missing_size = order_errors['size']
        error_message = f"could you please provide the size for the following items: {', '.join(items_missing_size)}?"
        error_messages.append(error_message)

    # Prepare the response message
    if error_messages:
        response_message = "n".join(error_messages)
        return {
        'statusCode': 205,
        'validator_response': response_message
            }   
    else:
        response_message = "Order is validated successfully"
        return {
        'statusCode': 200,
        'validator_response': response_message
        }
  1. Navigate to the Configuration tab in the Lambda function and choose Permissions.
  2. Attach a resource-based policy statement allowing the order processing Lambda function to invoke this function.

Test the solution

Now we can test the solution with example orders that customers place via Amazon Lex.

For our first example, the customer asked for a frappuccino, which is not on the menu. The model validates with the help of order validator template and suggests some recommendations based on the menu. After the customer confirms their order, they are notified of the order total and order summary. The order will be processed based on the customer’s final confirmation.

In our next example, the customer is ordering for large cappuccino and then modifying the size from large to medium. The model captures all necessary changes and requests the customer to confirm the order. The model presents the order total and order summary, and processes the order based on the customer’s final confirmation.

For our final example, the customer placed an order for multiple items and the size is missing for a couple of items. The model and Lambda function will verify if all required attributes are present to process the order and then ask the customer to provide the missing information. After the customer provides the missing information (in this case, the size of the coffee), they’re shown the order total and order summary. The order will be processed based on the customer’s final confirmation.

LLM limitations

LLM outputs are stochastic by nature, which means that the results from our LLM can vary in format, or even in the form of untruthful content (hallucinations). Therefore, developers need to rely on a good error handling logic throughout their code in order to handle these scenarios and avoid a degraded end-user experience.

Clean up

If you no longer need this solution, you can delete the following resources:

  • Lambda functions
  • Amazon Lex box
  • DynamoDB table
  • S3 bucket

Additionally, shut down the SageMaker Studio instance if the application is no longer required.

Cost assessment

For pricing information for the main services used by this solution, see the following:

Note that you can use Claude v2 without the need for provisioning, so overall costs remain at a minimum. To further reduce costs, you can configure the DynamoDB table with the on-demand setting.

Conclusion

This post demonstrated how to build a speech-enabled AI order processing agent using Amazon Lex, Amazon Bedrock, and other AWS services. We showed how prompt engineering with a powerful generative AI model like Claude can enable robust natural language understanding and conversation flows for order processing without the need for extensive training data.

The solution architecture uses serverless components like Lambda, Amazon S3, and DynamoDB to enable a flexible and scalable implementation. Storing the prompt templates in Amazon S3 allows you to customize the solution for different use cases.

Next steps could include expanding the agent’s capabilities to handle a wider range of customer requests and edge cases. The prompt templates provide a way to iteratively improve the agent’s skills. Additional customizations could involve integrating the order data with backend systems like inventory, CRM, or POS. Lastly, the agent could be made available across various customer touchpoints like mobile apps, drive-thru, kiosks, and more using the multi-channel capabilities of Amazon Lex.

To learn more, refer to the following related resources:


About the Authors

Moumita Dutta is a Partner Solution Architect at Amazon Web Services. In her role, she collaborates closely with partners to develop scalable and reusable assets that streamline cloud deployments and enhance operational efficiency. She is a member of AI/ML community and a Generative AI expert at AWS. In her leisure, she enjoys gardening and cycling.

Fernando Lammoglia is a Partner Solutions Architect at Amazon Web Services, working closely with AWS partners in spearheading the development and adoption of cutting-edge AI solutions across business units. A strategic leader with expertise in cloud architecture, generative AI, machine learning, and data analytics. He specializes in executing go-to-market strategies and delivering impactful AI solutions aligned with organizational goals. On his free time he loves to spend time with his family and travel to other countries.

Mitul Patel is a Senior Solution Architect at Amazon Web Services. In his role as a cloud technology enabler, he works with customers to understand their goals and challenges, and provides prescriptive guidance to achieve their objective with AWS offerings. He is a member of AI/ML community and a Generative AI ambassador at AWS. In his free time, he enjoys hiking and playing soccer.

Read More

Federated learning on AWS using FedML, Amazon EKS, and Amazon SageMaker

Federated learning on AWS using FedML, Amazon EKS, and Amazon SageMaker

This post is co-written with Chaoyang He, Al Nevarez and Salman Avestimehr from FedML.

Many organizations are implementing machine learning (ML) to enhance their business decision-making through automation and the use of large distributed datasets. With increased access to data, ML has the potential to provide unparalleled business insights and opportunities. However, the sharing of raw, non-sanitized sensitive information across different locations poses significant security and privacy risks, especially in regulated industries such as healthcare.

To address this issue, federated learning (FL) is a decentralized and collaborative ML training technique that offers data privacy while maintaining accuracy and fidelity. Unlike traditional ML training, FL training occurs within an isolated client location using an independent secure session. The client only shares its output model parameters with a centralized server, known as the training coordinator or aggregation server, and not the actual data used to train the model. This approach alleviates many data privacy concerns while enabling effective collaboration on model training.

Although FL is a step towards achieving better data privacy and security, it’s not a guaranteed solution. Insecure networks lacking access control and encryption can still expose sensitive information to attackers. Additionally, locally trained information can expose private data if reconstructed through an inference attack. To mitigate these risks, the FL model uses personalized training algorithms and effective masking and parameterization before sharing information with the training coordinator. Strong network controls at local and centralized locations can further reduce inference and exfiltration risks.

In this post, we share an FL approach using FedML, Amazon Elastic Kubernetes Service (Amazon EKS), and Amazon SageMaker to improve patient outcomes while addressing data privacy and security concerns.

The need for federated learning in healthcare

Healthcare relies heavily on distributed data sources to make accurate predictions and assessments about patient care. Limiting the available data sources to protect privacy negatively affects result accuracy and, ultimately, the quality of patient care. Therefore, ML creates challenges for AWS customers who need to ensure privacy and security across distributed entities without compromising patient outcomes.

Healthcare organizations must navigate strict compliance regulations, such as the Health Insurance Portability and Accountability Act (HIPAA) in the United States, while implementing FL solutions. Ensuring data privacy, security, and compliance becomes even more critical in healthcare, requiring robust encryption, access controls, auditing mechanisms, and secure communication protocols. Additionally, healthcare datasets often contain complex and heterogeneous data types, making data standardization and interoperability a challenge in FL settings.

Use case overview

The use case outlined in this post is of heart disease data in different organizations, on which an ML model will run classification algorithms to predict heart disease in the patient. Because this data is across organizations, we use federated learning to collate the findings.

The Heart Disease dataset from the University of California Irvine’s Machine Learning Repository is a widely used dataset for cardiovascular research and predictive modeling. It consists of 303 samples, each representing a patient, and contains a combination of clinical and demographic attributes, as well as the presence or absence of heart disease.

This multivariate dataset has 76 attributes in the patient information, out of which 14 attributes are most commonly used for developing and evaluating ML algorithms to predict the presence of heart disease based on the given attributes.

FedML framework

There is a wide selection of FL frameworks, but we decided to use the FedML framework for this use case because it is open source and supports several FL paradigms. FedML provides a popular open source library, MLOps platform, and application ecosystem for FL. These facilitate the development and deployment of FL solutions. It provides a comprehensive suite of tools, libraries, and algorithms that enable researchers and practitioners to implement and experiment with FL algorithms in a distributed environment. FedML addresses the challenges of data privacy, communication, and model aggregation in FL, offering a user-friendly interface and customizable components. With its focus on collaboration and knowledge sharing, FedML aims to accelerate the adoption of FL and drive innovation in this emerging field. The FedML framework is model agnostic, including recently added support for large language models (LLMs). For more information, refer to Releasing FedLLM: Build Your Own Large Language Models on Proprietary Data using the FedML Platform.

FedML Octopus

System hierarchy and heterogeneity is a key challenge in real-life FL use cases, where different data silos may have different infrastructure with CPU and GPUs. In such scenarios, you can use FedML Octopus.

FedML Octopus is the industrial-grade platform of cross-silo FL for cross-organization and cross-account training. Coupled with FedML MLOps, it enables developers or organizations to conduct open collaboration from anywhere at any scale in a secure manner. FedML Octopus runs a distributed training paradigm inside each data silo and uses synchronous or asynchronous trainings.

FedML MLOps

FedML MLOps enables local development of code that can later be deployed anywhere using FedML frameworks. Before initiating training, you must create a FedML account, as well as create and upload the server and client packages in FedML Octopus. For more details, refer to steps and Introducing FedML Octopus: scaling federated learning into production with simplified MLOps.

Solution overview

We deploy FedML into multiple EKS clusters integrated with SageMaker for experiment tracking. We use Amazon EKS Blueprints for Terraform to deploy the required infrastructure. EKS Blueprints helps compose complete EKS clusters that are fully bootstrapped with the operational software that is needed to deploy and operate workloads. With EKS Blueprints, the configuration for the desired state of EKS environment, such as the control plane, worker nodes, and Kubernetes add-ons, is described as an infrastructure as code (IaC) blueprint. After a blueprint is configured, it can be used to create consistent environments across multiple AWS accounts and Regions using continuous deployment automation.

The content shared in this post reflects real-life situations and experiences, but it’s important to note that the deployment of these situations in different locations may vary. Although we utilize a single AWS account with separate VPCs, it’s crucial to understand that individual circumstances and configurations may differ. Therefore, the information provided should be used as a general guide and may require adaptation based on specific requirements and local conditions.

The following diagram illustrates our solution architecture.

In addition to the tracking provided by FedML MLOps for each training run, we use Amazon SageMaker Experiments to track the performance of each client model and the centralized (aggregator) model.

SageMaker Experiments is a capability of SageMaker that lets you create, manage, analyze, and compare your ML experiments. By recording experiment details, parameters, and results, researchers can accurately reproduce and validate their work. It allows for effective comparison and analysis of different approaches, leading to informed decision-making. Additionally, tracking experiments facilitates iterative improvement by providing insights into the progression of models and enabling researchers to learn from previous iterations, ultimately accelerating the development of more effective solutions.

We send the following to SageMaker Experiments for each run:

  • Model evaluation metrics – Training loss and Area Under the Curve (AUC)
  • Hyperparameters – Epoch, learning rate, batch size, optimizer, and weight decay

Prerequisites

To follow along with this post, you should have the following prerequisites:

Deploy the solution

To begin, clone the repository hosting the sample code locally:

git clone git@ssh.gitlab.aws.dev:west-ml-sa/fl_fedml.ai.git

Then deploy the use case infrastructure using the following commands:

terraform init
terraform apply

The Terraform template may take 20–30 minutes to fully deploy. After it’s deployed, follow the steps in the next sections to run the FL application.

Create an MLOps deployment package

As a part of the FedML documentation, we need to create the client and server packages, which the MLOps platform will distribute to the server and clients to begin training.

To create these packages, run the following script found in the root directory:

. ./build_mlops_pkg.sh

This will create the respective packages in the following directory in the project’s root directory:

mlops/dist-packages

Upload the packages to the FedML MLOps platform

Complete the following steps to upload the packages:

  1. On the FedML UI, choose My Applications in the navigation pane.
  2. Choose New Application.
  3. Upload the client and server packages from your workstation.
  4. You can also adjust the hyperparameters or create new ones.

Trigger federated training

To run federated training, complete the following steps:

  1. On the FedML UI, choose Project List in the navigation pane.
  2. Choose Create a new project.
  3. Enter a group name and a project name, then choose OK.
  4. Choose the newly created project and choose Create new run to trigger a training run.
  5. Select the edge client devices and the central aggregator server for this training run.
  6. Choose the application that you created in the previous steps.
  7. Update any of the hyperparameters or use the default settings.
  8. Choose Start to start training.
  9. Choose the Training Status tab and wait for the training run to complete. You can also navigate to the tabs available.
  10. When training is complete, choose the System tab to see the training time durations on your edge servers and aggregation events.

View results and experiment details

When the training is complete, you can view the results using FedML and SageMaker.

On the FedML UI, on the Models tab, you can see the aggregator and client model. You can also download these models from the website.

You can also log in to Amazon SageMaker Studio and choose Experiments in the navigation pane.

The following screenshot shows the logged experiments.

Experiment tracking code

In this section, we explore the code that integrates SageMaker experiment tracking with the FL framework training.

In an editor of your choice, open the following folder to see the edits to the code to inject SageMaker experiment tracking code as a part of the training:

cd fl_fedml.ai/

For tracking the training, we create a SageMaker experiment with parameters and metrics logged using the log_parameter and log_metric command as outlined in the following code sample.

An entry in the config/fedml_config.yaml file declares the experiment prefix, which is referenced in the code to create unique experiment names: sm_experiment_name: "fed-heart-disease". You can update this to any value of your choice.

For example, see the following code for the heart_disease_trainer.py, which is used by each client to train the model on their own dataset:

# Add this code before the for loop on epochs
# We are passing the experiment prefix & client-rank from the config
# to the function to create a unique name
experiment_name = unique_name_from_base(args.sm_experiment_name + "-client-" + str(args.rank))
print(f"Sagemaker Experiment Name: {experiment_name}")

For each client run, the experiment details are tracked using the following code in heart_disease_trainer.py:

# create an experiment and start a new run
with Run(experiment_name=experiment_name, run_name=run_name, sagemaker_session=Session()) as run:
run.log_parameters(
{ "Train Data Size": str(len(train_data.dataset)),
"device": "cpu",
"center": args.rank,
"learning-rate": args.lr,
"batch-size": args.batch_size,
"client-optimizer" : args.client_optimizer,
"weight-decay": args.weight_decay
}
)
run.log_metric(name="Validation:AUC", value=epoch_auc)
run.log_metric(name="Training:Loss", value=epoch_loss)

Similarly, you can use the code in heart_disease_aggregator.py to run a test on local data after updating the model weights. The details are logged after each communication run with the clients.

# create an experiment and start a new run
with Run(experiment_name=experiment_name, run_name=run_name, sagemaker_session=Session()) as run:
run.log_parameters(
{ "Train Data Size": str(len(test_data_local_dict[i])),
"device": "cpu",
"round": i,
"learning-rate": args.lr,
"batch-size": args.batch_size,
"client-optimizer" : args.client_optimizer,
"weight-decay": args.weight_decay
}
)
run.log_metric(name="Test:AUC", value=test_auc_metrics)
run.log_metric(name="Test:Loss", value=test_loss_metrics)

Clean up

When you’re done with the solution, make sure to clean up the resources used to ensure efficient resource utilization and cost management, and avoid unnecessary expenses and resource wastage. Active tidying up the environment, such as deleting unused instances, stopping unnecessary services, and removing temporary data, contributes to a clean and organized infrastructure. You can use the following code to clean up your resources:

terraform destroy -target=module.m_fedml_edge_server.module.eks_blueprints_kubernetes_addons -auto-approve
terraform destroy -target=module.m_fedml_edge_client_1.module.eks_blueprints_kubernetes_addons -auto-approve
terraform destroy -target=module.m_fedml_edge_client_2.module.eks_blueprints_kubernetes_addons -auto-approve

terraform destroy -target=module.m_fedml_edge_client_1.module.eks -auto-approve
terraform destroy -target=module.m_fedml_edge_client_2.module.eks -auto-approve
terraform destroy -target=module.m_fedml_edge_server.module.eks -auto-approve

terraform destroy

Summary

By using Amazon EKS as the infrastructure and FedML as the framework for FL, we are able to provide a scalable and managed environment for training and deploying shared models while respecting data privacy. With the decentralized nature of FL, organizations can collaborate securely, unlock the potential of distributed data, and improve ML models without compromising data privacy.

As always, AWS welcomes your feedback. Please leave your thoughts and questions in the comments section.


About the Authors

Randy DeFauwRandy DeFauw is a Senior Principal Solutions Architect at AWS. He holds an MSEE from the University of Michigan, where he worked on computer vision for autonomous vehicles. He also holds an MBA from Colorado State University. Randy has held a variety of positions in the technology space, ranging from software engineering to product management. He entered the big data space in 2013 and continues to explore that area. He is actively working on projects in the ML space and has presented at numerous conferences, including Strata and GlueCon.

Arnab Sinha is a Senior Solutions Architect for AWS, acting as Field CTO to help organizations design and build scalable solutions supporting business outcomes across data center migrations, digital transformation and application modernization, big data, and machine learning. He has supported customers across a variety of industries, including energy, retail, manufacturing, healthcare, and life sciences. Arnab holds all AWS Certifications, including the ML Specialty Certification. Prior to joining AWS, Arnab was a technology leader and previously held architect and engineering leadership roles.

Prachi Kulkarni is a Senior Solutions Architect at AWS. Her specialization is machine learning, and she is actively working on designing solutions using various AWS ML, big data, and analytics offerings. Prachi has experience in multiple domains, including healthcare, benefits, retail, and education, and has worked in a range of positions in product engineering and architecture, management, and customer success.

Tamer Sherif is a Principal Solutions Architect at AWS, with a diverse background in the technology and enterprise consulting services realm, spanning over 17 years as a Solutions Architect. With a focus on infrastructure, Tamer’s expertise covers a broad spectrum of industry verticals, including commercial, healthcare, automotive, public sector, manufacturing, oil and gas, media services, and more. His proficiency extends to various domains, such as cloud architecture, edge computing, networking, storage, virtualization, business productivity, and technical leadership.

Hans Nesbitt is a Senior Solutions Architect at AWS based out of Southern California. He works with customers across the western US to craft highly scalable, flexible, and resilient cloud architectures. In his spare time, he enjoys spending time with his family, cooking, and playing guitar.

Chaoyang He is Co-founder and CTO of FedML, Inc., a startup running for a community building open and collaborative AI from anywhere at any scale. His research focuses on distributed and federated machine learning algorithms, systems, and applications. He received his PhD in Computer Science from the University of Southern California.

Al Nevarez is Director of Product Management at FedML. Before FedML, he was a group product manager at Google, and a senior manager of data science at LinkedIn. He has several data product-related patents, and he studied engineering at Stanford University.

Salman Avestimehr is Co-founder and CEO of FedML. He has been a Dean’s Professor at USC, Director of the USC-Amazon Center on Trustworthy AI, and an Amazon Scholar in Alexa AI. He is an expert on federated and decentralized machine learning, information theory, security, and privacy. He is a Fellow of IEEE and received his PhD in EECS from UC Berkeley.

Samir Lad is an accomplished enterprise technologist with AWS who works closely with customers’ C-level executives. As a former C-suite executive who has driven transformations across multiple Fortune 100 companies, Samir shares his invaluable experiences to help his clients succeed in their own transformation journey.

Stephen Kraemer is a Board and CxO advisor and former executive at AWS. Stephen advocates culture and leadership as the foundations of success. He professes security and innovation the drivers of cloud transformation enabling highly competitive, data-driven organizations.

Read More

Enable data sharing through federated learning: A policy approach for chief digital officers

Enable data sharing through federated learning: A policy approach for chief digital officers

This is a guest blog post written by Nitin Kumar, a Lead Data Scientist at T and T Consulting Services, Inc.

In this post, we discuss the value and potential impact of federated learning in the healthcare field. This approach can help heart stroke patients, doctors, and researchers with faster diagnosis, enriched decision-making, and more informed, inclusive research work on stroke-related health issues, using a cloud-native approach with AWS services for lightweight lift and straightforward adoption.

Diagnosis challenges with heart strokes

Statistics from the Centers for Disease Control and Prevention (CDC) show that each year in the US, more than 795,000 people suffer from their first stroke, and about 25% of them experience recurrent attacks. It is the number five cause of death according to the American Stroke Association and a leading cause of disability in the US. Therefore, it’s crucial to have prompt diagnosis and treatment to reduce brain damage and other complications in acute stroke patients.

CTs and MRIs are the gold standard in imaging technologies for classifying different sub-types of strokes and are crucial during preliminary assessment of patients, determining the root cause, and treatment. One critical challenge here, especially in the case of acute stroke, is the time of imaging diagnosis, which on average ranges from 30 minutes up to an hour and can be much longer depending on emergency department crowding.

Doctors and medical staff need quick and accurate image diagnosis to evaluate a patient’s condition and propose treatment options. In Dr. Werner Vogels’s own words at AWS re:Invent 2023, “every second that a person has a stroke counts.” Stroke victims can lose around 1.9 billion neurons every second they are not being treated.

Medical data restrictions

You can use machine learning (ML) to assist doctors and researchers in diagnosis tasks, thereby speeding up the process. However, the datasets needed to build the ML models and give reliable results are sitting in silos across different healthcare systems and organizations. This isolated legacy data has the potential for massive impact if cumulated. So why hasn’t it been used yet?

There are multiple challenges when working with medical domain datasets and building ML solutions, including patient privacy, security of personal data, and certain bureaucratic and policy restrictions. Additionally, research institutions have been tightening their data sharing practices. These obstacles also prevent international research teams from working together on diverse and rich datasets, which could save lives and prevent disabilities that can result from heart strokes, among other benefits.

Policies and regulations like General Data Protection Regulation (GDPR), Health Insurance Portability and Accountability Act (HIPPA), and California Consumer Privacy Act (CCPA) put guardrails on sharing data from the medical domain, especially patient data. Additionally, the datasets at individual institutes, organizations, and hospitals are often too small, are unbalanced, or have biased distribution, leading to model generalization constraints.

Federated learning: An introduction

Federated learning (FL) is a decentralized form of ML—a dynamic engineering approach. In this decentralized ML approach, the ML model is shared between organizations for training on proprietary data subsets, unlike traditional centralized ML training, where the model generally trains on aggregated datasets. The data stays protected behind the organization’s firewalls or VPC, while the model with its metadata is shared.

In the training phase, a global FL model is disseminated and synchronized between unit organizations for training on individual datasets, and a local trained model is returned. The final global model is available to use to make predictions for everyone among the participants, and can also be used as a base for further training to build local custom models for participating organizations. It can further be extended to benefit other institutes. This approach can significantly reduce the cybersecurity requirements for data in transit by removing the need for data to transit outside of the organization’s boundaries at all.

The following diagram illustrates an example architecture.

In the following sections, we discuss how federated learning can help.

Federation learning to save the day (and save lives)

For good artificial intelligence (AI), you need good data.

Legacy systems, which are frequently found in the federal domain, pose significant data processing challenges before you can derive any intelligence or merge them with newer datasets. This is an obstacle in providing valuable intelligence to leaders. It can lead to inaccurate decision-making because the proportion of legacy data is sometimes much more valuable compared to the newer small dataset. You want to resolve this bottleneck effectively and without workloads of manual consolidation and integration efforts (including cumbersome mapping processes) for legacy and newer datasets sitting across hospitals and institutes, which can take many months—if not years, in many cases. The legacy data is quite valuable because it holds important contextual information needed for accurate decision-making and well-informed model training, leading to reliable AI in the real world. Duration of data informs on long-term variations and patterns in the dataset that would otherwise go undetected and lead to biased and ill-informed predictions.

Breaking down these data silos to unite the untapped potential of the scattered data can save and transform many lives. It can also accelerate the research related to secondary health issues arising from heart strokes. This solution can help you share insights from data isolated between institutes due to policy and other reasons, whether you are a hospital, a research institute, or other health data-focused organizations. It can enable informed decisions on research direction and diagnosis. Additionally, it results in a centralized repository of intelligence via a secure, private, and global knowledge base.

Federated learning has many benefits in general and specifically for medical data settings.

Security and Privacy features:

  • Keeps sensitive data away from the internet and still uses it for ML, and harnesses its intelligence with differential privacy
  • Enables you to build, train, and deploy unbiased and robust models across not just machines but also networks, without any data security hazards
  • Overcomes the hurdles with multiple vendors managing the data
  • Eliminates the need for cross-site data sharing and global governance
  • Preserves privacy with differential privacy and offers secure multi-party computation with local training

Performance Improvements:

  • Addresses the small sample size problem in the medical imaging space and costly labeling processes
  • Balances the distribution of the data
  • Enables you to incorporate most traditional ML and deep learning (DL) methods
  • Uses pooled image sets to help improve statistical power, overcoming the sample size limitation of individual institutions

Resilience Benefits:

  • If any one party decides to leave, it won’t hinder the training
  • A new hospital or institute can join at any time; it’s not reliant on any specific dataset with any node organization
  • There is no need for extensive data engineering pipelines for the legacy data scattered across widespread geographical locations

These features can help bring the walls down between institutions hosting isolated datasets on similar domains. The solution can become a force multiplier by harnessing the unified powers of distributed datasets and improving efficiency by radically transforming the scalability aspect without the heavy infrastructure lift. This approach helps ML reach its full potential, becoming proficient at the clinical level and not just research.

Federated learning has comparable performance to regular ML, as shown in the following experiment by NVidia Clara (on Medical Modal ARchive (MMAR) using the BRATS2018 dataset). Here, FL achieved a comparable segmentation performance compared to training with centralized data: over 80% with approximately 600 epochs while training a multi-modal, multi-class brain tumor segmentation task.

Federated learning has been tested recently in a few medical sub-fields for use cases including patient similarity learning, patient representation learning, phenotyping, and predictive modeling.

Application blueprint: Federated learning makes it possible and straightforward

To get started with FL, you can choose from many high-quality datasets. For example, datasets with brain images include ABIDE (Autism Brain Imaging Data Exchange initiative), ADNI (Alzheimer’s Disease Neuroimaging Initiative), RSNA (Radiological Society of North America) Brain CT, BraTS (Multimodal Brain Tumor Image Segmentation Benchmark) updated regularly for the Brain Tumor Segmentation Challenge under UPenn (University of Pennsylvania), UK BioBank (covered in the following NIH paper), and IXI. Similarly for heart images, you can choose from several publicly available options, including ACDC (Automatic Cardiac Diagnosis Challenge), which is a cardiac MRI assessment dataset with full annotation mentioned by the National Library of Medicine in the following paper, and M&M (Multi-Center, Multi-Vendor, and Multi-Disease) Cardiac Segmentation Challenge mentioned in the following IEEE paper.

The following images show a probabilistic lesion overlap map for the primary lesions from the ATLAS R1.1 dataset. (Strokes are one of the most common causes of brain lesions according to Cleveland Clinic.)

For Electronic Health Records (EHR) data, a few datasets are available that follow the Fast Healthcare Interoperability Resources (FHIR) standard. This standard helps you build straightforward pilots by removing certain challenges with heterogenous, non-normalized datasets, allowing for seamless and secure exchange, sharing, and integration of datasets. The FHIR enables maximum interoperability. Dataset examples include MIMIC-IV (Medical Information Mart for Intensive Care). Other good-quality datasets that aren’t currently FHIR but can be easily converted include Centers for Medicare & Medicaid Services (CMS) Public Use Files (PUF) and eICU Collaborative Research Database from MIT (Massachusetts Institute of Technology). There are also other resources becoming available that offer FHIR-based datasets.

The lifecycle for implementing FL can include the following steps: task initialization, selection, configuration, model training, client/server communication, scheduling and optimization, versioning, testing, deployment, and termination. There are many time-intensive steps that go into preparing medical imaging data for traditional ML, as described in the following paper. Domain knowledge might be needed in some scenarios to preprocess raw patient data, especially due to its sensitive and private nature. These can be consolidated and sometimes eliminated for FL, saving crucial time for training and providing faster results.

Implementation

FL tools and libraries have grown with widespread support, making it straightforward to use FL without a heavy overhead lift. There are a lot of good resources and framework options available to get started. You can refer to the following extensive list of the most popular frameworks and tools in the FL domain, including PySyft, FedML, Flower, OpenFL, FATE, TensorFlow Federated, and NVFlare. It provides a beginner’s list of projects to get started quickly and build upon.

You can implement a cloud-native approach with Amazon SageMaker that seamlessly works with AWS VPC peering, keeping each node’s training in a private subnet in their respective VPC and enabling communication via private IPv4 addresses. Furthermore, model hosting on Amazon SageMaker JumpStart can help by exposing the endpoint API without sharing model weights.

It also takes away potential high-level compute challenges with on-premises hardware with Amazon Elastic Compute Cloud (Amazon EC2) resources. You can implement the FL client and servers on AWS with SageMaker notebooks and Amazon Simple Storage Service (Amazon S3), maintain regulated access to the data and model with AWS Identity and Access Management (IAM) roles, and use AWS Security Token Service (AWS STS) for client-side security. You can also build your own custom system for FL using Amazon EC2.

For a detailed overview of implementing FL with the Flower framework on SageMaker, and a discussion of its difference from distributed training, refer to Machine learning with decentralized training data using federated learning on Amazon SageMaker.

The following figures illustrate the architecture of transfer learning in FL.

Addressing FL data challenges

Federated learning comes with its own data challenges, including privacy and security, but they are straightforward to address. First, you need to address the data heterogeneity problem with medical imaging data arising from data being stored across different sites and participating organizations, known as a domain shift problem (also referred to as client shift in an FL system), as highlighted by Guan and Liu in the following paper. This can lead to a difference in convergence of the global model.

Other components for consideration include ensuring data quality and uniformity at the source, incorporating expert knowledge into the learning process to inspire confidence in the system among medical professionals, and achieving model precision. For more information about some of the potential challenges you may face during implementation, refer to the following paper.

AWS helps you resolve these challenges with features like the flexible compute of Amazon EC2 and pre-built Docker images in SageMaker for straightforward deployment. You can resolve client-side problems like unbalanced data and computation resources for each node organization. You can address server-side learning problems like poisoning attacks from malicious parties with Amazon Virtual Private Cloud (Amazon VPC), security groups, and other security standards, preventing client corruption and implementing AWS anomaly detection services.

AWS also helps in addressing real-world implementation challenges, which can include integration challenges, compatibility issues with current or legacy hospital systems, and user adoption hurdles, by offering flexible, easy-to-use, and effortless lift tech solutions.

With AWS services, you can enable large-scale FL-based research and clinical implementation and deployment, which can consist of various sites across the world.

Recent policies on interoperability highlight the need for federated learning

Many laws recently passed by the government include a focus on data interoperability, bolstering the need for cross-organizational interoperability of data for intelligence. This can be fulfilled by using FL, including frameworks like the TEFCA (Trusted Exchange Framework and Common Agreement) and the expanded USCID (United States Core Data for Interoperability).

The proposed idea also contributes towards the CDC’s capture and distribution initiative CDC Moving Forward. The following quote from the GovCIO article Data Sharing and AI Top Federal Health Agency Priorities in 2024 also echoes a similar theme: “These capabilities can also support the public in an equitable way, meeting patients where they are and unlocking critical access to these services. Much of this work comes down to the data.”

This can help medical institutes and agencies around the country (and across the globe) with data silos. They can benefit from seamless and secure integration and data interoperability, making medical data usable for impactful ML-based predictions and pattern recognition. You can start with images, but the approach is applicable to all EHR as well. The goal is to find the best approach for data stakeholders, with a cloud-native pipeline to normalize and standardize the data or directly use it for FL.

Let’s explore an example use case. Heart stroke imaging data and scans are scattered around the country and the world, sitting in isolated silos in institutes, universities, and hospitals, and separated by bureaucratic, geographical, and political boundaries. There is no single aggregated source and no easy way for medical professionals (non-programmers) to extract insights from it. At the same time, it’s not feasible to train ML and DL models on this data, which could help medical professionals make faster, more accurate decisions in critical times when heart scans can take hours to come in while the patient’s life could be hanging in the balance.

Other known use cases include POTS (Purchasing Online Tracking System) at NIH (National Institutes of Health) and cybersecurity for scattered and tiered intelligence solution needs at COMCOMs/MAJCOMs locations around the globe.

Conclusion

Federated learning holds great promise for legacy healthcare data analytics and intelligence. It’s straightforward to implement a cloud-native solution with AWS services, and FL is especially helpful for medical organizations with legacy data and technical challenges. FL can have a potential impact on the entire treatment cycle, and now even more so with the focus on data interoperability from large federal organizations and government leaders.

This solution can help you avoid reinventing the wheel and use the latest technology to take a leap from legacy systems and be at the forefront in this ever-evolving world of AI. You can also become a leader for best practices and an efficient approach to data interoperability within and across agencies and institutes in the health domain and beyond. If you are an institute or agency with data silos scattered around the country, you can benefit from this seamless and secure integration.

The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post. It is each customers’ responsibility to determine whether they are subject to HIPAA, and if so, how best to comply with HIPAA and its implementing regulations. Before using AWS in connection with protected health information, customers must enter an AWS Business Associate Addendum (BAA) and follow its configuration requirements.


About the Author

Nitin Kumar (MS, CMU) is a Lead Data Scientist at T and T Consulting Services, Inc. He has extensive experience with R&D prototyping, health informatics, public sector data, and data interoperability. He applies his knowledge of cutting-edge research methods to the federal sector to deliver innovative technical papers, POCs, and MVPs. He has worked with multiple federal agencies to advance their data and AI goals. Nitin’s other focus areas include natural language processing (NLP), data pipelines, and generative AI.

Read More