The principal economist and his team address unique challenges using techniques at the intersection of microeconomics, statistics, and machine learning.Read More
Unlock the potential of generative AI in industrial operations
In the evolving landscape of manufacturing, the transformative power of AI and machine learning (ML) is evident, driving a digital revolution that streamlines operations and boosts productivity. However, this progress introduces unique challenges for enterprises navigating data-driven solutions. Industrial facilities grapple with vast volumes of unstructured data, sourced from sensors, telemetry systems, and equipment dispersed across production lines. Real-time data is critical for applications like predictive maintenance and anomaly detection, yet developing custom ML models for each industrial use case with such time series data demands considerable time and resources from data scientists, hindering widespread adoption.
Generative AI using large pre-trained foundation models (FMs) such as Claude can rapidly generate a variety of content from conversational text to computer code based on simple text prompts, known as zero-shot prompting. This eliminates the need for data scientists to manually develop specific ML models for each use case, and therefore democratizes AI access, benefitting even small manufacturers. Workers gain productivity through AI-generated insights, engineers can proactively detect anomalies, supply chain managers optimize inventories, and plant leadership makes informed, data-driven decisions.
Nevertheless, standalone FMs face limitations in handling complex industrial data with context size constraints (typically less than 200,000 tokens), which poses challenges. To address this, you can use the FM’s ability to generate code in response to natural language queries (NLQs). Agents like PandasAI come into play, running this code on high-resolution time series data and handling errors using FMs. PandasAI is a Python library that adds generative AI capabilities to pandas, the popular data analysis and manipulation tool.
However, complex NLQs, such as time series data processing, multi-level aggregation, and pivot or joint table operations, may yield inconsistent Python script accuracy with a zero-shot prompt.
To enhance code generation accuracy, we propose dynamically constructing multi-shot prompts for NLQs. Multi-shot prompting provides additional context to the FM by showing it several examples of desired outputs for similar prompts, boosting accuracy and consistency. In this post, multi-shot prompts are retrieved from an embedding containing successful Python code run on a similar data type (for example, high-resolution time series data from Internet of Things devices). The dynamically constructed multi-shot prompt provides the most relevant context to the FM, and boosts the FM’s capability in advanced math calculation, time series data processing, and data acronym understanding. This improved response facilitates enterprise workers and operational teams in engaging with data, deriving insights without requiring extensive data science skills.
Beyond time series data analysis, FMs prove valuable in various industrial applications. Maintenance teams assess asset health, capture images for Amazon Rekognition-based functionality summaries, and anomaly root cause analysis using intelligent searches with Retrieval Augmented Generation (RAG). To simplify these workflows, AWS has introduced Amazon Bedrock, enabling you to build and scale generative AI applications with state-of-the-art pre-trained FMs like Claude v2. With Knowledge Bases for Amazon Bedrock, you can simplify the RAG development process to provide more accurate anomaly root cause analysis for plant workers. Our post showcases an intelligent assistant for industrial use cases powered by Amazon Bedrock, addressing NLQ challenges, generating part summaries from images, and enhancing FM responses for equipment diagnosis through the RAG approach.
Solution overview
The following diagram illustrates the solution architecture.
The workflow includes three distinct use cases:
Use case 1: NLQ with time series data
The workflow for NLQ with time series data consists of the following steps:
- We use a condition monitoring system with ML capabilities for anomaly detection, such as Amazon Monitron, to monitor industrial equipment health. Amazon Monitron is able to detect potential equipment failures from the equipment’s vibration and temperature measurements.
- We collect time series data by processing Amazon Monitron data through Amazon Kinesis Data Streams and Amazon Data Firehose, converting it into a tabular CSV format and saving it in an Amazon Simple Storage Service (Amazon S3) bucket.
- The end-user can start chatting with their time series data in Amazon S3 by sending a natural language query to the Streamlit app.
- The Streamlit app forwards user queries to the Amazon Bedrock Titan text embedding model to embed this query, and performs a similarity search within an Amazon OpenSearch Service index, which contains prior NLQs and example codes.
- After the similarity search, the top similar examples, including NLQ questions, data schema, and Python codes, are inserted in a custom prompt.
- PandasAI sends this custom prompt to the Amazon Bedrock Claude v2 model.
- The app uses the PandasAI agent to interact with the Amazon Bedrock Claude v2 model, generating Python code for Amazon Monitron data analysis and NLQ responses.
- After the Amazon Bedrock Claude v2 model returns the Python code, PandasAI runs the Python query on the Amazon Monitron data uploaded from the app, collecting code outputs and addressing any necessary retries for failed runs.
- The Streamlit app collects the response via PandasAI, and provides the output to users. If the output is satisfactory, the user can mark it as helpful, saving the NLQ and Claude-generated Python code in OpenSearch Service.
Use case 2: Summary generation of malfunctioning parts
Our summary generation use case consists of the following steps:
- After the user knows which industrial asset shows anomalous behavior, they can upload images of the malfunctioning part to identify if there is something physically wrong with this part according to its technical specification and operation condition.
- The user can use the Amazon Recognition DetectText API to extract text data from these images.
- The extracted text data is included in the prompt for the Amazon Bedrock Claude v2 model, enabling the model to generate a 200-word summary of the malfunctioning part. The user can use this information to perform further inspection of the part.
Use case 3: Root cause diagnosis
Our root cause diagnosis use case consists of the following steps:
- The user obtains enterprise data in various document formats (PDF, TXT, and so on) related with malfunctioning assets, and uploads them to an S3 bucket.
- A knowledge base of these files is generated in Amazon Bedrock with a Titan text embeddings model and a default OpenSearch Service vector store.
- The user poses questions related to the root cause diagnosis for malfunctioning equipment. Answers are generated through the Amazon Bedrock knowledge base with a RAG approach.
Prerequisites
To follow along with this post, you should meet the following prerequisites:
- You need an AWS account with an AWS Identity and Access Management (IAM) role with admin permissions to manage resources created as part of the solution. For details, refer to Step 1: Create your AWS account.
- For this tutorial, you need a bash terminal with Python 3.9 or higher installed on Linux, Mac, or Windows Subsystem for Linux, and an AWS account. We also recommend using an Amazon Elastic Compute Cloud (Amazon EC2) instance (Ubuntu Server 22.04 LTS).
- Install or update the AWS Command Line Interface (AWS CLI) on either your PC or EC2 instance.
- Request access to the Amazon Bedrock model.
Deploy the solution infrastructure
To set up your solution resources, complete the following steps:
- Deploy the AWS CloudFormation template opensearchsagemaker.yml, which creates an OpenSearch Service collection and index, Amazon SageMaker notebook instance, and S3 bucket. You can name this AWS CloudFormation stack as:
genai-sagemaker
. - Open the SageMaker notebook instance in JupyterLab. You will find the following GitHub repo already downloaded on this instance: unlocking-the-potential-of-generative-ai-in-industrial-operations.
- Run the notebook from the following directory in this repository: unlocking-the-potential-of-generative-ai-in-industrial-operations/SagemakerNotebook/nlq-vector-rag-embedding.ipynb. This notebook will load the OpenSearch Service index using the SageMaker notebook to store key-value pairs from the existing 23 NLQ examples.
- Upload documents from the data folder assetpartdoc in the GitHub repository to the S3 bucket listed in the CloudFormation stack outputs.
Next, you create the knowledge base for the documents in Amazon S3.
- On the Amazon Bedrock console, choose Knowledge base in the navigation pane.
- Choose Create knowledge base.
- For Knowledge base name, enter a name.
- For Runtime role, select Create and use a new service role.
- For Data source name, enter the name of your data source.
- For S3 URI, enter the S3 path of the bucket where you uploaded the root cause documents.
- Choose Next.
The Titan embeddings model is automatically selected.
- Select Quick create a new vector store.
- Review your settings and create the knowledge base by choosing Create knowledge base.
- After the knowledge base is successfully created, choose Sync to sync the S3 bucket with the knowledge base.
- After you set up the knowledge base, you can test the RAG approach for root cause diagnosis by asking questions like “My actuator travels slow, what might be the issue?”
The next step is to deploy the app with the required library packages on either your PC or an EC2 instance (Ubuntu Server 22.04 LTS).
- Set up your AWS credentials with the AWS CLI on your local PC. For simplicity, you can use the same admin role you used to deploy the CloudFormation stack. If you’re using Amazon EC2, attach a suitable IAM role to the instance.
- Clone GitHub repo:
- Change the directory to
unlocking-the-potential-of-generative-ai-in-industrial-operations/src
and run thesetup.sh
script in this folder to install the required packages, including LangChain and PandasAI:cd unlocking-the-potential-of-generative-ai-in-industrial-operations/src chmod +x ./setup.sh ./setup.sh
- Run the Streamlit app with the following command:
source monitron-genai/bin/activate python3 -m streamlit run app_bedrock.py <REPLACE WITH YOUR BEDROCK KNOWLEDGEBASE ARN>
Provide the OpenSearch Service collection ARN you created in Amazon Bedrock from the previous step.
Chat with your asset health assistant
After you complete the end-to-end deployment, you can access the app via localhost on port 8501, which opens a browser window with the web interface. If you deployed the app on an EC2 instance, allow port 8501 access via the security group inbound rule. You can navigate to different tabs for various use cases.
Explore use case 1
To explore the first use case, choose Data Insight and Chart. Begin by uploading your time series data. If you don’t have an existing time series data file to use, you can upload the following sample CSV file with anonymous Amazon Monitron project data. If you already have an Amazon Monitron project, refer to Generate actionable insights for predictive maintenance management with Amazon Monitron and Amazon Kinesis to stream your Amazon Monitron data to Amazon S3 and use your data with this application.
When the upload is complete, enter a query to initiate a conversation with your data. The left sidebar offers a range of example questions for your convenience. The following screenshots illustrate the response and Python code generated by the FM when inputting a question such as “Tell me the unique number of sensors for each site shown as Warning or Alarm respectively?” (a hard-level question) or “For sensors shown temperature signal as NOT Healthy, can you calculate the time duration in days for each sensor shown abnormal vibration signal?” (a challenge-level question). The app will answer your question, and will also show the Python script of data analysis it performed to generate such results.
If you’re satisfied with the answer, you can mark it as Helpful, saving the NLQ and Claude-generated Python code to an OpenSearch Service index.
Explore use case 2
To explore the second use case, choose the Captured Image Summary tab in the Streamlit app. You can upload an image of your industrial asset, and the application will generate a 200-word summary of its technical specification and operation condition based on the image information. The following screenshot shows the summary generated from an image of a belt motor drive. To test this feature, if you lack a suitable image, you can use the following example image.
Hydraulic elevator motor label” by Clarence Risher is licensed under CC BY-SA 2.0.
Explore use case 3
To explore the third use case, choose the Root cause diagnosis tab. Input a query related to your broken industrial asset, such as, “My actuator travels slow, what might be the issue?” As depicted in the following screenshot, the application delivers a response with the source document excerpt used to generate the answer.
Use case 1: Design details
In this section, we discuss the design details of the application workflow for the first use case.
Custom prompt building
The user’s natural language query comes with different difficult levels: easy, hard, and challenge.
Straightforward questions may include the following requests:
- Select unique values
- Count total numbers
- Sort values
For these questions, PandasAI can directly interact with the FM to generate Python scripts for processing.
Hard questions require basic aggregation operation or time series analysis, such as the following:
- Select value first and group results hierarchically
- Perform statistics after initial record selection
- Timestamp count (for example, min and max)
For hard questions, a prompt template with detailed step-by-step instructions assists FMs in providing accurate responses.
Challenge-level questions need advanced math calculation and time series processing, such as the following:
- Calculate anomaly duration for each sensor
- Calculate anomaly sensors for site on a monthly basis
- Compare sensor readings under normal operation and abnormal conditions
For these questions, you can use multi-shots in a custom prompt to enhance response accuracy. Such multi-shots show examples of advanced time series processing and math calculation, and will provide context for the FM to perform relevant inference on similar analysis. Dynamically inserting the most relevant examples from an NLQ question bank into the prompt can be a challenge. One solution is to construct embeddings from existing NLQ question samples and save these embeddings in a vector store like OpenSearch Service. When a question is sent to the Streamlit app, the question will be vectorized by BedrockEmbeddings. The top N most-relevant embeddings to that question are retrieved using opensearch_vector_search.similarity_search and inserted into the prompt template as a multi-shot prompt.
The following diagram illustrates this workflow.
The embedding layer is constructed using three key tools:
- Embeddings model – We use Amazon Titan Embeddings available through Amazon Bedrock (amazon.titan-embed-text-v1) to generate numerical representations of textual documents.
- Vector store – For our vector store, we use OpenSearch Service via the LangChain framework, streamlining the storage of embeddings generated from NLQ examples in this notebook.
- Index – The OpenSearch Service index plays a pivotal role in comparing input embeddings to document embeddings and facilitating the retrieval of relevant documents. Because the Python example codes were saved as a JSON file, they were indexed in OpenSearch Service as vectors via an OpenSearchVevtorSearch.fromtexts API call.
Continuous collection of human-audited examples via Streamlit
At the outset of app development, we began with only 23 saved examples in the OpenSearch Service index as embeddings. As the app goes live in the field, users start inputting their NLQs via the app. However, due to the limited examples available in the template, some NLQs may not find similar prompts. To continuously enrich these embeddings and offer more relevant user prompts, you can use the Streamlit app for gathering human-audited examples.
Within the app, the following function serves this purpose. When end-users find the output helpful and select Helpful, the application follows these steps:
- Use the callback method from PandasAI to collect the Python script.
- Reformat the Python script, input question, and CSV metadata into a string.
- Check whether this NLQ example already exists in the current OpenSearch Service index using opensearch_vector_search.similarity_search_with_score.
- If there’s no similar example, this NLQ is added to the OpenSearch Service index using opensearch_vector_search.add_texts.
In the event that a user selects Not Helpful, no action is taken. This iterative process makes sure that the system continually improves by incorporating user-contributed examples.
def addtext_opensearch(input_question, generated_chat_code, df_column_metadata, opensearch_vector_search,similarity_threshold,kexamples, indexname):
#######build the input_question and generated code the same format as existing opensearch index##########
reconstructed_json = {}
reconstructed_json["question"]=input_question
reconstructed_json["python_code"]=str(generated_chat_code)
reconstructed_json["column_info"]=df_column_metadata
json_str = ''
for key,value in reconstructed_json.items():
json_str += key + ':' + value
reconstructed_raw_text =[]
reconstructed_raw_text.append(json_str)
results = opensearch_vector_search.similarity_search_with_score(str(reconstructed_raw_text[0]), k=kexamples) # our search query # return 3 most relevant docs
if (dumpd(results[0][1])<similarity_threshold): ###No similar embedding exist, then add text to embedding
response = opensearch_vector_search.add_texts(texts=reconstructed_raw_text, engine="faiss", index_name=indexname)
else:
response = "A similar embedding is already exist, no action."
return response
By incorporating human auditing, the quantity of examples in OpenSearch Service available for prompt embedding grows as the app gains usage. This expanded embedding dataset results in enhanced search accuracy over time. Specifically, for challenging NLQs, the FM’s response accuracy reaches approximately 90% when dynamically inserting similar examples to construct custom prompts for each NLQ question. This represents a notable 28% increase compared to scenarios without multi-shot prompts.
Use case 2: Design details
On the Streamlit app’s Captured Image Summary tab, you can directly upload an image file. This initiates the Amazon Rekognition API (detect_text API), extracting text from the image label detailing machine specifications. Subsequently, the extracted text data is sent to the Amazon Bedrock Claude model as the context of a prompt, resulting in a 200-word summary.
From a user experience perspective, enabling streaming functionality for a text summarization task is paramount, allowing users to read the FM-generated summary in smaller chunks rather than waiting for the entire output. Amazon Bedrock facilitates streaming via its API (bedrock_runtime.invoke_model_with_response_stream).
Use case 3: Design details
In this scenario, we’ve developed a chatbot application focused on root cause analysis, employing the RAG approach. This chatbot draws from multiple documents related to bearing equipment to facilitate root cause analysis. This RAG-based root cause analysis chatbot uses knowledge bases for generating vector text representations, or embeddings. Knowledge Bases for Amazon Bedrock is a fully managed capability that helps you implement the entire RAG workflow, from ingestion to retrieval and prompt augmentation, without having to build custom integrations to data sources or manage data flows and RAG implementation details.
When you’re satisfied with the knowledge base response from Amazon Bedrock, you can integrate the root cause response from the knowledge base to the Streamlit app.
Clean up
To save costs, delete the resources you created in this post:
- Delete the knowledge base from Amazon Bedrock.
- Delete the OpenSearch Service index.
- Delete the genai-sagemaker CloudFormation stack.
- Stop the EC2 instance if you used an EC2 instance to run the Streamlit app.
Conclusion
Generative AI applications have already transformed various business processes, enhancing worker productivity and skill sets. However, the limitations of FMs in handling time series data analysis have hindered their full utilization by industrial clients. This constraint has impeded the application of generative AI to the predominant data type processed daily.
In this post, we introduced a generative AI Application solution designed to alleviate this challenge for industrial users. This application uses an open source agent, PandasAI, to strengthen an FM’s time series analysis capability. Rather than sending time series data directly to FMs, the app employs PandasAI to generate Python code for the analysis of unstructured time series data. To enhance the accuracy of Python code generation, a custom prompt generation workflow with human auditing has been implemented.
Empowered with insights into their asset health, industrial workers can fully harness the potential of generative AI across various use cases, including root cause diagnosis and part replacement planning. With Knowledge Bases for Amazon Bedrock, the RAG solution is straightforward for developers to build and manage.
The trajectory of enterprise data management and operations is unmistakably moving towards deeper integration with generative AI for comprehensive insights into operational health. This shift, spearheaded by Amazon Bedrock, is significantly amplified by the growing robustness and potential of LLMs like Amazon Bedrock Claude 3 to further elevate solutions. To learn more, visit consult the Amazon Bedrock documentation, and get hands-on with the Amazon Bedrock workshop.
About the authors
Julia Hu is a Sr. AI/ML Solutions Architect at Amazon Web Services. She is specialized in Generative AI, Applied Data Science and IoT architecture. Currently she is part of the Amazon Q team, and an active member/mentor in Machine Learning Technical Field Community. She works with customers, ranging from start-ups to enterprises, to develop AWSome generative AI solutions. She is particularly passionate about leveraging Large Language Models for advanced data analytics and exploring practical applications that address real-world challenges.
Sudeesh Sasidharan is a Senior Solutions Architect at AWS, within the Energy team. Sudeesh loves experimenting with new technologies and building innovative solutions that solve complex business challenges. When he is not designing solutions or tinkering with the latest technologies, you can find him on the tennis court working on his backhand.
Neil Desai is a technology executive with over 20 years of experience in artificial intelligence (AI), data science, software engineering, and enterprise architecture. At AWS, he leads a team of Worldwide AI services specialist solutions architects who help customers build innovative Generative AI-powered solutions, share best practices with customers, and drive product roadmap. In his previous roles at Vestas, Honeywell, and Quest Diagnostics, Neil has held leadership roles in developing and launching innovative products and services that have helped companies improve their operations, reduce costs, and increase revenue. He is passionate about using technology to solve real-world problems and is a strategic thinker with a proven track record of success.
Enhance performance of generative language models with self-consistency prompting on Amazon Bedrock
Generative language models have proven remarkably skillful at solving logical and analytical natural language processing (NLP) tasks. Furthermore, the use of prompt engineering can notably enhance their performance. For example, chain-of-thought (CoT) is known to improve a model’s capacity for complex multi-step problems. To additionally boost accuracy on tasks that involve reasoning, a self-consistency prompting approach has been suggested, which replaces greedy with stochastic decoding during language generation.
Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models from leading AI companies and Amazon via a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI. With the batch inference API, you can use Amazon Bedrock to run inference with foundation models in batches and get responses more efficiently. This post shows how to implement self-consistency prompting via batch inference on Amazon Bedrock to enhance model performance on arithmetic and multiple-choice reasoning tasks.
Overview of solution
Self-consistency prompting of language models relies on the generation of multiple responses that are aggregated into a final answer. In contrast to single-generation approaches like CoT, the self-consistency sample-and-marginalize procedure creates a range of model completions that lead to a more consistent solution. The generation of different responses for a given prompt is possible due to the use of a stochastic, rather than greedy, decoding strategy.
The following figure shows how self-consistency differs from greedy CoT in that it generates a diverse set of reasoning paths and aggregates them to produce the final answer.
Decoding strategies for text generation
Text generated by decoder-only language models unfolds word by word, with the subsequent token being predicted on the basis of the preceding context. For a given prompt, the model computes a probability distribution indicating the likelihood of each token to appear next in the sequence. Decoding involves translating these probability distributions into actual text. Text generation is mediated by a set of inference parameters that are often hyperparameters of the decoding method itself. One example is the temperature, which modulates the probability distribution of the next token and influences the randomness of the model’s output.
Greedy decoding is a deterministic decoding strategy that at each step selects the token with the highest probability. Although straightforward and efficient, the approach risks falling into repetitive patterns, because it disregards the broader probability space. Setting the temperature parameter to 0 at inference time essentially equates to implementing greedy decoding.
Sampling introduces stochasticity into the decoding process by randomly selecting each subsequent token based on the predicted probability distribution. This randomness results in greater output variability. Stochastic decoding proves more adept at capturing the diversity of potential outputs and often yields more imaginative responses. Higher temperature values introduce more fluctuations and increase the creativity of the model’s response.
Prompting techniques: CoT and self-consistency
The reasoning ability of language models can be augmented via prompt engineering. In particular, CoT has been shown to elicit reasoning in complex NLP tasks. One way to implement a zero-shot CoT is via prompt augmentation with the instruction to “think step by step.” Another is to expose the model to exemplars of intermediate reasoning steps in few-shot prompting fashion. Both scenarios typically use greedy decoding. CoT leads to significant performance gains compared to simple instruction prompting on arithmetic, commonsense, and symbolic reasoning tasks.
Self-consistency prompting is based on the assumption that introducing diversity in the reasoning process can be beneficial to help models converge on the correct answer. The technique uses stochastic decoding to achieve this goal in three steps:
- Prompt the language model with CoT exemplars to elicit reasoning.
- Replace greedy decoding with a sampling strategy to generate a diverse set of reasoning paths.
- Aggregate the results to find the most consistent answer in the response set.
Self-consistency is shown to outperform CoT prompting on popular arithmetic and commonsense reasoning benchmarks. A limitation of the approach is its larger computational cost.
This post shows how self-consistency prompting enhances performance of generative language models on two NLP reasoning tasks: arithmetic problem-solving and multiple-choice domain-specific question answering. We demonstrate the approach using batch inference on Amazon Bedrock:
- We access the Amazon Bedrock Python SDK in JupyterLab on an Amazon SageMaker notebook instance.
- For arithmetic reasoning, we prompt Cohere Command on the GSM8K dataset of grade school math problems.
- For multiple-choice reasoning, we prompt AI21 Labs Jurassic-2 Mid on a small sample of questions from the AWS Certified Solutions Architect – Associate exam.
Prerequisites
This walkthrough assumes the following prerequisites:
- An AWS account with a ml.t3.medium notebook Instance hosted in SageMaker.
- An AWS Identity and Access Management (IAM) SageMaker execution role with attached AmazonBedrockFullAccess and
iam:PassRole
policies to run Jupyter inside the SageMaker notebook instance. - An IAM
BedrockBatchInferenceRole
role for batch inference with Amazon Bedrock with Amazon Simple Storage Service (Amazon S3) access andsts:AssumeRole
trust policies. For more information, refer to Set up permissions for batch inference. - Access to models hosted on Amazon Bedrock. Choose Manage model access on the Amazon Bedrock console and choose among the list of available options. We use Cohere Command and AI21 Labs Jurassic-2 Mid for this demo.
The estimated cost to run the code shown in this post is $100, assuming you run self-consistency prompting one time with 30 reasoning paths using one value for the temperature-based sampling.
Dataset to probe arithmetic reasoning capabilities
GSM8K is a dataset of human-assembled grade school math problems featuring a high linguistic diversity. Each problem takes 2–8 steps to solve and requires performing a sequence of elementary calculations with basic arithmetic operations. This data is commonly used to benchmark the multi-step arithmetic reasoning capabilities of generative language models. The GSM8K train set comprises 7,473 records. The following is an example:
{"question": "Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?", "answer": "Natalia sold 48/2 = <<48/2=24>>24 clips in May.nNatalia sold 48+24 = <<48+24=72>>72 clips altogether in April and May.n#### 72"}
Set up to run batch inference with Amazon Bedrock
Batch inference allows you to run multiple inference calls to Amazon Bedrock asynchronously and improve the performance of model inference on large datasets. The service is in preview as of this writing and only available through the API. Refer to Run batch inference to access batch inference APIs via custom SDKs.
After you have downloaded and unzipped the Python SDK in a SageMaker notebook instance, you can install it by running the following code in a Jupyter notebook cell:
Format and upload input data to Amazon S3
Input data for batch inference needs to be prepared in JSONL format with recordId
and modelInput
keys. The latter should match the body field of the model to be invoked on Amazon Bedrock. In particular, some supported inference parameters for Cohere Command are temperature
for randomness, max_tokens
for output length, and num_generations
to generate multiple responses, all of which are passed together with the prompt
as modelInput
:
See Inference parameters for foundation models for more details, including other model providers.
Our experiments on arithmetic reasoning are performed in the few-shot setting without customizing or fine-tuning Cohere Command. We use the same set of eight few-shot exemplars from the chain-of-thought (Table 20) and self-consistency (Table 17) papers. Prompts are created by concatenating the exemplars with each question from the GSM8K train set.
We set max_tokens
to 512 and num_generations
to 5, the maximum allowed by Cohere Command. For greedy decoding, we set temperature
to 0 and for self-consistency, we run three experiments at temperatures 0.5, 0.7, and 1. Each setting yields different input data according to the respective temperature values. Data is formatted as JSONL and stored in Amazon S3.
Create and run batch inference jobs in Amazon Bedrock
Batch inference job creation requires an Amazon Bedrock client. We specify the S3 input and output paths and give each invocation job a unique name:
Jobs are created by passing the IAM role, model ID, job name, and input/output configuration as parameters to the Amazon Bedrock API:
Listing, monitoring, and stopping batch inference jobs is supported by their respective API calls. On creation, jobs appear first as Submitted
, then as InProgress
, and finally as Stopped
, Failed
, or Completed
.
If the jobs are successfully complete, the generated content can be retrieved from Amazon S3 using its unique output location.
[Out]: 'Natalia sold 48 * 1/2 = 24 clips less in May. This means she sold 48 + 24 = 72 clips in April and May. The answer is 72.'
Self-consistency enhances model accuracy on arithmetic tasks
Self-consistency prompting of Cohere Command outperforms a greedy CoT baseline in terms of accuracy on the GSM8K dataset. For self-consistency, we sample 30 independent reasoning paths at three different temperatures, with topP
and topK
set to their default values. Final solutions are aggregated by choosing the most consistent occurrence via majority voting. In case of a tie, we randomly choose one of the majority responses. We compute accuracy and standard deviation values averaged over 100 runs.
The following figure shows the accuracy on the GSM8K dataset from Cohere Command prompted with greedy CoT (blue) and self-consistency at temperature values 0.5 (yellow), 0.7 (green), and 1.0 (orange) as a function of the number of sampled reasoning paths.
The preceding figure shows that self-consistency enhances arithmetic accuracy over greedy CoT when the number of sampled paths is as low as three. Performance increases consistently with further reasoning paths, confirming the importance of introducing diversity in the thought generation. Cohere Command solves the GSM8K question set with 51.7% accuracy when prompted with CoT vs. 68% with 30 self-consistent reasoning paths at T=1.0. All three surveyed temperature values yield similar results, with lower temperatures being comparatively more performant at less sampled paths.
Practical considerations on efficiency and cost
Self-consistency is limited by the increased response time and cost incurred when generating multiple outputs per prompt. As a practical illustration, batch inference for greedy generation with Cohere Command on 7,473 GSM8K records finished in less than 20 minutes. The job took 5.5 million tokens as input and generated 630,000 output tokens. At current Amazon Bedrock inference prices, the total cost incurred was around $9.50.
For self-consistency with Cohere Command, we use inference parameter num_generations
to create multiple completions per prompt. As of this writing, Amazon Bedrock allows a maximum of five generations and three concurrent Submitted
batch inference jobs. Jobs proceed to the InProgress
status sequentially, therefore sampling more than five paths requires multiple invocations.
The following figure shows the runtimes for Cohere Command on the GSM8K dataset. Total runtime is shown on the x axis and runtime per sampled reasoning path on the y axis. Greedy generation runs in the shortest time but incurs a higher time cost per sampled path.
Greedy generation completes in less than 20 minutes for the full GSM8K set and samples a unique reasoning path. Self-consistency with five samples requires about 50% longer to complete and costs around $14.50, but produces five paths (over 500%) in that time. Total runtime and cost increase step-wise with every extra five sampled paths. A cost-benefit analysis suggests that 1–2 batch inference jobs with 5–10 sampled paths is the recommended setting for practical implementation of self-consistency. This achieves enhanced model performance while keeping cost and latency at bay.
Self-consistency enhances model performance beyond arithmetic reasoning
A crucial question to prove the suitability of self-consistency prompting is whether the method succeeds across further NLP tasks and language models. As an extension to an Amazon-related use case, we perform a small-sized analysis on sample questions from the AWS Solutions Architect Associate Certification. This is a multiple-choice exam on AWS technology and services that requires domain knowledge and the ability to reason and decide among several options.
We prepare a dataset from SAA-C01 and SAA-C03 sample exam questions. From the 20 available questions, we use the first 4 as few-shot exemplars and prompt the model to answer the remaining 16. This time, we run inference with the AI21 Labs Jurassic-2 Mid model and generate a maximum of 10 reasoning paths at temperature 0.7. Results show that self-consistency enhances performance: although greedy CoT produces 11 correct answers, self-consistency succeeds on 2 more.
The following table shows the accuracy results for 5 and 10 sampled paths averaged over 100 runs.
. | Greedy decoding | T = 0.7 |
# sampled paths: 5 | 68.6 | 74.1 ± 0.7 |
# sampled paths: 10 | 68.6 | 78.9 ± 0.3 |
In the following table, we present two exam questions that are incorrectly answered by greedy CoT while self-consistency succeeds, highlighting in each case the correct (green) or incorrect (red) reasoning traces that led the model to produce correct or incorrect responses. Although not every sampled path generated by self-consistency is correct, the majority converges on the true answer as the number of sampled paths increases. We observe that 5–10 paths are typically enough to improve over the greedy results, with diminishing returns in terms of efficiency past those values.
Question |
A web application allows customers to upload orders to an S3 bucket. The resulting Amazon S3 events trigger a Lambda function that inserts a message to an SQS queue. A single EC2 instance reads messages from the queue, processes them, and stores them in a DynamoDB table partitioned by unique order ID. Next month traffic is expected to increase by a factor of 10 and a Solutions Architect is reviewing the architecture for possible scaling problems. Which component is MOST likely to need re-architecting to be able to scale to accommodate the new traffic? A. Lambda function |
An application running on AWS uses an Amazon Aurora Multi-AZ DB cluster deployment for its database. When evaluating performance metrics, a solutions architect discovered that the database reads are causing high I/O and adding latency to the write requests against the database. What should the solutions architect do to separate the read requests from the write requests? A. Enable read-through caching on the Aurora database. |
Correct answer | C – A single EC2 instance will not scale and is a single point of failure in the architecture. A much better solution would be to have EC2 instances in an Auto Scaling group across 2 availability zones read messages from the queue. The other responses are all managed services that can be configured to scale or will scale automatically. | C – Aurora Replicas provide a way to offload read traffic. Aurora Replicas share the same underlying storage as the main database, so lag time is generally very low. Aurora Replicas have their own endpoints, so the application will need to be configured to direct read traffic to the new endpoints. |
Greedy decoding | Scaling the EC2 instance directly may not be possible, as it may not be possible to resize the instance. Scaling the Lambda function may not be sufficient, as Lambda functions have a 6 GB of memory allocation limit. Scaling the SQS queue may not be sufficient, as SQS has a 10,000 message limit. DynamoDB tables can be scaled by provisioning more capacity units. The correct answer is D. | Modifying the application to read from the Multi-AZ standby instance (option B) is the best approach. The standby instance is asynchronously replicated from the primary instance, so read operations will not interfere with write operations. Enabling read-through caching (option A) or creating an Aurora replica (option C) will not solve the I/O contention issue. Creating a second Aurora database and linking it as a read replica (option D) will not solve the I/O contention issue either. The correct answer is B. |
Sampled path #1 | Scaling the EC2 instance to 10 times its current size may not be feasible, either from a cost or performance perspective. It would require a larger instance, which can be expensive. A more cost-effective approach might be to use a swarm of smaller instances (e.g., Auto Scaling Group) to share the load. In this case, the EC2 instance would likely need to be re-architected. The correct answer is C. | Modifying the application to use the appropriate endpoints is not a practical solution as it would require changes to the application code. The correct answer is C: Create an Aurora replica and modify the application to use the appropriate endpoints. |
Sampled path #2 | Scaling a single EC2 instance is more difficult and problematic than scaling the other service components. Amazon SQS, Lambda, and DynamoDB can all scale automatically to cope with the increased load. The correct answer is C. |
(C) Create an Aurora replica and modify the application to use the appropriate endpoints. By configuring an Aurora Replica, you can separate read traffic from write traffic. The Aurora Replicas use different endpoint URLs, allowing you to direct read traffic to the replica instead of the primary database. The replica can process read requests in parallel with write requests to the primary database, reducing I/O and latency. |
Clean up
Running batch inference in Amazon Bedrock is subject to charges according to the Amazon Bedrock Pricing. When you complete the walkthrough, delete your SageMaker notebook instance and remove all data from your S3 buckets to avoid incurring future charges.
Considerations
Although the demonstrated solution shows improved performance of language models when prompted with self-consistency, it’s important to note that the walkthrough is not production-ready. Before you deploy to production, you should adapt this proof of concept to your own implementation, keeping in mind the following requirements:
- Access restriction to APIs and databases to prevent unauthorized usage.
- Adherence to AWS security best practices regarding IAM role access and security groups.
- Validation and sanitization of user input to prevent prompt injection attacks.
- Monitoring and logging of triggered processes to enable testing and auditing.
Conclusion
This post shows that self-consistency prompting enhances performance of generative language models in complex NLP tasks that require arithmetic and multiple-choice logical skills. Self-consistency uses temperature-based stochastic decoding to generate various reasoning paths. This increases the ability of the model to elicit diverse and useful thoughts to arrive at correct answers.
With Amazon Bedrock batch inference, the language model Cohere Command is prompted to generate self-consistent answers to a set of arithmetic problems. Accuracy improves from 51.7% with greedy decoding to 68% with self-consistency sampling 30 reasoning paths at T=1.0. Sampling five paths already enhances accuracy by 7.5 percent points. The approach is transferable to other language models and reasoning tasks, as demonstrated by results of the AI21 Labs Jurassic-2 Mid model on an AWS Certification exam. In a small-sized question set, self-consistency with five sampled paths increases accuracy by 5 percent points over greedy CoT.
We encourage you to implement self-consistency prompting for enhanced performance in your own applications with generative language models. Learn more about Cohere Command and AI21 Labs Jurassic models available on Amazon Bedrock. For more information about batch inference, refer to Run batch inference.
Acknowledgements
The author thanks technical reviewers Amin Tajgardoon and Patrick McSweeney for helpful feedback.
About the Author
Lucía Santamaría is a Sr. Applied Scientist at Amazon’s ML University, where she’s focused on raising the level of ML competency across the company through hands-on education. Lucía has a PhD in astrophysics and is passionate about democratizing access to tech knowledge and tools.
Optimize price-performance of LLM inference on NVIDIA GPUs using the Amazon SageMaker integration with NVIDIA NIM Microservices
NVIDIA NIM microservices now integrate with Amazon SageMaker, allowing you to deploy industry-leading large language models (LLMs) and optimize model performance and cost. You can deploy state-of-the-art LLMs in minutes instead of days using technologies such as NVIDIA TensorRT, NVIDIA TensorRT-LLM, and NVIDIA Triton Inference Server on NVIDIA accelerated instances hosted by SageMaker.
NIM, part of the NVIDIA AI Enterprise software platform listed on AWS marketplace, is a set of inference microservices that bring the power of state-of-the-art LLMs to your applications, providing natural language processing (NLP) and understanding capabilities, whether you’re developing chatbots, summarizing documents, or implementing other NLP-powered applications. You can use pre-built NVIDIA containers to host popular LLMs that are optimized for specific NVIDIA GPUs for quick deployment or use NIM tools to create your own containers.
In this post, we provide a high-level introduction to NIM and show how you can use it with SageMaker.
An introduction to NVIDIA NIM
NIM provides optimized and pre-generated engines for a variety of popular models for inference. These microservices support a variety of LLMs, such as Llama 2 (7B, 13B, and 70B), Mistral-7B-Instruct, Mixtral-8x7B, NVIDIA Nemotron-3 22B Persona, and Code Llama 70B, out of the box using pre-built NVIDIA TensorRT engines tailored for specific NVIDIA GPUs for maximum performance and utilization. These models are curated with the optimal hyperparameters for model-hosting performance for deploying applications with ease.
If your model is not in NVIDIA’s set of curated models, NIM offers essential utilities such as the Model Repo Generator, which facilitates the creation of a TensorRT-LLM-accelerated engine and a NIM-format model directory through a straightforward YAML file. Furthermore, an integrated community backend of vLLM provides support for cutting-edge models and emerging features that may not have been seamlessly integrated into the TensorRT-LLM-optimized stack.
In addition to creating optimized LLMs for inference, NIM provides advanced hosting technologies such as optimized scheduling techniques like in-flight batching, which can break down the overall text generation process for an LLM into multiple iterations on the model. With in-flight batching, rather than waiting for the whole batch to finish before moving on to the next set of requests, the NIM runtime immediately evicts finished sequences from the batch. The runtime then begins running new requests while other requests are still in flight, making the best use of your compute instances and GPUs.
Deploying NIM on SageMaker
NIM integrates with SageMaker, allowing you to host your LLMs with performance and cost optimization while benefiting from the capabilities of SageMaker. When you use NIM on SageMaker, you can use capabilities such as scaling out the number of instances to host your model, performing blue/green deployments, and evaluating workloads using shadow testing—all with best-in-class observability and monitoring with Amazon CloudWatch.
Conclusion
Using NIM to deploy optimized LLMs can be a great option for both performance and cost. It also helps make deploying LLMs effortless. In the future, NIM will also allow for Parameter-Efficient Fine-Tuning (PEFT) customization methods like LoRA and P-tuning. NIM also plans to have LLM support by supporting Triton Inference Server, TensorRT-LLM, and vLLM backends.
We encourage you to learn more about NVIDIA microservices and how to deploy your LLMs using SageMaker and try out the benefits available to you. NIM is available as a paid offering as part of the NVIDIA AI Enterprise software subscription available on AWS Marketplace.
In the near future, we will post an in-depth guide for NIM on SageMaker.
About the authors
James Park is a Solutions Architect at Amazon Web Services. He works with Amazon.com to design, build, and deploy technology solutions on AWS, and has a particular interest in AI and machine learning. In h is spare time he enjoys seeking out new cultures, new experiences, and staying up to date with the latest technology trends.You can find him on LinkedIn.
Saurabh Trikande is a Senior Product Manager for Amazon SageMaker Inference. He is passionate about working with customers and is motivated by the goal of democratizing machine learning. He focuses on core challenges related to deploying complex ML applications, multi-tenant ML models, cost optimizations, and making deployment of deep learning models more accessible. In his spare time, Saurabh enjoys hiking, learning about innovative technologies, following TechCrunch, and spending time with his family.
Qing Lan is a Software Development Engineer in AWS. He has been working on several challenging products in Amazon, including high performance ML inference solutions and high performance logging system. Qing’s team successfully launched the first Billion-parameter model in Amazon Advertising with very low latency required. Qing has in-depth knowledge on the infrastructure optimization and Deep Learning acceleration.
Nikhil Kulkarni is a software developer with AWS Machine Learning, focusing on making machine learning workloads more performant on the cloud, and is a co-creator of AWS Deep Learning Containers for training and inference. He’s passionate about distributed Deep Learning Systems. Outside of work, he enjoys reading books, fiddling with the guitar, and making pizza.
Harish Tummalacherla is Software Engineer with Deep Learning Performance team at SageMaker. He works on performance engineering for serving large language models efficiently on SageMaker. In his spare time, he enjoys running, cycling and ski mountaineering.
Eliuth Triana Isaza is a Developer Relations Manager at NVIDIA empowering Amazon’s AI MLOps, DevOps, Scientists and AWS technical experts to master the NVIDIA computing stack for accelerating and optimizing Generative AI Foundation models spanning from data curation, GPU training, model inference and production deployment on AWS GPU instances. In addition, Eliuth is a passionate mountain biker, skier, tennis and poker player.
Jiahong Liu is a Solution Architect on the Cloud Service Provider team at NVIDIA. He assists clients in adopting machine learning and AI solutions that leverage NVIDIA accelerated computing to address their training and inference challenges. In his leisure time, he enjoys origami, DIY projects, and playing basketball.
Kshitiz Gupta is a Solutions Architect at NVIDIA. He enjoys educating cloud customers about the GPU AI technologies NVIDIA has to offer and assisting them with accelerating their machine learning and deep learning applications. Outside of work, he enjoys running, hiking and wildlife watching.
Adapting language model architectures for time series forecasting
Tokenizing time series data and treating it like a language enables a model whose zero-shot performance matches or exceeds that of purpose-built models.Read More
Fine-tune Code Llama on Amazon SageMaker JumpStart
Today, we are excited to announce the capability to fine-tune Code Llama models by Meta using Amazon SageMaker JumpStart. The Code Llama family of large language models (LLMs) is a collection of pre-trained and fine-tuned code generation models ranging in scale from 7 billion to 70 billion parameters. Fine-tuned Code Llama models provide better accuracy and explainability over the base Code Llama models, as evident on its testing against HumanEval and MBPP datasets. You can fine-tune and deploy Code Llama models with SageMaker JumpStart using the Amazon SageMaker Studio UI with a few clicks or using the SageMaker Python SDK. Fine-tuning of Llama models is based on the scripts provided in the llama-recipes GitHub repo from Meta using PyTorch FSDP, PEFT/LoRA, and Int8 quantization techniques.
In this post, we walk through how to fine-tune Code Llama pre-trained models via SageMaker JumpStart through a one-click UI and SDK experience available in the following GitHub repository.
What is SageMaker JumpStart
With SageMaker JumpStart, machine learning (ML) practitioners can choose from a broad selection of publicly available foundation models. ML practitioners can deploy foundation models to dedicated Amazon SageMaker instances from a network isolated environment and customize models using SageMaker for model training and deployment.
What is Code Llama
Code Llama is a code-specialized version of Llama 2 that was created by further training Llama 2 on its code-specific datasets and sampling more data from that same dataset for longer. Code Llama features enhanced coding capabilities. It can generate code and natural language about code, from both code and natural language prompts (for example, “Write me a function that outputs the Fibonacci sequence”). You can also use it for code completion and debugging. It supports many of the most popular programming languages used today, including Python, C++, Java, PHP, Typescript (JavaScript), C#, Bash, and more.
Why fine-tune Code Llama models
Meta published Code Llama performance benchmarks on HumanEval and MBPP for common coding languages such as Python, Java, and JavaScript. The performance of Code Llama Python models on HumanEval demonstrated varying performance across different coding languages and tasks ranging from 38% on 7B Python model to 57% on 70B Python models. In addition, fine-tuned Code Llama models on SQL programming language have shown better results, as evident in SQL evaluation benchmarks. These published benchmarks highlight the potential benefits of fine-tuning Code Llama models, enabling better performance, customization, and adaptation to specific coding domains and tasks.
No-code fine-tuning via the SageMaker Studio UI
To start fine-tuning your Llama models using SageMaker Studio, complete the following steps:
- On the SageMaker Studio console, choose JumpStart in the navigation pane.
You will find listings of over 350 models ranging from open source and proprietary models.
- Search for Code Llama models.
If you don’t see Code Llama models, you can update your SageMaker Studio version by shutting down and restarting. For more information about version updates, refer to Shut down and Update Studio Apps. You can also find other model variants by choosing Explore all Code Generation Models or searching for Code Llama in the search box.
SageMaker JumpStart currently supports instruction fine-tuning for Code Llama models. The following screenshot shows the fine-tuning page for the Code Llama 2 70B model.
- For Training dataset location, you can point to the Amazon Simple Storage Service (Amazon S3) bucket containing the training and validation datasets for fine-tuning.
- Set your deployment configuration, hyperparameters, and security settings for fine-tuning.
- Choose Train to start the fine-tuning job on a SageMaker ML instance.
We discuss the dataset format you need prepare for instruction fine-tuning in the next section.
- After the model is fine-tuned, you can deploy it using the model page on SageMaker JumpStart.
The option to deploy the fine-tuned model will appear when fine-tuning is finished, as shown in the following screenshot.
Fine-tune via the SageMaker Python SDK
In this section, we demonstrate how to fine-tune Code LIama models using the SageMaker Python SDK on an instruction-formatted dataset. Specifically, the model is fine-tuned for a set of natural language processing (NLP) tasks described using instructions. This helps improve the model’s performance for unseen tasks with zero-shot prompts.
Complete the following steps to complete your fine-tuning job. You can get the entire fine-tuning code from the GitHub repository.
First, let’s look at the dataset format required for the instruction fine-tuning. The training data should be formatted in a JSON lines (.jsonl) format, where each line is a dictionary representing a data sample. All training data must be in a single folder. However, it can be saved in multiple .jsonl files. The following is a sample in JSON lines format:
The training folder can contain a template.json
file describing the input and output formats. The following is an example template:
To match the template, each sample in the JSON lines files must include system_prompt
, question
, and response
fields. In this demonstration, we use the Dolphin Coder dataset from Hugging Face.
After you prepare the dataset and upload it to the S3 bucket, you can start fine-tuning using the following code:
You can deploy the fine-tuned model directly from the estimator, as shown in the following code. For details, see the notebook in the GitHub repository.
Fine-tuning techniques
Language models such as Llama are more than 10 GB or even 100 GB in size. Fine-tuning such large models requires instances with significantly high CUDA memory. Furthermore, training these models can be very slow due to the size of the model. Therefore, for efficient fine-tuning, we use the following optimizations:
- Low-Rank Adaptation (LoRA) – This is a type of parameter efficient fine-tuning (PEFT) for efficient fine-tuning of large models. With this method, you freeze the whole model and only add a small set of adjustable parameters or layers into the model. For instance, instead of training all 7 billion parameters for Llama 2 7B, you can fine-tune less than 1% of the parameters. This helps in significant reduction of the memory requirement because you only need to store gradients, optimizer states, and other training-related information for only 1% of the parameters. Furthermore, this helps in reduction of training time as well as the cost. For more details on this method, refer to LoRA: Low-Rank Adaptation of Large Language Models.
- Int8 quantization – Even with optimizations such as LoRA, models such as Llama 70B are still too big to train. To decrease the memory footprint during training, you can use Int8 quantization during training. Quantization typically reduces the precision of floating point data types. Although this decreases the memory required to store model weights, it degrades the performance due to loss of information. Int8 quantization uses only a quarter precision but doesn’t incur degradation of performance because it doesn’t simply drop the bits. It rounds the data from one type to the another. To learn about Int8 quantization, refer to LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale.
- Fully Sharded Data Parallel (FSDP) – This is a type of data-parallel training algorithm that shards the model’s parameters across data parallel workers and can optionally offload part of the training computation to the CPUs. Although the parameters are sharded across different GPUs, computation of each microbatch is local to the GPU worker. It shards parameters more uniformly and achieves optimized performance via communication and computation overlapping during training.
The following table summarizes the details of each model with different settings.
Model | Default Setting | LORA + FSDP | LORA + No FSDP | Int8 Quantization + LORA + No FSDP |
Code Llama 2 7B | LORA + FSDP | Yes | Yes | Yes |
Code Llama 2 13B | LORA + FSDP | Yes | Yes | Yes |
Code Llama 2 34B | INT8 + LORA + NO FSDP | No | No | Yes |
Code Llama 2 70B | INT8 + LORA + NO FSDP | No | No | Yes |
Fine-tuning of Llama models is based on scripts provided by the following GitHub repo.
Supported hyperparameters for training
Code Llama 2 fine-tuning supports a number of hyperparameters, each of which can impact the memory requirement, training speed, and performance of the fine-tuned model:
- epoch – The number of passes that the fine-tuning algorithm takes through the training dataset. Must be an integer greater than 1. Default is 5.
- learning_rate – The rate at which the model weights are updated after working through each batch of training examples. Must be a positive float greater than 0. Default is 1e-4.
- instruction_tuned – Whether to instruction-train the model or not. Must be
True
orFalse
. Default isFalse
. - per_device_train_batch_size – The batch size per GPU core/CPU for training. Must be a positive integer. Default is 4.
- per_device_eval_batch_size – The batch size per GPU core/CPU for evaluation. Must be a positive integer. Default is 1.
- max_train_samples – For debugging purposes or quicker training, truncate the number of training examples to this value. Value -1 means using all of the training samples. Must be a positive integer or -1. Default is -1.
- max_val_samples – For debugging purposes or quicker training, truncate the number of validation examples to this value. Value -1 means using all of the validation samples. Must be a positive integer or -1. Default is -1.
- max_input_length – Maximum total input sequence length after tokenization. Sequences longer than this will be truncated. If -1,
max_input_length
is set to the minimum of 1024 and the maximum model length defined by the tokenizer. If set to a positive value,max_input_length
is set to the minimum of the provided value and themodel_max_length
defined by the tokenizer. Must be a positive integer or -1. Default is -1. - validation_split_ratio – If validation channel is
none
, the ratio of the train-validation split from the train data must be between 0–1. Default is 0.2. - train_data_split_seed – If validation data is not present, this fixes the random splitting of the input training data to training and validation data used by the algorithm. Must be an integer. Default is 0.
- preprocessing_num_workers – The number of processes to use for preprocessing. If
None
, the main process is used for preprocessing. Default isNone
. - lora_r – Lora R. Must be a positive integer. Default is 8.
- lora_alpha – Lora Alpha. Must be a positive integer. Default is 32
- lora_dropout – Lora Dropout. must be a positive float between 0 and 1. Default is 0.05.
- int8_quantization – If
True
, the model is loaded with 8-bit precision for training. Default for 7B and 13B isFalse
. Default for 70B isTrue
. - enable_fsdp – If True, training uses FSDP. Default for 7B and 13B is True. Default for 70B is False. Note that
int8_quantization
is not supported with FSDP.
When choosing the hyperparameters, consider the following:
- Setting
int8_quantization=True
decreases the memory requirement and leads to faster training. - Decreasing
per_device_train_batch_size
andmax_input_length
reduces the memory requirement and therefore can be run on smaller instances. However, setting very low values may increase the training time. - If you’re not using Int8 quantization (
int8_quantization=False
), use FSDP (enable_fsdp=True
) for faster and efficient training.
Supported instance types for training
The following table summarizes the supported instance types for training different models.
Model | Default Instance Type | Supported Instance Types |
Code Llama 2 7B | ml.g5.12xlarge |
ml.g5.12xlarge, ml.g5.24xlarge, ml.g5.48xlarge, ml.p3dn.24xlarge, ml.g4dn.12xlarge |
Code Llama 2 13B | ml.g5.12xlarge |
ml.g5.24xlarge, ml.g5.48xlarge, ml.p3dn.24xlarge, ml.g4dn.12xlarge |
Code Llama 2 70B | ml.g5.48xlarge |
ml.g5.48xlarge ml.p4d.24xlarge |
When choosing the instance type, consider the following:
- G5 instances provide the most efficient training among the instance types supported. Therefore, if you have G5 instances available, you should use them.
- Training time largely depends on the amount of the number of GPUs and the CUDA memory available. Therefore, training on instances with the same number of GPUs (for example, ml.g5.2xlarge and ml.g5.4xlarge) is roughly the same. Therefore, you can use the cheaper instance for training (ml.g5.2xlarge).
- When using p3 instances, training will be done with 32-bit precision because bfloat16 is not supported on these instances. Therefore, the training job will consume double the amount of CUDA memory when training on p3 instances compared to g5 instances.
To learn about the cost of training per instance, refer to Amazon EC2 G5 Instances.
Evaluation
Evaluation is an important step to assess the performance of fine-tuned models. We present both qualitative and quantitative evaluations to show improvement of fine-tuned models over non-fine-tuned ones. In qualitative evaluation, we show an example response from both fine-tuned and non-fine-tuned models. In quantitative evaluation, we use HumanEval, a test suite developed by OpenAI to generate Python code to test the abilities of producing correct and accurate results. The HumanEval repository is under MIT license. We fine-tuned Python variants of all Code LIama models over different sizes (Code LIama Python 7B, 13B, 34B, and 70B on the Dolphin Coder dataset), and present the evaluation results in the following sections.
Qualitatively evaluation
With your fine-tuned model deployed, you can start using the endpoint to generate code. In the following example, we present responses from both base and fine-tuned Code LIama 34B Python variants on a test sample in the Dolphin Coder dataset:
The fine-tuned Code Llama model, in addition to providing the code for the preceding query, generates a detailed explanation of the approach and a pseudo code.
Code Llama 34b Python Non-Fine-Tuned Response:
Code Llama 34B Python Fine-Tuned Response
Ground Truth
Interestingly, our fine-tuned version of Code Llama 34B Python provides a dynamic programming-based solution to the longest palindromic substring, which is different from the solution provided in the ground truth from the selected test example. Our fine-tuned model reasons and explains the dynamic programming-based solution in detail. On the other hand, the non-fine-tuned model hallucinates potential outputs right after the print
statement (shown in the left cell) because the output axyzzyx
is not the longest palindrome in the given string. In terms of time complexity, the dynamic programming solution is generally better than the initial approach. The dynamic programming solution has a time complexity of O(n^2), where n is the length of the input string. This is more efficient than the initial solution from the non-fine-tuned model, which also had a quadratic time complexity of O(n^2) but with a less optimized approach.
This looks promising! Remember, we only fine-tuned the Code LIama Python variant with 10% of the Dolphin Coder dataset. There is a lot more to explore!
Despite of thorough instructions in the response, we still need examine the correctness of the Python code provided in the solution. Next, we use an evaluation framework called Human Eval to run integration tests on the generated response from Code LIama to systematically examine its quality.
Quantitative evaluation with HumanEval
HumanEval is an evaluation harness for evaluating an LLM’s problem-solving capabilities on Python-based coding problems, as described in the paper Evaluating Large Language Models Trained on Code. Specifically, it consists of 164 original Python-based programming problems that assess a language model’s ability to generate code based on provided information like function signature, docstring, body, and unit tests.
For each Python-based programming question, we send it to a Code LIama model deployed on a SageMaker endpoint to get k responses. Next, we run each of the k responses on the integration tests in the HumanEval repository. If any response of the k responses passes the integration tests, we count that test case succeed; otherwise, failed. Then we repeat the process to calculate the ratio of successful cases as the final evaluation score named pass@k
. Following standard practice, we set k as 1 in our evaluation, to only generate one response per question and test whether it passes the integration test.
The following is a sample code to use HumanEval repository. You can access the dataset and generate a single response using a SageMaker endpoint. For details, see the notebook in the GitHub repository.
The following table shows the improvements of the fine-tuned Code LIama Python models over the non-fine-tuned models across different model sizes. To ensure correctness, we also deploy the non-fine-tuned Code LIama models in SageMaker endpoints and run through Human Eval evaluations. The pass@1 numbers (the first row in the following table) match the reported numbers in the Code Llama research paper. The inference parameters are consistently set as "parameters": {"max_new_tokens": 384, "temperature": 0.2}
.
As we can see from the results, all the fine-tuned Code LIama Python variants show significant improvement over the non-fine-tuned models. In particular, Code LIama Python 70B outperforms the non-fine-tuned model by approximately 12%.
. | 7B Python | 13B Python | 34B | 34B Python | 70B Python |
Pre-trained model performance (pass@1) | 38.4 | 43.3 | 48.8 | 53.7 | 57.3 |
Fine-tuned model performance (pass@1) | 45.12 | 45.12 | 59.1 | 61.5 | 69.5 |
Now you can try fine-tuning Code LIama models on your own dataset.
Clean up
If you decide that you no longer want to keep the SageMaker endpoint running, you can delete it using AWS SDK for Python (Boto3), AWS Command Line Interface (AWS CLI), or SageMaker console. For more information, see Delete Endpoints and Resources. Additionally, you can shut down the SageMaker Studio resources that are no longer required.
Conclusion
In this post, we discussed fine-tuning Meta’s Code Llama 2 models using SageMaker JumpStart. We showed that you can use the SageMaker JumpStart console in SageMaker Studio or the SageMaker Python SDK to fine-tune and deploy these models. We also discussed the fine-tuning technique, instance types, and supported hyperparameters. In addition, we outlined recommendations for optimized training based on various tests we carried out. As we can see from these results of fine-tuning three models over two datasets, fine-tuning improves summarization compared to non-fine-tuned models. As a next step, you can try fine-tuning these models on your own dataset using the code provided in the GitHub repository to test and benchmark the results for your use cases.
About the Authors
Dr. Xin Huang is a Senior Applied Scientist for Amazon SageMaker JumpStart and Amazon SageMaker built-in algorithms. He focuses on developing scalable machine learning algorithms. His research interests are in the area of natural language processing, explainable deep learning on tabular data, and robust analysis of non-parametric space-time clustering. He has published many papers in ACL, ICDM, KDD conferences, and Royal Statistical Society: Series A.
Vishaal Yalamanchali is a Startup Solutions Architect working with early-stage generative AI, robotics, and autonomous vehicle companies. Vishaal works with his customers to deliver cutting-edge ML solutions and is personally interested in reinforcement learning, LLM evaluation, and code generation. Prior to AWS, Vishaal was an undergraduate at UCI, focused on bioinformatics and intelligent systems.
Meenakshisundaram Thandavarayan works for AWS as an AI/ ML Specialist. He has a passion to design, create, and promote human-centered data and analytics experiences. Meena focuses on developing sustainable systems that deliver measurable, competitive advantages for strategic customers of AWS. Meena is a connector and design thinker, and strives to drive businesses to new ways of working through innovation, incubation, and democratization.
Dr. Ashish Khetan is a Senior Applied Scientist with Amazon SageMaker built-in algorithms and helps develop machine learning algorithms. He got his PhD from University of Illinois Urbana-Champaign. He is an active researcher in machine learning and statistical inference, and has published many papers in NeurIPS, ICML, ICLR, JMLR, ACL, and EMNLP conferences.
Transform one-on-one customer interactions: Build speech-capable order processing agents with AWS and generative AI
In today’s landscape of one-on-one customer interactions for placing orders, the prevailing practice continues to rely on human attendants, even in settings like drive-thru coffee shops and fast-food establishments. This traditional approach poses several challenges: it heavily depends on manual processes, struggles to efficiently scale with increasing customer demands, introduces the potential for human errors, and operates within specific hours of availability. Additionally, in competitive markets, businesses adhering solely to manual processes might find it challenging to deliver efficient and competitive service. Despite technological advancements, the human-centric model remains deeply ingrained in order processing, leading to these limitations.
The prospect of utilizing technology for one-on-one order processing assistance has been available for some time. However, existing solutions can often fall into two categories: rule-based systems that demand substantial time and effort for setup and upkeep, or rigid systems that lack the flexibility required for human-like interactions with customers. As a result, businesses and organizations face challenges in swiftly and efficiently implementing such solutions. Fortunately, with the advent of generative AI and large language models (LLMs), it’s now possible to create automated systems that can handle natural language efficiently, and with an accelerated on-ramping timeline.
Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Stability AI, and Amazon via a single API, along with a broad set of capabilities you need to build generative AI applications with security, privacy, and responsible AI. In addition to Amazon Bedrock, you can use other AWS services like Amazon SageMaker JumpStart and Amazon Lex to create fully automated and easily adaptable generative AI order processing agents.
In this post, we show you how to build a speech-capable order processing agent using Amazon Lex, Amazon Bedrock, and AWS Lambda.
Solution overview
The following diagram illustrates our solution architecture.
The workflow consists of the following steps:
- A customer places the order using Amazon Lex.
- The Amazon Lex bot interprets the customer’s intents and triggers a
DialogCodeHook
. - A Lambda function pulls the appropriate prompt template from the Lambda layer and formats model prompts by adding the customer input in the associated prompt template.
- The
RequestValidation
prompt verifies the order with the menu item and lets the customer know via Amazon Lex if there’s something they want to order that isn’t part of the menu and will provide recommendations. The prompt also performs a preliminary validation for order completeness. - The
ObjectCreator
prompt converts the natural language requests into a data structure (JSON format). - The customer validator Lambda function verifies the required attributes for the order and confirms if all necessary information is present to process the order.
- A customer Lambda function takes the data structure as an input for processing the order and passes the order total back to the orchestrating Lambda function.
- The orchestrating Lambda function calls the Amazon Bedrock LLM endpoint to generate a final order summary including the order total from the customer database system (for example, Amazon DynamoDB).
- The order summary is communicated back to the customer via Amazon Lex. After the customer confirms the order, the order will be processed.
Prerequisites
This post assumes that you have an active AWS account and familiarity with the following concepts and services:
- Generative AI
- Amazon Bedrock
- Anthropic Claude V2
- Amazon DynamoDB
- AWS Lambda
- Amazon Lex
- Amazon Simple Storage Service (Amazon S3)
Also, in order to access Amazon Bedrock from the Lambda functions, you need to make sure the Lambda runtime has the following libraries:
- boto3>=1.28.57
- awscli>=1.29.57
- botocore>=1.31.57
This can be done with a Lambda layer or by using a specific AMI with the required libraries.
Furthermore, these libraries are required when calling the Amazon Bedrock API from Amazon SageMaker Studio. This can be done by running a cell with the following code:
Finally, you create the following policy and later attach it to any role accessing Amazon Bedrock:
Create a DynamoDB table
In our specific scenario, we’ve created a DynamoDB table as our customer database system, but you could also use Amazon Relational Database Service (Amazon RDS). Complete the following steps to provision your DynamoDB table (or customize the settings as needed for your use case):
- On the DynamoDB console, choose Tables in the navigation pane.
- Choose Create table.
- For Table name, enter a name (for example,
ItemDetails
). - For Partition key, enter a key (for this post, we use
Item
). - For Sort key, enter a key (for this post, we use
Size
). - Choose Create table.
Now you can load the data into the DynamoDB table. For this post, we use a CSV file. You can load the data to the DynamoDB table using Python code in a SageMaker notebook.
First, we need to set up a profile named dev.
- Open a new terminal in SageMaker Studio and run the following command:
This command will prompt you to enter your AWS access key ID, secret access key, default AWS Region, and output format.
- Return to the SageMaker notebook and write a Python code to set up a connection to DynamoDB using the Boto3 library in Python. This code snippet creates a session using a specific AWS profile named dev and then creates a DynamoDB client using that session. The following is the code sample to load the data:
Alternatively, you can use NoSQL Workbench or other tools to quickly load the data to your DynamoDB table.
The following is a screenshot after the sample data is inserted into the table.
Create templates in a SageMaker notebook using the Amazon Bedrock invocation API
To create our prompt template for this use case, we use Amazon Bedrock. You can access Amazon Bedrock from the AWS Management Console and via API invocations. In our case, we access Amazon Bedrock via API from the convenience of a SageMaker Studio notebook to create not only our prompt template, but our complete API invocation code that we can later use on our Lambda function.
- On the SageMaker console, access an existing SageMaker Studio domain or create a new one to access Amazon Bedrock from a SageMaker notebook.
- After you create the SageMaker domain and user, choose the user and choose Launch and Studio. This will open a JupyterLab environment.
- When the JupyterLab environment is ready, open a new notebook and begin importing the necessary libraries.
There are many FMs available via the Amazon Bedrock Python SDK. In this case, we use Claude V2, a powerful foundational model developed by Anthropic.
The order processing agent needs a few different templates. This can change depending on the use case, but we have designed a general workflow that can apply to multiple settings. For this use case, the Amazon Bedrock LLM template will accomplish the following:
- Validate the customer intent
- Validate the request
- Create the order data structure
- Pass a summary of the order to the customer
- To invoke the model, create a bedrock-runtime object from Boto3.
Let’s start by working on the intent validator prompt template. This is an iterative process, but thanks to Anthropic’s prompt engineering guide, you can quickly create a prompt that can accomplish the task.
- Create the first prompt template along with a utility function that will help prepare the body for the API invocations.
The following is the code for prompt_template_intent_validator.txt:
- Save this template into a file in order to upload to Amazon S3 and call from the Lambda function when needed. Save the templates as JSON serialized strings in a text file. The previous screenshot shows the code sample to accomplish this as well.
- Repeat the same steps with the other templates.
The following are some screenshots of the other templates and the results when calling Amazon Bedrock with some of them.
The following is the code for prompt_template_request_validator.txt:
The following is our response from Amazon Bedrock using this template.
The following is the code for prompt_template_object_creator.txt
:
The following is the code for prompt_template_order_summary.txt:
As you can see, we have used our prompt templates to validate menu items, identify missing required information, create a data structure, and summarize the order. The foundational models available on Amazon Bedrock are very powerful, so you could accomplish even more tasks via these templates.
You have completed engineering the prompts and saved the templates to text files. You can now begin creating the Amazon Lex bot and the associated Lambda functions.
Create a Lambda layer with the prompt templates
Complete the following steps to create your Lambda layer:
- In SageMaker Studio, create a new folder with a subfolder named
python
. - Copy your prompt files to the
python
folder.
- You can add the ZIP library to your notebook instance by running the following command.
- Now, run the following command to create the ZIP file for uploading to the Lambda layer.
- After you create the ZIP file, you can download the file. Go to Lambda, create a new layer by uploading the file directly or by uploading to Amazon S3 first.
- Then attach this new layer to the orchestration Lambda function.
Now your prompt template files are locally stored in your Lambda runtime environment. This will speed up the process during your bot runs.
Create a Lambda layer with the required libraries
Complete the following steps to create your Lambda layer with the required librarues:
- Open an AWS Cloud9 instance environment, create a folder with a subfolder called
python
. - Open a terminal inside the
python
folder. - Run the following commands from the terminal:
- Run
cd ..
and position yourself inside your new folder where you also have thepython
subfolder. - Run the following command:
- After you create the ZIP file, you can download the file. Go to Lambda, create a new layer by uploading the file directly or by uploading to Amazon S3 first.
- Then attach this new layer to the orchestration Lambda function.
Create the bot in Amazon Lex v2
For this use case, we build an Amazon Lex bot that can provide an input/output interface for the architecture in order to call Amazon Bedrock using voice or text from any interface. Because the LLM will handle the conversation piece of this order processing agent, and Lambda will orchestrate the workflow, you can create a bot with three intents and no slots.
- On the Amazon Lex console, create a new bot with the method Create a blank bot.
Now you can add an intent with any appropriate initial utterance for the end-users to start the conversation with the bot. We use simple greetings and add an initial bot response so end-users can provide their requests. When creating the bot, make sure to use a Lambda code hook with the intents; this will trigger a Lambda function that will orchestrate the workflow between the customer, Amazon Lex, and the LLM.
- Add your first intent, which triggers the workflow and uses the intent validation prompt template to call Amazon Bedrock and identify what the customer is trying to accomplish. Add a few simple utterances for end-users to start conversation.
You don’t need to use any slots or initial reading in any of the bot intents. In fact, you don’t need to add utterances to the second or third intents. That is because the LLM will guide Lambda throughout the process.
- Add a confirmation prompt. You can customize this message in the Lambda function later.
- Under Code hooks, select Use a Lambda function for initialization and validation.
- Create a second intent with no utterance and no initial response. This is the
PlaceOrder
intent.
When the LLM identifies that the customer is trying to place an order, the Lambda function will trigger this intent and validate the customer request against the menu, and make sure that no required information is missing. Remember that all of this is on the prompt templates, so you can adapt this workflow for any use case by changing the prompt templates.
- Don’t add any slots, but add a confirmation prompt and decline response.
- Select Use a Lambda function for initialization and validation.
- Create a third intent named
ProcessOrder
with no sample utterances and no slots. - Add an initial response, a confirmation prompt, and a decline response.
After the LLM has validated the customer request, the Lambda function triggers the third and last intent to process the order. Here, Lambda will use the object creator template to generate the order JSON data structure to query the DynamoDB table, and then use the order summary template to summarize the whole order along with the total so Amazon Lex can pass it to the customer.
- Select Use a Lambda function for initialization and validation. This can use any Lambda function to process the order after the customer has given the final confirmation.
- After you create all three intents, go to the Visual builder for the
ValidateIntent
, add a go-to intent step, and connect the output of the positive confirmation to that step. - After you add the go-to intent, edit it and choose the PlaceOrder intent as the intent name.
- Similarly, to go the Visual builder for the
PlaceOrder
intent and connect the output of the positive confirmation to theProcessOrder
go-to intent. No editing is required for theProcessOrder
intent. - You now need to create the Lambda function that orchestrates Amazon Lex and calls the DynamoDB table, as detailed in the following section.
Create a Lambda function to orchestrate the Amazon Lex bot
You can now build the Lambda function that orchestrates the Amazon Lex bot and workflow. Complete the following steps:
- Create a Lambda function with the standard execution policy and let Lambda create a role for you.
- In the code window of your function, add a few utility functions that will help: format the prompts by adding the lex context to the template, call the Amazon Bedrock LLM API, extract the desired text from the responses, and more. See the following code:
- Attach the Lambda layer you created earlier to this function.
- Additionally, attach the layer to the prompt templates you created.
- In the Lambda execution role, attach the policy to access Amazon Bedrock, which was created earlier.
The Lambda execution role should have the following permissions.
Attach the Orchestration Lambda function to the Amazon Lex bot
- After you create the function in the previous section, return to the Amazon Lex console and navigate to your bot.
- Under Languages in the navigation pane, choose English.
- For Source, choose your order processing bot.
- For Lambda function version or alias, choose $LATEST.
- Choose Save.
Create assisting Lambda functions
Complete the following steps to create additional Lambda functions:
- Create a Lambda function to query the DynamoDB table that you created earlier:
- Navigate to the Configuration tab in the Lambda function and choose Permissions.
- Attach a resource-based policy statement allowing the order processing Lambda function to invoke this function.
- Navigate to the IAM execution role for this Lambda function and add a policy to access the DynamoDB table.
- Create another Lambda function to validate if all required attributes were passed from the customer. In the following example, we validate if the size attribute is captured for an order:
- Navigate to the Configuration tab in the Lambda function and choose Permissions.
- Attach a resource-based policy statement allowing the order processing Lambda function to invoke this function.
Test the solution
Now we can test the solution with example orders that customers place via Amazon Lex.
For our first example, the customer asked for a frappuccino, which is not on the menu. The model validates with the help of order validator template and suggests some recommendations based on the menu. After the customer confirms their order, they are notified of the order total and order summary. The order will be processed based on the customer’s final confirmation.
In our next example, the customer is ordering for large cappuccino and then modifying the size from large to medium. The model captures all necessary changes and requests the customer to confirm the order. The model presents the order total and order summary, and processes the order based on the customer’s final confirmation.
For our final example, the customer placed an order for multiple items and the size is missing for a couple of items. The model and Lambda function will verify if all required attributes are present to process the order and then ask the customer to provide the missing information. After the customer provides the missing information (in this case, the size of the coffee), they’re shown the order total and order summary. The order will be processed based on the customer’s final confirmation.
LLM limitations
LLM outputs are stochastic by nature, which means that the results from our LLM can vary in format, or even in the form of untruthful content (hallucinations). Therefore, developers need to rely on a good error handling logic throughout their code in order to handle these scenarios and avoid a degraded end-user experience.
Clean up
If you no longer need this solution, you can delete the following resources:
- Lambda functions
- Amazon Lex box
- DynamoDB table
- S3 bucket
Additionally, shut down the SageMaker Studio instance if the application is no longer required.
Cost assessment
For pricing information for the main services used by this solution, see the following:
- Amazon Bedrock Pricing
- Amazon DynamoDB Pricing
- AWS Lambda Pricing
- Amazon Lex Pricing
- Amazon S3 Pricing
Note that you can use Claude v2 without the need for provisioning, so overall costs remain at a minimum. To further reduce costs, you can configure the DynamoDB table with the on-demand setting.
Conclusion
This post demonstrated how to build a speech-enabled AI order processing agent using Amazon Lex, Amazon Bedrock, and other AWS services. We showed how prompt engineering with a powerful generative AI model like Claude can enable robust natural language understanding and conversation flows for order processing without the need for extensive training data.
The solution architecture uses serverless components like Lambda, Amazon S3, and DynamoDB to enable a flexible and scalable implementation. Storing the prompt templates in Amazon S3 allows you to customize the solution for different use cases.
Next steps could include expanding the agent’s capabilities to handle a wider range of customer requests and edge cases. The prompt templates provide a way to iteratively improve the agent’s skills. Additional customizations could involve integrating the order data with backend systems like inventory, CRM, or POS. Lastly, the agent could be made available across various customer touchpoints like mobile apps, drive-thru, kiosks, and more using the multi-channel capabilities of Amazon Lex.
To learn more, refer to the following related resources:
- Deploying and managing multi-channel bots:
- Prompt engineering for Claude and other models:
- Serverless architectural patterns for scalable AI assistants:
About the Authors
Moumita Dutta is a Partner Solution Architect at Amazon Web Services. In her role, she collaborates closely with partners to develop scalable and reusable assets that streamline cloud deployments and enhance operational efficiency. She is a member of AI/ML community and a Generative AI expert at AWS. In her leisure, she enjoys gardening and cycling.
Fernando Lammoglia is a Partner Solutions Architect at Amazon Web Services, working closely with AWS partners in spearheading the development and adoption of cutting-edge AI solutions across business units. A strategic leader with expertise in cloud architecture, generative AI, machine learning, and data analytics. He specializes in executing go-to-market strategies and delivering impactful AI solutions aligned with organizational goals. On his free time he loves to spend time with his family and travel to other countries.
Mitul Patel is a Senior Solution Architect at Amazon Web Services. In his role as a cloud technology enabler, he works with customers to understand their goals and challenges, and provides prescriptive guidance to achieve their objective with AWS offerings. He is a member of AI/ML community and a Generative AI ambassador at AWS. In his free time, he enjoys hiking and playing soccer.
Federated learning on AWS using FedML, Amazon EKS, and Amazon SageMaker
This post is co-written with Chaoyang He, Al Nevarez and Salman Avestimehr from FedML.
Many organizations are implementing machine learning (ML) to enhance their business decision-making through automation and the use of large distributed datasets. With increased access to data, ML has the potential to provide unparalleled business insights and opportunities. However, the sharing of raw, non-sanitized sensitive information across different locations poses significant security and privacy risks, especially in regulated industries such as healthcare.
To address this issue, federated learning (FL) is a decentralized and collaborative ML training technique that offers data privacy while maintaining accuracy and fidelity. Unlike traditional ML training, FL training occurs within an isolated client location using an independent secure session. The client only shares its output model parameters with a centralized server, known as the training coordinator or aggregation server, and not the actual data used to train the model. This approach alleviates many data privacy concerns while enabling effective collaboration on model training.
Although FL is a step towards achieving better data privacy and security, it’s not a guaranteed solution. Insecure networks lacking access control and encryption can still expose sensitive information to attackers. Additionally, locally trained information can expose private data if reconstructed through an inference attack. To mitigate these risks, the FL model uses personalized training algorithms and effective masking and parameterization before sharing information with the training coordinator. Strong network controls at local and centralized locations can further reduce inference and exfiltration risks.
In this post, we share an FL approach using FedML, Amazon Elastic Kubernetes Service (Amazon EKS), and Amazon SageMaker to improve patient outcomes while addressing data privacy and security concerns.
The need for federated learning in healthcare
Healthcare relies heavily on distributed data sources to make accurate predictions and assessments about patient care. Limiting the available data sources to protect privacy negatively affects result accuracy and, ultimately, the quality of patient care. Therefore, ML creates challenges for AWS customers who need to ensure privacy and security across distributed entities without compromising patient outcomes.
Healthcare organizations must navigate strict compliance regulations, such as the Health Insurance Portability and Accountability Act (HIPAA) in the United States, while implementing FL solutions. Ensuring data privacy, security, and compliance becomes even more critical in healthcare, requiring robust encryption, access controls, auditing mechanisms, and secure communication protocols. Additionally, healthcare datasets often contain complex and heterogeneous data types, making data standardization and interoperability a challenge in FL settings.
Use case overview
The use case outlined in this post is of heart disease data in different organizations, on which an ML model will run classification algorithms to predict heart disease in the patient. Because this data is across organizations, we use federated learning to collate the findings.
The Heart Disease dataset from the University of California Irvine’s Machine Learning Repository is a widely used dataset for cardiovascular research and predictive modeling. It consists of 303 samples, each representing a patient, and contains a combination of clinical and demographic attributes, as well as the presence or absence of heart disease.
This multivariate dataset has 76 attributes in the patient information, out of which 14 attributes are most commonly used for developing and evaluating ML algorithms to predict the presence of heart disease based on the given attributes.
FedML framework
There is a wide selection of FL frameworks, but we decided to use the FedML framework for this use case because it is open source and supports several FL paradigms. FedML provides a popular open source library, MLOps platform, and application ecosystem for FL. These facilitate the development and deployment of FL solutions. It provides a comprehensive suite of tools, libraries, and algorithms that enable researchers and practitioners to implement and experiment with FL algorithms in a distributed environment. FedML addresses the challenges of data privacy, communication, and model aggregation in FL, offering a user-friendly interface and customizable components. With its focus on collaboration and knowledge sharing, FedML aims to accelerate the adoption of FL and drive innovation in this emerging field. The FedML framework is model agnostic, including recently added support for large language models (LLMs). For more information, refer to Releasing FedLLM: Build Your Own Large Language Models on Proprietary Data using the FedML Platform.
FedML Octopus
System hierarchy and heterogeneity is a key challenge in real-life FL use cases, where different data silos may have different infrastructure with CPU and GPUs. In such scenarios, you can use FedML Octopus.
FedML Octopus is the industrial-grade platform of cross-silo FL for cross-organization and cross-account training. Coupled with FedML MLOps, it enables developers or organizations to conduct open collaboration from anywhere at any scale in a secure manner. FedML Octopus runs a distributed training paradigm inside each data silo and uses synchronous or asynchronous trainings.
FedML MLOps
FedML MLOps enables local development of code that can later be deployed anywhere using FedML frameworks. Before initiating training, you must create a FedML account, as well as create and upload the server and client packages in FedML Octopus. For more details, refer to steps and Introducing FedML Octopus: scaling federated learning into production with simplified MLOps.
Solution overview
We deploy FedML into multiple EKS clusters integrated with SageMaker for experiment tracking. We use Amazon EKS Blueprints for Terraform to deploy the required infrastructure. EKS Blueprints helps compose complete EKS clusters that are fully bootstrapped with the operational software that is needed to deploy and operate workloads. With EKS Blueprints, the configuration for the desired state of EKS environment, such as the control plane, worker nodes, and Kubernetes add-ons, is described as an infrastructure as code (IaC) blueprint. After a blueprint is configured, it can be used to create consistent environments across multiple AWS accounts and Regions using continuous deployment automation.
The content shared in this post reflects real-life situations and experiences, but it’s important to note that the deployment of these situations in different locations may vary. Although we utilize a single AWS account with separate VPCs, it’s crucial to understand that individual circumstances and configurations may differ. Therefore, the information provided should be used as a general guide and may require adaptation based on specific requirements and local conditions.
The following diagram illustrates our solution architecture.
In addition to the tracking provided by FedML MLOps for each training run, we use Amazon SageMaker Experiments to track the performance of each client model and the centralized (aggregator) model.
SageMaker Experiments is a capability of SageMaker that lets you create, manage, analyze, and compare your ML experiments. By recording experiment details, parameters, and results, researchers can accurately reproduce and validate their work. It allows for effective comparison and analysis of different approaches, leading to informed decision-making. Additionally, tracking experiments facilitates iterative improvement by providing insights into the progression of models and enabling researchers to learn from previous iterations, ultimately accelerating the development of more effective solutions.
We send the following to SageMaker Experiments for each run:
- Model evaluation metrics – Training loss and Area Under the Curve (AUC)
- Hyperparameters – Epoch, learning rate, batch size, optimizer, and weight decay
Prerequisites
To follow along with this post, you should have the following prerequisites:
- An AWS account
- Local access to the AWS Command Line Interface (AWS CLI) or usage of AWS CloudShell
- Terraform
- kubectl
- A FedML account ID
Deploy the solution
To begin, clone the repository hosting the sample code locally:
Then deploy the use case infrastructure using the following commands:
The Terraform template may take 20–30 minutes to fully deploy. After it’s deployed, follow the steps in the next sections to run the FL application.
Create an MLOps deployment package
As a part of the FedML documentation, we need to create the client and server packages, which the MLOps platform will distribute to the server and clients to begin training.
To create these packages, run the following script found in the root directory:
This will create the respective packages in the following directory in the project’s root directory:
Upload the packages to the FedML MLOps platform
Complete the following steps to upload the packages:
- On the FedML UI, choose My Applications in the navigation pane.
- Choose New Application.
- Upload the client and server packages from your workstation.
- You can also adjust the hyperparameters or create new ones.
Trigger federated training
To run federated training, complete the following steps:
- On the FedML UI, choose Project List in the navigation pane.
- Choose Create a new project.
- Enter a group name and a project name, then choose OK.
- Choose the newly created project and choose Create new run to trigger a training run.
- Select the edge client devices and the central aggregator server for this training run.
- Choose the application that you created in the previous steps.
- Update any of the hyperparameters or use the default settings.
- Choose Start to start training.
- Choose the Training Status tab and wait for the training run to complete. You can also navigate to the tabs available.
- When training is complete, choose the System tab to see the training time durations on your edge servers and aggregation events.
View results and experiment details
When the training is complete, you can view the results using FedML and SageMaker.
On the FedML UI, on the Models tab, you can see the aggregator and client model. You can also download these models from the website.
You can also log in to Amazon SageMaker Studio and choose Experiments in the navigation pane.
The following screenshot shows the logged experiments.
Experiment tracking code
In this section, we explore the code that integrates SageMaker experiment tracking with the FL framework training.
In an editor of your choice, open the following folder to see the edits to the code to inject SageMaker experiment tracking code as a part of the training:
For tracking the training, we create a SageMaker experiment with parameters and metrics logged using the log_parameter
and log_metric
command as outlined in the following code sample.
An entry in the config/fedml_config.yaml
file declares the experiment prefix, which is referenced in the code to create unique experiment names: sm_experiment_name: "fed-heart-disease"
. You can update this to any value of your choice.
For example, see the following code for the heart_disease_trainer.py
, which is used by each client to train the model on their own dataset:
For each client run, the experiment details are tracked using the following code in heart_disease_trainer.py:
Similarly, you can use the code in heart_disease_aggregator.py
to run a test on local data after updating the model weights. The details are logged after each communication run with the clients.
Clean up
When you’re done with the solution, make sure to clean up the resources used to ensure efficient resource utilization and cost management, and avoid unnecessary expenses and resource wastage. Active tidying up the environment, such as deleting unused instances, stopping unnecessary services, and removing temporary data, contributes to a clean and organized infrastructure. You can use the following code to clean up your resources:
Summary
By using Amazon EKS as the infrastructure and FedML as the framework for FL, we are able to provide a scalable and managed environment for training and deploying shared models while respecting data privacy. With the decentralized nature of FL, organizations can collaborate securely, unlock the potential of distributed data, and improve ML models without compromising data privacy.
As always, AWS welcomes your feedback. Please leave your thoughts and questions in the comments section.
About the Authors
Randy DeFauw is a Senior Principal Solutions Architect at AWS. He holds an MSEE from the University of Michigan, where he worked on computer vision for autonomous vehicles. He also holds an MBA from Colorado State University. Randy has held a variety of positions in the technology space, ranging from software engineering to product management. He entered the big data space in 2013 and continues to explore that area. He is actively working on projects in the ML space and has presented at numerous conferences, including Strata and GlueCon.
Arnab Sinha is a Senior Solutions Architect for AWS, acting as Field CTO to help organizations design and build scalable solutions supporting business outcomes across data center migrations, digital transformation and application modernization, big data, and machine learning. He has supported customers across a variety of industries, including energy, retail, manufacturing, healthcare, and life sciences. Arnab holds all AWS Certifications, including the ML Specialty Certification. Prior to joining AWS, Arnab was a technology leader and previously held architect and engineering leadership roles.
Prachi Kulkarni is a Senior Solutions Architect at AWS. Her specialization is machine learning, and she is actively working on designing solutions using various AWS ML, big data, and analytics offerings. Prachi has experience in multiple domains, including healthcare, benefits, retail, and education, and has worked in a range of positions in product engineering and architecture, management, and customer success.
Tamer Sherif is a Principal Solutions Architect at AWS, with a diverse background in the technology and enterprise consulting services realm, spanning over 17 years as a Solutions Architect. With a focus on infrastructure, Tamer’s expertise covers a broad spectrum of industry verticals, including commercial, healthcare, automotive, public sector, manufacturing, oil and gas, media services, and more. His proficiency extends to various domains, such as cloud architecture, edge computing, networking, storage, virtualization, business productivity, and technical leadership.
Hans Nesbitt is a Senior Solutions Architect at AWS based out of Southern California. He works with customers across the western US to craft highly scalable, flexible, and resilient cloud architectures. In his spare time, he enjoys spending time with his family, cooking, and playing guitar.
Chaoyang He is Co-founder and CTO of FedML, Inc., a startup running for a community building open and collaborative AI from anywhere at any scale. His research focuses on distributed and federated machine learning algorithms, systems, and applications. He received his PhD in Computer Science from the University of Southern California.
Al Nevarez is Director of Product Management at FedML. Before FedML, he was a group product manager at Google, and a senior manager of data science at LinkedIn. He has several data product-related patents, and he studied engineering at Stanford University.
Salman Avestimehr is Co-founder and CEO of FedML. He has been a Dean’s Professor at USC, Director of the USC-Amazon Center on Trustworthy AI, and an Amazon Scholar in Alexa AI. He is an expert on federated and decentralized machine learning, information theory, security, and privacy. He is a Fellow of IEEE and received his PhD in EECS from UC Berkeley.
Samir Lad is an accomplished enterprise technologist with AWS who works closely with customers’ C-level executives. As a former C-suite executive who has driven transformations across multiple Fortune 100 companies, Samir shares his invaluable experiences to help his clients succeed in their own transformation journey.
Stephen Kraemer is a Board and CxO advisor and former executive at AWS. Stephen advocates culture and leadership as the foundations of success. He professes security and innovation the drivers of cloud transformation enabling highly competitive, data-driven organizations.
Enable data sharing through federated learning: A policy approach for chief digital officers
This is a guest blog post written by Nitin Kumar, a Lead Data Scientist at T and T Consulting Services, Inc.
In this post, we discuss the value and potential impact of federated learning in the healthcare field. This approach can help heart stroke patients, doctors, and researchers with faster diagnosis, enriched decision-making, and more informed, inclusive research work on stroke-related health issues, using a cloud-native approach with AWS services for lightweight lift and straightforward adoption.
Diagnosis challenges with heart strokes
Statistics from the Centers for Disease Control and Prevention (CDC) show that each year in the US, more than 795,000 people suffer from their first stroke, and about 25% of them experience recurrent attacks. It is the number five cause of death according to the American Stroke Association and a leading cause of disability in the US. Therefore, it’s crucial to have prompt diagnosis and treatment to reduce brain damage and other complications in acute stroke patients.
CTs and MRIs are the gold standard in imaging technologies for classifying different sub-types of strokes and are crucial during preliminary assessment of patients, determining the root cause, and treatment. One critical challenge here, especially in the case of acute stroke, is the time of imaging diagnosis, which on average ranges from 30 minutes up to an hour and can be much longer depending on emergency department crowding.
Doctors and medical staff need quick and accurate image diagnosis to evaluate a patient’s condition and propose treatment options. In Dr. Werner Vogels’s own words at AWS re:Invent 2023, “every second that a person has a stroke counts.” Stroke victims can lose around 1.9 billion neurons every second they are not being treated.
Medical data restrictions
You can use machine learning (ML) to assist doctors and researchers in diagnosis tasks, thereby speeding up the process. However, the datasets needed to build the ML models and give reliable results are sitting in silos across different healthcare systems and organizations. This isolated legacy data has the potential for massive impact if cumulated. So why hasn’t it been used yet?
There are multiple challenges when working with medical domain datasets and building ML solutions, including patient privacy, security of personal data, and certain bureaucratic and policy restrictions. Additionally, research institutions have been tightening their data sharing practices. These obstacles also prevent international research teams from working together on diverse and rich datasets, which could save lives and prevent disabilities that can result from heart strokes, among other benefits.
Policies and regulations like General Data Protection Regulation (GDPR), Health Insurance Portability and Accountability Act (HIPPA), and California Consumer Privacy Act (CCPA) put guardrails on sharing data from the medical domain, especially patient data. Additionally, the datasets at individual institutes, organizations, and hospitals are often too small, are unbalanced, or have biased distribution, leading to model generalization constraints.
Federated learning: An introduction
Federated learning (FL) is a decentralized form of ML—a dynamic engineering approach. In this decentralized ML approach, the ML model is shared between organizations for training on proprietary data subsets, unlike traditional centralized ML training, where the model generally trains on aggregated datasets. The data stays protected behind the organization’s firewalls or VPC, while the model with its metadata is shared.
In the training phase, a global FL model is disseminated and synchronized between unit organizations for training on individual datasets, and a local trained model is returned. The final global model is available to use to make predictions for everyone among the participants, and can also be used as a base for further training to build local custom models for participating organizations. It can further be extended to benefit other institutes. This approach can significantly reduce the cybersecurity requirements for data in transit by removing the need for data to transit outside of the organization’s boundaries at all.
The following diagram illustrates an example architecture.
In the following sections, we discuss how federated learning can help.
Federation learning to save the day (and save lives)
For good artificial intelligence (AI), you need good data.
Legacy systems, which are frequently found in the federal domain, pose significant data processing challenges before you can derive any intelligence or merge them with newer datasets. This is an obstacle in providing valuable intelligence to leaders. It can lead to inaccurate decision-making because the proportion of legacy data is sometimes much more valuable compared to the newer small dataset. You want to resolve this bottleneck effectively and without workloads of manual consolidation and integration efforts (including cumbersome mapping processes) for legacy and newer datasets sitting across hospitals and institutes, which can take many months—if not years, in many cases. The legacy data is quite valuable because it holds important contextual information needed for accurate decision-making and well-informed model training, leading to reliable AI in the real world. Duration of data informs on long-term variations and patterns in the dataset that would otherwise go undetected and lead to biased and ill-informed predictions.
Breaking down these data silos to unite the untapped potential of the scattered data can save and transform many lives. It can also accelerate the research related to secondary health issues arising from heart strokes. This solution can help you share insights from data isolated between institutes due to policy and other reasons, whether you are a hospital, a research institute, or other health data-focused organizations. It can enable informed decisions on research direction and diagnosis. Additionally, it results in a centralized repository of intelligence via a secure, private, and global knowledge base.
Federated learning has many benefits in general and specifically for medical data settings.
Security and Privacy features:
- Keeps sensitive data away from the internet and still uses it for ML, and harnesses its intelligence with differential privacy
- Enables you to build, train, and deploy unbiased and robust models across not just machines but also networks, without any data security hazards
- Overcomes the hurdles with multiple vendors managing the data
- Eliminates the need for cross-site data sharing and global governance
- Preserves privacy with differential privacy and offers secure multi-party computation with local training
Performance Improvements:
- Addresses the small sample size problem in the medical imaging space and costly labeling processes
- Balances the distribution of the data
- Enables you to incorporate most traditional ML and deep learning (DL) methods
- Uses pooled image sets to help improve statistical power, overcoming the sample size limitation of individual institutions
Resilience Benefits:
- If any one party decides to leave, it won’t hinder the training
- A new hospital or institute can join at any time; it’s not reliant on any specific dataset with any node organization
- There is no need for extensive data engineering pipelines for the legacy data scattered across widespread geographical locations
These features can help bring the walls down between institutions hosting isolated datasets on similar domains. The solution can become a force multiplier by harnessing the unified powers of distributed datasets and improving efficiency by radically transforming the scalability aspect without the heavy infrastructure lift. This approach helps ML reach its full potential, becoming proficient at the clinical level and not just research.
Federated learning has comparable performance to regular ML, as shown in the following experiment by NVidia Clara (on Medical Modal ARchive (MMAR) using the BRATS2018 dataset). Here, FL achieved a comparable segmentation performance compared to training with centralized data: over 80% with approximately 600 epochs while training a multi-modal, multi-class brain tumor segmentation task.
Federated learning has been tested recently in a few medical sub-fields for use cases including patient similarity learning, patient representation learning, phenotyping, and predictive modeling.
Application blueprint: Federated learning makes it possible and straightforward
To get started with FL, you can choose from many high-quality datasets. For example, datasets with brain images include ABIDE (Autism Brain Imaging Data Exchange initiative), ADNI (Alzheimer’s Disease Neuroimaging Initiative), RSNA (Radiological Society of North America) Brain CT, BraTS (Multimodal Brain Tumor Image Segmentation Benchmark) updated regularly for the Brain Tumor Segmentation Challenge under UPenn (University of Pennsylvania), UK BioBank (covered in the following NIH paper), and IXI. Similarly for heart images, you can choose from several publicly available options, including ACDC (Automatic Cardiac Diagnosis Challenge), which is a cardiac MRI assessment dataset with full annotation mentioned by the National Library of Medicine in the following paper, and M&M (Multi-Center, Multi-Vendor, and Multi-Disease) Cardiac Segmentation Challenge mentioned in the following IEEE paper.
The following images show a probabilistic lesion overlap map for the primary lesions from the ATLAS R1.1 dataset. (Strokes are one of the most common causes of brain lesions according to Cleveland Clinic.)
For Electronic Health Records (EHR) data, a few datasets are available that follow the Fast Healthcare Interoperability Resources (FHIR) standard. This standard helps you build straightforward pilots by removing certain challenges with heterogenous, non-normalized datasets, allowing for seamless and secure exchange, sharing, and integration of datasets. The FHIR enables maximum interoperability. Dataset examples include MIMIC-IV (Medical Information Mart for Intensive Care). Other good-quality datasets that aren’t currently FHIR but can be easily converted include Centers for Medicare & Medicaid Services (CMS) Public Use Files (PUF) and eICU Collaborative Research Database from MIT (Massachusetts Institute of Technology). There are also other resources becoming available that offer FHIR-based datasets.
The lifecycle for implementing FL can include the following steps: task initialization, selection, configuration, model training, client/server communication, scheduling and optimization, versioning, testing, deployment, and termination. There are many time-intensive steps that go into preparing medical imaging data for traditional ML, as described in the following paper. Domain knowledge might be needed in some scenarios to preprocess raw patient data, especially due to its sensitive and private nature. These can be consolidated and sometimes eliminated for FL, saving crucial time for training and providing faster results.
Implementation
FL tools and libraries have grown with widespread support, making it straightforward to use FL without a heavy overhead lift. There are a lot of good resources and framework options available to get started. You can refer to the following extensive list of the most popular frameworks and tools in the FL domain, including PySyft, FedML, Flower, OpenFL, FATE, TensorFlow Federated, and NVFlare. It provides a beginner’s list of projects to get started quickly and build upon.
You can implement a cloud-native approach with Amazon SageMaker that seamlessly works with AWS VPC peering, keeping each node’s training in a private subnet in their respective VPC and enabling communication via private IPv4 addresses. Furthermore, model hosting on Amazon SageMaker JumpStart can help by exposing the endpoint API without sharing model weights.
It also takes away potential high-level compute challenges with on-premises hardware with Amazon Elastic Compute Cloud (Amazon EC2) resources. You can implement the FL client and servers on AWS with SageMaker notebooks and Amazon Simple Storage Service (Amazon S3), maintain regulated access to the data and model with AWS Identity and Access Management (IAM) roles, and use AWS Security Token Service (AWS STS) for client-side security. You can also build your own custom system for FL using Amazon EC2.
For a detailed overview of implementing FL with the Flower framework on SageMaker, and a discussion of its difference from distributed training, refer to Machine learning with decentralized training data using federated learning on Amazon SageMaker.
The following figures illustrate the architecture of transfer learning in FL.
Addressing FL data challenges
Federated learning comes with its own data challenges, including privacy and security, but they are straightforward to address. First, you need to address the data heterogeneity problem with medical imaging data arising from data being stored across different sites and participating organizations, known as a domain shift problem (also referred to as client shift in an FL system), as highlighted by Guan and Liu in the following paper. This can lead to a difference in convergence of the global model.
Other components for consideration include ensuring data quality and uniformity at the source, incorporating expert knowledge into the learning process to inspire confidence in the system among medical professionals, and achieving model precision. For more information about some of the potential challenges you may face during implementation, refer to the following paper.
AWS helps you resolve these challenges with features like the flexible compute of Amazon EC2 and pre-built Docker images in SageMaker for straightforward deployment. You can resolve client-side problems like unbalanced data and computation resources for each node organization. You can address server-side learning problems like poisoning attacks from malicious parties with Amazon Virtual Private Cloud (Amazon VPC), security groups, and other security standards, preventing client corruption and implementing AWS anomaly detection services.
AWS also helps in addressing real-world implementation challenges, which can include integration challenges, compatibility issues with current or legacy hospital systems, and user adoption hurdles, by offering flexible, easy-to-use, and effortless lift tech solutions.
With AWS services, you can enable large-scale FL-based research and clinical implementation and deployment, which can consist of various sites across the world.
Recent policies on interoperability highlight the need for federated learning
Many laws recently passed by the government include a focus on data interoperability, bolstering the need for cross-organizational interoperability of data for intelligence. This can be fulfilled by using FL, including frameworks like the TEFCA (Trusted Exchange Framework and Common Agreement) and the expanded USCID (United States Core Data for Interoperability).
The proposed idea also contributes towards the CDC’s capture and distribution initiative CDC Moving Forward. The following quote from the GovCIO article Data Sharing and AI Top Federal Health Agency Priorities in 2024 also echoes a similar theme: “These capabilities can also support the public in an equitable way, meeting patients where they are and unlocking critical access to these services. Much of this work comes down to the data.”
This can help medical institutes and agencies around the country (and across the globe) with data silos. They can benefit from seamless and secure integration and data interoperability, making medical data usable for impactful ML-based predictions and pattern recognition. You can start with images, but the approach is applicable to all EHR as well. The goal is to find the best approach for data stakeholders, with a cloud-native pipeline to normalize and standardize the data or directly use it for FL.
Let’s explore an example use case. Heart stroke imaging data and scans are scattered around the country and the world, sitting in isolated silos in institutes, universities, and hospitals, and separated by bureaucratic, geographical, and political boundaries. There is no single aggregated source and no easy way for medical professionals (non-programmers) to extract insights from it. At the same time, it’s not feasible to train ML and DL models on this data, which could help medical professionals make faster, more accurate decisions in critical times when heart scans can take hours to come in while the patient’s life could be hanging in the balance.
Other known use cases include POTS (Purchasing Online Tracking System) at NIH (National Institutes of Health) and cybersecurity for scattered and tiered intelligence solution needs at COMCOMs/MAJCOMs locations around the globe.
Conclusion
Federated learning holds great promise for legacy healthcare data analytics and intelligence. It’s straightforward to implement a cloud-native solution with AWS services, and FL is especially helpful for medical organizations with legacy data and technical challenges. FL can have a potential impact on the entire treatment cycle, and now even more so with the focus on data interoperability from large federal organizations and government leaders.
This solution can help you avoid reinventing the wheel and use the latest technology to take a leap from legacy systems and be at the forefront in this ever-evolving world of AI. You can also become a leader for best practices and an efficient approach to data interoperability within and across agencies and institutes in the health domain and beyond. If you are an institute or agency with data silos scattered around the country, you can benefit from this seamless and secure integration.
The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post. It is each customers’ responsibility to determine whether they are subject to HIPAA, and if so, how best to comply with HIPAA and its implementing regulations. Before using AWS in connection with protected health information, customers must enter an AWS Business Associate Addendum (BAA) and follow its configuration requirements.
About the Author
Nitin Kumar (MS, CMU) is a Lead Data Scientist at T and T Consulting Services, Inc. He has extensive experience with R&D prototyping, health informatics, public sector data, and data interoperability. He applies his knowledge of cutting-edge research methods to the federal sector to deliver innovative technical papers, POCs, and MVPs. He has worked with multiple federal agencies to advance their data and AI goals. Nitin’s other focus areas include natural language processing (NLP), data pipelines, and generative AI.
Amazon and Max Planck Society announce recipients of gift awards
The awards support four research projects exploring the intersection of fashion and AI.Read More