Parameta accelerates client email resolution with Amazon Bedrock Flows

Parameta accelerates client email resolution with Amazon Bedrock Flows

This blog post is co-written with Siokhan Kouassi and Martin Gregory at Parameta. 

When financial industry professionals need reliable over-the-counter (OTC) data solutions and advanced analytics, they can turn to Parameta Solutions, the data powerhouse behind TP ICAP . With a focus on data-led solutions, Parameta Solutions makes sure that these professionals have the insights they need to make informed decisions. Managing thousands of client service requests efficiently while maintaining accuracy is crucial for Parameta’s reputation as a trusted data provider. Through a simple yet effective application of Amazon Bedrock Flows, Parameta transformed their client service operations from a manual, time-consuming process into a streamlined workflow in just two weeks.

Parameta empowers clients with comprehensive industry insights, from price discovery to risk management, and pre- to post-trade analytics. Their services are fundamental to clients navigating the complexities of OTC transactions and workflow effectively. Accurate and timely responses to technical support queries are essential for maintaining service quality.

However, Parameta’s support team faced a common challenge in the financial services industry: managing an increasing volume of email-based client requests efficiently. The traditional process involved multiple manual steps—reading emails, understanding technical issues, gathering relevant data, determining the correct routing path, and verifying information in databases. This labor-intensive approach not only consumed valuable time, but also introduced risks of human error that could potentially impact client trust.

Recognizing the need for modernization, Parameta sought a solution that could maintain their high standards of service while significantly reducing resolution times. The answer lay in using generative AI through Amazon Bedrock Flows, enabling them to build an automated, intelligent request handling system that would transform their client service operations. Amazon Bedrock Flows provide a powerful, low-code solution for creating complex generative AI workflows with an intuitive visual interface and with a set of APIs in the Amazon Bedrock SDK. By seamlessly integrating foundation models (FMs), prompts, agents, and knowledge bases, organizations can rapidly develop flexible, efficient AI-driven processes tailored to their specific business needs.

In this post, we show you how Parameta used Amazon Bedrock Flows to transform their manual client email processing into an automated, intelligent workflow that reduced resolution times from weeks to days while maintaining high accuracy and operational control.

Client email triage

For Parameta, every client email represents a critical touchpoint that demands both speed and accuracy. The challenge of email triage extends beyond simple categorization—it requires understanding technical queries, extracting precise information, and providing contextually appropriate responses.

The email triage workflow involves multiple critical steps:

  • Accurately classifying incoming technical support emails
  • Extracting relevant entities like data products or time periods
  • Validating if all required information is present for the query type
  • Consulting internal knowledge bases and databases for context
  • Generating either complete responses or specific requests for additional information

The manual handling of this process led to time-consuming back-and-forth communications, the risk of overlooking critical details, and inconsistent response quality. With that in mind, Parameta identified this as an opportunity to develop an intelligent system that could automate this entire workflow while maintaining their high standard of accuracy and professionalism.

Path to the solution

When evaluating solutions for email triage automation, several approaches appeared viable, each with its own pros and cons. However, not all of them were effective for Parameta.

Traditional NLP pipelines and ML classification models

Traditional natural language processing pipelines struggle with email complexity due to their reliance on rigid rules and poor handling of language variations, making them impractical for dynamic client communications. The inconsistency in email structures and terminology, which varies significantly between clients, further complicates their effectiveness. These systems depend on predefined patterns, which are difficult to maintain and adapt when faced with such diverse inputs, leading to inefficiencies and brittleness in handling real-world communication scenarios. Machine learning (ML) classification models offer improved categorization, but introduce complexity by requiring separate, specialized models for classification, entity extraction, and response generation, each with its own training data and contextual limitations.

Deterministic LLM-based workflows

Parameta’s solution demanded more than just raw large language model (LLM) capabilities—it required a structured approach while maintaining operational control. Amazon Bedrock Flows provided this critical balance through the following capabilities:

  • Orchestrated prompt chaining – Multiple specialized prompts work together in a deterministic sequence, each optimized for specific tasks like classification, entity extraction, or response generation.
  • Multi-conditional workflows – Support for complex business logic with the ability to branch flows based on validation results or extracted information completeness.
  • Version management – Simple switching between different prompt versions while maintaining workflow integrity, enabling rapid iteration without disrupting the production pipeline.
  • Component integration – Seamless incorporation of other generative AI capabilities like Amazon Bedrock Agents or Amazon Bedrock Knowledge Bases, creating a comprehensive solution.
  • Experimentation framework – The ability to test and compare different prompt variations while maintaining version control. This is crucial for optimizing the email triage process.
  • Rapid iteration and tight feedback loop – The system allows for quick testing of new prompts and immediate feedback, facilitating continuous improvement and adaptation.

This structured approach to generative AI through Amazon Bedrock Flows enabled Parameta to build a reliable, production-grade email triage system that maintains both flexibility and control.

Solution overview

Parameta’s solution demonstrates how Amazon Bedrock Flows can transform complex email processing into a structured, intelligent workflow. The architecture comprises three key components, as shown in the following diagram: orchestration, structured data extraction, and intelligent response generation.

Orchestration

Amazon Bedrock Flows serves as the central orchestrator, managing the entire email processing pipeline. When a client email arrives through Microsoft Teams, the workflow invokes the following stages:

  •  The workflow initiates through Amazon API Gateway, taking the email and using an AWS Lambda function to extract the text contained in the email and store it in Amazon Simple Storage Service (Amazon S3).
  • Amazon Bedrock Flows coordinates the sequence of operations, starting with the email from Amazon S3.
  • Version management streamlines controlled testing of prompt variations.
  • Built-in conditional logic handles different processing paths.

Structured data extraction

A sequence of specialized prompts within the flow handles the critical task of information processing:

  • The classification prompt identifies the type of technical inquiry
  • The entity extraction prompt discovers key data points
  • The validation prompt verifies completeness of required information

These prompts work in concert to transform unstructured emails into actionable data, with each prompt optimized for its specific task.

Intelligent response generation

The final stage uses advanced AI capabilities for response creation:

  • An Amazon Bedrock agent synthesizes information from multiple sources:
  • Response generation adapts based on validation results:
    • Specific information requests for incomplete queries
    • Comprehensive solutions for complete inquiries
  • Delivery back to clients using Microsoft Teams

The following diagram illustrates the flow for the email triaging system.

This structured approach allows Parameta to maintain consistent, high-quality responses while significantly reducing processing time for client inquiries.

Solution walkthrough

Let’s walk through how Parameta’s email triage system processes a typical client inquiry. We start with the following sample client email:

Dear Support Team,

Could you please verify the closing price for the Dollar ATM swaption (USD_2Y_1Y) as of March 15, 2024? We need this for our end-of-day reconciliation.

Best regards,

John Smith

Portfolio Manager, ABC Investments

The classification prompt classifies this as a price verification request based on the content and intent. It uses the email as the input, and the output is type: price_verification_request.

The entity extraction prompt uses the preceding email, and provides the following output:

{
"product_type": "Interest Rate Option",
"ticker": "USD_2Y_1Y",
"date_requested": "2024-03-15",
"data_source": "ICAP",
"request_type": "closing_price"
}

The workflow then performs validation using Amazon Bedrock Flows. This requires the following checks for price verification:

  • Product identifier ( USD_2Y_1Y present)
  • Date (March 15, 2024 present)
  • Price type (closing specified)

When all required entities are found, the workflow proceeds to the Amazon Bedrock agent.

The agent submits the following query to the knowledge base: “Product specifications and market context for Interest Rate option USD_2Y_1Y.”

The agent submits the following Snowflake query (using Athena):

SELECTclosing_price,settlement_date,currency,market_status
FROM otc_derivatives.daily_prices
WHERE instrument_id ='USD_2Y_1Y'
AND trade_date = '2024-03-15'

The generated response is as follows:

Dear John,

Regarding your price verification request for USD_2Y_1Y (USD ATM Swaption Implied Volatility Mid 2Y vs 1Y) as of March 15, 2024:
The closing price has been verified in our records. The market was active on the requested date with normal trading volumes.

Please note this information is provided under your existing data service agreement.

If you need any further clarification, please don’t hesitate to ask.

Best regards,

Parameta Support

Benefits

Parameta quickly transitioned from implementation to achieving impactful results, thanks to the substantial benefits provided by Amazon Bedrock Flows across various areas:

  • Operational efficiency
    • Development teams accelerated prompt optimization by quickly testing different variations for email classification and entity extraction
    • Time-to-insight reduced from weeks to days through rapid prompt iteration and immediate feedback on performance
    • Quick adjustments to validation rules without rebuilding the entire workflow
  • Team collaboration
    • Modification of prompts through a simplified interface without deep AWS knowledge
    • Support teams gained the ability to understand and adjust the response process
    • Cross-functional teams collaborated on prompt improvements using familiar interfaces
  • Model transparency
    • Clear visibility into why emails were classified into specific categories
    • Understanding of entity extraction decisions helped refine prompts for better accuracy
    • Ability to trace decisions through the workflow enhanced trust in automated responses
  • Observability and governance
    • Comprehensive observability provided stakeholders with a holistic view of the end-to-end process
    • Built-in controls provided appropriate oversight of the automated system, aligning with governance and compliance requirements
    • Transparent workflows enabled stakeholders to monitor, audit, and refine the system effectively, providing accountability and reliability

These benefits directly translated to Parameta’s business objectives: faster response times to client queries, more accurate classifications, and improved ability to maintain and enhance the system across teams. The structured yet flexible nature of Amazon Bedrock Flows enabled Parameta to achieve these gains while maintaining control over their critical client communications.

Key takeaways and best practices

When implementing Amazon Bedrock Flows, consider these essential learnings:

  • Prompt design principles
    • Design modular prompts that handle specific tasks for better maintainability of the system
    • Keep prompts focused and concise to optimize token usage
    • Include clear input and output specifications for better maintainability and robustness
    • Diversify model selection for different tasks within the flow:
      • Use lighter models for simple classifications
      • Reserve advanced models for complex reasoning
      • Create resilience through model redundancy
  • Flow architecture
    • Start with a clear validation strategy early in the flow
    • Include error handling in prompt design
    • Consider breaking complex flows into smaller, manageable segments
  • Version management
    • Implement proper continuous deployment and delivery (CI/CD) pipelines for flow deployment
    • Establish approval workflows for flow changes
    • Document flow changes and their impact including metrics
  • Testing and implementation
    • Create comprehensive test cases covering a diverse set of scenarios
    • Validate flow behavior with sample datasets
    • Constantly monitor flow performance and token usage in production
    • Start with smaller workflows and scale gradually
  • Cost optimization
    • Review and optimize prompt lengths regularly
    • Monitor token usage patterns
    • Balance between model capability and cost when selecting models

Consider these practices derived from real-world implementation experience to help successfully deploy Amazon Bedrock Flows while maintaining efficiency and reliability.

Testimonials

“As the CIO of our company, I am thoroughly impressed by how rapidly our team was able to leverage Amazon Bedrock Flows to create an innovative solution to a complex business problem. The low barrier to entry of Amazon Bedrock Flows allowed our team to quickly get up to speed and start delivering results. This tool is democratizing generative AI, making it easier for everyone in the business to get hands-on with Amazon Bedrock, regardless of their technical skill level. I can see this tool being incredibly useful across multiple parts of our business, enabling seamless integration and efficient problem-solving.”

– Roland Anderson, CIO at Parameta Solutions

“As someone with a tech background, using Amazon Bedrock Flows for the first time was a great experience. I found it incredibly intuitive and user-friendly. The ability to refine prompts based on feedback made the process seamless and efficient. What impressed me the most was how quickly I could get started without needing to invest time in creating code or setting up infrastructure. The power of generative AI applied to business problems is truly transformative, and Amazon Bedrock has made it accessible for tech professionals like myself to drive innovation and solve complex challenges with ease.”

– Martin Gregory, Market Data Support Engineer, Team Lead at Parameta Solutions

Conclusion

In this post, we showed how Parameta uses Amazon Bedrock Flows to build an intelligent client email processing workflow that reduces resolution times from days to minutes while maintaining high accuracy and control. As organizations increasingly adopt generative AI, Amazon Bedrock Flows offers a balanced approach, combining the flexibility of LLMs with the structure and control that enterprises require.

For more information, refer to Build an end-to-end generative AI workflow with Amazon Bedrock Flows. For code samples, see Run Amazon Bedrock Flows code samples. Visit the Amazon Bedrock console to start building your first flow, and explore our AWS Blog for more customer success stories and implementation patterns.


About the Authors

Siokhan Kouassi is a Data Scientist at Parameta Solutions with expertise in statistical machine learning, deep learning, and generative AI. His work is focused on the implementation of efficient ETL data analytics pipelines, and solving business problems via automation, experimenting and innovating using AWS services with a code-first approach using AWS CDK.

Martin Gregory is a Senior Market Data Technician at Parameta Solutions with over 25 years of experience. He has recently played a key role in transitioning Market Data systems to the cloud, leveraging his deep expertise to deliver seamless, efficient, and innovative solutions for clients.

Talha Chattha is a Senior Generative AI Specialist SA at AWS, based in Stockholm. With 10+ years of experience working with AI, Talha now helps establish practices to ease the path to production for Gen AI workloads. Talha is an expert in Amazon Bedrock and supports customers across entire EMEA. He holds passion about meta-agents, scalable on-demand inference, advanced RAG solutions and optimized prompt engineering with LLMs. When not shaping the future of AI, he explores the scenic European landscapes and delicious cuisines.

Jumana Nagaria is a Prototyping Architect at AWS, based in London. She builds innovative prototypes with customers to solve their business challenges. She is passionate about cloud computing and believes in giving back to the community by inspiring women to join tech and encouraging young girls to explore STEM fields. Outside of work, Jumana enjoys travelling, reading, painting, and spending quality time with friends and family.

Hin Yee Liu is a prototype Engagement Manager at AWS, based in London. She helps AWS customers to bring their big ideas to life and accelerate the adoption of emerging technologies. Hin Yee works closely with customer stakeholders to identify, shape and deliver impactful use cases leveraging Generative AI, AI/ML, Big Data, and Serverless technologies using agile methodologies. In her free time, she enjoys knitting, travelling and strength training.

Read More

Efficiently build and tune custom log anomaly detection models with Amazon SageMaker

Efficiently build and tune custom log anomaly detection models with Amazon SageMaker

In this post, we walk you through the process to build an automated mechanism using Amazon SageMaker to process your log data, run training iterations over it to obtain the best-performing anomaly detection model, and register it with the Amazon SageMaker Model Registry for your customers to use it.

Log-based anomaly detection involves identifying anomalous data points in log datasets for discovering execution anomalies, as well as suspicious activities. It usually comprises parsing log data into vectors or machine-understandable tokens, which you can then use to train custom machine learning (ML) algorithms for determining anomalies.

You can adjust the inputs or hyperparameters for an ML algorithm to obtain a combination that yields the best-performing model. This process is called hyperparameter tuning and is an essential part of machine learning. Choosing appropriate hyperparameter values is crucial for success, and it’s usually performed iteratively by experts, which can be time-consuming. Added to this are the general data-related processes such as loading data from appropriate sources, parsing and processing them with custom logic, storing the parsed data back to storage, and loading them again for training custom models. Moreover, these tasks need to be done repetitively for each combination of hyperparameters, which doesn’t scale well with increasing data and new supplementary steps. You can use Amazon SageMaker Pipelines to automate all these steps into a single execution flow. In this post, we demonstrate how to set up this entire workflow.

Solution overview

Contemporary log anomaly detection techniques such as Drain-based detection [1] or DeepLog [2] consist of the following general approach: perform custom processing on logs, train their anomaly detection models using custom models, and obtain the best-performing model with an optimal set of hyperparameters. To build an anomaly detection system using such techniques, you need to write custom scripts for processing as well for training. SageMaker provides support for developing scripts by extending in-built algorithm containers, or by building your own custom containers. Moreover, you can combine these steps as a series of interconnected stages using SageMaker Pipelines. The following figure shows an example architecture:

The workflow consists of the following steps:

  1. The log training data is initially stored in an Amazon Simple Storage Service (Amazon S3) bucket, from where it’s picked up by the SageMaker processing step of the SageMaker pipeline.
  2. After the pipeline is started, the processing step loads the Amazon S3 data into SageMaker containers and runs custom processing scripts that parse and process the logs before uploading them to a specified Amazon S3 destination. This processing could be either decentralized with a single script running on one or more instances, or it could be run in parallel over multiple instances using a distributed framework like Apache Spark. We discuss both approaches in this post.
  3. After processing, the data is automatically picked up by the SageMaker tuning step, where multiple training iterations with unique hyperparameter combinations are run for the custom training script.
  4. Finally, the SageMaker model step creates a SageMaker model using the best-trained model obtained from the tuning step and registers it to the SageMaker Model Registry for consumers to use. These consumers, for example, could be testers, who use models trained on different datasets by different pipelines to compare their effectiveness and generality, before deploying them to a public endpoint.

We walk through implementing the solution with the following high-level steps:

  1. Perform custom data processing, using either a decentralized or distributed approach.
  2. Write custom SageMaker training scripts that automatically tune the resulting models with a range of hyperparameters.
  3. Select the best-tuned model, create a custom SageMaker model from it, and register it to the SageMaker Model Registry.
  4. Combine all the steps in a SageMaker pipeline and run it.

Prerequisites

You should have the following prerequisites:

Process the data

To start, upload the log dataset to an S3 bucket in your AWS account. You can use the AWS Command Line Interface (AWS CLI) using Amazon S3 commands, or use the AWS Management Console. To process the data, you use a SageMaker processing step as the first stage in your SageMaker pipeline. This step spins up a SageMaker container and runs a script that you provide for custom processing. There are two ways to do this: decentralized or distributed processing. SageMaker provides Processor classes for both approaches. You can choose either approach for your custom processing depending on your use case.

Decentralized processing with ScriptProcessor

In the decentralized approach, a single custom script runs on one or more standalone instances and processes the input data. The SageMaker Python SDK provides the ScriptProcessor class, which you can use to run your custom processing script in a SageMaker processing step. For small datasets, a single instance can usually suffice for performing data processing. Increasing the number of instances is recommended if your dataset is large and can be split into multiple independent components, which can all be processed separately (this can be done using the ShardedByS3Key parameter, which we discuss shortly).

If you have custom dependencies (which can often be the case during R&D processes), you can extend an existing container and customize it with your dependencies before providing it to the ScriptProcessor class. For example, if you’re using the Drain technique, you need the logparser Python library for log parsing, in which case you write a simple Dockerfile that installs it along with the usual Python ML libraries:

FROM python:3.7-slim-buster
RUN pip3 install pandas==0.25.3 scikit-learn==0.21.3 logparser3 boto3
ENV PYTHONUNBUFFERED=TRUE
ENTRYPOINT ["python3"]

You can use a Python SageMaker notebook instance in your AWS account to create such a Dockerfile and save it to an appropriate folder, such as docker. To build a container using this Dockerfile, enter the following code into a main driver program in a Jupyter notebook on your notebook instance:

import boto3
from sagemaker import get_execution_role

region = boto3.session.Session().region_name
role = get_execution_role()
account_id = boto3.client("sts").get_caller_identity().get("Account")
ecr_repository = "sagemaker-processing-my-container"
tag = ":latest"

uri_suffix = "amazonaws.com"
if region in ["cn-north-1", "cn-northwest-1"]:
uri_suffix = "amazonaws.com.cn"
processing_repository_uri = "{}.dkr.ecr.{}.{}/{}".format(
account_id, region, uri_suffix, ecr_repository + tag
)

# Create ECR repository and push docker image
!docker build -t $ecr_repository docker
!$(aws ecr get-login --region $region --registry-ids $account_id --no-include-email)
!aws ecr create-repository --repository-name $ecr_repository
!docker tag {ecr_repository + tag} $processing_repository_uri
!docker push $processing_repository_uri

This code creates an Amazon Elastic Container Registry (Amazon ECR) repository where your custom container image will be stored (the repository will be created if it’s not already present). The container image is then built, tagged with the repository name (and :latest), and pushed to the ECR repository.

The next step is writing your actual processing script. For more information on writing a processing script using ScriptProcessor, refer to Amazon SageMaker Processing – Fully Managed Data Processing and Model Evaluation. The following are a few key points to remember:

  • A SageMaker processing step loads the data from an input location (Amazon S3 or local developer workspace) to an input path specified by you under the /opt/ml/processing directory of your container. It then runs your script in the container and uploads the output data from your specified path under /opt/ml/processing to an Amazon S3 destination you’ve specified.
  • Customer log datasets can sometimes consist of multiple subsets without any inter-dependencies amongst them. For these cases, you can parallelize your processing by making your processing script run over multiple instances in a single processing step, with each instance processing one of these independent subsets. It’s a good practice to keep the script’s logic redundant so that each execution on every instance happens independently of the others. This avoids duplicative work.

When your script is ready, you can instantiate the SageMaker ScriptProcessor class for running it on your custom container (created in the previous step) by adding the following code to your driver program:

from sagemaker.processing import (
ProcessingInput,
ProcessingOutput,
ScriptProcessor,
)
from sagemaker.workflow.pipeline_context import PipelineSession

from sagemaker.workflow.steps import ProcessingStep

pipeline_session = PipelineSession()
script_processor = ScriptProcessor(
command=["python3"],
image_uri=processing_repository_uri,
role=role,
instance_count=1,
instance_type="ml.m5.xlarge",
sagemaker_session=pipeline_session,
)

script_processor_run_args = script_processor.run(
code="preprocessing.py",
inputs=[ProcessingInput(source="s3://amzn-s3-demo-bucket-pca-detect/processing_input/", destination="/opt/ml/processing/input")],
outputs=[ProcessingOutput(output_name="training", source="/opt/ml/processing/train")
])

step_processing = ProcessingStep(

name="PreprocessStep",

step_args=script_processor_run_args,

)

In the preceding code, a ScriptProcessor class is being instantiated to run the python3 command for running your custom Python script. You provide the following information:

  • You provide the ECR URI of your custom container image and give SageMaker PipelineSession credentials to the class. When you specify the PipelineSession, the ScriptProcessor doesn’t actually begin the execution when you call its run() method—rather, it defers until the SageMaker pipeline as a whole is invoked.
  • In the run() method, you specify the preprocessing script along with the appropriate ProcessingInput and ProcessingOutput These specify where the data will be mounted in your custom container from Amazon S3, and where it will be later uploaded in Amazon S3 from your container’s output folder. The output channel is named training, and the final Amazon output location will be located at s3://<amzn-s3-demo-bucket-pca-detect>/<job-name>/output/<output-name>.

You can also specify an additional parameter in run() named distribution, and it can either be ShardedByS3Key or FullyReplicated, depending on whether you’re splitting and sending your S3 dataset to multiple ScriptProcessor instances or not. You can specify the number of instances in the instance_count parameter of your ScriptProcessor class.

Once instantiated, you can pass the ScriptProcessor class as an argument to the SageMaker processing step along with an appropriate name.

Distributed processing with PySparkProcessor

An alternative to the decentralized processing is distributed processing. Distributed processing is particularly effective when you need to process large amounts of log data. Apache Spark is a popular engine for distributed data processing. It uses in-memory caching and optimized query execution for fast analytic queries against datasets of all sizes. SageMaker provides the PySparkProcessor class within the SageMaker Python SDK for running Spark jobs. For an example of performing distributed processing with PySparkProcessor on SageMaker processing, see Distributed Data Processing using Apache Spark and SageMaker Processing. The following are a few key points to note:

  • To install custom dependencies in your Spark container, you can either build a custom container image (similar to the decentralized processing example) or use the subprocess Python module to install them using pip at runtime. For example, to run the anomaly detection technique on Spark, you need an argformat module, which you can install along with other dependencies as follows:
import subprocess
subprocess.run(["pip3", "install", "scipy", "scikit-learn", "logparser3"])
  • Spark transformations are powerful operations to process your data, and Spark actions are the operations that actually perform the requested transformations on your data. The collect() method is a Spark action that brings all the data from worker nodes to the main driver node. It’s a good practice to use it in conjunction with filter functions so you don’t run into memory issues when working with large log datasets.
  • You should also try to partition your input data based on the total number of cores you plan to have in your SageMaker cluster. The official Spark recommendation is to have approximately 2–3 times the number of partitions as the total number of cores in your cluster.

When your Spark processing script is ready, you can instantiate the SageMaker PySparkProcessor class for running it by adding the following lines to your driver program:

from sagemaker.processing import (
ProcessingInput,
ProcessingOutput,

PySparkProcessor
)

from sagemaker.workflow.steps import ProcessingStep

pipeline_session = PipelineSession()

spark_processor = PySparkProcessor(

base_job_name="hdfs-spark-job",

framework_version="3.1",

role=role,

sagemaker_session=pipeline_session,

instance_count=3,

instance_type="ml.m5.xlarge",

max_runtime_in_seconds=6000,

)

spark_processor.run(

submit_app="./sagemaker_spark_processing.py",

spark_event_logs_s3_uri="s3://amzn-s3-demo-bucket-pca-detect/logs/spark_event_logs",

logs=True,

)

step_processing = ProcessingStep(

name="SparkPreprocessStep",

step_args=spark_processor_run_args,

)

The preceding code instantiates a PySparkProcessor instance with three nodes in the SageMaker cluster with Spark v3.1 installed in them. You submit your Spark processing code to it along with the Amazon S3 location where your event logs would be uploaded. These logs can be useful for debugging.

In the run() method invocation, you don’t need to specify your inputs and outputs, which can be the case if these are fixed Amazon S3 destinations already known to your processing code. Otherwise, you can specify them using the ProcessingInput and ProcessingOutput parameters just like in the decentralized example.

Post-instantiation, the PySparkProcessor class is passed to a SageMaker processing step with an appropriate name. Its execution won’t be triggered until the pipeline is created.

Train and tune the model

Now that your processing steps are complete, you can proceed to the model training step. The training algorithm could either be a classical anomaly detection model like Drain-based detection or a neural-network based model like DeepLog. Every model takes in certain hyperparameters that influence how the model is trained. To obtain the best-performing model, the model is usually executed and validated multiple times over a wide range of hyperparameters. This can be a time-consuming manual process and can instead be automated using SageMaker hyperparameter tuning jobs. Tuning jobs perform hyperparameter optimization by running your training script with a specified range of hyperparameter values and obtaining the best model based on the metrics you specify. You can predefine these metrics if you use built-in SageMaker algorithms or define them for your custom training algorithm.

You first need to write your training script for your anomaly detection model. Keep the following in mind:

  • SageMaker makes artifacts available to your container under the /opt/ml container directory. You should use this when fetching your artifacts. For more details on the SageMaker container structure, see SageMaker AI Toolkits Containers Structure.
  • For using a tuning job, you need to make sure that your code doesn’t hardcode parameter hyperparameter values but instead reads them from the /opt/ml/input/config/hyperparameters.json file in your container where SageMaker places it.
  • When using a custom training script, you also need to add a custom training metric to your script that can be used by the tuning job to find the best model. For this, you should print your desired metrics in your training script using a logger or print function. For example, you could print out custom_metric_value: 91, which indicates that your custom metric’s value is 91. We demonstrate later in this post how SageMaker can be informed about this metric.

When your training script is ready, you can use it inside a SageMaker container. SageMaker provides a wide range of built-in algorithm containers that you can use to run your training code. However, there might be cases when you need to build your own training containers. This could be the case when you need custom libraries installed or if you plan to use a new algorithm not built in by SageMaker. In such a case, you can build your own containers in two ways:

After you create your training container image, you need to define the hyperparameter ranges for your tuning job. For example, if you’re using a custom adaptation of the PCA algorithm (like in Drain-based detection), you add the following lines to your driver program:

from sagemaker.tuner import (

IntegerParameter,

)

hyperparameter_ranges = {
"max_components": IntegerParameter(1, 30, scaling_type="Auto")
}

The preceding code indicates that your hyperparameter max_components is an integer and it ranges from 1–30. The auto scaling type indicates that SageMaker will choose the best scale for hyperparameter changes. For more details on other scaling options, see Hyperparameter scaling types.

Then you can use the following code to fully configure your training and tuning steps in the driver program:

estimator = Estimator(
image_uri= training_image_uri,
role=role,
base_job_name='new_training_job',
sagemaker_session=pipeline_session,
instance_count=1,
instance_type='ml.m5.large',
output_path='s3://amzn-s3-demo-bucket-pca-detect/models/',
metric_definitions=[{'Name': custom_metric, 'Regex': "custom_metric_value: ([0-9\.]+)"}]
)

parameter_tuner = HyperparameterTuner(
estimator,
objective_metric_name ="custom_metric",
hyperparameter_ranges,
metric_definitions=[{'Name': custom_metric, 'Regex': "custom_metric_value: ([0-9\.]+)"}],
max_jobs=30,
max_parallel_jobs=5,
strategy="Bayesian",
objective_type="Maximize",
early_stopping_type="Auto"
)

hpo_args = parameter_tuner.fit(
inputs={
"training": TrainingInput(
s3_data= step_processing.properties.ProcessingOutputConfig.Outputs["training"].S3Output.S3Uri,
s3_data_type="S3Prefix",
distribution="FullyReplicated"
)
}
)

step_tuning = TuningStep(
name="AnomalyDetectionTuning",
step_args=hpo_args,
)

In the preceding code, a SageMaker Estimator instance is created using your custom training image’s ECR URI. SageMaker Estimators help in training your models and orchestrating their training lifecycles. The Estimator is provided with a suitable role and the PipelineSession is designated as its SageMaker session.

You provide the location where your trained model should be stored to the Estimator and supply it with custom metric definitions that you created. For the example metric custom_metric_value: 91, the definition to the Estimator includes its name along with its regex. The regex informs SageMaker how to pick up the metric’s values from training logs in Amazon CloudWatch. The tuning job uses these values to find the best-performing model. You also specify where the output model should be uploaded in the output_path parameter.

You then use this Estimator to instantiate your HyperparameterTuner. Its parameters include the total and maximum parallel number of training jobs, search strategy (for more details on strategies, see Understand the hyperparameter tuning strategies available in Amazon SageMaker AI), and whether you want to use early stopping. Early stopping can be set to Auto so that SageMaker automatically stops model training when it doesn’t see improvements in your custom logged metric.

After the HyperparameterTuner is instantiated, you can call its fit() method. In its input parameter, you specify the output Amazon S3 URI from the processing step as the input location for obtaining training data in your tuning step. This way, you don’t need to specify the Amazon S3 URI yourself and it’s passed between steps implicitly. You can then specify your s3prefix and distribution depending on whether you’re using multiple instances or not.

Once instantiated, the HyperparameterTuner is passed to the tuning step, where it becomes part of your SageMaker pipeline. The training configuration is now complete!

Register the model

You can now choose the best model from the tuning step to create a SageMaker model and publish it to the SageMaker Model Registry. You can use the following driver program code:

from sagemaker import PipelineModel
from sagemaker.workflow.model_step import ModelStep

best_model = sagemaker.model.Model(
image_uri=training_image_uri,
model_data=step_tuning.get_top_model_s3_uri(
top_k=0,
s3_bucket="amzn-s3-demo-bucket-pca-detect",
prefix="models"
)
)

pipeline_model = PipelineModel(
models=[best_model],
role=role,

sagemaker_session=pipeline_session,
)

register_model_step_args = pipeline_model.register(
content_types=["text/csv"],
response_types=["text/csv"],
model_package_group_name="PCAAnomalyDetection",
)

step_model_registration = ModelStep(
name="NewRegistry",
step_args=register_model_step_args,
)

The code instantiates a SageMaker model using the Amazon S3 URI of the best model obtained from the tuning step. The top_k attribute of the get_top_model_s3_uri() method indicates that you’re interested in only obtaining the best-trained model.

After the model is instantiated, you can use it to create a SageMaker PipelineModel so that your pipeline can work directly with your model. You then call the register() method of PipelineModel to register your model to the SageMaker Model Registry. In the register() call, you specify the name of the new model package group where your model will be registered and specify its input and output request and response prediction types.

Finally, a SageMaker ModelStep is invoked with the instantiated PipelineModel to carry out the model registration process.

Create and run a pipeline

You’ve now reached the final step where all your steps will be tied together in a SageMaker pipeline. Add the following code to your driver program to complete your pipeline creation steps:

from sagemaker.workflow.pipeline import Pipeline

pipeline = Pipeline(
name="Anomaly-Detection-Pipeline",
steps=[
step_processing,

step_tuning,
step_model_registration
],
sagemaker_session=pipeline_session,
)
pipeline.upsert(role_arn=role)

pipeline.start()

This code instantiates the SageMaker Pipeline construct and provides it with all the steps defined until now—processing, tuning, and registering the model. It’s provided with a role and then invoked with the start() method.

The pipeline invocation could be on-demand using code (using pipeline.start() as shown earlier) or it could be event-driven using Amazon EventBridge rules. For example, you can create an EventBridge rule that triggers when new training data is uploaded to your S3 buckets and specify your SageMaker pipeline as the target for this rule. This makes sure that when new data is uploaded to your training bucket, your SageMaker pipeline is automatically invoked. For more details on SageMaker and EventBridge integration, refer to Schedule Pipeline Runs.

On invocation, your SageMaker pipeline runs your custom processing script in the processing step and uploads the processed data to your specified Amazon S3 destination. It then starts a tuning job with your custom training code and iteratively trains multiple models with your supplied hyperparameters and selects the best model based on your custom provided metric. The following screenshot shows that it selected the best model when tuning was complete:

Finally, the best model is selected and a model package resource is created with it in your model registry. Your customers can use it to deploy your model:

You have now completed all the steps in processing, training, tuning, and registering your custom anomaly detection model automatically with the aid of a SageMaker pipeline that was initiated using your driver program.

Clean up

To avoid incurring future charges, complete the following steps:

  1. Delete the SageMaker notebook instance used for this post.
  2. Delete the model package resource that was created using the best-tuned model.
  3. Delete any Amazon S3 data that was used for this post.

Conclusion

In this post, we demonstrated the building, training, tuning, and registering of an anomaly detection system with custom processing code, custom training code, and custom training metrics. We ran these steps automatically with the aid of a SageMaker pipeline, which was run by invoking a single main driver program. We also discussed the different ways of processing our data, and how it could be done using the various constructs and tools that SageMaker provides in a user-friendly and straightforward manner.

Try this approach for building your own custom anomaly detection model, and share your feedback in the comments.

References

[1] https://ieeexplore.ieee.org/document/8029742

[2] https://dl.acm.org/doi/pdf/10.1145/3133956.3134015


About the Author

Nitesh Sehwani is an SDE with the EC2 Threat Detection team where he’s involved in building large-scale systems that provide security to our customers. In his free time, he reads about art history and enjoys listening to mystery thrillers.

Read More

Optimizing costs of generative AI applications on AWS

Optimizing costs of generative AI applications on AWS

The report The economic potential of generative AI: The next productivity frontier, published by McKinsey & Company, estimates that generative AI could add an equivalent of $2.6 trillion to $4.4 trillion in value to the global economy. The largest value will be added across four areas: customer operations, marketing and sales, software engineering, and R&D.

The potential for such large business value is galvanizing tens of thousands of enterprises to build their generative AI applications in AWS. However, many product managers and enterprise architect leaders want a better understanding of the costs, cost-optimization levers, and sensitivity analysis.

This post addresses these cost considerations so you can optimize your generative AI costs in AWS.

The post assumes a basic familiarity of foundation model (FMs) and large language models (LLMs), tokens, vector embeddings, and vector databases in AWS. With Retrieval Augmented Generation (RAG) being one of the most common frameworks used in generative AI solutions, the post explains costs in the context of a RAG solution and respective optimization pillars on Amazon Bedrock.

In Part 2 of this series, we will cover how to estimate business value and the influencing factors.

Cost and performance optimization pillars

Designing performant and cost-effective generative AI applications is essential for realizing the full potential of this transformative technology and driving widespread adoption within your organization.

Forecasting and managing costs and performance in generative AI applications is driven by the following optimization pillars:

  • Model selection, choice, and customization – We define these as follows:
    • Model selection – This process involves identifying the optimal model that meets a wide variety of use cases, followed by model validation, where you benchmark against high-quality datasets and prompts to identify successful model contenders.
    • Model choice – This refers to the choice of an appropriate model because different models have varying pricing and performance attributes.
    • Model customization – This refers to choosing the appropriate techniques to customize the FMs with training data to optimize the performance and cost-effectiveness according to business-specific use cases.
  • Token usage – Analyzing token usage consists of the following:
    • Token count – The cost of using a generative AI model depends on the number of tokens processed. This can directly impact the cost of an operation.
    • Token limits – Understanding token limits and what drives token count, and putting guardrails in place to limit token count can help you optimize token costs and performance.
    • Token caching – Caching at the application layer or LLM layer for commonly asked user questions can help reduce the token count and improve performance.
  • Inference pricing plan and usage patterns – We consider two pricing options:
    • On-Demand – Ideal for most models, with charges based on the number of input/output tokens, with no guaranteed token throughput.
    • Provisioned Throughput – Ideal for workloads demanding guaranteed throughput, but with relatively higher costs.
  • Miscellaneous factors – Additional factors can include:
    • Security guardrails – Applying content filters for personally identifiable information (PII), harmful content, undesirable topics, and detecting hallucinations improves the safety of your generative AI application. These filters can perform and scale independently of LLMs and have costs that are directly proportional to the number of filters and the tokens examined.
    • Vector database – The vector database is a critical component of most generative AI applications. As the amount of data usage in your generative AI application grows, vector database costs can also grow.
    • Chunking strategy – Chunking strategies such as fixed size chunking, hierarchical chunking, or semantic chunking can influence the accuracy and costs of your generative AI application.

Let’s dive deeper to examine these factors and associated cost-optimization tips.

Retrieval Augmented Generation

RAG helps an LLM answer questions specific to your corporate data, even though the LLM was never trained on your data.

As illustrated in the following diagram, the generative AI application reads your corporate trusted data sources, chunks it, generates vector embeddings, and stores the embeddings in a vector database. The vectors and data stored in a vector database are often called a knowledge base.

The generative AI application uses the vector embeddings to search and retrieve chunks of data that are most relevant to the user’s question and augment the question to generate the LLM response. The following diagram illustrates this workflow.

The workflow consists of the following steps:

  1. A user asks a question using the generative AI application.
  2. A request to generate embeddings is sent to the LLM.
  3. The LLM returns embeddings to the application.
  4. These embeddings are searched against vector embeddings stored in a vector database (knowledge base).
  5. The application receives context relevant to the user question from the knowledge base.
  6. The application sends the user question and the context to the LLM.
  7. The LLM uses the context to generate an accurate and grounded response.
  8. The application sends the final response back to the user.

Amazon Bedrock is a fully managed service providing access to high-performing FMs from leading AI providers through a unified API. It offers a wide range of LLMs to choose from.

In the preceding workflow, the generative AI application invokes Amazon Bedrock APIs to send text to an LLM like Amazon Titan Embeddings V2 to generate text embeddings, and to send prompts to an LLM like Anthropic’s Claude Haiku or Meta Llama to generate a response.

The generated text embeddings are stored in a vector database such as Amazon OpenSearch Service, Amazon Relational Database Service (Amazon RDS), Amazon Aurora, or Amazon MemoryDB.

A generative AI application such as a virtual assistant or support chatbot might need to carry a conversation with users. A multi-turn conversation requires the application to store a per-user question-answer history and send it to the LLM for additional context. This question-answer history can be stored in a database such as Amazon DynamoDB.

The generative AI application could also use Amazon Bedrock Guardrails to detect off-topic questions, ground responses to the knowledge base, detect and redact PII information, and detect and block hate or violence-related questions and answers.

Now that we have a good understanding of the various components in a RAG-based generative AI application, let’s explore how these factors influence costs while running your application in AWS using RAG.

Directional costs for small, medium, large, and extra large scenarios

Consider an organization that wants to help their customers with a virtual assistant that can answer their questions any time with a high degree of accuracy, performance, consistency, and safety. The performance and cost of the generative AI application depends directly on a few major factors in the environment, such as the velocity of questions per minute, the volume of questions per day (considering peak and off-peak), the amount of knowledge base data, and the LLM that is used.

Although this post explains the factors that influence costs, it can be useful to know the directional costs, based on some assumptions, to get a relative understanding of various cost components for a few scenarios such as small, medium, large, and extra large environments.

The following table is a snapshot of directional costs for four different scenarios with varying volume of user questions per month and knowledge base data.

. SMALL MEDIUM LARGE EXTRA LARGE
INPUTs 500,000 2,000,000 5,000,000 7,020,000
Total questions per month 5 25 50 100
Knowledge base data size in GB (actual text size on documents) . . . .
Annual costs (directional)* . . . .
Amazon Bedrock On-Demand costs using Anthropic’s Claude 3 Haiku $5,785 $23,149 $57,725 $81,027
Amazon OpenSearch Service provisioned cluster costs $6,396 $13,520 $20,701 $39,640
Amazon Bedrock Titan Text Embedding v2 costs $396 $5,826 $7,320 $13,585
Total annual costs (directional) $12,577 $42,495 $85,746 $134,252
Unit cost per 1,000 questions (directional) $2.10 $1.80 $1.40 $1.60

These costs are based on assumptions. Costs will vary if assumptions change. Cost estimates will vary for each customer. The data in this post should not be used as a quote and does not guarantee the cost for actual use of AWS services. The costs, limits, and models can change over time.

For the sake of brevity, we use the following assumptions:

  • Amazon Bedrock On-Demand pricing model
  • Anthropic’s Claude 3 Haiku LLM
  • AWS Region us-east-1
  • Token assumptions for each user question:
    • Total input tokens to LLM = 2,571
    • Output tokens from LLM = 149
    • Average of four characters per token
    • Total tokens = 2,720
  • There are other cost components such as DynamoDB to store question-answer history, Amazon Simple Storage Service (Amazon S3) to store data, and AWS Lambda or Amazon Elastic Container Service (Amazon ECS) to invoke Amazon Bedrock APIs. However, these costs are not as significant as the cost components mentioned in the table.

We refer to this table in the remainder of this post. In the next few sections, we will cover Amazon Bedrock costs and the key factors influences its costs, vector embedding costs, vector database costs, and Amazon Bedrock Guardrails costs. In the final section, we will cover how chunking strategies will influence some of the above cost components.

Amazon Bedrock costs

Amazon Bedrock has two pricing models: On-Demand (used in the preceding example scenario) and Provisioned Throughput.

With the On-Demand model, an LLM has a maximum requests (questions) per minute (RPM) and tokens per minute (TPM) limit. The RPM and TPM are typically different for each LLM. For more information, see Quotas for Amazon Bedrock.

In the extra large use case, with 7 million questions per month, assuming 10 hours per day and 22 business days per month, it translates to 532 questions per minute (532 RPM). This is well below the maximum limit of 1,000 RPM for Anthropic’s Claude 3 Haiku.

With 2,720 average tokens per question and 532 requests per minute, the TPM is 2,720 x 532 = 1,447,040, which is well below the maximum limit of 2,000,000 TPM for Anthropic’s Claude 3 Haiku.

However, assume that the user questions grow by 50%. The RPM, TPM, or both might cross the thresholds. In such cases where the generative AI application needs cross the On-Demand RPM and TPM thresholds, you should consider the Amazon Bedrock Provisioned Throughput model.

With Amazon Bedrock Provisioned Throughput, cost is based on a per-model unit basis. Model units are dedicated for the duration you plan to use, such as an hourly, 1-month, 6-month commitment.

Each model unit offers a certain capacity of maximum tokens per minute. Therefore, the number of model units (and the costs) are determined by the input and output TPM.

With Amazon Bedrock Provisioned Throughput, you incur charges per model unit whether you use it or not. Therefore, the Provisioned Throughput model is relatively more expensive than the On-Demand model.

Consider the following cost-optimization tips:

  • Start with the On-Demand model and test for your performance and latency with your choice of LLM. This will deliver the lowest costs.
  • If On-Demand can’t satisfy the desired volume of RPM or TPM, start with Provisioned Throughput with a 1-month subscription during your generative AI application beta period. However, for steady state production, consider a 6-month subscription to lower the Provisioned Throughput costs.
  • If there are shorter peak hours and longer off-peak hours, consider using a Provisioned Throughput hourly model during the peak hours and On-Demand during the off-peak hours. This can minimize your Provisioned Throughput costs.

Factors influencing costs

In this section, we discuss various factors that can influence costs.

Number of questions

Cost grows as the number of questions grow with the On-Demand model, as can be seen in the following figure for annual costs (based on the table discussed earlier).

Input tokens

The main sources of input tokens to the LLM are the system prompt, user prompt, context from the vector database (knowledge base), and context from QnA history, as illustrated in the following figure.

As the size of each component grows, the number of input tokens to the LLM grows, and so does the costs.

Generally, user prompts are relatively small. For example, in the user prompt “What are the performance and cost optimization strategies for Amazon DynamoDB?”, assuming four characters per token, there are approximately 20 tokens.

System prompts can be large (and therefore the costs are higher), especially for multi-shot prompts where multiple examples are provided to get LLM responses with better tone and style. If each example in the system prompt uses 100 tokens and there are three examples, that’s 300 tokens, which is quite larger than the actual user prompt.

Context from the knowledge base tends to be the largest. For example, when the documents are chunked and text embeddings are generated for each chunk, assume that the chunk size is 2,000 characters. Assume that the generative AI application sends three chunks relevant to the user prompt to the LLM. This is 6,000 characters. Assuming four characters per token, this translates to 1,500 tokens. This is much higher compared to a typical user prompt or system prompt.

Context from QnA history can also be high. Assume an average of 20 tokens in the user prompt and 100 tokens in LLM response. Assume that the generative AI application sends a history of three question-answer pairs along with each question. This translates to (20 tokens per question + 100 tokens per response) x 3 question-answer pairs = 360 tokens.

Consider the following cost-optimization tips:

  • Limit the number of characters per user prompt
  • Test the accuracy of responses with various numbers of chunks and chunk sizes from the vector database before finalizing their values
  • For generative AI applications that need to carry a conversation with a user, test with two, three, four, or five pairs of QnA history and then pick the optimal value

Output tokens

The response from the LLM will depend on the user prompt. In general, the pricing for output tokens is three to five times higher than the pricing for input tokens.

Consider the following cost-optimization tips:

  • Because the output tokens are expensive, consider specifying the maximum response size in your system prompt
  • If some users belong to a group or department that requires higher token limits on the user prompt or LLM response, consider using multiple system prompts in such a way that the generative AI application picks the right system prompt depending on the user

Vector embedding costs

As explained previously, in a RAG application, the data is chunked, and text embeddings are generated and stored in a vector database (knowledge base). The text embeddings are generated by invoking the Amazon Bedrock API with an LLM, such as Amazon Titan Text Embeddings V2. This is independent of the Amazon Bedrock model you choose for inferencing, such as Anthropic’s Claude Haiku or other LLMs.

The pricing to generate text embeddings is based on the number of input tokens. The greater the data, the greater the input tokens, and therefore the higher the costs.

For example, with 25 GB of data, assuming four characters per token, input tokens total 6,711 million. With the Amazon Bedrock On-Demand costs for Amazon Titan Text Embeddings V2 as $0.02 per million tokens, the cost of generating embeddings is $134.22.

However, On-Demand has an RPM limit of 2,000 for Amazon Titan Text Embeddings V2. With 2,000 RPM, it will take 112 hours to embed 25 GB of data. Because this is a one-time job of embedding data, this might be acceptable in most scenarios.

For monthly change rate and new data of 5% (1.25 GB per month), the time required will be 6 hours.

In rare situations where the actual text data is very high in TBs, Provisioned Throughput will be needed to generate text embeddings. For example, to generate text embeddings for 500 GB in 3, 6, and 9 days, it will be approximately $60,000, $33,000, or $24,000 one-time costs using Provisioned Throughput.

Typically, the actual text inside a file is 5–10 times smaller than the file size reported by Amazon S3 or a file system. Therefore, when you see 100 GB size for all your files that need to be vectorized, there is a high probability that the actual text inside the files will be 2–20 GB.

One way to estimate the text size inside files is with the following steps:

  1. Pick 5–10 sample representations of the files.
  2. Open the files, copy the content, and enter it into a Word document.
  3. Use the word count feature to identify the text size.
  4. Calculate the ratio of this size with the file system reported size.
  5. Apply this ratio to the total file system to get a directional estimate of actual text size inside all the files.

Vector database costs

AWS offers many vector databases, such as OpenSearch Service, Aurora, Amazon RDS, and MemoryDB. As explained earlier in this post, the vector database plays a critical role in grounding responses to your enterprise data whose vector embeddings are stored in a vector database.

The following are some of the factors that influence the costs of vector database. For the sake of brevity, we consider an OpenSearch Service provisioned cluster as the vector database.

  • Amount of data to be used as the knowledge base – Costs are directly proportional to data size. More data means more vectors. More vectors mean more indexes in a vector database, which in turn requires more memory and therefore higher costs. For best performance, it’s recommended to size the vector database so that all the vectors are stored in memory.
  • Index compression – Vector embeddings can be indexed by HNSW or IVF algorithms. The index can also be compressed. Although compressing the indexes can reduce the memory requirements and costs, it might lose accuracy. Therefore, consider doing extensive testing for accuracy before deciding to use compression variants of HNSW or IVF. For example, for a large text data size of 100 GB, assuming 2,000 bytes of chunk size, 15% overlap, vector dimension count of 512, no upfront Reserved Instance for 3 years, and HNSW algorithm, the approximate costs are $37,000 per year. The corresponding costs with compression using hnsw-fp16 and hnsw-pq are $21,000 and $10,000 per year, respectively.
  • Reserved Instances – Cost is inversely proportional to the number of years you reserve the cluster instance that stores the vector database. For example, in the preceding scenario, an On-Demand instance would cost approximately, $75,000 per year, a no upfront 1-year Reserved Instance would cost $52,000 per year, and a no upfront 3-year Reserved Instance would cost $37,000 per year.

Other factors, such as the number of retrievals from the vector database that you pass as context to the LLM, can influence input tokens and therefore costs. But in general, the preceding factors are the most important cost drivers.

Amazon Bedrock Guardrails

Let’s assume your generative AI virtual assistant is supposed to answer questions related to your products for your customers on your website. How will you avoid users asking off-topic questions such as science, religion, geography, politics, or puzzles? How do you avoid responding to user questions on hate, violence, or race? And how can you detect and redact PII in both questions and responses?

The Amazon Bedrock ApplyGuardrail API can help you solve these problems. Guardrails offer multiple policies such as content filters, denied topics, contextual grounding checks, and sensitive information filters (PII). You can selectively apply these filters to all or a specific portion of data such as user prompt, system prompt, knowledge base context, and LLM responses.

Applying all filters to all data will increase costs. Therefore, you should evaluate carefully which filter you want to apply on what portion of data. For example, if you want PII to be detected or redacted from the LLM response, for 2 million questions per month, approximate costs (based on output tokens mentioned earlier in this post) would be $200 per month. In addition, if your security team wants to detect or redact PII for user questions as well, the total Amazon Bedrock Guardrails costs will be $400 per month.

Chunking strategies

As explained earlier in how RAG works, your data is chunked, embeddings are generated for those chunks, and the chunks and embeddings are stored in a vector database. These chunks of data are retrieved later and passed as context along with user questions to the LLM to generate a grounded and relevant response.

The following are different chunking strategies, each of which can influence costs:

  • Standard chunking – In this case, you can specify default chunking, which is approximately 300 tokens, or fixed-size chunking, where you specify the token size (for example, 300 tokens) for each chunk. Larger chunks will increase input tokens and therefore costs.
  • Hierarchical chunking – This strategy is useful when you want to chunk data at smaller sizes (for example, 300 tokens) but send larger pieces of chunks (for example, 1,500 tokens) to the LLM so the LLM has a bigger context to work with while generating responses. Although this can improve accuracy in some cases, this can also increase the costs because of larger chunks of data being sent to the LLM.
  • Semantic chunking – This strategy is useful when you want chunking based on semantic meaning instead of just the token. In this case, a vector embedding is generated for one or three sentences. A sliding window is used to consider the next sentence and embeddings are calculated again to identify whether the next sentence is semantically similar or not. The process continues until you reach an upper limit of tokens (for example, 300 tokens) or you find a sentence that isn’t semantically similar. This boundary defines a chunk. The input token costs to the LLM will be similar to standard chunking (based on a maximum token size) but the accuracy might be better because of chunks having sentences that are semantically similar. However, this will increase the costs of generating vector embeddings because embeddings are generated for each sentence, and then for each chunk. But at the same time, these are one-time costs (and for new or changed data), which might be worth it if the accuracy is comparatively better for your data.
  • Advanced parsing – This is an optional pre-step to your chunking strategy. This is used to identify chunk boundaries, which is especially useful when you have documents with a lot of complex data such as tables, images, and text. Therefore, the costs will be the input and output token costs for the entire data that you want to use for vector embeddings. These costs will be high. Consider using advanced parsing only for those files that have a lot of tables and images.

The following table is a relative cost comparison for various chunking strategies.

Chunking Strategy Standard Semantic Hierarchical
Relative Inference Costs Low Medium High

Conclusion

In this post, we discussed various factors that could impact costs for your generative AI application. This a rapidly evolving space, and costs for the components we mentioned could change in the future. Consider the costs in this post as a snapshot in time that is based on assumptions and is directionally accurate. If you have any questions, reach out to your AWS account team.

In Part 2, we discuss how to calculate business value and the factors that impact business value.


About the Authors

Vinnie Saini is a Senior Generative AI Specialist Solution Architect at Amazon Web Services(AWS) based in Toronto, Canada. With a background in Machine Learning, she has over 15 years of experience designing & building transformational cloud based solutions for customers across industries. Her focus has been primarily scaling AI/ML based solutions for unparalleled business impacts, customized to business needs.

Chandra Reddy is a Senior Manager of Solution Architects team at Amazon Web Services(AWS) in Austin, Texas. He and his team help enterprise customers in North America on their AIML and Generative AI use cases in AWS. He has more than 20 years of experience in software engineering, product management, product marketing, business development, and solution architecture.

Read More

PEFT fine tuning of Llama 3 on SageMaker HyperPod with AWS Trainium

PEFT fine tuning of Llama 3 on SageMaker HyperPod with AWS Trainium

Training large language models (LLMs) models has become a significant expense for businesses. For many use cases, companies are looking to use LLM foundation models (FM) with their domain-specific data. However, companies are discovering that performing full fine tuning for these models with their data isn’t cost effective. To reduce costs while continuing to use the power of AI, many companies have shifted to fine tuning LLMs on their domain-specific data using Parameter-Efficient Fine Tuning (PEFT). PEFT is a set of techniques designed to adapt pre-trained LLMs to specific tasks while minimizing the number of parameters that need to be updated. Techniques such as Low-Rank Adaptation (LoRA) and Weighted-Decomposed Low Rank Adaptation (DoRA), significantly reducing the number of trainable parameters and resulting in lower costs for fine tuning.

In addition to cost, performing fine tuning for LLMs at scale presents significant technical challenges. The process of setting up and configuring a distributed training environment can be complex, requiring expertise in server management, cluster configuration, networking and distributed computing. Manually managing such complexity can often be counter-productive and take away valuable resources from your businesses AI development. To simplify infrastructure setup and accelerate distributed training, AWS introduced Amazon SageMaker HyperPod in late 2023.

In this blog post, we showcase how you can perform efficient supervised fine tuning for a Meta Llama 3 model using PEFT on AWS Trainium with SageMaker HyperPod. We use HuggingFace’s Optimum-Neuron software development kit (SDK) to apply LoRA to fine-tuning jobs, and use SageMaker HyperPod as the primary compute cluster to perform distributed training on Trainium. Using LoRA supervised fine-tuning for Meta Llama 3 models, you can further reduce your cost to fine tune models by up to 50% and reduce the training time by 70%.

Solution overview

SageMaker HyperPod is designed to help reduce the time required to train generative AI FMs by providing a purpose-built infrastructure for distributed training at scale. When using SageMaker HyperPod for training, SageMaker will actively monitor the cluster’s health, automatically replacing faulty nodes and resuming model training from checkpoints. The clusters come pre-configured with SageMaker distributed training libraries that enable you to split your training data and model across thousands of compute nodes, allowing data to be processed in parallel while fully utilizing the cluster’s compute and network infrastructure. You can also customize your distributed training. The architecture diagram that follows provides a high level overview of these various components:

  • Compute cluster: This contains a head node that orchestrates computation across a cluster of worker nodes. Because the head node is only facilitating the training, it’s typically a much smaller instance. In this post, we use Amazon Elastic Compute Cloud (Amazon EC2) Trn1 instances for the worker nodes and a single Amazon EC2 C5 instance for the head node.
  • Shared Volume: FSx for Lustre is used as the shared storage volume across nodes to maximize data throughput. It’s mounted at /fsx on the head and compute nodes.
  • External storage: Amazon Simple Storage Service (Amazon S3) is used to store the cluster’s lifecycle scripts, configuration files, datasets, and checkpoints.
  • Scheduler: SLURM is used as the job scheduler for the cluster.

Amazon SageMaker HyperPod Architecture

Trainium chips are purpose-built for deep learning training of 100 billion and larger parameter models. Model training on Trainium is supported by the AWS Neuron SDK, which provides compiler, runtime, and profiling tools that unlock high-performance and cost-effective deep learning acceleration. To learn more about Trainium chips and the Neuron SDK, see Welcome to AWS Neuron.

To integrate Trainium chips with existing models and tools provided through the transformers package, Hugging Face’s Optimum-Neuron package functions as an interface with Neuron. With Optimum-Neuron, users can apply techniques such as LoRA to their fine-tuning jobs, streamlining the process of adapting LLMs for specific tasks while capitalizing on the performance gains provided by the AWS infrastructure.

Traditional fine tuning involves modifying all the parameters of a model, which can be computationally expensive and memory intensive. PEFT approaches such as LoRA focus on introducing a smaller set of trainable parameters, often in the form of low-rank matrices that adjust the model’s behavior while keeping most of its parameters frozen. The advantage of LoRA lies in its ability to maintain the performance of the base model while significantly lowering the computational burden and resource requirements. The Neuron 2.20 release supports model training with LoRA on Trainium.

In the next section, we’ll walk through the code in three steps for PEFT on Trainium with HyperPod:

  1. Setting up and deploying a HyperPod cluster for distributed training.
  2. Fine tuning a Meta Llama 3-8B model on Trainium instance with the dolly 15k dataset.
  3. Model weights consolidation and inference.

Amazon SageMaker HyperPod cluster setup

In this first section, you will begin setting up your Amazon SageMaker HyperPod compute environment for fine tuning.

Prerequisites

The following are the prerequisites for configuring and deploying a SageMaker HyperPod cluster for fine tuning:

Step 1: Infrastructure setup

After completing the prerequisites, deploy an AWS CloudFormation stack that contains the necessary infrastructure components for distributed training through SageMaker HyperPod. The default Region specified in the template is us-west-2, but you can modify that. You will also need to specify the Availability Zone where your subnets will be deployed. The template configures your environment with an Amazon Virtual Private Cloud (Amazon VPC) and corresponding public and private subnets for network isolation. It establishes additional components inside your VPC including an S3 bucket for lifecycle scripts and FSx for Lustre, a file system shared across the head and compute nodes of the HyperPod cluster.

Step 2: Cluster configuration

Configure and deploy the HyperPod cluster. Begin by defining your infrastructure’s environment variables through the create_config script. This script uses the AWS CLI to extract infrastructure component variables from your CloudFormation stack including Region, resource IDs, and Amazon Resource Name (ARN).

# Set region
export AWS_REGION=us-west-2

# Fetch create_config script
curl 'https://static.us-east-1.prod.workshops.aws/public/05a78a77-24f9-4f29-867c-64c9687646e1/static/scripts/create_config.sh' --output create_config.sh

# Set environment variables
bash create_config.sh
source env_vars

After setting your environment variables, download the lifecycle scripts required for bootstrapping the compute nodes on your SageMaker HyperPod cluster and define its configuration settings before uploading the scripts to your S3 bucket.

# Download Lifecycle scripts
git clone --depth=1 https://github.com/aws-samples/awsome-distributed-training/

# upload scripts to s3
aws s3 cp --recursive awsome-distributed-training/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/ s3://${BUCKET}/src

After uploading the Lifecycle scripts to Amazon S3, create your cluster and file system configurations. See the Create Cluster section of the SageMaker HyperPod workshop to create these files. After generating the cluster-config.json and provisioning_parameters.json configuration files, validate them and upload the FSx for Lustre configuration file to Amazon S3.

# validate and check config for known issues
curl -O https://raw.githubusercontent.com/aws-samples/awsome-distributed-training/main/1.architectures/5.sagemaker-hyperpod/validate-config.py
python3 validate-config.py --cluster-config cluster-config.json --provisioning-parameters provisioning_parameters.json

# Upload FSx configuration to S3
aws s3 cp provisioning_parameters.json s3://${BUCKET}/src/

Step 3: Cluster deployment

Now that the cluster’s configuration is defined, you can create the cluster.

aws sagemaker create-cluster 
--cli-input-json file://cluster-config.json 
--region $AWS_REGION

You should be able to see your cluster by navigating to SageMaker Hyperpod in the AWS Management Console and see a cluster named ml-cluster listed. After a few minutes, its status should change from Creating to InService.

SageMaker Console

If you select your cluster, you will be able to see the details of your compute cluster including the head and worker nodes.

SageMaker Console

After installing the Systems Manager Session Manager plugin, you can ssh into your cluster’s head node using the easy-ssh script to begin training.

# Modify permissions and ssh
chmod +x easy-ssh.sh
./easy-ssh.sh -c controller-machine ml-cluster

# Switch to ubuntu user
sudo su - ubuntu

# Change directory
cd /fsx

Now that your cluster is running and accessible through ssh, you can begin uploading the model training scripts to the shared file system through either curl or the AWS CLI. For more instructions on setting up your cluster, see the SageMaker HyperPod workshop.

Fine tuning

Now that your SageMaker HyperPod cluster is deployed, you can start preparing to execute your fine tuning job.

Data preparation

The foundation of successful language model fine tuning lies in properly structured and prepared training data. This implementation focuses on instruction-tuned datasets, which form the backbone of modern language model adaptation. These datasets work together to create meaningful training examples through three essential components:

  • Instructions that guide the model’s task.
  • Optional context that provides background information.
  • Responses that represent the desired output.

Training begins by loading your dataset and formatting your dataset examples with this structure. Loading your dataset can be accomplished through the Hugging Face datasets library, which provides a straightforward interface for accessing and managing training data. Hugging Face also provides this format function for the databricks-dolly-15k dataset. Note that the format function needs to be embedded in your train.py file (as shown in the following sample). It’s referenced by the NeuronSFTTrainer to format your dataset during fine tuning.

# Load dataset
dataset = load_dataset(args.dataset, split="train")

def format_dolly(examples):
    output_text = []
    for i in range(len(examples["instruction"])):
        instruction = f"### Instructionn{examples['instruction'][i]}"
        context = f"### Contextn{examples['context'][i]}" if examples["context"][i] else None
        response = f"### Answern{examples['response'][i]}"
        prompt = "nn".join([i for i in [instruction, context, response] if i is not None])
        output_text.append(prompt)
    return output_text

The formatting function employs delimiter tokens ("###") to create clear boundaries between different components of each training example. This separation is important because it helps the model distinguish between different parts of the input during training. The function handles cases where context might be missing, making sure that the final format remains consistent regardless of whether all components are present. Double newlines between sections provide additional structural clarity that helps the model recognize the natural breaks in the input.

Tokenization

After formatting your dataset, the next step is tokenization—the process of converting your text data into a numerical format that your model can understand. Tokenization serves as the bridge between your human-readable text and the mathematical operations that drive your model’s understanding of language. To begin, you use Hugging Face’s AutoTokenizer to load your model’s tokenizer.

tokenizer = AutoTokenizer.from_pretrained(args.tokenizer_path)
tokenizer.pad_token = tokenizer.eos_token

The AutoTokenizer class automatically selects the appropriate tokenizer for your model, loading not just the vocabulary, but also the rules and special tokens that match your training configuration. The assignment of the padding token to match the end-of-sequence token is particularly important for causal language modeling, because it verifies the consistent handling of your variable-length sequences.

The tokenization process itself operates in several stages. First, it breaks down your input text into tokens based on its vocabulary. These tokens are then converted to numerical IDs that your model can process. During this process, your tokenizer also handles special tokens that mark the beginning and end of sequences, in addition to padding tokens that make sure that the sequences in your batch have the same length.

When working with tokenizers, your sequence length management becomes a critical consideration. Your maximum sequence length must balance between preserving enough information for your model to understand the context and staying within your model’s architectural limitations. Too short, and you risk losing important context; too long, and you might exceed memory constraints or introduce unnecessary computational overhead.

Model compilation and fine tuning

For this solution, you created a SageMaker HyperPod cluster with the controller node and one worker node. The worker node contains one ml.trn1.32xlarge instance which has 32 Neuron cores. You can conduct distributed fine tuning using all 32 Neuron cores within the worker node.

Step 1: Environment setup

You first need to install the required Python packages for fine tuning. The following is the bash script for the Python environment setup. Note that the solution uses the most recently released Neuron SDK. From the HOME directory, create a file touch environment.sh with the following code and run it with sbatch ./environment.sh. You might need to modify the permissions of the shell scripts throughout this post before running them with the command chmod +x environment.sh.

#!/usr/bin/env bash
#SBATCH --nodes=1
#SBATCH --exclusive
#SBATCH -o /fsx/ubuntu/peft_ft/logs/8b/environment.out

sudo apt install -y python3.8-venv git
python3.8 -m venv $HOME/peft_ft/env_llama3_8B_peft
source $HOME/peft_ft/env_llama3_8B_peft/bin/activate
pip install -U pip

python3 -m pip config set global.extra-index-url "https://pip.repos.neuron.amazonaws.com"
python3 -m pip install torch-neuronx==2.1.2.2.3.0 neuronx-cc==2.15.128.0 neuronx_distributed==0.9.0 torchvision
python3 -m pip install datasets transformers peft huggingface_hub trl PyYAML
python3 -m pip install git+https://github.com/huggingface/optimum-neuron.git

With your environment created, switch to your fine-tuning directory before proceeding to the next step: cd $HOME/peft_ft.

Step 1: Download the base Llama 3 8B model and tokenizer from Hugging Face

Download the base Meta Llama 3 8B model and the corresponding tokenizer from Hugging Face. You will need to first request access for the model from Meta on Hugging Face and then use your Hugging Face access token to download the model. The following is the Python code for the get_model.py script to download the model and tokenizer. Create this file with touch get_model.py and copy the following code to this file before moving on to the next step.

import os
import argparse
from transformers import AutoTokenizer, LlamaForCausalLM

def download_model_and_tokenizer(model_id: str, model_output_path: str, tokenizer_output_path: str, huggingface_token: str = None) -> None:
    huggingface_token = os.environ.get("HUGGINGFACE_TOKEN", None)
    model = LlamaForCausalLM.from_pretrained(model_id, token=huggingface_token)
    model.save_pretrained(model_output_path)
    tokenizer = AutoTokenizer.from_pretrained(model_id, token=huggingface_token)
    tokenizer.save_pretrained(tokenizer_output_path)
    
if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--model_id", type=str, required=True, help="Hugging Face Model id")
    parser.add_argument("--model_output_path", type=str, required=True, help="Path to save model/weights file")
    parser.add_argument("--tokenizer_output_path", type=str, required=True, help="Path to save tokenizer file")
    args, _ = parser.parse_known_args()
    download_model_and_tokenizer(model_id=args.model_id, model_output_path=args.model_output_path, tokenizer_output_path=args.tokenizer_output_path)  

Next, create the bash script touch get_model.sh with the code that follows and run it with the command sbatch ./get_model.sh. This will trigger the get_model.py script to download the model and tokenizer using Slurm. Because you’re using the Llama 3 8B model, Hugging Face requires you to authenticate with an access token prior to download. Be sure to add your access token to get_model.sh before running the script.

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --exclusive
#SBATCH -o /fsx/ubuntu/peft_ft/logs/8b/get_model.out

export OMP_NUM_THREADS=1
export HUGGINGFACE_TOKEN="<YOUR TOKEN HERE>"
source $HOME/peft_ft/env_llama3_8B_peft/bin/activate

srun python3 $HOME/peft_ft/get_model.py 
--model_id meta-llama/Meta-Llama-3-8B-Instruct 
--model_output_path $HOME/peft_ft/model_artifacts/llama3-8B 
--tokenizer_output_path $HOME/peft_ft/tokenizer/llama3-8B

Step 2: Pre-compile model

Training deep learning models on Trainium requires model compilation. To do that, use the neuron_parallel_compile CLI utility, which will extract graphs from a trial run of your script, and perform parallel pre-compilation of the computation graphs. Note that the scripts for model pre-compilation are identical to those for the actual training, except for max_steps. This is because pre-compilation doesn’t require the completion of the entire training cycle; rather, it necessitates approximately 10 training steps to extract the graphs. Before compiling the model, you need to create the training script, touch train.py which is used for both pre-compilation and model fine tuning steps. Add the following code after creating the file, along with the format function previously mentioned.

import os
import torch
import argparse
from datasets import load_dataset
from peft import LoraConfig
from transformers import AutoModelForCausalLM, AutoTokenizer, set_seed
from optimum.neuron import NeuronSFTConfig, NeuronSFTTrainer
from optimum.neuron.distributed import lazy_load_for_parallelism
import torch_xla.core.xla_model as xm

# add format_dolly function here

def training_function(args):
    dataset = load_dataset(args.dataset, split="train")    
    tokenizer = AutoTokenizer.from_pretrained(args.tokenizer_path)
    tokenizer.pad_token = tokenizer.eos_token
    with lazy_load_for_parallelism(tensor_parallel_size=args.tp_size):
        model = AutoModelForCausalLM.from_pretrained(
            args.model_path, 
            low_cpu_mem_usage=True, 
            torch_dtype=torch.bfloat16 if args.bf16 else torch.float32
        )

    lora_config = LoraConfig(
        r=16,
        lora_alpha=16,
        lora_dropout=0.05,
        target_modules=["q_proj", "v_proj"],
        bias="none",
        task_type="CAUSAL_LM",
    )
        
    training_args = NeuronSFTConfig(
        output_dir=args.model_checkpoint_path,
        overwrite_output_dir=True,
        num_train_epochs=args.epochs,
        per_device_train_batch_size=args.train_batch_size,
        gradient_accumulation_steps=args.gradient_accumulation_steps,
        learning_rate=args.learning_rate,
        weight_decay=args.weight_decay,
        warmup_steps=args.warmup_steps,
        bf16=args.bf16,
        tensor_parallel_size=args.tp_size,
        pipeline_parallel_size=args.pp_size,
        save_steps=args.checkpoint_frequency,
        logging_steps=100,
        max_steps=args.max_steps,
        )

    trainer = NeuronSFTTrainer(
        args=training_args,
        model=model,
        peft_config=lora_config,
        tokenizer=tokenizer,
        train_dataset=dataset,
        formatting_func=format_dolly,
    )

    trainer.train()
    trainer.save_model(args.model_final_path)
    
if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--model_path", type=str)
    parser.add_argument("--tokenizer_path", type=str)
    parser.add_argument("--epochs", type=int)
    parser.add_argument("--train_batch_size", type=int)
    parser.add_argument("--learning_rate", type=float)
    parser.add_argument("--weight_decay", type=float)
    parser.add_argument("--bf16", type=bool)
    parser.add_argument("--tp_size", type=int)
    parser.add_argument("--pp_size", type=int)
    parser.add_argument("--gradient_accumulation_steps", type=int)
    parser.add_argument("--warmup_steps", type=int)
    parser.add_argument("--early_stopping_patience", type=int)
    parser.add_argument("--checkpoint_frequency", type=int)
    parser.add_argument("--dataset", type=str)
    parser.add_argument("--max_steps", type=int)
    parser.add_argument("--max_seq_length", type=int)
    parser.add_argument("--model_checkpoint_path", type=str)
    parser.add_argument("--model_final_path", type=str)
    args = parser.parse_args()
    training_function(args)

After creating the training file, use the following code to create the compile.sh script, which will trigger finetune-llama3-8B.sh to compile the Llama 3 8B model using the neuron_parallel_compile command. You can run this with the sbatch compile.sh command.

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --exclusive
#SBATCH -o /fsx/ubuntu/peft_ft/logs/8b/compile.out

source $HOME/peft_ft/env_llama3_8B_peft/bin/activate

export NEURON_EXTRACT_GRAPHS_ONLY=0
srun bash ${HOME}/peft_ft/finetune-llama3-8B.sh

The following is the finetune-llama3-8B.sh script, which lists the hyper-parameters for your model fine tuning. The script uses tensor parallelism for the training with degree of 8. With 32 NeuronCores in the ml.trn1.32xlarge instance, you get data parallel of degree 4. Note that the script also sets XLA_USE_BF16=1 to map both torch.float and torch.double tensors to bfloat16 tensors. This can both reduce memory footprint and improve performance. The script then sets gradient_accumulation_steps to be 3 to get a larger effective batch size for gradient update.

#!/bin/bash
GPUS_PER_NODE=32
if [ $NEURON_EXTRACT_GRAPHS_ONLY -gt 0 ]; then
    MAX_STEPS=10
    MAYBE_COMPILE="neuron_parallel_compile"
else
    MAX_STEPS=-1
fi

declare -a TORCHRUN_ARGS=(
    --nproc_per_node=$GPUS_PER_NODE
    --nnodes=$SLURM_JOB_NUM_NODES
)
export TRAIN_SCRIPT=${HOME}/peft_ft/train.py

declare -a TRAINING_ARGS=(
    --bf16 True 
    --checkpoint_frequency 400 
    --dataset "databricks/databricks-dolly-15k" 
    --max_steps $MAX_STEPS 
    --max_seq_length 1024 
    --epochs 1 
    --gradient_accumulation_steps 3 
    --learning_rate 2e-05 
    --model_path "/fsx/ubuntu/peft_ft/model_artifacts/llama3-8B" 
    --tokenizer_path "/fsx/ubuntu/peft_ft/tokenizer/llama3-8B" 
    --model_checkpoint_path "/fsx/ubuntu/peft_ft/model_checkpoints" 
    --model_final_path "/fsx/ubuntu/peft_ft/model_checkpoints/final" 
    --tp_size 8 
    --pp_size 1 
    --train_batch_size 1 
    --warmup_steps 100 
    --weight_decay 0.01 
)
$MAYBE_COMPILE torchrun "${TORCHRUN_ARGS[@]}" $TRAIN_SCRIPT "${TRAINING_ARGS[@]}"

Step 3: Model fine tuning

After the model compiling is complete, you can then start the model fine tuning by reusing the compile.sh script. To do this, prevent the neuron_parallel_compile utility from being used by setting export NEURON_EXTRACT_GRAPHS_ONLY=-1 in compile.sh, and then re-run the script to start fine tuning your model. You might need to delete the model_consolidation directory created during the previous model compilation step before you start your fine-tuning job.

Model consolidation

When working with distributed machine learning workflows, you’ll often need to manage and merge model weights efficiently. Let’s explore two essential processes that you’ll frequently encounter: checkpoint consolidation and weight merging when performing LoRA fine tuning.

Checkpoint consolidation

During distributed training, your model checkpoints are typically split across multiple devices according to the model parallelism configuration that you provide. To bring these pieces back together, you’ll use a consolidation process. Your consolidation function handles three primary tasks. First, it combines distributed checkpoints into a unified model. Then, it manages memory efficiently by processing tensors in chunks. Finally, it creates sharded outputs with an index file for quick access.

LoRA weight merging

When you’re working with LoRA, you need to merge these adapters with your base model. The merging process is straightforward but requires careful attention to detail. Start by loading your base model and LoRA configuration. Then transform the LoRA weight names to match your base model’s structure. The process concludes by merging the adapters and saving the final model in a sharded format.

To put these tools into practice, you can use the following scripts after your fine-tuning job has finished. First, create the Python file, touch consolidation.py and shell file, touch consolidation.sh using the following code.

import argparse
import json
from pathlib import Path
from huggingface_hub import split_torch_state_dict_into_shards
from safetensors.torch import save_file
from optimum.neuron.distributed.checkpointing import consolidate_model_parallel_checkpoints
import torch

def custom_consolidate_to_unified_checkpoint(checkpoint_dir: str, output_dir: str, save_format: str = "safetensors"):
    output_dir.mkdir(parents=True, exist_ok=True)
    state_dict = consolidate_model_parallel_checkpoints(checkpoint_dir)
    for key, value in state_dict.items():
        if isinstance(value, torch.Tensor):
            state_dict[key] = value.contiguous()

    split_result = split_torch_state_dict_into_shards(state_dict, max_shard_size="5GB")
    # Save shards
    for shard_file, shard_tensors in split_result.filename_to_tensors.items():
        shard_dict = {name: state_dict[name] for name in shard_tensors}
        shard_path = output_dir / shard_file
        if save_format == "safetensors":
            save_file(shard_dict, shard_path, metadata={"format": "pt"})
        else:
            torch.save(shard_dict, shard_path)

    index = {
        "metadata": split_result.metadata,
        "weight_map": split_result.tensor_to_filename
    }
    
    index_file = "model.safetensors.index.json" if save_format == "safetensors" else "pytorch_model.bin.index.json"
    with open(output_dir / index_file, "w") as f:
        json.dump(index, f, indent=2)

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--input_dir", type=str, required=True)
    parser.add_argument("--output_dir", type=str, required=True)
    parser.add_argument("--save_format", type=str, choices=["safetensors", "pytorch"])
    args = parser.parse_args()
    output_dir = Path(args.output_dir)
    checkpoint_dir = Path(args.input_dir) / "adapter_shards"
    custom_consolidate_to_unified_checkpoint(
        checkpoint_dir=checkpoint_dir,
        output_dir=output_dir,
        save_format=args.save_format
    )

This code will consolidate the sharded checkpoint files generated during training into a consolidated LoRA adaptersafetensor format. After saving the file, you can invoke this script to trigger the model checkpoint consolidation job. The input directory that you provide points to your fine-tuned model’s sharded checkpoints and the output directory for the consolidated LoRA adapter safetensor file. You trigger this with sbatch consolidation.sh.

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --exclusive

export OMP_NUM_THREADS=1
source $HOME/peft_ft/env_llama3_8B_peft/bin/activate

srun python3 "$HOME/peft_ft/consolidation.py" 
--input_dir "/fsx/ubuntu/peft_ft/model_checkpoints/checkpoint-1251" 
--output_dir "$HOME/peft_ft/model_checkpoints/adapter_shards_consolidation"
--save_format "safetensors"

After consolidation is complete, you need to merge the LoRA adapter weights from the consolidated files with the base model’s weights. Begin by creating a new Python file touch merge_lora.py and shell file merge_lora.sh using the following code.

import json
from peft import LoraConfig, PeftModel
from transformers import AutoModelForCausalLM
import torch
import argparse
from safetensors import safe_open


def merge_lora_weights(args):
    base_model = AutoModelForCausalLM.from_pretrained(args.base_model_path)
    with open(args.adapter_config_path, "r") as f:
        config_dict = json.load(f)
    peft_config = LoraConfig(**config_dict)
    model = PeftModel(base_model, peft_config)
    
    lora_weights_tensors = {}
    with safe_open(args.lora_safetensors_path, framework="pt", device='cpu') as f:
        for k in f.keys():
            lora_weights_tensors[k] = f.get_tensor(k)
            
    for layer_name in list(lora_weights_tensors):
        if 'layer' in layer_name and 'lora' in layer_name:
            new_layer_name = layer_name.replace('weight', 'default.weight')
            lora_weights_tensors[new_layer_name] = lora_weights_tensors[layer_name].clone()
            del lora_weights_tensors[layer_name]
        else:
            del lora_weights_tensors[layer_name]

    updated_state_dict = model.state_dict().copy()
    for layer, weights in lora_weights_tensors.items():
        updated_state_dict[layer] = weights
    model.load_state_dict(updated_state_dict)    
    merged_model = model.merge_and_unload()    
    merged_model.save_pretrained(args.final_model_path, safe_serialization=True, max_shard_size="5GB")

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--final_model_path", type=str)
    parser.add_argument("--adapter_config_path", type=str)
    parser.add_argument("--base_model_path", type=str)
    parser.add_argument("--lora_safetensors_path", type=str)
    args = parser.parse_args()
    merge_lora_weights(args)
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --exclusive
#SBATCH --output=/fsx/ubuntu/peft_ft/logs/8b/lora_weights.log

export OMP_NUM_THREADS=1
source $HOME/peft_ft/env_llama3_8B_peft/bin/activate

srun python3 "$HOME/peft_ft/merge_lora.py" 
    --final_model_path "/fsx/ubuntu/peft_ft/model_checkpoints/final_model_output" 
    --adapter_config_path "/fsx/ubuntu/peft_ft/model_checkpoints/checkpoint-1251/adapter_config.json"
    --base_model_path "/fsx/ubuntu/peft_ft/model_artifacts/llama3-8B" 
    --lora_safetensors_path "/fsx/ubuntu/peft_ft/model_checkpoints/adapter_shards_consolidation/model.safetensors" 

Trigger the run with sbatch merge_lora.sh to merge the model weights. Here the base_model_path parameter is the local directory where you previously downloaded the model from Hugging Face in step 1 of “Model compilation and fine tuning.” Similarly, the adapter_config_path parameter will be the model’s configuration file previously downloaded and the lora_safetensors_path parameter will be the path to the model.safetensor file output by the LoRA consolidation in the previous step.

Inference

After consolidation and merging, the safetensors files will be saved to your final_model_path output directory containing the updated model weights after fine tuning. Using these updated weights, you can load and generate a prediction for your trained model in the context of the dolly dataset. To check that the fine-tuned model understands the databricks-dolly-15k dataset it was fine tuned on, select a question from the dataset for validation, as shown in the following figure.

Hugging Face databricks-dolly-15k dataset card

Using Hugging Face’s LlamaForCausalLM class you can load your newly fine-tuned model, and generate a prediction for the question, “Who are the Smiths?” (shown in the following figure):

Model inference generation

Comparing the generated answer to the ground truth context and response from the training dataset, it’s clear that the fine-tuned Meta Llama 3 model now understands this data and can give coherent responses to posed questions.

Results

Technique Trainable parameters Samples processed per second Training time (minutes)
FPFT 7,570,591,744 2.083 90
PEFT 6,815,744 3.554 53

To benchmark the fine-tuned model’s performance with LoRA on a single ml.trn1.32xlarge, we compared it to full parameter fine tuning (FPFT) for the model over three training epochs. Measuring training samples processed per second showed a 70% increase in throughput and reduction in training time for the LoRA fine-tuned model. Subsequently, on-demand hours required to fine tune the model on the dolly 15k dataset for three epochs was halved compared to FPFT, resulting in a 50% reduction of training costs.

Clean up

To clean up the resources provisioned for this post, first delete the SageMaker HyperPod cluster. This can be done either through the AWS CLI or in the SageMaker console.

aws sagemaker delete-cluster --cluster-name ml-cluster

After the cluster is deleted, delete the CloudFormation template to delete the remaining provisioned resources.

aws cloudformation delete-stack --stack-name sagemaker-hyperpod

Conclusion

In this post, we showed you how to set up a SageMaker HyperPod compute cluster for training. Then we showed you how to perform multi-node distributed fine tuning with Trainium for a Meta Llama 3 model using LoRA. Finally, we showed you how to consolidate model weights across a distributed training environment to generate coherent predictions for the newly fine-tuned model.


About the Authors

Georgios Ioannides is a Deep Learning Architect with the AWS Generative AI Innovation Center. Before AWS, Georgios worked in startups, where he specialized in signal processing, deep learning, and multi-modal and cross-modal machine learning systems for speech, vision, and text applications. He holds Master’s degrees from Imperial College London and Carnegie Mellon University.

Bingchen Liu is a Machine Learning Engineer with the AWS Generative AI Innovation Center. Before AWS, he worked as a lead MLE in ADP focusing on RAG applications, vector database, model development, and serving. He holds a Master’s degree in Computer Science from Columbia University and a PhD in Statistics from Southern Methodist University.

Hannah Marlowe is a Senior Manager of Model Customization at the AWS Generative AI Innovation Center. Her team specializes in helping customers develop differentiating generative AI solutions using their unique and proprietary data to achieve key business outcomes. She holds a PhD in Physics from the University of Iowa, with a focus on astronomical X-ray analysis and instrumentation development. Outside of work, she can be found hiking, mountain biking, and skiing around the mountains in Colorado.

Jeremy Roghair is a Machine Learning Engineer with the AWS Generative AI Innovation Center, where he focuses on developing generative AI solutions for distributed training workloads and model hosting for customers. Prior to joining AWS, Jeremy worked as a Data Scientist in the finance/insurance industry and earned a Master’s degree in Computer Science with research in reinforcement learning from Iowa State University.

Read More

Using transcription confidence scores to improve slot filling in Amazon Lex

Using transcription confidence scores to improve slot filling in Amazon Lex

When building voice-enabled chatbots with Amazon Lex, one of the biggest challenges is accurately capturing user speech input for slot values. For example, when a user needs to provide their account number or confirmation code, speech recognition accuracy becomes crucial. This is where transcription confidence scores come in to help ensure reliable slot filling.

What Are Transcription Confidence Scores?

Transcription confidence scores indicate how confident Amazon Lex is in converting speech to text for slot values. These scores range from low to high and are separate from intent/entity recognition scores. For each spoken slot value, Lex provides a confidence score that you can use to:

  • Validate if a spoken slot value was correctly understood
  • Decide whether to ask for confirmation or re-prompt
  • Branch conversation flows based on recognition confidence

Here are some ways to leverage confidence scores for better slot handling:

  1. Progressive Confirmation
    • High confidence (>0.9): Accept the slot value and continue
    • Medium confidence (0.6-0.9): Ask user to confirm (“Did you say 12345?”)
    • Low confidence (<0.6): Re-prompt for the slot value
  2. Adaptive re-prompting
    • Customize re-prompt messages based on confidence level
    • Provide more specific guidance for low confidence inputs
    • Offer alternative input methods when needed
  3. Branching Logic
    • Route to human agent if multiple low confidence attempts
    • Skip confirmation for consistently high confidence inputs
    • Adjust validation rules based on confidence thresholds

The post includes an AWS CloudFormation template, to demonstrate these patterns, that you can deploy in your AWS account.

Using confidence scores strategically helps create more robust slot filling experiences that:

  • Reduce errors in capturing critical information
  • Improve containment rates for self-service
  • Provide better handling of challenging audio conditions
  • Enable smarter conversation flows

By implementing these patterns, you can significantly improve the accuracy and reliability of slot filling in your Lex voice bots while maintaining a natural conversation flow.

Solution overview

The following diagram illustrates the overall architecture for the solution.

architecture diagram

We use an Amazon Lex bot integrated with an Amazon Connect contact flow to deliver the conversational experience.

Amazon Lex asks the caller “In a few words, how can I help you today?” Amazon Lex reviews the transcription confidence score of the slot values to influence the confirmation routine.

Interaction model

The following are different customer experiences a caller may have depending on what they say and the transcription confidence scores returned by Amazon Lex:

  • Conversation 1:

Virtual Agent:
Thank you for calling Acme Travel.
Virtual Agent:
…In a few words, what is the reason for your call today?
Customer:
I’d like to check my booking.
Virtual Agent:
That’s great, please say your 5 character booking reference, you will find it at the top of the information pack we sent.
Customer:
1 9 A K B (Recognized with low transcription confidence: 0.4)
Virtual Agent:
Ok, let’s try again. What is your booking reference?
Customer:
1 9 A A B (More clearly pronounced: recognized with 0.8)
Virtual Agent:
Your booking 1 9 A A B is currently being progressed.

  • Conversation 2:

Virtual Agent:
Thank you for calling Acme Travel.
Virtual Agent:
…In a few words, what is the reason for your call today?
Customer:
I’d like to check my booking 1 9 A Z Y.
Virtual Agent:
Your booking 1 9 A Z Y is currently being progressed.

  • Conversation 3:

Virtual Agent:
Thank you for calling Acme Travel.
Virtual Agent:
…In a few words, what is the reason for your call today?
Customer:
I’d like to check my booking.
Virtual Agent:
That’s great, please say your 5 character booking reference, you will find it at the top of the information pack we sent.
Customer:
1 9 A Z Y (recognized with a transcription confidence of 0.6)
Virtual Agent:
Did you say 1 9 A Z Y? Please say yes or no.
Customer:
Yes
Virtual Agent:
Your booking 1 9 A Z Y is currently being progressed.

In the example conversations, the IVR requests the booking reference from the customer. Once received, the transcription confidence score is evaluated by enabling conditional branching in Amazon Lex based on speech confidence scores. These conditions check the value against specific thresholds. If the transcription confidence score exceeds the high threshold (for example, greater than 0.7), the conversation progresses to the next state. If the score falls in the medium confidence range (for example, between 0.4–0.7), the user is asked to confirm the interpreted input. Finally, if the score falls below a minimum threshold (for example, lower than 0.4), the user is prompted to retry and provide the information again. This approach optimizes the conversation flow based on the quality of the input captured and prevents erroneous or redundant slot capturing, leading to an improved user experience while increasing the self-service containment rates.

Prerequisites

You need to have an AWS account and an AWS Identity and Access Management (IAM) role and user with permissions to create and manage the necessary resources and components for this application. If you don’t have an AWS account, see How do I create and activate a new Amazon Web Services account?

Additionally, you need an Amazon Connect instance—you use the instance Amazon Resource Name (ARN) in a later step.

Deploy the Amazon Lex bot and Amazon Connect flow

To create the sample bot and configure the runtime phrase hints, perform the following steps. For this example, we create an Amazon Lex bot called disambiguation-bot, one intent (CheckBooking), and one slot type (BookingRef).

  1. Sign in to your AWS account, then choose Launch Stack to deploy the CloudFormation template:

stack launch button

  1. For Stack Name, enter a name, for example contact-center-transcription-confidence-scores.
  2. Choose Next.
  3. Provide the following parameters:
    1. For BotName, enter disambiguation-bot.
    2. For ConnectInstanceARN, enter the ARN of your Amazon Connect instance.
    3. For ContactFlowName, enter a name for your Amazon Connect contact flow (for example, lex-check-booking-sample-flow).
    4. For LogGroupName, enter the name of the Amazon CloudWatch log group where the conversation logs are stored.
  4. Choose Next.

CFN stack parameters

  1. Leave all remaining settings as default and choose Next.
  2. Select I acknowledge that AWS CloudFormation might create IAM resources.
  3. Choose Submit.

CFN acknowledge

  1. Wait for the CloudFormation stack to successfully deploy.
  2. On the Amazon Connect console, assign the contact flow to an Amazon Connect claimed number.

Configure the transcript confidence score logic

After you create your intent (CheckBooking), use you can Visual conversation builder to configure your transcription confidence score logic.

The following figure is an example of how we add logic to the intent. Highlighted in red is the branch condition where we use the transcription confidence score to dynamically change the customer experience and improve accuracy.

Lex Visual Builder

If you choose the node, you’re presented with the following configuration options, which is where you can configure the branch condition.

Lex condition

Test the solution

To test the solution, we examine a conversation with words that might not be clearly understood.

  1. Assign the Amazon Lex bot to an Amazon Connect workflow.
  2. Make a call.

Amazon Connect will ask “Thank you for calling Acme travel, In a few words, what is the reason for your call today?”

  1. Respond “I want to check my booking.”
  2. When asked for the booking reference, speak any two numbers followed by three letters (for example, “1 9 A Z Y”).

This test checks the confidence score and will either say “your booking 1 9 A Z Y is currently being progressed” or it will ask you to confirm “1 9 A Z Y”.

Limitations

Audio transcription confidence scores are available only in the English (GB) (en_GB) and English (US) (en_US) languages. Confidence scores are supported only for 8 kHz audio input. Transcription confidence scores aren’t provided for audio input from the test window on the Amazon Lex V2 console because it uses 16 kHz audio input.

Clean up

To remove the infrastructure created by the CloudFormation template, open the AWS CloudFormation console and delete the stack. This will remove the services and configuration installed as part of this deployment process.

Conclusion

Optimizing the user experience is at the forefront of any Amazon Lex conversational designer’s priority list, and so is capturing information accurately. This new feature empowers designers to have choices around confirmation routines that drive a more natural dialog between the customer and the bot. Although confirming each input can slow down the user experience and cause frustration, failing to confirm when transcription confidence is low can risk accuracy. These improvements enable you to create a more natural and performant experience.

For more information about how to build effective conversations on Amazon Lex with intent confidence scores, see Build more effective conversations on Amazon Lex with confidence scores and increased accuracy.


About the Authors

Alex BuckhurstAlex Buckhurst is a Senior Amazon Connect consultant at Amazon Web Services with a focus on innovation and building customer-centric designs. In his downtime, Alex enjoys playing squash, perfecting his BBQ skills, and cherishing moments with his family.

Kai LoreckKai Loreck is a Senior professional services Amazon Connect consultant. He works on designing and implementing scalable customer experience solutions. In his spare time, he can be found playing sports, snowboarding, or hiking in the mountains.

Neel KapadiaNeel Kapadia is a Senior Software Engineer at AWS where he works on designing and building scalable AI/ML services using Large Language Models and Natural Language Processing. He has been with Amazon for over 5 years and has worked on Amazon Lex and Amazon Bedrock. In his spare time, he enjoys cooking, reading, and traveling.

Anand Jumnani is a DevOps Consultant at Amazon Web Services based in United Kingdom. Outside of work, he is passionate about club cricket and enjoys spending quality time with family and friends.

Read More

Improving Retrieval Augmented Generation accuracy with GraphRAG

Improving Retrieval Augmented Generation accuracy with GraphRAG

Customers need better accuracy to take generative AI applications into production. In a world where decisions are increasingly data-driven, the integrity and reliability of information are paramount. To address this, customers often begin by enhancing generative AI accuracy through vector-based retrieval systems and the Retrieval Augmented Generation (RAG) architectural pattern, which integrates dense embeddings to ground AI outputs in relevant context. When even greater precision and contextual fidelity are required, the solution evolves to graph-enhanced RAG (GraphRAG), where graph structures provide enhanced reasoning and relationship modeling capabilities.

Lettria, an AWS Partner, demonstrated that integrating graph-based structures into RAG workflows improves answer precision by up to 35% compared to vector-only retrieval methods. This enhancement is achieved by using the graph’s ability to model complex relationships and dependencies between data points, providing a more nuanced and contextually accurate foundation for generative AI outputs.

In this post, we explore why GraphRAG is more comprehensive and explainable than vector RAG alone, and how you can use this approach using AWS services and Lettria.

How graphs make RAG more accurate

In this section, we discuss the ways in which graphs make RAG more accurate.

Capturing complex human queries with graphs

Human questions are inherently complex, often requiring the connection of multiple pieces of information. Traditional data representations struggle to accommodate this complexity without losing context. Graphs, however, are designed to mirror the way humans naturally think and ask questions. They represent data in a machine-readable format that preserves the rich relationships between entities.

By modeling data as a graph, you capture more of the context and intent. This means your RAG application can access and interpret data in a way that aligns closely with human thought processes. The result is a more accurate and relevant answer to complex queries.

Avoiding loss of context in data representation

When you rely solely on vector similarity for information retrieval, you miss out on the nuanced relationships that exist within the data. Translating natural language into vectors reduces the richness of the information, potentially leading to less accurate answers. Also, end-user queries are not always aligned semantically to useful information in provided documents, leading to vector search excluding key data points needed to build an accurate answer.

Graphs maintain the natural structure of the data, allowing for a more precise mapping between questions and answers. They enable the RAG system to understand and navigate the intricate connections within the data, leading to improved accuracy.

Lettria demonstrated improvement on correctness of answers from 50% with traditional RAG to more than 80% using GraphRAG within a hybrid approach. The testing covered datasets from finance (Amazon financial reports), healthcare (scientific studies on COVID-19 vaccines), industry (technical specifications for aeronautical construction materials), and law (European Union directives on environmental regulations).

Proving that graphs are more accurate

To substantiate the accuracy improvements of graph-enhanced RAG, Lettria conducted a series of benchmarks comparing their GraphRAG solution—a hybrid RAG using both vector and graph stores—with a baseline vector-only RAG reference.

Lettria’s hybrid methodology to RAG

Lettria’s hybrid approach to question answering combines the best of vector similarity and graph searches to optimize performance of RAG applications on complex documents. By integrating these two retrieval systems, Lettria uses both structured precision and semantic flexibility in handling intricate queries.

GraphRAG specializes in using fine-grained, contextual data, ideal for answering questions that require explicit connections between entities. In contrast, vector RAG excels at retrieving semantically relevant information, offering broader contextual insights. This dual system is further reinforced by a fallback mechanism: when one system struggles to provide relevant data, the other compensates. For example, GraphRAG pinpoints explicit relationships when available, whereas vector RAG fills in relational gaps or enhances context when structure is missing.

The benchmarking process

To demonstrate the value of this hybrid method, Lettria conducted extensive benchmarks across datasets from various industries. Using their solution, they compared GraphRAG’s hybrid pipeline against a leading open source RAG package, Verba by Weaviate, a baseline RAG reference reliant solely on vector stores. The datasets included Amazon financial reports, scientific texts on COVID-19 vaccines, technical specifications from aeronautics, and European environmental directives—providing a diverse and representative test bed.

The evaluation tackled real-world complexity by focusing on six distinct question types, including fact-based, multi-hop, numerical, tabular, temporal, and multi-constraint queries. The questions ranged from simple fact-finding, like identifying vaccine formulas, to multi-layered reasoning tasks, such as comparing revenue figures across different timeframes. An example multi-hop query in finance is “Compare the oldest booked Amazon revenue to the most recent.”

Lettria’s in-house team manually assessed the answers with a detailed evaluation grid, categorizing results as correct, partially correct (acceptable or not), or incorrect. This process measured how the hybrid GraphRAG approach outperformed the baseline, particularly in handling multi-dimensional queries that required combining structured relationships with semantic breadth. By using the strengths of both vector and graph-based retrieval, Lettria’s system demonstrated its ability to navigate the nuanced demands of diverse industries with precision and flexibility.

The benchmarking results

The results were significant and compelling. GraphRAG achieved 80% correct answers, compared to 50.83% with traditional RAG. When including acceptable answers, GraphRAG’s accuracy rose to nearly 90%, whereas the vector approach reached 67.5%.

The following graph shows the results for vector RAG and GraphRAG.

In the industry sector, dealing with complex technical specifications, GraphRAG provided 90.63% correct answers, almost doubling vector RAG’s 46.88%. These figures highlight how GraphRAG offers substantial advantages over the vector-only approach, particularly for clients focused on structuring complex data.

GraphRAG’s overall reliability and superior handling of intricate queries allow customers to make more informed decisions with confidence. By delivering up to 35% more accurate answers, it significantly boosts efficiency and reduces the time spent sifting through unstructured data. These compelling results demonstrate that incorporating graphs into the RAG workflow not only enhances accuracy, but is essential for tackling the complexity of real-world questions.

Using AWS and Lettria for enhanced RAG applications

In this section, we discuss how you can use AWS and Lettria for enhanced RAG applications.

AWS: A robust foundation for generative AI

AWS offers a comprehensive suite of tools and services to build and deploy generative AI applications. With AWS, you have access to scalable infrastructure and advanced services like Amazon Neptune, a fully managed graph database service. Neptune allows you to efficiently model and navigate complex relationships within your data, making it an ideal choice for implementing graph-based RAG systems.

Implementing GraphRAG from scratch usually requires a process similar to the following diagram.

The process can be broken down as follows:

  1. Based on domain definition, the large language model (LLM) can identify the entities and relationship contained in the unstructured data, which are then stored in a graph database such as Neptune.
  2. At query time, user intent is turned into an efficient graph query based on domain definition to retrieve the relevant entities and relationship.
  3. Results are then used to augment the prompt and generate a more accurate response compared to standard vector-based RAG.

Implementing such process requires teams to develop specific skills in topics such as graph modeling, graph queries, prompt engineering, or LLM workflow maintenance. AWS released an open source GraphRAG Toolkit to make it simple for customers who want to build and customize their GraphRAG workflows. Iterations on extraction process and graph lookup are to be expected in order to get accuracy improvement.

Managed GraphRAG implementations

There are two solutions for managed GraphRAG with AWS: Lettria’s solution, soon available on AWS Marketplace, and Amazon Bedrock integrated GraphRAG support with Neptune. Lettria provides an accessible way to integrate GraphRAG into your applications. By combining Lettria’s expertise in natural language processing (NLP) and graph technology with the scalable and managed AWS infrastructure, you can develop RAG solutions that deliver more accurate and reliable results.

The following are key benefits of Lettria on AWS:

  • Simple integration – Lettria’s solution simplifies the ingestion and processing of complex datasets
  • Improved accuracy – You can achieve up to 35% better performance in question-answering tasks
  • Scalability – You can use scalable AWS services to handle growing data volumes and user demands
  • Flexibility – The hybrid approach combines the strengths of vector and graph representations

In addition to Lettria’s solution, Amazon Bedrock introduced managed GraphRAG support on December 4, 2024, integrating directly with Neptune. GraphRAG with Neptune is built into Amazon Bedrock Knowledge Bases, offering an integrated experience with no additional setup or additional charges beyond the underlying services. GraphRAG is available in AWS Regions where Amazon Bedrock Knowledge Bases and Amazon Neptune Analytics are both available (see the current list of supported Regions). To learn more, see Retrieve data and generate AI responses with Amazon Bedrock Knowledge Bases.

Conclusion

Data accuracy is a critical concern for enterprises adopting generative AI applications. By incorporating graphs into your RAG workflow, you can significantly enhance the accuracy of your systems. Graphs provide a richer, more nuanced representation of data, capturing the complexity of human queries and preserving context.

GraphRAG is a key option to consider for organizations seeking to unlock the full potential of their data. With the combined power of AWS and Lettria, you can build advanced RAG applications that help meet the demanding needs of today’s data-driven enterprises and achieve up to 35% improvement in accuracy.

Explore how you can implement GraphRAG on AWS in your generative AI application:


About the Authors

Denise Gosnell is a Principal Product Manager for Amazon Neptune, focusing on generative AI infrastructure and graph data applications that enable scalable, cutting-edge solutions across industry verticals.

Vivien de Saint Pern is a Startup Solutions Architect working with AI/ML startups in France, focusing on generative AI workloads.

Read More