Amazon AWS – Page 11

Evaluate healthcare generative AI applications using LLM-as-a-judge on AWS

February 27, 2025

by Dr. Adewale Akinfaderin Amazon AWS

In our previous blog posts, we explored various techniques such as fine-tuning large language models (LLMs), prompt engineering, and Retrieval Augmented Generation (RAG) using Amazon Bedrock to generate impressions from the findings section in radiology reports using generative AI. Part 1 focused on model fine-tuning. Part 2 introduced RAG, which combines LLMs with external knowledge bases to reduce hallucinations and improve accuracy in medical applications. Through real-time retrieval of relevant medical information, RAG systems can provide more reliable and contextually appropriate responses, making them particularly valuable for healthcare applications where precision is crucial. In both previous posts, we used traditional metrics like ROUGE scores for performance evaluation. This metric is suitable for evaluating general summarization tasks, but can’t effectively assess whether a RAG system successfully integrates retrieved medical knowledge or maintains clinical accuracy.

In Part 3, we’re introducing an approach to evaluate healthcare RAG applications using LLM-as-a-judge with Amazon Bedrock. This innovative evaluation framework addresses the unique challenges of medical RAG systems, where both the accuracy of retrieved medical knowledge and the quality of generated medical content must align with stringent standards such as clear and concise communication, clinical accuracy, and grammatical accuracy. By using the latest models from Amazon and the newly released RAG evaluation feature for Amazon Bedrock Knowledge Bases, we can now comprehensively assess how well these systems retrieve and use medical information to generate accurate, contextually appropriate responses.

This advancement in evaluation methodology is particularly crucial as healthcare RAG applications become more prevalent in clinical settings. The LLM-as-a-judge approach provides a more nuanced evaluation framework that considers both the quality of information retrieval and the clinical accuracy of generated content, aligning with the rigorous standards required in healthcare.

In this post, we demonstrate how to implement this evaluation framework using Amazon Bedrock, compare the performance of different generator models, including Anthropic’s Claude and Amazon Nova on Amazon Bedrock, and showcase how to use the new RAG evaluation feature to optimize knowledge base parameters and assess retrieval quality. This approach not only establishes new benchmarks for medical RAG evaluation, but also provides practitioners with practical tools to build more reliable and accurate healthcare AI applications that can be trusted in clinical settings.

Overview of the solution

The solution uses Amazon Bedrock Knowledge Bases evaluation capabilities to assess and optimize RAG applications specifically for radiology findings and impressions. Let’s examine the key components of this architecture in the following figure, following the data flow from left to right.

The workflow consists of the following phases:

Data preparation – Our evaluation process begins with a prompt dataset containing paired radiology findings and impressions. This clinical data undergoes a transformation process where it’s converted into a structured JSONL format, which is essential for compatibility with the knowledge base evaluation system. After it’s prepared, this formatted data is securely uploaded to an Amazon Simple Storage Service (Amazon S3) bucket, providing accessibility and data security throughout the evaluation process.
Evaluation processing – At the heart of our solution lies an Amazon Bedrock Knowledge Bases evaluation job. This component processes the prepared data while seamlessly integrating with Amazon Bedrock Knowledge Bases. This integration is crucial because it enables the system to create specialized medical RAG capabilities specifically tailored for radiology findings and impressions, making sure that the evaluation considers both medical context and accuracy.
Analysis – The final stage empowers healthcare data scientists with detailed analytical capabilities. Through an advanced automated report generation system, professionals can access detailed analysis of performance metrics of the summarization task for impression generation. This comprehensive reporting system enables thorough assessment of both retrieval quality and generation accuracy, providing valuable insights for system optimization and quality assurance.

This architecture provides a systematic and thorough approach to evaluating medical RAG applications, providing both accuracy and reliability in healthcare contexts where precision and dependability are paramount.

Dataset and background

The MIMIC Chest X-ray (MIMIC-CXR) database v2.0.0 is a large, publicly available dataset of chest radiographs in DICOM format with free-text radiology reports. We used the MIMIC CXR dataset consisting of 91,544 reports, which can be accessed through a data use agreement. This requires user registration and the completion of a credentialing process.

During routine clinical care, clinicians trained in interpreting imaging studies (radiologists) will summarize their findings for a particular study in a free-text note. The reports were de-identified using a rule-based approach to remove protected health information. Because we used only the radiology report text data, we downloaded just one compressed report file (mimic-cxr-reports.zip) from the MIMIC-CXR website. For evaluation, 1,000 of the total 2,000 reports from a subset of MIMIC-CXR dataset were used. This is referred to as the dev1 dataset. Another set of 1,000 of the total 2,000 radiology reports (referred to as dev2) from the chest X-ray collection from the Indiana University hospital network were also used.

RAG with Amazon Bedrock Knowledge Bases

Amazon Bedrock Knowledge Bases helps take advantage of RAG, a popular technique that involves drawing information from a data store to augment the responses generated by LLMs. We used Amazon Bedrock Knowledge Bases to generate impressions from the findings section of the radiology reports by enriching the query with context that is received from querying the knowledge base. The knowledge base is set up to contain findings and corresponding impression sections of 91,544 MIMIC-CXR radiology reports as {prompt, completion} pairs.

LLM-as-a-judge and quality metrics

LLM-as-a-judge represents an innovative approach to evaluating AI-generated medical content by using LLMs as automated evaluators. This method is particularly valuable in healthcare applications where traditional metrics might fail to capture the nuanced requirements of medical accuracy and clinical relevance. By using specialized prompts and evaluation criteria, LLM-as-a-judge can assess multiple dimensions of generated medical content, providing a more comprehensive evaluation framework that aligns with healthcare professionals’ standards.

Our evaluation framework encompasses five critical metrics, each designed to assess specific aspects of the generated medical content:

Correctness – Evaluated on a 3-point Likert scale, this metric measures the factual accuracy of generated responses by comparing them against ground truth responses. In the medical context, this makes sure that the clinical interpretations and findings align with the source material and accepted medical knowledge.
Completeness – Using a 5-point Likert scale, this metric assesses whether the generated response comprehensively addresses the prompt holistically while considering the ground truth response. It makes sure that critical medical findings or interpretations are not omitted from the response.
Helpfulness – Measured on a 7-point Likert scale, this metric evaluates the practical utility of the response in clinical contexts, considering factors such as clarity, relevance, and actionability of the medical information provided.
Logical coherence – Assessed on a 5-point Likert scale, this metric examines the response for logical gaps, inconsistencies, or contradictions, making sure that medical reasoning flows naturally and maintains clinical validity throughout the response.
Faithfulness – Scored on a 5-point Likert scale, this metric specifically evaluates whether the response contains information not found in or quickly inferred from the prompt, helping identify potential hallucinations or fabricated medical information that could be dangerous in clinical settings.

These metrics are normalized in the final output and job report card, providing standardized scores that enable consistent comparison across different models and evaluation scenarios. This comprehensive evaluation framework not only helps maintain the reliability and accuracy of medical RAG systems, but also provides detailed insights for continuous improvement and optimization. For details about the metric and evaluation prompts, see Evaluator prompts used in a knowledge base evaluation job.

Prerequisites

Before proceeding with the evaluation setup, make sure you have the following:

An active AWS account with appropriate permissions
Amazon Bedrock model access enabled in your preferred AWS Region
An S3 bucket with CORS enabled for storing evaluation data
An Amazon Bedrock knowledge base
An AWS Identity and Access Management (IAM) role with necessary permissions for Amazon S3 and Amazon Bedrock

The solution code can be found at the following GitHub repo.

Make sure that your knowledge base is fully synced and ready before initiating an evaluation job.

Convert the test dataset into JSONL for RAG evaluation

In preparation for evaluating our RAG system’s performance on radiology reports, we implemented a data transformation pipeline to convert our test dataset into the required JSONL format. The following code shows the format of the original dev1 and dev2 datasets:

{
    "prompt": "value of prompt key",
    "completion": "value of completion key"
}
Output Format

{
    "conversationTurns": [{
        "referenceResponses": [{
            "content": [{
                "text": "value from completion key"
            }]
        }],
        "prompt": {
            "content": [{
                "text": "value from prompt key"
            }]
        }
    }]
}

Drawing from Wilcox’s seminal paper The Written Radiology Report, we carefully structured our prompt to include comprehensive guidelines for generating high-quality impressions:

import json
import random
import boto3

# Initialize the S3 client
s3 = boto3.client('s3')

# S3 bucket name
bucket_name = "<BUCKET_NAME>"

# Function to transform a single record
def transform_record(record):
    return {
        "conversationTurns": [
            {
                "referenceResponses": [
                    {
                        "content": [
                            {
                                "text": record["completion"]
                            }
                        ]
                    }
                ],
                "prompt": {
                    "content": [
                        {
                            "text": """You're given a radiology report findings to generate a concise radiology impression from it.

A Radiology Impression is the radiologist's final concise interpretation and conclusion of medical imaging findings, typically appearing at the end of a radiology report.
n Follow these guidelines when writing the impression:
n- Use clear, understandable language avoiding obscure terms.
n- Number each impression.
n- Order impressions by importance.
n- Keep impressions concise and shorter than the findings section.
n- Write for the intended reader's understanding.n
Findings: n""" + record["prompt"]
                        }
                    ]
                }
            }
        ]
    }

The script processes individual records, restructuring them to include conversation turns with both the original radiology findings and their corresponding impressions, making sure each report maintains the professional standards outlined in the literature. To maintain a manageable dataset size set used by this feature, we randomly sampled 1,000 records from the original dev1 and dev2 datasets, using a fixed random seed for reproducibility:

# Read from input file and write to output file
def convert_file(input_file_path, output_file_path, sample_size=1000):
    # First, read all records into a list
    records = []
    with open(input_file_path, 'r', encoding='utf-8') as input_file:
        for line in input_file:
            records.append(json.loads(line.strip()))
    
    # Randomly sample 1000 records
    random.seed(42)  # Set the seed first
    sampled_records = random.sample(records, sample_size)
    
    # Write the sampled and transformed records to the output file
    with open(output_file_path, 'w', encoding='utf-8') as output_file:
        for record in sampled_records:
            transformed_record = transform_record(record)
            output_file.write(json.dumps(transformed_record) + 'n')
            
# Usage
input_file_path = '<INPUT_FILE_NAME>.jsonl'  # Replace with your input file path
output_file_path = '<OUTPUT_FILE_NAME>.jsonl'  # Replace with your desired output file path
convert_file(input_file_path, output_file_path)

# File paths and S3 keys for the transformed files
transformed_files = [
    {'local_file': '<OUTPUT_FILE_NAME>.jsonl', 'key': '<FOLDER_NAME>/<OUTPUT_FILE_NAME>.jsonl'},
    {'local_file': '<OUTPUT_FILE_NAME>.jsonl', 'key': '<FOLDER_NAME>/<OUTPUT_FILE_NAME>.jsonl'}
]

# Upload files to S3
for file in transformed_files:
    s3.upload_file(file['local_file'], bucket_name, file['key'])
    print(f"Uploaded {file['local_file']} to s3://{bucket_name}/{file['key']}")

Set up a RAG evaluation job

Our RAG evaluation setup begins with establishing core configurations for the Amazon Bedrock evaluation job, including the selection of evaluation and generation models (Anthropic’s Claude 3 Haiku and Amazon Nova Micro, respectively). The implementation incorporates a hybrid search strategy with a retrieval depth of 10 results, providing comprehensive coverage of the knowledge base during evaluation. To maintain organization and traceability, each evaluation job is assigned a unique identifier with timestamp information, and input data and results are systematically managed through designated S3 paths. See the following code:

import boto3
from datetime import datetime

# Generate unique name for the job
job_name = f"rag-eval-{datetime.now().strftime('%Y-%m-%d-%H-%M-%S')}"

# Configure knowledge base and model settings
knowledge_base_id = "<KNOWLEDGE_BASE_ID>"
evaluator_model = "anthropic.claude-3-haiku-20240307-v1:0"
generator_model = "amazon.nova-micro-v1:0"
role_arn = "<IAM_ROLE_ARN>"

# Specify S3 locations
input_data = "<INPUT_S3_PATH>"
output_path = "<OUTPUT_S3_PATH>"

# Configure retrieval settings
num_results = 10
search_type = "HYBRID"

# Create Bedrock client
bedrock_client = boto3.client('bedrock')

With the core configurations in place, we initiate the evaluation job using the Amazon Bedrock create_evaluation_job API, which orchestrates a comprehensive assessment of our RAG system’s performance. The evaluation configuration specifies five key metrics—correctness, completeness, helpfulness, logical coherence, and faithfulness—providing a multi-dimensional analysis of the generated radiology impressions. The job is structured to use the knowledge base for retrieval and generation tasks, with the specified models handling their respective roles: Amazon Nova Micro for generation and Anthropic’s Claude 3 Haiku for evaluation, and the results are systematically stored in the designated S3 output location for subsequent analysis. See the following code:

retrieve_generate_job = bedrock_client.create_evaluation_job(
    jobName=job_name,
    jobDescription="Evaluate retrieval and generation",
    roleArn=role_arn,
    applicationType="RagEvaluation",
    inferenceConfig={
        "ragConfigs": [{
            "knowledgeBaseConfig": {
                "retrieveAndGenerateConfig": {
                    "type": "KNOWLEDGE_BASE",
                    "knowledgeBaseConfiguration": {
                        "knowledgeBaseId": knowledge_base_id,
                        "modelArn": generator_model,
                        "retrievalConfiguration": {
                            "vectorSearchConfiguration": {
                                "numberOfResults": num_results,
                                "overrideSearchType": search_type
                            }
                        }
                    }
                }
            }
        }]
    },
    outputDataConfig={
        "s3Uri": output_path
    },
    evaluationConfig={
        "automated": {
            "datasetMetricConfigs": [{
                "taskType": "Custom",
                "dataset": {
                    "name": "RagDataset",
                    "datasetLocation": {
                        "s3Uri": input_data
                    }
                },
                "metricNames": [
                    "Builtin.Correctness",
                    "Builtin.Completeness",
                    "Builtin.Helpfulness",
                    "Builtin.LogicalCoherence",
                    "Builtin.Faithfulness"
                ]
            }],
            "evaluatorModelConfig": {
                "bedrockEvaluatorModels": [{
                    "modelIdentifier": evaluator_model
                }]
            }
        }
    }
)

Evaluation results and metrics comparisons

The evaluation results for the healthcare RAG applications, using datasets dev1 and dev2, demonstrate strong performance across the specified metrics. For the dev1 dataset, the scores were as follows: correctness at 0.98, completeness at 0.95, helpfulness at 0.83, logical coherence at 0.99, and faithfulness at 0.79. Similarly, the dev2 dataset yielded scores of 0.97 for correctness, 0.95 for completeness, 0.83 for helpfulness, 0.98 for logical coherence, and 0.82 for faithfulness. These results indicate that the RAG system effectively retrieves and uses medical information to generate accurate and contextually appropriate responses, with particularly high scores in correctness and logical coherence, suggesting robust factual accuracy and logical consistency in the generated content.

The following screenshot shows the evaluation summary for the dev1 dataset.

The following screenshot shows the evaluation summary for the dev2 dataset.

Additionally, as shown in the following screenshot, the LLM-as-a-judge framework allows for the comparison of multiple evaluation jobs across different models, datasets, and prompts, enabling detailed analysis and optimization of the RAG system’s performance.

Additionally, you can perform a detailed analysis by drilling down and investigating the outlier cases with least performance metrics such as correctness, as shown in the following screenshot.

Metrics explainability

The following screenshot showcases the detailed metrics explainability interface of the evaluation system, displaying example conversations with their corresponding metrics assessment. Each conversation entry includes four key columns: Conversation input, Generation output, Retrieved sources, and Ground truth, along with a Score column. The system provides a comprehensive view of 1,000 examples, with navigation controls to browse through the dataset. Of particular note is the retrieval depth indicator showing 10 for each conversation, demonstrating consistent knowledge base utilization across examples.

The evaluation framework enables detailed tracking of generation metrics and provides transparency into how the knowledge base arrives at its outputs. Each example conversation presents the complete chain of information, from the initial prompt through to the final assessment. The system displays the retrieved context that informed the generation, the actual generated response, and the ground truth for comparison. A scoring mechanism evaluates each response, with a detailed explanation of the decision-making process visible through an expandable interface (as shown by the pop-up in the screenshot). This granular level of detail allows for thorough analysis of the RAG system’s performance and helps identify areas for optimization in both retrieval and generation processes.

In this specific example from the Indiana University Medical System dataset (dev2), we see a clear assessment of the system’s performance in generating a radiology impression for chest X-ray findings. The knowledge base successfully retrieved relevant context (shown by 10 retrieved sources) to generate an impression stating “Normal heart size and pulmonary vascularity 2. Unremarkable mediastinal contour 3. No focal consolidation, pleural effusion, or pneumothorax 4. No acute bony findings.” The evaluation system scored this response with a perfect correctness score of 1, noting in the detailed explanation that the candidate response accurately summarized the key findings and correctly concluded there was no acute cardiopulmonary process, aligning precisely with the ground truth response.

In the following screenshot, the evaluation system scored this response with a low score of 0.5, noting in the detailed explanation the ground truth response provided is “Moderate hiatal hernia. No definite pneumonia.” This indicates that the key findings from the radiology report are the presence of a moderate hiatal hernia and the absence of any definite pneumonia. The candidate response covers the key finding of the moderate hiatal hernia, which is correctly identified as one of the impressions. However, the candidate response also includes additional impressions that are not mentioned in the ground truth, such as normal lung fields, normal heart size, unfolded aorta, and degenerative changes in the spine. Although these additional impressions might be accurate based on the provided findings, they are not explicitly stated in the ground truth response. Therefore, the candidate response is partially correct and partially incorrect based on the ground truth.

Clean up

To avoid incurring future charges, delete the S3 bucket, knowledge base, and other resources that were deployed as part of the post.

Conclusion

The implementation of LLM-as-a-judge for evaluating healthcare RAG applications represents a significant advancement in maintaining the reliability and accuracy of AI-generated medical content. Through this comprehensive evaluation framework using Amazon Bedrock Knowledge Bases, we’ve demonstrated how automated assessment can provide detailed insights into the performance of medical RAG systems across multiple critical dimensions. The high-performance scores across both datasets indicate the robustness of this approach, though these metrics are just the beginning.

Looking ahead, this evaluation framework can be expanded to encompass broader healthcare applications while maintaining the rigorous standards essential for medical applications. The dynamic nature of medical knowledge and clinical practices necessitates an ongoing commitment to evaluation, making continuous assessment a cornerstone of successful implementation.

Through this series, we’ve demonstrated how you can use Amazon Bedrock to create and evaluate healthcare generative AI applications with the precision and reliability required in clinical settings. As organizations continue to refine these tools and methodologies, prioritizing accuracy, safety, and clinical utility in healthcare AI applications remains paramount.

About the Authors

Adewale Akinfaderin is a Sr. Data Scientist–Generative AI, Amazon Bedrock, where he contributes to cutting edge innovations in foundational models and generative AI applications at AWS. His expertise is in reproducible and end-to-end AI/ML methods, practical implementations, and helping global customers formulate and develop scalable solutions to interdisciplinary problems. He has two graduate degrees in physics and a doctorate in engineering.

Priya Padate is a Senior Partner Solution Architect supporting healthcare and life sciences worldwide at Amazon Web Services. She has over 20 years of healthcare industry experience leading architectural solutions in areas of medical imaging, healthcare related AI/ML solutions and strategies for cloud migrations. She is passionate about using technology to transform the healthcare industry to drive better patient care outcomes.

Dr. Ekta Walia Bhullar is a principal AI/ML/GenAI consultant with AWS Healthcare and Life Sciences business unit. She has extensive experience in development of AI/ML applications for healthcare especially in Radiology. During her tenure at AWS she has actively contributed to applications of AI/ML/GenAI within lifescience domain such as for clinical, drug development and commercial lines of business.

AWS DeepRacer: Closing time at AWS re:Invent 2024 –How did that physical racing go?

February 27, 2025

by Lars Lorentz Ludvigsen Amazon AWS

Having spent the last years studying the art of AWS DeepRacer in the physical world, the author went to AWS re:Invent 2024. How did it go?

In AWS DeepRacer: How to master physical racing?, I wrote in detail about some aspects relevant to racing AWS DeepRacer in the physical world. We looked at the differences between the virtual and the physical world and how we could adapt the simulator and the training approach to overcome the differences. The previous post was left open-ended—with one last Championship Final left, it was too early to share all my secrets.

Now that AWS re:Invent is over, it’s time to share my strategy, how I prepared, and how it went in the end.

Strategy

Going into the 2024 season, I was reflecting on my performance from 2022 and 2023. In 2022, I had unstable models that were unable to do fast laps on the new re:Invent 2022 Championship track, not even making the last 32. In 2023, things went slightly better, but it was clear that there was potential to improve.

Specifically, I wanted a model that:

Goes straight on the straights and corners with precision
Has a survival instinct and avoids going off-track even in a tight spot
Can ignore the visual noise seen around the track

Combine that with the ability to test the models before showing up at the Expo, and success seemed possible!

Implementation

In this section, I will explain my thinking about why physical racing is so different than virtual racing, as well as describe my approach to training a model that overcomes those differences.

How hard can it be to go straight?

If you have watched DeepRacer over the years, you have probably seen that most models struggle to go straight on the straights and end up oscillating left and right. The question has always been: why is it like that? This behavior causes two issues: the distance driven increases (result: slower lap time) and the car potentially enters the next turn in a way it can’t handle (result: off-track).

A few theories emerged:

Sim-to-real issues – The steering response isn’t matching the simulator, both with regards to the steering geometry and latency (time from picture to servo command, as well as the time it takes the servo to actually actuate). Therefore, when the car tries to adjust the direction on the straight, it doesn’t get the response it expects.
Model issues – A combination of the model not actually using the straight action, and not having access to angles needed to dampen oscillations (2.5–5.0 degrees).
Calibration issues – If the car isn’t calibrated to go straight when given a 0-degree action, and the left/right max values are either too high (tendency to oversteer) or too low (tendency to understeer), you are likely to get control issues and unstable behavior.

My approach:

Use the Ackermann steering geometry patch. With it, the car will behave more realistically, and the turning radius will decrease for a given angle. As a result, the action space can be limited to angles up to about 20 degrees. This roughly matches with the real car’s steering angle.
Include stabilizing steering angles (2.5 and 5.0) in the action space, allowing for minor corrections on the straights.
Use relatively slow speeds (0.8–1.3 m/s) to avoid slipping in the simulator. My theory is that the 15 fps simulator and the 30 fps car actually translates 1.2 mps in the simulator into effectively 2.4 mps in the real world.
By having an inverted chevron action space giving higher speeds for straights, nudge the car to use the straight actions, rather than oscillating left-right actions.
Try out v3, v4, and v5 physical models—test on a real track to see what works best.
Otherwise, the reward function was the same progress-based reward function I also use in virtual racing.

The following figure illustrates the view of testing in the garage, going straight at least one frame.

Be flexible

Virtual racing is (almost) deterministic, and over time, the model will converge and the car will take a narrow path, reducing the variety in the situations it sees. Early in training, it will frequently be in odd positions, almost going off-track, and it remembers how to get out of these situations. As it converges, the frequency at which it must handle these reduces, and the theory is that the memory fades, and at some point, it forgets how to get out of a tight spot.

My approach:

Diversify training to teach the car to handle a variety of corners, in both directions:
- Consistently train models going both clockwise and counterclockwise.
- Use tracks—primarily the 2022 Championship track—that are significantly more complex than the Forever Raceway.
- Do final optimization on the Forever Raceway—again in both directions.
Take several snapshots during training; don’t go below 0.5 in entropy.
Test on tracks the car has never seen. The simulator has many suitable, narrow tracks—the hallmark of a generalized model is one that can handle tracks it has never seen during training.

Stay focused on the track

In my last post, I looked at the visual differences between the virtual and real worlds. The question is what to do about it. The goal is to trick the model into ignoring the noise and focus on what is important: the track.

My approach:

Train in an environment with significantly more visual noise. The tracks in the custom track repository have added noise through additional lights, buildings, and different walls (and some even come with shadows).
Alter the environment during training to avoid overfitting to the added noise. The custom tracks were made in such a way that different objects (buildings, walls, and lines) could be made invisible at runtime. I had a cron job randomizing the environment every 5 minutes.

The following figure illustrates the varied training environment.

What I didn’t consider this year was simulating blurring during training. I attempted this previously by averaging the current camera frame with the previous one before inferencing. It didn’t seem to help.

Lens distortion is a topic I have observed, but not fully investigated. The original camera has a distinct fish-eye distortion, and Gazebo would be able to replicate it, but it would require some work to actually determine the coefficients. Equally, I have never tried to replicate the rolling motions of the real car.

Testing

Testing took place in the garage on the Trapezoid Narrow track. The track is obviously basic, but with two straights and two 180-degree turns with different radii, it had to do the job. The garage track also had enough visual noise to see if the models were robust enough.

The method was straightforward: try all models both clockwise and counterclockwise. Using the logs captured by the custom car stack, I spent the evening looking through the video of each run to determine which model I liked the best—looking at stability, handling (straight on straights plus precision cornering), and speed.

re:Invent 2024

The track for re:Invent 2024 was the Forever Raceway. The shape of the track isn’t new; it shares the centerline with the 2022 Summit Speedway, but being only 76 cm wide (the original was 1.07 cm), the turns become more pronounced, making it a significantly more difficult track.

The environment

The environment is classic re:Invent: a smooth track with very little shine combined with smooth, fairly tall walls surrounding the track. The background is what often causes trouble—this year, a large lit display hung under the ceiling at the far end of the track, and as the following figure shows, it was attracting quite some attention from the GradCam.

Similarly, the pit crew cage, where cars are maintained, attracted attention.

The results

So where did I end up, and why? In Round 1, I ended up at place 14, with a best average of 10.072 seconds, and a best lap time of 9.335 seconds. Not great, but also not bad—almost 1 second outside top 8.

Using the overhead camera provided by AWS through the Twitch stream, it’s possible to create a graphical view showing the path the car took, as shown in the following figure.

If we compare this with how the same model liked to drive in training, we see a bit of a difference.

What becomes obvious quite quickly is that although I succeeded in going straight on the (upper) straight, the car didn’t corner as tightly as during training, making the bottom half of the track a bit of a mess. Nevertheless, the car demonstrated the desired survival instinct and stayed on track even when faced with unexpectedly sharp corners.

Why did this happen:

20 degrees of turning using Ackermann steering is too much; the real car isn’t capable of doing it in the real world
The turning radius is increasing as the speed goes up due to slipping, caused both by low friction and lack of grip due to rolling
The reaction time plays more of a role as the speed increases, and my model acted too late, overshooting into the corner

The combined turning radius and reaction time effect also caused issues at the start. If the car goes slowly, it turns much faster—and ends up going off-track on the inside—causing issues for me and others.

My takeaways:

Overall, the training approach seemed to work well. Well-calibrated cars went straight on the straights, and background noise didn’t seem to bother my models much.
I need to get closer to the car’s actual handling characteristics at speed during training by increasing the max speed and reducing the max angle in the action space.
Physical racing is still not well understood—and it’s a lot about model-meets-car. Some models thrive on objectively perfectly calibrated cars, whereas others work great when matched with a particular one.
Track is king—those that had access to the track, either through their employer or having built one at home, had a massive advantage, even if almost everyone said that they were surprised by which model worked in the end.

Now enjoy the inside view of a car at re:Invent, and see if you can detect any of the issues that I have discussed. The video was recorded after I had been knocked out of the competition using a car with the custom car software.

Closing time: Where do we go from here?

This section is best enjoyed with Semisonic’s Closing Time as a soundtrack.

As we all wrapped up at the Expo after an intense week of racing, re:Invent literally being dismantled around us, the question was: what comes next?

This was the last DeepRacer Championship, but the general sentiment was that whereas nobody will really miss virtual racing—it is a problem solved—physical racing is still a whole lot of fun, and the community is not yet ready to move on. Since re:Invent several initiatives have gained traction with a common goal to make DeepRacer more accessible:

By enrolling cars with the DeepRacer Custom Car software stack into DeepRacer Event Manager you can capture car logs and generate the analytics videos, as shown in this article, directly during your event!
- DeepRacer Pi and DeepRacer Custom Car initiatives allow racers to build cars at home:Use off-the-shelf components for a 1:18 scale racer, or
Combine off-the-shelf components with a custom circuit board to build the 1:28 scale DeepRacer Pi Mini Both options are compatible with already trained models, including integration with DeepRacer Event Manager.

DeepRacer Custom Console will be a drop-in replacement for the current car UI with a beautiful UI designed in Cloudscape, aligning the design with DREM and the AWS Console.

Prototype DeepRacer Pi Mini – 1 :28 scale

Closing Words

DeepRacer is a fantastic way to teach AI in a very physical and visual way, and is suitable for older kids, students, and adults in the corporate setting alike. It will be interesting to see how AWS, its corporate partners, and the community will continue the journey in the years ahead.

A big thank you goes to all of those that have been involved in DeepRacer from its inception to today—too many to be named—it has been a wonderful experience. A big congratulations goes out to this years’ winners!

Closing time, every new beginning comes from some other beginning’s end…

About the Author

Lars Lorentz Ludvigsen is a technology enthusiast who was introduced to AWS DeepRacer in late 2019 and was instantly hooked. Lars works as a Managing Director at Accenture, where he helps clients build the next generation of smart connected products. In addition to his role at Accenture, he’s an AWS Community Builder who focuses on developing and maintaining the AWS DeepRacer community’s software solutions.

How Pattern PXM’s Content Brief is driving conversion on ecommerce marketplaces using AI

February 26, 2025

by Parker Bradshaw Amazon AWS

Brands today are juggling a million things, and keeping product content up-to-date is at the top of the list. Between decoding the endless requirements of different marketplaces, wrangling inventory across channels, adjusting product listings to catch a customer’s eye, and trying to outpace shifting trends and fierce competition, it’s a lot. And let’s face it—staying ahead of the ecommerce game can feel like running on a treadmill that just keeps speeding up. For many, it results in missed opportunities and revenue that doesn’t quite hit the mark.

“Managing a diverse range of products and retailers is so challenging due to the varying content requirements, imagery, different languages for different regions, formatting and even the target audiences that they serve.”

– Martin Ruiz, Content Specialist, Kanto

Pattern is a leader in ecommerce acceleration, helping brands navigate the complexities of selling on marketplaces and achieve profitable growth through a combination of proprietary technology and on-demand expertise. Pattern was founded in 2013 and has expanded to over 1,700 team members in 22 global locations, addressing the growing need for specialized ecommerce expertise.

Pattern has over 38 trillion proprietary ecommerce data points, 12 tech patents and patents pending, and deep marketplace expertise. Pattern partners with hundreds of brands, like Nestle and Philips, to drive revenue growth. As the top third-party seller on Amazon, Pattern uses this expertise to optimize product listings, manage inventory, and boost brand presence across multiple services simultaneously.

In this post, we share how Pattern uses AWS services to process trillions of data points to deliver actionable insights, optimizing product listings across multiple services.

Content Brief: Data-backed content optimization for product listings

Pattern’s latest innovation, Content Brief, is a powerful AI-driven tool designed to help brands optimize their product listings and accelerate growth across online marketplaces. Using Pattern’s dataset of over 38 trillion ecommerce data points, Content Brief provides actionable insights and recommendations to create standout product content that drives traffic and conversions.

Content Brief analyzes consumer demographics, discovery behavior, and content performance to give brands a comprehensive understanding of their product’s position in the marketplace. What would normally require months of research and work is now done in minutes. Content Brief takes the guesswork out of product strategy with tools that do the heavy lifting. Its attribute importance ranking shows you which product features deserve the spotlight, and the image archetype analysis makes sure your visuals engage customers.

As shown in the following screenshot, the image archetype feature shows attributes that are driving sales in a given category, allowing brands to highlight the most impactful features in the image block and A+ image content.

Content Brief incorporates review and feedback analysis capabilities. It uses sentiment analysis to process customer reviews, identifying recurring themes in both positive and negative feedback, and highlights areas for potential improvement.

Content Brief’s Search Family analysis groups similar search terms together, helping brands understand distinct customer intent and tailor their content accordingly. This feature combined with detailed persona insights helps marketers create highly targeted content for specific segments. It also offers competitive analysis, providing side-by-side comparisons with competing products, highlighting areas where a brand’s product stands out or needs improvement.

“This is the thing we need the most as a business. We have all of the listening tools, review sentiment, keyword things, but nothing is in a single place like this and able to be optimized to my listing. And the thought of writing all those changes back to my PIM and then syndicating to all of my retailers, this is giving me goosebumps.”

– Marketing executive, Fortune 500 brand

Brands using Content Brief can more quickly identify opportunities for growth, adapt to change, and maintain a competitive edge in the digital marketplace. From search optimization and review analysis to competitive benchmarking and persona targeting, Content Brief empowers brands to create compelling, data-driven content that drives both traffic and conversions.

Select Brands looked to improve their Amazon performance and partnered with Pattern. Content Brief’s insights led to updates that caused a transformation for their Triple Buffet Server listing’s image stack. Their old image stack was created for marketplace requirements, whereas the new image stack was optimized with insights based on product attributes to highlight from category and sales data. The updated image stack featured bold product highlights and captured shoppers with lifestyle imagery. The results were a 21% MoM revenue surge, 14.5% more traffic, and a 21 bps conversion lift.

“Content Brief is a perfect example of why we chose to partner with Pattern. After just one month of testing, we see how impactful it can be for driving incremental growth—even on products that are already performing well. We have a product that, together with Pattern, we were able to grow into a top performer in its category in less than 2 years, and it’s exciting to see how adding this additional layer can grow revenue even for that product, which we already considered to be strong.”

– Eric Endres, President, Select Brands

To discover how Content Brief helped Select Brands boost their Amazon performance, refer to the full case study.

The AWS backbone of Content Brief

At the heart of Pattern’s architecture lies a carefully orchestrated suite of AWS services. Amazon Simple Storage Service (Amazon S3) serves as the cornerstone for storing product images, crucial for comprehensive ecommerce analysis. Amazon Textract is employed to extract and analyze text from these images, providing valuable insights into product presentation and enabling comparisons with competitor listings. Meanwhile, Amazon DynamoDB acts as the powerhouse behind Content Brief’s rapid data retrieval and processing capabilities, storing both structured and unstructured data, including content brief object blobs.

Pattern’s approach to data management is both innovative and efficient. As data is processed and analyzed, they create a shell in DynamoDB for each content brief, progressively injecting data as it’s processed and refined. This method allows for rapid access to partial results and enables further data transformations as needed, making sure that brands have access to the most up-to-date insights.

The following diagram illustrates the pipeline workflow and architecture.

Scaling to handle 38 trillion data points

Processing over 38 trillion data points is no small feat, but Pattern has risen to the challenge with a sophisticated scaling strategy. At the core of this strategy is Amazon Elastic Container Store (Amazon ECS) with GPU support, which handles the computationally intensive tasks of natural language processing and data science. This setup allows Pattern to dynamically scale resources based on demand, providing optimal performance even during peak processing times.

To manage the complex flow of data between various AWS services, Pattern employs Apache Airflow. This orchestration tool manages the intricate dance of data with a primary DAG, creating and managing numerous sub-DAGs as needed. This innovative use of Airflow allows Pattern to efficiently manage complex, interdependent data processing tasks at scale.

But scaling isn’t just about processing power—it’s also about efficiency. Pattern has implemented batching techniques in their AI model calls, resulting in up to 50% cost reduction for two-batch processing while maintaining high throughput. They’ve also implemented cross-region inference to improve scalability and reliability across different geographical areas.

To keep a watchful eye on their system’s performance, Pattern employs LLM observability techniques. They monitor AI model performance and behavior, enabling continuous system optimization and making sure that Content Brief is operating at peak efficiency.

Using Amazon Bedrock for AI-powered insights

A key component of Pattern’s Content Brief solution is Amazon Bedrock, which plays a pivotal role in their AI and machine learning (ML) capabilities. Pattern uses Amazon Bedrock to implement a flexible and secure large language model (LLM) strategy.

Model flexibility and optimization

Amazon Bedrock offers support for multiple foundation models (FMs), which allows Pattern to dynamically select the most appropriate model for each specific task. This flexibility is crucial for optimizing performance across various aspects of Content Brief:

Natural language processing – For analyzing product descriptions, Pattern uses models optimized for language understanding and generation.
Sentiment analysis – When processing customer reviews, Amazon Bedrock enables the use of models fine-tuned for sentiment classification.
Image analysis – Pattern currently uses Amazon Textract for extracting text from product images. However, Amazon Bedrock also offers advanced vision-language models that could potentially enhance image analysis capabilities in the future, such as detailed object recognition or visual sentiment analysis.

The ability to rapidly prototype on different LLMs is a key component of Pattern’s AI strategy. Amazon Bedrock offers quick access to a variety of cutting-edge models o facilitate this process, allowing Pattern to continuously evolve Content Brief and use the latest advancements in AI technology. Today, this allows the team to build seamless integration and use various state-of-the-art language models tailored to different tasks, including the new, cost-effective Amazon Nova models.

Prompt engineering and efficiency

Pattern’s team has developed a sophisticated prompt engineering process, continually refining their prompts to optimize both quality and efficiency. Amazon Bedrock offers support for custom prompts, which allows Pattern to tailor the model’s behavior precisely to their needs, improving the accuracy and relevance of AI-generated insights.

Moreover, Amazon Bedrock offers efficient inference capabilities that help Pattern optimize token usage, reducing costs while maintaining high-quality outputs. This efficiency is crucial when processing the vast amounts of data required for comprehensive ecommerce analysis.

Security and data privacy

Pattern uses the built-in security features of Amazon Bedrock to uphold data protection and compliance. By employing AWS PrivateLink, data transfers between Pattern’s virtual private cloud (VPC) and Amazon Bedrock occur over private IP addresses, never traversing the public internet. This approach significantly enhances security by reducing exposure to potential threats.

Furthermore, the Amazon Bedrock architecture makes sure that Pattern’s data remains within their AWS account throughout the inference process. This data isolation provides an additional layer of security and helps maintain compliance with data protection regulations.

“Amazon Bedrock’s flexibility is crucial in the ever-evolving landscape of AI, enabling Pattern to utilize the most effective and efficient models for their diverse ecommerce analysis needs. The service’s robust security features and data isolation capabilities give us peace of mind, knowing that our data and our clients’ information are protected throughout the AI inference process.”

– Jason Wells, CTO, Pattern

Building on Amazon Bedrock, Pattern has created a secure, flexible, and efficient AI-powered solution that continuously evolves to meet the dynamic needs of ecommerce optimization.

Conclusion

Pattern’s Content Brief demonstrates the power of AWS in revolutionizing data-driven solutions. By using services like Amazon Bedrock, DynamoDB, and Amazon ECS, Pattern processes over 38 trillion data points to deliver actionable insights, optimizing product listings across multiple services.

Inspired to build your own innovative, high-performance solution? Explore AWS’s suite of services at aws.amazon.com and discover how you can harness the cloud to bring your ideas to life. To learn more about how Content Brief could help your brand optimize its ecommerce presence, visit pattern.com.

About the Author

Parker Bradshaw is an Enterprise SA at AWS who focuses on storage and data technologies. He helps retail companies manage large data sets to boost customer experience and product quality. Parker is passionate about innovation and building technical communities. In his free time, he enjoys family activities and playing pickleball.

How to configure cross-account model deployment using Amazon Bedrock Custom Model Import

February 26, 2025

by Hrushikesh Gangur Amazon AWS

In enterprise environments, organizations often divide their AI operations into two specialized teams: an AI research team and a model hosting team. The research team is dedicated to developing and enhancing AI models using model training and fine-tuning techniques. Meanwhile, a separate hosting team is responsible for deploying these models across their own development, staging, and production environments.

With Amazon Bedrock Custom Model Import, the hosting team can import and serve custom models using supported architectures such as Meta Llama 2, Llama 3, and Mistral using On-Demand pricing. Teams can import models with weights in Hugging Face safetensors format from Amazon SageMaker or from Amazon Simple Storage Service (Amazon S3). These imported custom models work alongside existing Amazon Bedrock foundation models (FMs) through a single, unified API in a serverless manner, alleviating the need to manage model deployment and scaling.

However, in such enterprise environments, these teams often work in separate AWS accounts for security and operational reasons. The model development team’s training results, known as model artifacts, for example model weights, are typically stored in S3 buckets within the research team’s AWS account, but the hosting team needs to access these artifacts from another account to deploy models. This creates a challenge: how do you securely share model artifacts between accounts?

This is where cross-account access becomes important. With Amazon Bedrock Custom Model Import cross-account support, we can help you configure direct access between the S3 buckets storing model artifacts and the hosting account. This streamlines your operational workflow while maintaining security boundaries between teams. One of our customers quotes:

Bedrock Custom Model Import cross-account support helped AI Platform team to simplify the configuration, reduce operational overhead and secure models in the original location.

– Scott Chang, Principal Engineer, AI Platform at Salesforce

In this guide, we walk you through step-by-step instructions for configuring cross-account access for Amazon Bedrock Custom Model Import, covering both non-encrypted and AWS Key Management Service (AWS KMS) based encrypted scenarios.

Example scenario

For this walkthrough, consider two AWS accounts:

Model Development account (111122223333):
- Stores model artifacts (custom weights and configurations) in an S3 bucket called model-artifacts-111122223333
- Optionally encrypts artifacts using AWS KMS customer managed key kms-cmk-111122223333
Model Hosting account (777788889999):
- Hosts models using Amazon Bedrock Custom Model Import
- Uses a new AWS Identity and Access Management (IAM) execution role BedrockCMIExecutionRole-777788889999
- Can optionally encrypt artifacts using AWS KMS key kms-cmk-777788889999

The following figure illustrates this setup, showing how the cross-account access is configured between the S3 bucket, KMS keys, and Amazon Bedrock Custom Model Import.

To successfully implement the described scenario while adhering to the principle of least privilege access, the following steps must be executed:

The Model Development account must provide access to the Model Hosting account’s IAM role BedrockCMIExecutionRole-777788889999, allowing it to utilize their S3 bucket and, if applicable, the encryption key, using resource-based policies.
The Model Hosting account should establish an IAM role, such as BedrockCMIExecutionRole-777788889999. The identity-based policies needed would be for the Model Development S3 bucket and customer managed keys for decrypting model artifacts, like using kms-cmk-111122223333.
The Model Hosting account must enable the Amazon Bedrock service to assume the IAM role BedrockCMIExecutionRole-777788889999, created in step 2, by including the Amazon Bedrock service as a trusted entity. This IAM role will be utilized by the Model Hosting account to initiate the custom model import job.

Prerequisites

Before you can start a custom model import job, you need to fulfill the following prerequisites:

If you’re importing your model from an S3 bucket, prepare your model files in the Hugging Face weights format. For more information refer to Import source.
(Optional) Set up extra security configurations.
- You can encrypt input and output data, import jobs, or inference requests made to imported models. For more information refer to Encryption of custom model import.
- You can create a virtual private cloud (VPC) to protect your customization jobs. For more information, refer to (Optional) Protect custom model import jobs using a VPC.

Step-by-step execution

The following section provides the step-by-step execution of the previously outlined high-level process, from the perspective of an administrator managing both accounts:

Step 1: Set up the S3 bucket policy (in the Model Development account) to enable access for the Model Hosting account’s IAM role:

Sign in to the AWS Management Console for account 111122223333, then access the Amazon S3 console.
On the General purpose buckets view, locate model-artifacts-111122223333, the bucket used by the model development team to store their model artifacts.

On the Permissions tab, select Edit in the Bucket policy section, and insert the following IAM resource-based policy. Be sure to update the AWS account IDs (shown in red) in the policy with your information.

{
    "Version": "2012-10-17",
    "Id": "AllowCrossAccountS3Access",
    "Statement": [
        {
            "Sid": "cross-account-list-get",
            "Effect": "Allow",
            "Principal": {
 "AWS": "arn:aws:iam::777788889999:root"             },
            "Action": [
                "s3:ListBucket",
                "s3:GetObject"
            ],
            "Resource": [
 "arn:aws:s3:::model-artifacts-111122223333", "arn:aws:s3:::model-artifacts-111122223333/*"             ],
            "Condition": {
                "ArnLike": {
 "aws:PrincipalArn": "arn:aws:iam::777788889999:role/BedrockCMIExecutionRole-777788889999*"                 }
            }
        }
    ]
}

Step 2: Establish an IAM role (in the Model Hosting account) and authorize Amazon Bedrock to assume this role:

In the left navigation pane, select Policies and then choose Create policy. Within the Policy Editor, switch to the JSON tab and insert the following identity-based policy. This policy is designed for read-only access, enabling users or a role to list and download objects from a specified S3 bucket, but only if the bucket is owned by account 111122223333. Customize the AWS account ID and S3 bucket name/prefix (shown in red) with your information.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "1",
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket",
                "s3:GetObject"
            ],
            "Resource": [
 "arn:aws:s3:::model-artifacts-111122223333", "arn:aws:s3:::model-artifacts-111122223333/*"             ],
            "Condition": {
                "StringEquals": {
  "aws:ResourceAccount": "111122223333"                 }
            }
        }
    ]
}

Choose Next, assign the policy name as BedrockCMIExecutionPolicy-777788889999, and finalize by choosing Create policy.

In the left navigation pane, choose Roles and select Custom trust policy as the Trusted entity type. Insert the following trusted entity policy, which restricts the role assumption to the Amazon Bedrock service, specifically for model import jobs in account 777788889999 located in the US East (N. Virginia) us-east-1 Region. Modify the AWS account ID and Region (shown in red) with your information.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "1",
            "Effect": "Allow",
            "Principal": {
                "Service": "bedrock.amazonaws.com"
            },
            "Action": "sts:AssumeRole",
            "Condition": {
                "StringEquals": {
 "aws:SourceAccount": "777788889999"                 },
                "ArnEquals": {
 "aws:SourceArn": "arn:aws:bedrock:us-east-1:777788889999:model-import-job/*"                 }
            }
        }
    ]
}

Choose Next and in the Add permissions section, search for the policy created in the previous step BedrockCMIExecutionPolicy-777788889999, select the checkbox, and proceed by choosing Next.
Assign the Role name as BedrockCMIExecutionRole-777788889999, provide a Description as “IAM execution role to be used by CMI jobs,” and finalize by choosing Create role.

Important: If you’re using an AWS KMS encryption key for model artifacts in the Model Development account or for imported model artifacts with the Amazon Bedrock managed AWS account, proceed with steps 3 through 5. If not, skip to step 6.

Step 3: Adjust the AWS KMS key policy (in the Model Development account) to allow the Amazon Bedrock CMI execution IAM role to decrypt model artifacts:

Transition back to the Model Development account and find the AWS KMS key named kms-cmk-111122223333 in the AWS KMS console. Note the AWS KMS key Amazon Resource Name (ARN).

On the Key policy tab, switch to the Policy view, and incorporate the following resource-based policy statement to enable the Model Hosting account’s IAM role BedrockCMIExecutionRole-777788889999 to decrypt model artifacts. Revise items in red with your information.

{
      "Sid": "Allow use of the key by the destination account",
      "Effect": "Allow",
      "Principal": {
 "AWS": "arn:aws:iam::777788889999:role/BedrockCMIExecutionRole-777788889999"       },
      "Action": [
        "kms:Decrypt",
        "kms:DescribeKey"
      ],
      "Resource": "*"
}

Step 4: Set the AWS KMS key policy (in the Model Hosting account) for the CMI execution IAM role to encrypt and decrypt model artifacts to securely store in the Amazon Bedrock AWS account:

Return to the Model Hosting account and locate the AWS KMS key named kms-cmk-777788889999 in the AWS KMS console. Note the AWS KMS key ARN.

Insert the following statement into the AWS KMS key’s resource-based policy to enable the BedrockCMIExecutionRole-777788889999 IAM role to encrypt and decrypt model artifacts at rest in the Amazon Bedrock managed AWS account. Revise items in red with your information.

{
      "Sid": "Allow use of the key",
      "Effect": "Allow",
      "Principal": {
 "AWS": "arn:aws:iam::777788889999:role/BedrockCMIExecutionRole-777788889999"       },
      "Action": [
        "kms:Encrypt",
        "kms:Decrypt",
        "kms:ReEncrypt*",
        "kms:GenerateDataKey*",
        "kms:DescribeKey"
      ],
      "Resource": "*"
}

Step 5: Modify the CMI execution role’s permissions (in the Model Hosting account) to provide access to encryption keys:

Access the IAM console and find the IAM policy BedrockCMIExecutionPolicy-777788889999. To the existing identity-based policy, append the following statements (replace the ARNs in red with one noted in steps 4 and 5):

{
    "Effect": "Allow",
    "Action": [
        "kms:Decrypt",
        "kms:DescribeKey"
    ],
 "Resource": "arn:aws:kms:us-east-1:111122223333:key/b5b6e052-fb27-4dbb-bf0d-daf3375a9fda" },
{
    "Effect": "Allow",
    "Action": [
        "kms:Encrypt",
        "kms:Decrypt",
        "kms:ReEncrypt*",
        "kms:GenerateDataKey*",
        "kms:DescribeKey"
    ],
 "Resource": "arn:aws:kms:us-east-1:777788889999:key/6cd5d3bf-3d9b-4d1c-83d5-8df6284435a1" }

Step 6: Initiate the Model import job (in the Model Hosting account)

In this step, we execute the model import job using the AWS Command Line Interface (AWS CLI) command. You can also use AWS SDKs or APIs for the same purpose. Run the following command from your terminal session with an IAM user or role that has the necessary privileges to create a custom model import job. You don’t need to explicitly provide an ARN or details of the CMK used by the Model Development team.

aws bedrock create-model-import-job 
    --job-name "cmi-job-777788889999-01" 
    --imported-model-name "mistral-777788889999-01" 
    --role-arn "arn:aws:iam::777788889999:role/BedrockCMIExecutionRole-777788889999" 
    --model-data-source "s3DataSource={s3Uri="s3://model-artifacts-111122223333/mistral-model-weights/"}"

When encrypting model artifacts with Amazon Bedrock Custom Model Import, use the --imported-model-kms-key-id flag and specify the ARN of the Model Hosting account’s CMK key.

aws bedrock create-model-import-job 
    --job-name "cmi-job-777788889999-04" 
    --imported-model-name "mistral-777788889999-01" 
    --role-arn "arn:aws:iam::777788889999:role/BedrockCMIExecutionRole-777788889999" 
    --model-data-source "s3DataSource={s3Uri="s3://model-artifacts-111122223333/mistral-model-weights/"}" 
    --imported-model-kms-key-id "arn:aws:kms:us-east-1:777788889999:key/6cd5d3bf-3d9b-4d1c-83d5-8df6284435a1"

Cross-account access to the S3 bucket using the custom model import job is only supported through AWS CLI, AWS SDKs, or APIs. Console support is not yet available.

Troubleshooting

When IAM policy misconfigurations prevent a custom model import job, you might encounter an error like:

Amazon Bedrock does not have access to the S3 location (s3://model-artifacts-111122223333/mistral-model-weights). Update the permissions and try again.

To resolve this, manually verify access to Model Development’s S3 bucket from the Model Hosting account by assuming the BedrockCMIExecutionRole-777788889999. Follow these steps:

Step 1: Identify the current IAM role or user in the CLI with the following and copy the ARN from the output:

aws sts get-caller-identity

Step 2: Update trust relationships. Append the trust policy of the BedrockCMIExecutionRole-777788889999 to allow the current user or IAM role to assume this role:

{
    "Effect": "Allow",
    "Principal": {
        "AWS": "arn:aws:sts::777788889999:role/current-user-role"
    },
    "Action": "sts:AssumeRole"
}

Step 3: List or copy the S3 bucket contents assuming the Amazon Bedrock Custom Model Import execution role

Assume the CMI execution role (replace the ARN with your information):

aws sts assume-role 
    --role-arn "arn:aws:iam::776941257690:role/BedrockCMIExecutionRole-777788889999" 
    --role-session-name "BedrockCMISession"

Export the returned temporary credentials as environment variables:

export AWS_ACCESS_KEY_ID="ASIA..."
export AWS_SECRET_ACCESS_KEY="..."
export AWS_SESSION_TOKEN="..."

Run commands to troubleshoot permission issues:

aws s3 ls s3://model-artifacts-111122223333/mistral-model-weights/
aws s3 cp s3://model-artifacts-111122223333/mistral-model-weights/config.json .

If errors persist, consider using Amazon Q Developer or refer to additional resources outlined in the IAM User Guide.

Cleanup

There is no additional charge to import a custom model to Amazon Bedrock (refer to step 6 in the Step-by-step execution section). However, if your model isn’t in use for inference, and you want to avoid paying storage costs (refer to Amazon Bedrock pricing), delete the imported model using the AWS console or AWS CLI reference or API Reference. For example (replace the text in red with your imported model name):

aws bedrock delete-imported-model 
    --model-identifier "mistral-777788889999-01"

Conclusion

By using cross-account access in Amazon Bedrock Custom Model Import, organizations can significantly streamline their AI model deployment workflows.

Amazon Bedrock Custom Model Import is generally available today in Amazon Bedrock in the US East (N. Virginia) us-east-1 and US West (Oregon) us-west-2 AWS Regions. Refer to the full Region list for future updates. To learn more, refer to the Amazon Bedrock Custom Model Import product page and Amazon Bedrock pricing page. Give Amazon Bedrock Custom Model Import a try in the Amazon Bedrock console today and send feedback to AWS re:Post for Amazon Bedrock or through your usual AWS Support contacts.

Thank you to our contributors Scott Chang (Salesforce), Raghav Tanaji (Salesforce), Rupinder Grewal (AWS), Ishan Singh (AWS), and Dharinee Gupta (AWS)

About the Authors

Hrushikesh Gangur is a Principal Solutions Architect at AWS. Based in San Francisco, California, Hrushikesh is an expert in AWS machine learning. As a thought leader in the field of generative AI, Hrushikesh has contributed to AWS’s efforts in helping startups and ISVs build and deploy AI applications. His expertise extends to various AWS services, including Amazon SageMaker, Amazon Bedrock, and accelerated computing which are crucial for building AI applications.

Sai Darahas Akkineni is a Software Development Engineer at AWS. He holds a master’s degree in Computer Engineering from Cornell University, where he worked in the Autonomous Systems Lab with a specialization in computer vision and robot perception. Currently, he helps deploy large language models to optimize throughput and latency.

Prashant Patel is a Senior Software Development Engineer in AWS. He’s passionate about scaling large language models for enterprise applications. Prior to joining AWS, he worked at IBM on productionizing large-scale AI/ML workloads on Kubernetes. Prashant has a master’s degree from NYU Tandon School of Engineering. While not at work, he enjoys traveling and playing with his dogs.

ByteDance processes billions of daily videos using their multimodal video understanding models on AWS Inferentia2

February 26, 2025

by Wangpeng An Amazon AWS

This is a guest post authored by the team at ByteDance.

ByteDance is a technology company that operates a range of content platforms to inform, educate, entertain, and inspire people across languages, cultures, and geographies. Users trust and enjoy our content platforms because of the rich, intuitive, and safe experiences they provide. These experiences are made possible by our machine learning (ML) backend engine, with ML models built for video understanding, search, recommendation, advertising, and novel visual effects.

In support of its mission to “Inspire Creativity and Enrich Life,” we’ve made it straightforward and fun for people to engage with, create, and consume content. People can also discover and transact with a suite of more than a dozen products and services, such as CapCut, e-Shop, Lark, Pico, and Mobile Legends: Bang Bang.

At ByteDance, we collaborated with Amazon Web Services (AWS) to deploy multimodal large language models (LLMs) for video understanding using AWS Inferentia2 across multiple AWS Regions around the world. By using sophisticated ML algorithms, the platform efficiently scans billions of videos each day. We use this process to identify and flag content that violates community guidelines, enabling a better experience for all users. By using Amazon EC2 Inf2 instances for these video understanding workloads, we were able to cut the inference cost by half.

In this post, we discuss the use of multimodal LLMs for video understanding, the solution architecture, and techniques for performance optimization.

Overcoming video understanding hurdles with multimodal LLMs

Multimodal LLMs enable better understanding of the world, enabling various forms of digital content as inputs to the LLM, greatly increasing the range of useful applications we can now build. The need for AI systems capable of processing various content forms has become increasingly apparent. Multimodal LLMs have risen to meet this challenge by taking multiple data modalities, including text, images, audio, and video (refer to the following diagram), which allows for full understanding of content, mimicking human perception and interaction with the world. The enhanced capabilities of these models are evident in their performance, which far surpasses that of traditional models in tasks ranging from sophisticated virtual assistant to advanced content creation. By expanding the boundaries of AI capabilities and paving the way for more natural and intuitive interactions with technology, these models aren’t just improving existing applications but opening doors to entirely new possibilities in the realm of AI and user experience.

In our operations, the implementation of multimodal LLMs for video understanding represents a significant shift in thinking about AI-driven content analysis. This innovation addresses the daily challenge of processing billions of volumes of video content, overcoming the efficiency limits of traditional AI models. We’ve developed our own multimodal LLM architecture, designed to achieve state-of-the-art performance across single-image, multi-image, and video applications. Unlike traditional ML models, this new generative AI–enabled system integrates multiple input streams into a unified representational space. Cross-modal attention mechanisms facilitate information exchange between modalities, and fusion layers combine representations from different modalities. The decoder then generates output based on the fused multimodal representation, enabling a more nuanced and context-aware analysis of content.

Solution overview

We’ve collaborated with AWS since the first generation of Inferentia chips. Our video understanding department has been committed to finding more cost-efficient solutions that deliver higher performance to better meet ever-growing business needs. During this period, we found that AWS has been continually inventing and adding features and capabilities to its AWS Neuron software development kit (SDK), the software enabling high-performance workloads on the Inferentia chips. The popular Meta Llama and Mistral models were well supported with high performance on Inferentia2 shortly after their open source release. Therefore, we began to evaluate the Inferentia2 based solution, illustrated in the following diagram.

We made the strategic decision to deploy a fine-tuned middle-sized LLM on Inferentia2, to provide a performant and cost-effective solution capable of processing billions of videos daily. The process was a comprehensive effort aimed at optimizing end-to-end response time for our video understanding workload. The team explored a wide range of parameters, including tensor parallel sizes, compile configurations, sequence lengths, and batch sizes. We employed various parallelization techniques, such as multi-threading and model replication (for non-LLM models) across multiple NeuronCores. Through these optimizations, which included parallelizing sequence steps, reusing devices, and using auto-benchmark and profiling tools, we achieved a substantial performance boost, maintaining our position at the forefront of industry performance standards

We used tensor parallelism to effectively distribute and scale the model across multiple accelerators in an Inf2 instance. We used static batching, which improved the latency and throughput of our models by making sure that data is processed in uniform, fixed-size batches during inference. Using repeated n-grams filtering significantly improved the quality of automatically generated text and reduced inference time. Quantizing the weights of the multimodal model from FP16/BF16 to INT8 format allowed it to run more efficiently on Inferentia2 with less device memory usage, without compromising on accuracy. Using these techniques and model serialization, we optimized the throughput on inf2.48xlarge instance by maximizing the batch size such that the model could still fit on a single accelerator in an instance so we could deploy multiple model replicas on the same instance. This comprehensive optimization strategy helped us meet our latency requirements while providing optimal throughput and cost reduction. Notably, the Inferentia2 based solution cut the cost by half compared to comparable Amazon Elastic Compute Cloud (Amazon EC2) instances, highlighting the significant economic advantages of using Inferentia2 chips for large-scale video understanding tasks.

The following diagram shows how we deploy our LLM container on Amazon EC2 Inf2 instances using Neuron.

In summary, our collaboration with AWS has revolutionized video understanding, setting new industry standards for efficiency and accuracy. The multimodal LLM’s ability to adapt to global market demands and its scalable performance on Inferentia2 chips underscore the profound impact of this technology in safeguarding the platform’s global community.

Future plans

Looking further ahead, the development of a unified multimodal LLM represents an important shift in video understanding technology. This ambitious project aims to create a universal content tokenizer capable of processing all content types and aligning them within a common semantic space. After it’s tokenized, the content will be analyzed by advanced large models, generating appropriate content understanding outputs regardless of the original format (as shown in the following diagram). This unified approach can streamline the content understanding process, potentially improving both efficiency and consistency across diverse content types.

For additional learning, refer to the paper The Evolution of Multimodal Model Architectures.

The implementation of this comprehensive strategy sets new benchmarks in video understanding technology, striking a balance between accuracy, speed, and cultural sensitivity in an increasingly complex digital ecosystem. This forward-looking approach not only addresses current challenges in video understanding but also positions the system at the forefront of AI-driven content analysis and management for the foreseeable future.

By using cutting-edge AI techniques and a holistic approach to content understanding, this next-generation content understanding system aims to set new industry standards, providing safer and more inclusive online environments while adapting to the ever-evolving landscape of digital communication. At the same time, AWS is investing in next-generation AI chips such as AWS Trainium2, which will continue to push the performance boundaries while keeping costs under control. At ByteDance, we’re planning to test out this new generation of AWS AI chips and adopt them appropriately as the models and workloads continue to evolve.

Conclusion

The collaboration between ByteDance and AWS has revolutionized video understanding through the deployment of multimodal LLMs on Inferentia2 chips. This partnership has yielded remarkable results, the ability to process billions of videos daily, and significant cost reductions and higher performance over comparable EC2 instances.

As ByteDance continues to innovate with projects such as the unified multimodal large model, we remain committed to pushing the boundaries of AI-driven content analysis. Our goal is to make sure our platforms remain safe, inclusive, and creative spaces for our global community, setting new industry standards for efficient video understanding.

To learn more about Inf2 instances, refer to Amazon EC2 Inf2 Architecture.

About the Authors

Wangpeng An, Principal Algorithm Engineer at TikTok, specializes in multimodal LLMs for video understanding, advertising, and recommendations. He has led key projects in model acceleration, content moderation, and Ads LLM pipelines, enhancing TikTok’s real-time machine learning systems.

Haotian Zhang is a Tech Lead MLE at TikTok, specializing in content understanding, search, and recommendation. He received an ML PhD from University of Waterloo. At TikTok, he leads a group of engineers to improve the efficiency, robustness, and effectiveness of training and inference for LLMs and multimodal LLMs, especially for large distributed ML systems.

Xiaojie Ding is a senior engineer at TikTok, focusing on content moderation system development, model resource and deployment optimization, and algorithm engineering stability construction. In his free time, he likes to play single-player games.

Nachuan Yang is a senior engineer at TikTok, focusing on content security and moderation. He has successively been engaged in the construction of moderation systems, model applications, and deployment and performance optimization.

Kairong Sun is a Senior SRE on the AML Team at ByteDance. His role focuses on maintaining the seamless operation and efficient allocation of resources within the cluster, specializing in cluster machine maintenance and resource optimization.

The authors would like to thank other ByteDance and AWS team members for their contributions: Xi Dai, Kaili Zhao, Zhixin Zhang, Jin Ye, and Yann Xia from ByteDance; Jia Dong, Bingyang Huang, Kamran Khan, Shruti Koparkar, and Diwakar Bansal from AWS.

How IDIADA optimized its intelligent chatbot with Amazon Bedrock

February 25, 2025

by Xavier Vizcaino Amazon AWS

This post is co-written with Xavier Vizcaino, Diego Martín Montoro, and Jordi Sánchez Ferrer from Applus+ Idiada.

In 2021, Applus+ IDIADA, a global partner to the automotive industry with over 30 years of experience supporting customers in product development activities through design, engineering, testing, and homologation services, established the Digital Solutions department. This strategic move aimed to drive innovation by using digital tools and processes. Since then, we have optimized data strategies, developed customized solutions for customers, and prepared for the technological revolution reshaping the industry.

AI now plays a pivotal role in the development and evolution of the automotive sector, in which Applus+ IDIADA operates. Within this landscape, we developed an intelligent chatbot, AIDA (Applus Idiada Digital Assistant)— an Amazon Bedrock powered virtual assistant serving as a versatile companion to IDIADA’s workforce.

Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Stability AI, and Amazon through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI.

With Amazon Bedrock, AIDA assists with a multitude of tasks, from addressing inquiries to tackling complex technical challenges spanning code, mathematics, and translation. Its capabilities are truly boundless.

With AIDA, we take another step towards our vision of providing global and integrated digital solutions that add value for our customers. Its internal deployment strengthens our leadership in developing data analysis, homologation, and vehicle engineering solutions. Additionally, in the medium term, IDIADA plans to offer AIDA as an integrable product for customers’ environments and develop “light” versions seamlessly integrable into existing systems.

In this post, we showcase the research process undertaken to develop a classifier for human interactions in this AI-based environment using Amazon Bedrock. The objective was to accurately identify the type of interaction received by the intelligent agent to route the request to the appropriate pipeline, providing a more specialized and efficient service.

The challenge: Optimize intelligent chatbot responses, allocate resources more effectively, and enhance the overall user experience

Built on a flexible and secure architecture, AIDA offers a versatile environment for integrating multiple data sources, including structured data from enterprise databases and unstructured data from internal sources like Amazon Simple Storage Service (Amazon S3). It boasts advanced capabilities like chat with data, advanced Retrieval Augmented Generation (RAG), and agents, enabling complex tasks such as reasoning, code execution, or API calls.

As AIDA’s interactions with humans proliferated, a pressing need emerged to establish a coherent system for categorizing these diverse exchanges.

Initially, users were making simple queries to AIDA, but over time, they started to request more specific and complex tasks. These included document translations, inquiries about IDIADA’s internal services, file uploads, and other specialized requests.

The main reason for this categorization was to develop distinct pipelines that could more effectively address various types of requests. By sorting interactions into categories, AIDA could be optimized to handle specific kinds of tasks more efficiently. This approach allows for tailored responses and processes for different types of user needs, whether it’s a simple question, a document translation, or a complex inquiry about IDIADA’s services.

The primary objective is to offer a more specialized service through the creation of dedicated pipelines for various contexts, such as conversation, document translation, and services to provide more accurate, relevant, and efficient responses to users’ increasingly diverse and specialized requests.

Solution overview

By categorizing the interactions into three main groups—conversation, services, and document translation—the system can better understand the user’s intent and respond accordingly. The Conversation class encompasses general inquiries and exchanges, the Services class covers requests for specific functionalities or support, and the Document_Translation class handles text translation needs.

The specialized pipelines, designed specifically for each use case, allow for a significant increase in efficiency and accuracy of AIDA’s responses. This is achieved in several ways:

Enhanced efficiency – By having dedicated pipelines for specific types of tasks, AIDA can process requests more quickly. Each pipeline is optimized for its particular use case, which reduces the computation time needed to generate an appropriate response.
Increased accuracy – The specialized pipelines are equipped with specific tools and knowledge for each type of task. This allows AIDA to provide more accurate and relevant responses, because it uses the most appropriate resources for each type of request.
Optimized resource allocation – By classifying interactions, AIDA can allocate computational resources more efficiently, directing the appropriate processing power to each type of task.
Improved response time – The combination of greater efficiency and optimized resource allocation results in faster response times for users.
Enhanced adaptability – This system allows AIDA to better adapt to different types of requests, from simple queries to complex tasks such as document translations or specialized inquiries about IDIADA services.

The research and development of this large language model (LLM) based classifier is an important step in the continuous improvement of the intelligent agent’s capabilities within the Applus IDIADA environment.

For this occasion, we use a set of 1,668 examples of pre-classified human interactions. These have been divided into 666 for training and 1,002 for testing. A 40/60 split has been applied, giving significant importance to the test set.

The following table shows some examples.

SAMPLE	CLASS
Can you make a summary of this text? “Legislation for the Australian Government’s …”	Conversation
No, only focus on this sentence : Braking technique to enable maximum brake application speed	Conversation
In a factory give me synonyms of a limiting resource of activities	Conversation
We need a translation of the file “Company_Bylaws.pdf” into English, could you handle it?	Document_Translation
Please translate the file “Product_Manual.xlsx” into English	Document_Translation
Could you convert the document “Data_Privacy_Policy.doc’ into English, please?	Document_Translation
Register my username in the IDIADA’s human resources database	Services
Send a mail to random_user@mail.com to schedule a meeting for the next weekend	Services
Book an electric car charger for me at IDIADA	Services

We present three different classification approaches: two based on LLMs and one using a classic machine learning (ML) algorithm. The aim is to understand which approach is most suitable for addressing the presented challenge.

LLM-based classifier: Simple prompt

In this case, we developed an LLM-based classifier to categorize inputs into three classes: Conversation, Services, and Document_Translation. Instead of relying on predefined, rigid definitions, our approach follows the principle of understanding a set. This principle involves analyzing the common characteristics and patterns present in the examples or instances that belong to each class. By studying the shared traits of inputs within a class, we can derive an understanding of the class itself, without being constrained by preconceived notions.

It’s important to note that the learned definitions might differ from common expectations. For instance, the Conversation class encompasses not only typical conversational exchanges but also tasks like text summarization, which share similar linguistic and contextual traits with conversational inputs.

By following this data-driven approach, the classifier can accurately categorize new inputs based on their similarity to the learned characteristics of each class, capturing the nuances and diversity within each category.

The code consists of the following key components: libraries, a prompt, model invocation, and an output parser.

Libraries

The programming language used in this code is Python, complemented by the LangChain module, which is specifically designed to facilitate the integration and use of LLMs. This module provides a comprehensive set of tools and abstractions that streamline the process of incorporating and deploying these advanced AI models.

To take advantage of the power of these language models, we use Amazon Bedrock. The integration with Amazon Bedrock is achieved through the Boto3 Python module, which serves as an interface to the AWS, enabling seamless interaction with Amazon Bedrock and the deployment of the classification model.

Prompt

The task is to assign one of three classes (Conversation, Services, or Document_Translation) to a given sentence, represented by question:

Conversation class – This class encompasses casual messages, summarization requests, general questions, affirmations, greetings, and similar types of text. It also includes requests for text translation, summarization, or explicit inquiries about the meaning of words or sentences in a specific language.
Services class – Texts belonging to this class consist of explicit requests for services such as room reservations, hotel bookings, dining services, cinema information, tourism-related inquiries, and similar service-oriented requests.
Document_Translation class – This class is characterized by requests for the translation of a document to a specific language. Unlike the Conversation class, these requests don’t involve summarization. Additionally, the name of the document to be translated and the target language are specified.

The prompt suggests a hierarchical approach to the classification process. First, the sentence should be evaluated to determine if it can be classified as a conversation. If the sentence doesn’t fit the Conversation class, one of the other two classes (Services or Document_Translation) should be assigned.

The priority for the Conversation class stems from the fact that 99% of the interactions are actually simple questions regarding various matters.

Model invocation

We use Anthropic’s Claude 3 Sonnet model for the natural language processing task. This LLM model has a context window of 200,000 tokens, enabling it to manage different languages and retrieve highly accurate answers. We use two key parameters:

max_tokens – This parameter limits the maximum number of tokens (words or subwords) that the language model can generate in its output to 50.
temperature – This parameter controls the randomness of the language model’s output. A temperature of 0.0 means that the model will produce the most likely output according to its training, without introducing randomness.

Output parser

Another important component is the output parser, which allows us to gather the desired information in JSON format. To achieve this, we use the LangChain parameter output_parsers.

The following code illustrates a simple prompt approach:

def classify_interaction(question):
   response_schemas = [
        ResponseSchema(name="class", description="the assigned class")
    ]
    output_parser = StructuredOutputParser.from_response_schemas(response_schemas)
    format_instructions = output_parser.get_format_instructions()
    prompt =f"""
    We have 3 classes Conversation (for example asking for assistance), Services and Document_Translation.
    Conversation:text consist of casual messages, summarization requests, general questions, afirmations, greetings, 
    and similar. Requests for text translation, text summarisation or explicit text translation requests, 
    questions about the meaning of words or sentences in a concrete language.
    Services:the text consist of explicit requests for rooms, hotels, eating services, cinema, tourism, and similar.
    Document_Translation: A translation of a document to a specific language is requested, and a summary is not requested. 
    The length of the document is specified.
    Assign a class to the following sentence.
    {question}
    Try to understand the sentence as a Conversation one, if you can't, then asign one of the other classes.
    {format_instructions} 
    """          
    response = bedrock_runtime.invoke_model(
            modelId='anthropic.claude-3-sonnet-20240229-v1:0',
            body=json.dumps(
                {
                    "anthropic_version": "bedrock-2023-05-31",
                    "max_tokens": 20,
                    "temperature":0,
                    "messages": [
                        {
                            "role": "user",
                            "content": [{"type": "text", "text": prompt}],
                        }
                    ],
                }
            ),
        )
    result_message = json.loads(response.get("body").read())
    texto = result_message['content'][0]['text']
    try:
        output_dict = output_parser.parse(texto.replace(''', '"'))['class']
    except:
        output_dict = 'Conversation' 
    return output_dict

LLM-based classifier: Example augmented inference

We use RAG techniques to enhance the model’s response capabilities. Instead of relying solely on compressed definitions, we provide the model with a quasi-definition by extension. Specifically, we present the model with a diverse set of examples for each class, allowing it to learn the inherent characteristics and patterns that define each category. For instance, in addition to a concise definition of the Conversation class, the model is exposed to various conversational inputs, enabling it to identify common traits such as informal language, open-ended questions, and back-and-forth exchanges. This example-driven approach complements the initial descriptions provided, allowing the model to capture the nuances and diversity within each class. By combining concise definitions with representative examples, the RAG technique helps the model develop a more comprehensive understanding of the classes, enhancing its ability to accurately categorize new inputs based on their inherent nature and characteristics.

The following code provides examples in JSON format for RAG:

{
    "Conversation":[
       ""Could you give me examples of how to solve it?",
       "cool but anything short and sweet",
       "..."
    ],
    "Services":[
       "make a review of my investments in the eBull.com platform",
       "I need a room in IDIADA",
       "schedule a meeting with",
       "..."
    ]"Document_Translation":[
       "Translate the file into Catalan",
       "Could you translate the document I added earlier into Swedish?",
       "Translate the Guía_Rápida.doc file into Romanian",
       "..."
    ]
 }

The total number of examples provided for each class is as follows:

Conversation – 500 examples. This is the most common class, and only 500 samples are given to the model due to the vast amount of information, which could cause infrastructure overflow (very high delays, throttling, connection shutouts). This is a crucial point to note because it represents a significant bottleneck. Providing more examples to this approach could potentially improve performance, but the question remains: How many examples? Surely, a substantial amount would be required.
Services – 26 examples. This is the least common class, and in this case, all available training data has been used.
Document_Translation – 140 examples. Again, all available training data has been used for this class.

One of the key challenges with this approach is scalability. Although the model’s performance improves with more training examples, the computational demands quickly become overwhelming for our current infrastructure. The sheer volume of data required can lead to quota issues with Amazon Bedrock and unacceptably long response times. Rapid response times are essential for providing a satisfactory user experience, and this approach falls short in that regard.

In this case, we need to modify the code to embed all the examples. The following code shows the changes applied to the first version of the classifier. The prompt is modified to include all the examples in JSON format under the “Here you have some examples” section.

def classify_interaction(question, agent_examples):
    response_schemas = [
        ResponseSchema(name="class", description="the assigned class")
    ]
    output_parser = StructuredOutputParser.from_response_schemas(response_schemas)
    format_instructions = output_parser.get_format_instructions()
    prompt =f"""
    We have 3 classes Conversation (for example asking for assistance), Services and Document_Translation.
    Conversation:text consist of casual messages, summarization requests, general questions, afirmations, greetings, 
    and similar. Requests for text translation, text summarisation or explicit text translation requests, 
    questions about the meaning of words or sentences in a concrete language.
    Services:the text consist of explicit requests for rooms, hotels, eating services, cinema, tourism, and similar.
    Document_Translation: A translation of a document to a specific language is requested, and a summary is not requested. 
    The length of the document is specified.
    
    Here you have some examples:
    {agent_examples}

    Assign a class to the following sentence.
    {question}

    Try to understand the sentence as a Conversation one, if you can't, then asign one of the other classes.
    {format_instructions}   
    """
    
    response = bedrock_runtime.invoke_model(
            modelId='anthropic.claude-3-sonnet-20240229-v1:0',
            body=json.dumps(
                {
                    "anthropic_version": "bedrock-2023-05-31",
                    "max_tokens": 50,
                    "messages": [
                        {
                            "role": "user",
                            "content": [{"type": "text", "text": prompt}],
                        }
                    ],
                }
            ),
        )

    result_message = json.loads(response.get("body").read())
    texto = result_message['content'][0]['text']
    output_dict = output_parser.parse(texto.replace(''', '"'))['class']
    
    return output_dict

K-NN-based classifier: Amazon Titan Embeddings

In this case, we take a different approach by recognizing that despite the multitude of possible interactions, they often share similarities and repetitive patterns. Instead of treating each input as entirely unique, we can use a distance-based approach like k-nearest neighbors (k-NN) to assign a class based on the most similar examples surrounding the input. To make this work, we need to transform the textual interactions into a format that allows algebraic operations. This is where embeddings come into play. Embeddings are vector representations of text that capture semantic and contextual information. We can calculate the semantic similarity between different interactions by converting text into these vector representations and comparing their vectors and determining their proximity in the embedding space.

To accommodate this approach, we need to modify the code accordingly:

from langchain.embeddings import BedrockEmbeddings
from sklearn.neighbors import KNeighborsClassifier

bedrock_runtime = boto3.client(
    service_name='bedrock-runtime',
    region_name='us-west-2'
)

bedrock_embedding = BedrockEmbeddings(
    client=bedrock_runtime,
    model_id="amazon.titan-embed-text-v1",
)
df_train = pd.read_excel('coordinator_dataset/casos_coordinador_train.xlsx')
df_test = pd.read_excel('coordinator_dataset/casos_coordinador_test.xlsx')

X_train_emb = bedrock_embedding.embed_documents(df_train['sample'].values.tolist())
X_test_emb = bedrock_embedding.embed_documents(df_test['sample'].values.tolist())
y_train = df_train['agent'].values.tolist()
y_test = df_test['agent'].values.tolist()
neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(X_train_emb, y_train)
y_pred = neigh.predict(X_test_emb)
print(classification_report(y_test, y_pred, target_names=['Conversation', 'Document_Translation', 'Services']))

We used the Amazon Titan Text Embeddings G1 model, which generates vectors of 1,536 dimensions. This model is trained to accept multiple languages while retaining the semantic meaning of the embedded phrases.

For the classfier, we employed a classic ML algorithm, k-NN, using the scikit-learn Python module. This method takes a parameter, which we set to 3.

The following figure illustrates the F1 scores for each class plotted against the number of neighbors (k) used in the k-NN algorithm. As the graph shows, the optimal value for k is 3, which yields the highest F1 score for the most prevalent class, Document_Translation. Although it’s not the absolute highest score for the Services class, Document_Translation is significantly more common, making k=3 the best overall choice to maximize performance across all classes.

K-NN-based classifier: Cohere’s multilingual embeddings model

In the previous section, we used the popular Amazon Titan Text Embeddings G1 model to generate text embeddings. However, other models might offer different advantages. In this section, we explore the use of Cohere’s multilingual model on Amazon Bedrock for generating embeddings. We chose the Cohere model due to its excellent capability in handling multiple languages without compromising the vectorization of phrases. As we will demonstrate, this model doesn’t introduce significant differences in the generated vectors compared to other models, making it more suitable for use in a multilingual environment like AIDA.

To use the Cohere model, we need to change the model_id:

from langchain.embeddings import BedrockEmbeddings
from sklearn.neighbors import KNeighborsClassifier

bedrock_runtime = boto3.client(
    service_name='bedrock-runtime',
    region_name='us-west-2'
)

bedrock_embedding = BedrockEmbeddings(
    client=bedrock_runtime,
    model_id=" cohere.embed-multilingual-v3",
)
df_train = pd.read_excel('coordinator_dataset/casos_coordinador_train.xlsx')
df_test = pd.read_excel('coordinator_dataset/casos_coordinador_test.xlsx')
data_train = [s[:1500] for s in df_train['sample']]
data_test = [s[:1500] for s in df_test['sample']]

y_train = df_train['agent'].values.tolist()
y_test = df_test['agent'].values.tolist()
X_test = df_test['sample'].values.tolist()

X_train_emb = bedrock_embedding.embed_documents(data_train)
X_test_emb = bedrock_embedding.embed_documents(data_test)

neigh = KNeighborsClassifier(n_neighbors=11)
neigh.fit(X_train_emb, y_train)
y_pred = neigh.predict(X_test_emb)
print(classification_report(y_test, y_pred, target_names=['Conversation', 'Document_Translation', 'Services']))

We use Cohere’s multilingual embeddings model to generate vectors with 1,024 dimensions. This model is trained to accept multiple languages and retain the semantic meaning of the embedded phrases.

For the classifier, we employ k-NN, using the scikit-learn Python module. This method takes a parameter, which we have set to 11.

The following figure illustrates the F1 scores for each class plotted against the number of neighbors used. As depicted, the optimal point is k=11, achieving the highest value for Document_Translation and the second-highest for Services. In this instance, the trade-off between Documents_Translation and Services is favorable.

Amazon Titan Embeddings vs. Cohere’s multilingual embeddings model

In this section, we delve deeper into the embeddings generated by both models, aiming to understand their nature and consequently comprehend the results obtained. To achieve this, we have performed dimensionality reduction to visualize the vectors obtained in both cases in 2D.

Cohere’s multilingual embeddings model has a limitation on the size of the text it can vectorize, posing a significant constraint. Therefore, in the implementation showcased in the previous section, we applied a filter to only include interactions up to 1,500 characters (excluding cases that exceed this limit).

The following figure illustrates the vector spaces generated in each case.

As we can observe, the generated vector spaces are relatively similar, initially appearing to be analogous spaces with a rotation between one another. However, upon closer inspection, it becomes evident that the direction of maximum variance in the case of Cohere’s multilingual embeddings model is distinct (deducible from observing the relative position and shape of the different groups). This type of situation, where high class overlap is observed, presents an ideal case for applying algorithms such as k-NN.

As mentioned in the introduction, most human interactions with AI are very similar to each other within the same class. This would explain why k-NN-based models outperform LLM-based models.

SVM-based classifier: Amazon Titan Embeddings

In this scenario, it is likely that user interactions belonging to the three main categories (Conversation, Services, and Document_Translation) form distinct clusters or groups within the embedding space. Each category possesses particular linguistic and semantic characteristics that would be reflected in the geometric structure of the embedding vectors. The previous visualization of the embeddings space displayed only a 2D transformation of this space. This doesn’t imply that clusters coudn’t be highly separable in higher dimensions.

Classification algorithms like support vector machines (SVMs) are especially well-suited to use this implicit geometry of the data. SVMs seek to find the optimal hyperplane that separates the different groups or classes in the embedding space, maximizing the margin between them. This ability of SVMs to use the underlying geometric structure of the data makes them an intriguing option for this user interaction classification problem.

Furthermore, SVMs are a robust and efficient algorithm that can effectively handle high-dimensional datasets, such as text embeddings. This makes them particularly suitable for this scenario, where the embedding vectors of the user interactions are expected to have a high dimensionality.

The following code illustrates the implementation:

bedrock_runtime = boto3.client(
    service_name='bedrock-runtime',
    region_name='eu-central-1'
)

bedrock_embedding = BedrockEmbeddings(
    client=bedrock_runtime,
    model_id="amazon.titan-embed-text-v1",
)

df_train = pd.read_excel('coordinator_dataset/casos_coordinador_train.xlsx')
df_test = pd.read_excel('coordinator_dataset/casos_coordinador_test.xlsx')

y_train = df_train['agent'].values.tolist()
y_test = df_test['agent'].values.tolist()
X_test = df_test['sample'].values.tolist()

X_train_emb = bedrock_embedding.embed_documents(df_train['sample'].values.tolist())
X_test_emb = bedrock_embedding.embed_documents(df_test['sample'].values.tolist())

f1 = make_scorer(f1_score , average='weighted')
parameters = {'kernel':('linear', 'rbf','poly', 'sigmoid'), 
              'C':[1, 2, 4, 6, 8, 10],
              'class_weight':[None, 'balanced']}

svc = svm.SVC()
clf = GridSearchCV(svc, parameters,cv=10,n_jobs= -1, scoring=f1)
clf.fit(X_train_emb, y_train)

y_pred = clf.predict(X_test_emb)

We use Amazon Titan Text Embeddings G1. This model generates vectors of 1,536 dimensions, and is trained to accept several languages and to retain the semantic meaning of the phrases embedded.

To implement the classifier, we employed a classic ML algorithm, SVM, using the scikit-learn Python module. The SVM algorithm requires the tuning of several parameters to achieve optimal performance. To determine the best parameter values, we conducted a grid search with 10-fold cross-validation, using the F1 multi-class score as the evaluation metric. This systematic approach allowed us to identify the following set of parameters that yielded the highest performance for our classifier:

C – We set this parameter to 1. This parameter controls the trade-off between allowing training errors and forcing rigid margins. It acts as a regularization parameter. A higher value of C (for example, 10) indicates a higher penalty for misclassification errors. This results in a more complex model that tries to fit the training data more closely. A higher C value can be beneficial when the classes in the data are well separated, because it allows the algorithm to create a more intricate decision boundary to accurately classify the samples. On the other hand, a C value of 1 indicates a reasonable balance between fitting the training set and the model’s generalization ability. This value might be appropriate when the data has a simple structure, and a more flexible model isn’t necessary to capture the underlying relationships. In our case, the selected C value of 1 suggests that the data has a relatively simple structure, and a balanced model with moderate complexity is sufficient for accurate classification.
class_weight – We set this parameter to None. This parameter adjusts the weights of each class during the training process. Setting class_weight to balanced automatically adjusts the weights inversely proportional to the class frequencies in the input data. This is particularly useful when dealing with imbalanced datasets, where one class is significantly more prevalent than the others. In our case, the value of None for the class_weight parameter suggests that the minor classes don’t have much relevance or impact on the overall classification task. This choice implies that the implicit geometry or decision boundaries learned by the model might not be optimized for separating the different classes effectively.
Kernel – We set this parameter to linear. This parameter specifies the type of kernel function to be used by the SVC algorithm. The linear kernel is a simple and efficient choice because it assumes that the decision boundary between classes can be represented by a linear hyperplane in the feature space. This value suggests that, in a higher dimension vector space, the categories could be linearly separated by an hyperplane.

SVM-based classifier: Cohere’s multilingual embeddings model

The implementation details of the classifier are presented in the following code:

from langchain.embeddings import BedrockEmbeddings

bedrock_runtime = boto3.client(
    service_name='bedrock-runtime',
    region_name='us-west-2'
)

bedrock_embedding = BedrockEmbeddings(
    client=bedrock_runtime,
    model_id="cohere.embed-multilingual-v3",
)

df_train = pd.read_excel('coordinator_dataset/casos_coordinador_train.xlsx')
df_test = pd.read_excel('coordinator_dataset/casos_coordinador_test.xlsx')

data_train = [s[:1500] for s in df_train['sample']]
data_test = [s[:1500] for s in df_test['sample']]

y_train = df_train['agent'].values.tolist()

X_train_emb = bedrock_embedding.embed_documents(data_train)
X_test_emb = bedrock_embedding.embed_documents(data_test)

f1 = make_scorer(f1_score , average='weighted')

parameters = {'kernel':('linear', 'rbf','poly', 'sigmoid'), 
              'C':[1, 2, 4, 6, 8, 10],
              'class_weight':[None, 'balanced']}

svc = svm.SVC()
clf = GridSearchCV(svc, parameters,cv=10,n_jobs= -1, scoring=f1)
clf.fit(X_train_emb, y_train)

y_pred = clf.predict(X_test_emb)

We use the Amazon Titan Text Embeddings G1 model, which generates vectors of 1,536 dimensions. This model is trained to accept multiple languages and retain the semantic meaning of the embedded phrases.

For the classifier, we employ SVM, using the scikit-learn Python module. To obtain the optimal parameters, we performed a grid search with 10-fold cross-validation based on the multi-class F1 score, resulting in the following selected parameters (as detailed in the previous section):

C – We set this parameter to 1, which indicates a reasonable balance between fitting the training set and the model’s generalization ability. This setting suggests that the data has a simple structure and that a more flexible model might not be necessary to capture the underlying relationships.
class_weight – We set this parameter to None. A value of None suggests that the minor classes don’t have much relevance, which in turn implies that the implicit geometry might not be suitable for separating the different classes.
kernel – We set this parameter to linear. This value suggests that in a higher-dimensional vector space, the categories could be linearly separated by a hyperplane.

ANN-based classifier: Amazon Titan and Cohere’s multilingual embeddings model

Given the promising results obtained with SVMs, we decided to explore another geometry-based method by employing an Artificial Neural Network (ANN) approach.

In this case, we performed normalization of the input vectors to use the advantages of normalization when using neural networks. Normalizing the input data is a crucial step when working with ANNs, because it can help improve the model’s during training. We applied min/max scaling for normalization.

The use of an ANN-based approach provides the ability to capture complex non-linear relationships in the data, which might not be easily modeled using traditional linear methods like SVMs. The combination of the geometric insights and the normalization of inputs can potentially lead to improved predictive accuracy compared to the previous SVM results.

This approach consists of the following parameters:

Model definition – We define a sequential deep learning model using the Keras library from TensorFlow.
Model architecture – The model consists of three densely connected layers. The first layer has 16 neurons and uses the ReLU activation function. The second layer has 8 neurons and employs the ReLU activation function. The third layer has 3 neurons and uses the softmax activation function.
Model compilation – We compile the model using the categorical_crossentropy loss function, the Adam optimizer with a learning rate of 0.01, and the categorical_accuracy. We incorporate an EarlyStopping callback to stop the training if the categorical_accuracy metric doesn’t improve for 25 epochs.
Model training – We train the model for a maximum of 500 epochs using the training set and validate it on the test set. The batch size is set to 64. The performance metric used is the maximum classification accuracy (categorical_accuracy) obtained during the training.

We applied the same methodology, but using the embeddings generated by Cohere’s multilingual embeddings model after being normalized through min/max scaling. In both cases, we employed the same preprocessing steps:

bedrock_runtime = boto3.client(
    service_name='bedrock-runtime',
    region_name='us-west-2'
)

bedrock_embedding = BedrockEmbeddings(
    client=bedrock_runtime,
    model_id="cohere.embed-multilingual-v3",
)

df_train = pd.read_excel('coordinator_dataset/casos_coordinador_train.xlsx')
df_test = pd.read_excel('coordinator_dataset/casos_coordinador_test.xlsx')

df_train['sample'] = [s[:1500] for s in df_train['sample']]
df_test['sample'] = [s[:1500] for s in df_test['sample']]

X_train_emb = bedrock_embedding.embed_documents(df_train['sample'].values.tolist())
X_test_emb = bedrock_embedding.embed_documents(df_test['sample'].values.tolist())
y_train = df_train['agent'].values.tolist()

y_train_ohe = [ [int(y=='Conversation'), int(y=='Document_Translation'), int(y=='Services')] for y in y_train]
y_test = df_test['agent'].values.tolist()
y_test = [ ['Conversation', 'Document_Translation', 'Services'].index(y) for y in y_test]
X_test = df_test['sample'].values.tolist()

To help avoid ordinal assumptions, we employed a one-hot encoding representation for the output of the network. One-hot encoding doesn’t make any assumptions about the inherent order or hierarchy among the categories. This is particularly useful when the categorical variable doesn’t have a clear ordinal relationship, because the model can learn the relationships without being biased by any assumed ordering.

The following code illustrates the implementation:

def train_model( X, y, n_hebras = 10, reps = 30, train_size = 0.7, tipo_optimizacion = "low"):
    import threading

    reps_por_hebra = int(reps/n_hebras)
    hebras = [0]*n_hebras
    results = [0]*reps
    models = [0]*reps    
    
    for i in range(len(hebras)):
        hebras[i] = threading.Thread(target=eval_model_rep_times,
            args=(X, y, train_size, reps_por_hebra, i*reps_por_hebra, models, results))
        hebras[i].start()
        
    for i in range(len(hebras)):
        hebras[i].join()
        
    if tipo_optimizacion == "low":
        result = models[np.argmin(results)], min(results)
    else:
        result = models[np.argmax(results)], max(results)
    return result

def eval_model_rep_times(X, y, train_size, reps, index, models, results):
    for rep in range(reps):
        X_train, X_test, y_train, y_test = train_test_split( X, y, train_size = train_size)
        model, metric = create_and_fit_model(X_train, y_train, X_test, y_test) 
        models[index+rep] = model
        results[index+rep] = metric

def create_and_fit_model(X_train, y_train, X_test, y_test):
    ### DEFINITION GOES HERE ###
    model = Sequential() 
    model.add(Dense(16, input_shape = (len(X_train[0]),), activation='relu')  )
    model.add(Dense(8, activation='relu')  )
    model.add(Dense(3, activation='softmax' ))
    model.compile(loss='categorical_crossentropy', optimizer=Adam(learning_rate=0.01), metrics=['categorical_accuracy'])
    early_stopping = tf.keras.callbacks.EarlyStopping(monitor="categorical_accuracy", patience=25, mode = 'max')
    ### DEFINITION GOES HERE ###
  
    ### TRAINING GOES HERE ###
    history = model.fit(X_train,
              y_train,
              epochs=500,
              validation_data = (X_test, y_test),
              batch_size=64,
              callbacks= early_stopping,
              verbose=0)
    ### TRAINING GOES HERE ###
    
    metrica = max(history.history['categorical_accuracy'])
    
    #ALWAYS RETURN THE MODEL
    return model, metrica

model, mse = train_model(X_train_emb_norm, y_train_ohe, 5, 20, tipo_optimizacion="high")
y_pred = [ est.argmax() for est in model.predict(X_test_emb_norm) ]

Results

We conducted a comparative analysis using the previously presented code and data. The models were assessed based on their F1 scores for the conversation, services, and document translation tasks, as well as their runtimes. The following table summarizes our results.

MODEL	CONVERSATION F1	SERVICES F1	DOCUMENT_ TRANSLATION F1	RUNTIME (Seconds)
LLM	0.81	0.22	0.46	1.2
LLM with examples	0.86	0.13	0.68	18
KNN – Amazon Titan Embedding	0.98	0.57	0.88	0.35
KNN – Cohere Embedding	0.96	0.72	0.72	0.35
SVM Amazon Titan Embedding	0.98	0.69	0.82	0.3
SVM Cohere Embedding	0.99	0.80	0.93	0.3
ANN Amazon Titan Embedding	0.98	0.60	0.87	0.15
ANN Cohere Embedding	0.99	0.77	0.96	0.15

As illustrated in the table, the SVM and ANN models using Cohere’s multilingual embeddings model demonstrated the strongest overall performance. The SVM with Cohere’s multilingual embeddings model achieved the highest F1 scores in two out of three tasks, reaching 0.99 for Conversation, 0.80 for Services, and 0.93 for Document_Translation. Similarly, the ANN with Cohere’s multilingual embeddings model also performed exceptionally well, with F1 scores of 0.99, 0.77, and 0.96 for the respective tasks.

In contrast, the LLM exhibited relatively lower F1 scores, particularly for the services (0.22) and document translation (0.46) tasks. However, the performance of the LLM improved when provided with examples, with the F1 score for document translation increasing from 0.46 to 0.68.

Regarding runtime, the k-NN, SVM, and ANN models demonstrated significantly faster inference times compared to the LLM. The k-NN and SVM models with both Amazon Titan and Cohere’s multilingual embeddings model had runtimes of approximately 0.3–0.35 seconds. The ANN models were even faster, with runtimes of approximately 0.15 seconds. In contrast, the LLM required approximately 1.2 seconds for inference, and the LLM with examples took around 18 seconds.

These results suggest that the SVM and ANN models using Cohere’s multilingual embeddings model offer the best balance of performance and efficiency for the given tasks. The superior F1 scores of these models, coupled with their faster runtimes, make them promising candidates for application. The potential benefits of providing examples to the LLM model are also noteworthy, because this approach can help improve its performance on specific tasks.

Conclusion

The optimization of AIDA, Applus IDIADA’s intelligent chatbot powered by Amazon Bedrock, has been a resounding success. By developing dedicated pipelines to handle different types of user interactions—from general conversations to specialized service requests and document translations—AIDA has significantly improved its efficiency, accuracy, and overall user experience. The innovative use of LLMs, embeddings, and advanced classification algorithms has allowed AIDA to adapt to the evolving needs of IDIADA’s workforce, providing a versatile and reliable virtual assistant. AIDA now handles over 1,000 interactions per day, with a 95% accuracy rate in routing requests to the appropriate pipeline and driving a 20% increase in their team’s productivity.

Looking ahead, IDIADA plans to offer AIDA as an integrated product for customer environments, further expanding the reach and impact of this transformative technology.

Amazon Bedrock offers a comprehensive approach to security, compliance, and responsible AI development that empowers IDIADA and other customers to harness the full potential of generative AI without compromising on safety and trust. As this advanced technology continues to rapidly evolve, Amazon Bedrock provides the transparent framework needed to build innovative applications that inspire confidence.

Unlock new growth opportunities by creating custom, secure AI models tailored to your organization’s unique needs. Take the first step in your generative AI transformation—connect with an AWS expert today to begin your journey.

About the Authors

Xavier Vizcaino is the head of the DataLab, in the Digital Solutions department of Applus IDIADA. DataLab is the unit focused on the development of solutions for generating value from the exploitation of data through artificial intelligence.

Diego Martín Montoro is an AI Expert and Machine Learning Engineer at Applus+ Idiada Datalab. With a Computer Science degree and a Master’s in Data Science, Diego has built his career in the field of artificial intelligence and machine learning. His experience includes roles as a Machine Learning Engineer at companies like AppliedIT and Applus+ IDIADA, where he has worked on developing advanced AI systems and anomaly detection solutions.

Jordi Sánchez Ferrer is the current Product Owner of the Datalab at Applus+ Idiada. A Computer Engineer with a Master’s degree in Data Science, Jordi’s trajectory includes roles as a Business Intelligence developer, Machine Learning engineer, and lead developer in Datalab. In his current role, Jordi combines his technical expertise with product management skills, leading strategic initiatives that align data science and AI projects with business objectives at Applus+ Idiada.

Daniel Colls is a professional with more than 25 years of experience who has lived through the digital transformation and the transition from the on-premises model to the cloud from different perspectives in the IT sector. For the past 3 years, as a Solutions Architect at AWS, he has made this experience available to his customers, helping them successfully implement or move their workloads to the cloud.

Accelerate IaC troubleshooting with Amazon Bedrock Agents

February 25, 2025

by Akhil Raj Yallamelli Amazon AWS

Troubleshooting infrastructure as code (IaC) errors often consumes valuable time and resources. Developers can spend multiple cycles searching for solutions across forums, troubleshooting repetitive issues, or trying to identify the root cause. These delays can lead to missed security errors or compliance violations, especially in complex, multi-account environments.

This post demonstrates how you can use Amazon Bedrock Agents to create an intelligent solution to streamline the resolution of Terraform and AWS CloudFormation code issues through context-aware troubleshooting. Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Stability AI, and Amazon through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI. Amazon Bedrock Agents is a fully managed service that helps developers create AI agents that can break down complex tasks into steps and execute them using FMs and APIs to accomplish specific business objectives.

Our solution uses Amazon Bedrock Agents to analyze error messages and code context, generating detailed troubleshooting steps for IaC errors. In organizations with multi-account AWS environments, teams often maintain a centralized AWS environment for developers to deploy applications. This setup makes sure that AWS infrastructure deployments using IaC align with organizational security and compliance measures. For specific IaC errors related to these compliance measures, such as those involving service control policies (SCPs) or resource-based policies, our solution intelligently directs developers to contact appropriate teams like Security or Enablement. This targeted guidance maintains security protocols and makes sure that sensitive issues are handled by the right experts. The solution is flexible and can be adapted for similar use cases beyond these examples.

Although we focus on Terraform Cloud workspaces in this example, the same principles apply to GitLab CI/CD pipelines or other continuous integration and delivery (CI/CD) approaches executing IaC code. By automating initial error analysis and providing targeted solutions or guidance, you can improve operational efficiency and focus on solving complex infrastructure challenges within your organization’s compliance framework.

Solution overview

Before we dive into the deployment process, let’s walk through the key steps of the architecture as illustrated in the following figure.

The workflow for the Terraform solution is as follows:

Initial input through the Amazon Bedrock Agents chat console – The user begins by entering details about their Terraform error into the chat console for Amazon Bedrock Agents. This typically includes the Terraform Cloud workspace URL where the error occurred, and optionally, a Git repository URL and branch name if additional context is needed.
Error retrieval and context gathering – The Amazon Bedrock agent forwards these details to an action group that invokes the first AWS Lambda function (see the following Lambda function code). This function invokes another Lambda function (see the following Lambda function code) which retrieves the latest error message from the specified Terraform Cloud workspace. If a Git repository URL is provided, it also retrieves relevant Terraform files from the repository. This contextual information is then sent back to the first Lambda function.
Error analysis and response generation – Lambda function would then construct a detailed prompt that includes the error message, repository files (if available), and specific use case instructions. It then uses the Amazon Bedrock model to analyze the error and generate either troubleshooting steps or guidance to contact specific teams.
Interaction and user guidance – The agent displays the generated response to the user. For most Terraform errors, this includes detailed troubleshooting steps. For specific cases related to organizational policies (for example, service control policies or resource-based policies), the response directs the user to contact the appropriate team, such as Security or Enablement.
Continuous improvement – The solution can be continually updated with new specific use cases and organizational guidelines, making sure that the troubleshooting advice stays current with the organization’s evolving infrastructure and compliance requirements. For example:
1. SCP or IAM policy violations – Guides developers when they encounter permission issues due to SCPs or strict AWS Identity and Access Management (IAM) boundaries, offering alternatives or escalation paths.
2. VPC and networking restrictions – Flags non-compliant virtual private cloud (VPC) or subnet configurations (such as public subnets) and suggests security-compliant adjustments.
3. Encryption requirements – Detects missing or incorrect encryption for Amazon Simple Storage Service (Amazon S3) or Amazon Elastic Block Store (Amazon EBS) resources and recommends the appropriate configurations to align with compliance standards.

The following diagram illustrates the step-by-step process of how the solution works.

This solution streamlines the process of resolving Terraform errors, providing immediate, context-aware guidance to developers while making sure that sensitive or complex issues are directed to the appropriate teams. By using the capabilities of Amazon Bedrock Agents, it offers a scalable and intelligent approach to managing IaC challenges in large, multi-account AWS environments.

Prerequisites

To implement the solution, you need the following:

An understanding of Amazon Bedrock Agents, prompt engineering, Amazon Bedrock Knowledge Bases, Lambda functions, and IAM
An AWS account with appropriate IAM permissions to create agents and knowledge bases in Amazon Bedrock, Lambda functions, and IAM roles
A service role created for Amazon Bedrock Agents
Model access enabled for Amazon Bedrock
A GitLab account with a repository and a personal access token to access the repository

Create the Amazon Bedrock agent

To create and configure the Amazon Bedrock agent, complete the following steps:

On the Amazon Bedrock console, choose Agents in the navigation pane.
Choose Create agent.
Provide agent details, including agent name and description (optional).
Grant the agent permissions to AWS services through the IAM service role. This gives your agent access to required services, such as Lambda.
Select an FM from Amazon Bedrock (such as Anthropic’s Claude 3 Sonnet).
For troubleshooting Terraform errors through Amazon Bedrock Agents, attach the following instruction to the agent. This instruction makes sure that the agent gathers the required input from the user and executes the action group to provide detailed troubleshooting steps.

“You are a terraform code error specialist. Greet the user and ask for terraform workspace url, branch name, code repository url. Once received, trigger troubleshooting action group. Provide the troubleshooting steps to the user.”

Configure the Lambda function for the action group

After you configure the initial agent and add the preceding instruction to the agent, you need to create two Lambda functions:

The first Lambda function will be added to the action group, which is invoked by the Amazon Bedrock agent, and will subsequently trigger the second Lambda function using the invoke method. Refer to the Lambda function code for more details. Make sure the LAMBDA_2_FUNCTION_NAME environment variable is set.
The second Lambda function will handle fetching the Terraform workspace error and the associated Terraform code from GitLab. Refer to the Lambda function code. Make sure that the TERRAFORM_API_URL, TERRAFORM_SECRET_NAME, and VCS_SECRET_NAME environment variables are set.

After the Terraform workspace error and code details are retrieved, these details will be passed back to the first Lambda function, which will use the Amazon Bedrock API with an FM to generate and provide the appropriate troubleshooting steps based on the error and code information.

Add the action group to the Amazon Bedrock agent

Complete the following steps to add the action group to the Amazon Bedrock agent:

Add an action group to the Amazon Bedrock agent.
Assign a descriptive name (for example, troubleshooting) to the action group and provide a description. This helps clarify the purpose of the action group within the workflow.
For Action group type, select Define with function details.

For more details, see Define function details for your agent’s action groups in Amazon Bedrock.

For Action group invocation, choose the first Lambda function that you created previously.

This function runs the business logic required when an action is invoked. Make sure to choose the correct version of the first Lambda function. For more details on how to configure Lambda functions for action groups, see Configure Lambda functions to send information that an Amazon Bedrock agent elicits from the user.

For Action group function 1, provide a name and description.
Add the following parameters.

Name	Description	Type	Required
workspace_url	Terraform workspace url	string	True
repo_url	Code repository URL	string	True
branch_name	Code repository branch name	string	True

Test the solution

The following example is of a Terraform error due to a service control polcy. The troubleshooting steps provided would be aligned to address those specific constraints. The action group triggers the Lambda function, which follows structured single-shot prompting by passing the complete context—such as the error message and repository contents—in a single input to the Amazon Bedrock model to generate precise troubleshooting steps.

Example 1: The following screenshot shows an example of a Terraform error caused by an SCP limitation managed by the security team.

The following screenshot shows an example of the user interaction with Amazon Bedrock Agents and the troubleshooting steps provided.

Example 2: The following screenshot shows an example of a Terraform error due to a missing variable value.

The following screenshot shows an example of the user interaction with Amazon Bedrock Agents and the troubleshooting steps provided.

Clean up

The services used in this demo can incur costs. Complete the following steps to clean up your resources:

Delete the Lambda functions if they are no longer required.
Delete the action group and Amazon Bedrock agent you created.

Conclusion

IaC offers flexibility for managing cloud environments, but troubleshooting code errors can be time-consuming, especially in environments with strict organizational guardrails. This post demonstrated how Amazon Bedrock Agents, combined with action groups and generative AI models, streamlines and accelerates the resolution of Terraform errors while maintaining compliance with environment security and operational guidelines.

Using the capabilities of Amazon Bedrock Agents, developers can receive context-aware troubleshooting steps tailored to environment-related issues such as SCP or IAM violations, VPC restrictions, and encryption policies. The solution provides specific guidance based on the error’s context and directs users to the appropriate teams for issues that require further escalation. This reduces the time spent on IaC errors, improves developer productivity, and maintains organizational compliance.

Are you ready to streamline your cloud deployment process with the generative AI of Amazon Bedrock? Start by exploring the Amazon Bedrock User Guide to see how it can facilitate your organization’s transition to the cloud. For specialized assistance, consider engaging with AWS Professional Services to maximize the efficiency and benefits of using Amazon Bedrock.

About the Authors

Akhil Raj Yallamelli is a Cloud Infrastructure Architect at AWS, specializing in architecting cloud infrastructure solutions for enhanced data security and cost efficiency. He is experienced in integrating technical solutions with business strategies to create scalable, reliable, and secure cloud environments. Akhil enjoys developing solutions focusing on customer business outcomes, incorporating generative AI (Gen AI) technologies to drive innovation and cloud enablement. He holds an MS degree in Computer Science. Outside of his professional work, Akhil enjoys watching and playing sports.

Ebbey Thomas is a Senior Generative AI Specialist Solutions Architect at AWS. He designs and implements generative AI solutions that address specific customer business problems. He is recognized for simplifying complexity and delivering measurable business outcomes for clients. Ebbey holds a BS in Computer Engineering and an MS in Information Systems from Syracuse University.

Derive generative AI powered insights from Alation Cloud Services using Amazon Q Business Custom Connector

February 25, 2025

by Gene Arnold Amazon AWS

This blog post is co-written with Gene Arnold from Alation.

To build a generative AI-based conversational application integrated with relevant data sources, an enterprise needs to invest time, money, and people. First, you would need build connectors to the data sources. Next you need to index this data to make it available for a Retrieval Augmented Generation (RAG) approach where relevant passages are delivered with high accuracy to a large language model (LLM). To do this, you need to select an index that provides the capabilities to index the content for semantic and vector search, build the infrastructure to retrieve data, rank the answers, and build a feature rich web application. Additionally, you might need to hire and staff a large team to build, maintain, and manage such a system.

Amazon Q Business is a fully managed generative AI-powered assistant that can answer questions, provide summaries, generate content, and securely complete tasks based on data and information in your enterprise systems. Amazon Q Business can help you get fast, relevant answers to pressing questions, solve problems, generate content, and take actions using the data and expertise found in your company’s information repositories, code, and enterprise systems. To do this Amazon Q Business provides out-of-the-box native data source connectors that can index content into a built-in retriever and uses an LLM to provide accurate, well written answers. A data source connector is a component of Amazon Q Business that helps to integrate and synchronize data from multiple repositories into one index. Amazon Q Business offers multiple prebuilt connectors to a large number of data sources, including ServiceNow, Atlassian Confluence, Amazon Simple Storage Service (Amazon S3), Microsoft SharePoint, Salesforce, and many more. For a full list of supported data source connectors, see Amazon Q Business connectors.

However, many organizations store relevant information in the form of unstructured data on company intranets or within file systems on corporate networks that are inaccessible to Amazon Q Business using its native data source connectors. You can now use the custom data source connector within Amazon Q Business to upload content to your index from a wider range of data sources.

Using an Amazon Q Business custom data source connector, you can gain insights into your organization’s third party applications with the integration of generative AI and natural language processing. This post shows how to configure an Amazon Q Business custom connector and derive insights by creating a generative AI-powered conversation experience on AWS using Amazon Q Business while using access control lists (ACLs) to restrict access to documents based on user permissions.

Alation is a data intelligence company serving more than 600 global enterprises, including 40% of the Fortune 100. Customers rely on Alation to realize the value of their data and AI initiatives. Headquartered in Redwood City, California, Alation is an AWS Specialization Partner and AWS Marketplace Seller with Data and Analytics Competency. Organizations trust Alation’s platform for self-service analytics, cloud transformation, data governance, and AI-ready data, fostering innovation at scale. In this post, we will showcase a sample of how Alation’s business policies can be integrated with an Amazon Q Business application using a custom data source connector.

Finding accurate answers from content in custom data sources using Amazon Q Business

After you integrate Amazon Q Business with data sources such as Alation, users can ask questions from the description of the document. For example,

What are the top sections of the HR benefits policies?
Who are the data stewards for my proprietary database sources?

Overview of a custom connector

A data source connector is a mechanism for integrating and synchronizing data from multiple repositories into one container index. Amazon Q Business offers multiple pre-built data source connectors that can connect to your data sources and help you create your generative AI solution with minimal configuration. However, if you have valuable data residing in spots for which those pre-built connectors cannot be used, you can use a custom connector.

When you connect Amazon Q Business to a data source and initiate the data synchronization process, Amazon Q Business crawls and adds documents from the data source to its index.

You would typically use an Amazon Q Business custom connector when you have a repository that Amazon Business doesn’t yet provide a data source connector for. Amazon Q Business only provides metric information that you can use to monitor your data source sync jobs. You must create and run the crawler that determines the documents your data source indexes. A simple architectural representation of the steps involved is shown in the following figure.

Solution overview

The solution shown of integrating Alation’s business policies is for demonstration purposes only. We recommend running similar scripts only on your own data sources after consulting with the team who manages them, or be sure to follow the terms of service for the sources that you’re trying to fetch data from. The steps involved for other custom data sources are very similar except the part where we connect to Alation and fetch data from it. To crawl and index contents in Alation you configure an Amazon Q Business custom connector as a data source in your Amazon Q Business application.

Prerequisites

For this walkthrough, you should have the following prerequisites:

An AWS account
Access to the Alation service with the ability to create new policies and access tokens. You can verify if you have access by navigating to https://[[your-domain]].alationcloud.com/admin/auth/ and see the OAuth Client Applications. Alation admins can navigate to https://[[your-domain]].alationcloud.com/admin/users/ and change user access if needed.
Privileges to create an Amazon Q Business application, AWS resources, and AWS Identity and Access Management (IAM) roles and policies.
Basic knowledge of AWS services and working knowledge of Alation or other data sources of choice.
Set up AWS IAM Identity Center integration with Amazon Q Business for user management.
Set up SageMaker Studio notebook and ensure the execution role on it has the necessary privileges to access both the Amazon Q Business application (specifically StartDataSourceSyncJob, BatchPutDocument, and StopDataSourceSyncJob permissions) and the AWS Secrets Manager secret (GetSecretValue). Additionally, it’s recommended that the policy restricts access to only the Amazon Q Business application Amazon Resource Name (ARN) and the Secrets Manager secret created in the following steps.

Configure your Alation connection

In your Alation cloud account, create an OAuth2 client application that can be consumed from an Amazon Q Business application.

In Alation, sign in as a user with administrator privileges, navigate to the settings page, and choose Authentication (https://[[your-domain]].alationcloud.com/admin/auth/).

In the OAuth Client Applications section, choose Add.

Enter an easily identifiable application name, and choose Save.

Take note of the OAuth client application data—the Client ID and the Client Secret—created and choose Close.

As a security best practice, storing the client application data in Secrets Manager is recommended. In AWS console, navigate to AWS Secrets Manager and add a new secret. Key in the Client_Id and Client_Secret values copied from the previous step.

Provide a name and description for the secret and choose Next.

Leave the defaults and choose Next.

Choose Store in the last page.

Create sample Alation policies

In our example, you would create three different sets of Alation policies for a fictional organization named Unicorn Rentals. Grouped as Workplace, HR, and Regulatory, each policy contains a rough two-page summary of crucial organizational items of interest. You can find details on how to create policies on Alation documentation.

On the Amazon Q Business side, let’s assume that we want to ensure that the following access policies are enforced. Users and access are setup via code illustrated in later sections.

#	First name	Last name	Policies authorized for access
1	Alejandro	Rosalez	Workplace, HR, and Regulatory
2	Sofia	Martinez	Workplace and HR
3	Diego	Ramirez	Workplace and Regulatory

Create an Amazon Q Business application

On the Amazon Q Business console, choose Get Started.

On the Applications page, choose Create application.

In the first step of the Create application wizard, enter the default values. Additionally, you need to choose a list of users who require access to the Amazon Q Business application by including them through the IAM Identity Center settings.

In the access management settings page, you would create and add users via AWS IAM Identity Center.

Once all users are added, choose Create.

After the application is created, take note of the Application ID value from the landing page.

Next is to choose an index type for the Amazon Q Business application. Choose the native retriever option.

After the index is created, verify that the status has changed to Active. You can then take a note of the Index ID.

Next step is for you to add the custom data source.

Search for Custom data source and choose the plus sign next to it.

Provide a name and description for the custom data source.

Once done, choose Add data source.

After the data source is added and its status is Active, take note of the Data source ID.

Load policy data from Alation to Amazon Q Business using the custom connector

Now let’s load the Alation data into Amazon Q Business using the correct access permissions. The code examples that follow are also available on the accompanying GitHub code repository.

With the connector ready, move over to the SageMaker Studio notebook and perform data synchronization operations by invoking Amazon Q Business APIs.

To start, retrieve the Alation OAuth client application credentials stored in Secrets Manager.

secrets_manager_client = boto3.client('secretsmanager')
secret_name = "alation_test"

try:
    get_secret_value_response = secrets_manager_client.get_secret_value(
        SecretId=secret_name
    )
    secret = eval(get_secret_value_response['SecretString'])

except ClientError as e:
        raise e

Next, initiate the connection using the OAuth client application credentials from Alation.

base_url = "https://[[your-domain]].alationcloud.com"
token_url = "/oauth/v2/token/"
introspect_url = "/oauth/v2/introspect/"
jwks_url = "/oauth/v2/.well-known/jwks.json/"

api_url = base_url + token_url
data = {
        "grant_type": "client_credentials",
       }
client_id = secret['Client_Id']
client_secret = secret['Client_Secret']

auth = HTTPBasicAuth(username=client_id, password=client_secret)
response = requests.post(url=api_url, data=data, auth=auth)
print(response.json())

access_token = response.json().get('access_token','')
api_url = base_url + introspect_url + "?verify_token=true"
data = {
        "token": access_token,
       }
response = requests.post(url=api_url, data=data, auth=auth)

You then configure policy type level user access. This section can be customized based on how user access information is stored on any data sources. Here, we assume a pre-set access based on the user’s email IDs.

primary_principal_list = []
workplace_policy_principals = []
hr_policy_principals = []
regulatory_policy_principals = []

principal_user_email_ids = ['alejandro_rosalez@example.com', ‘sofia_martinez@example.com', ‘diego_martinez@example.com']

workplace_policy_email_ids = ['alejandro_rosalez@example.com', 'sofia_martinez@example.com', 'diego_ramirez@example.com']
hr_policy_email_ids = ['alejandro_rosalez@example.com', 'sofia_martinez@example.com']
regulatory_policy_email_ids = ['alejandro_rosalez@example.com', 'diego_ramirez@example.com']

for workplace_policy_member in workplace_policy_email_ids:
    workplace_policy_members_dict = { 'user': { 'id': workplace_policy_member, 'access': 'ALLOW', 'membershipType': 'DATASOURCE' }}
    workplace_policy_principals.append(workplace_policy_members_dict)
    if workplace_policy_member not in primary_principal_list:
        primary_principal_list.append(workplace_policy_member)

for hr_policy_member in hr_policy_email_ids:
    hr_policy_members_dict = { 'user': { 'id': hr_policy_member, 'access': 'ALLOW', 'membershipType': 'DATASOURCE' }}
    hr_policy_principals.append(hr_policy_members_dict)
    if hr_policy_member not in primary_principal_list:
        primary_principal_list.append(hr_policy_member)
        
for regulatory_policy_member in regulatory_policy_email_ids:
    regulatory_policy_members_dict = { 'user': { 'id': regulatory_policy_member, 'access': 'ALLOW', 'membershipType': 'DATASOURCE' }}
    regulatory_policy_principals.append(regulatory_policy_members_dict)
    if regulatory_policy_member not in primary_principal_list:
        primary_principal_list.append(regulatory_policy_member)

You then pull individual policy details from Alation. This step can be repeated for all three policy types: Workplace, HR, and regulatory

url = "https://[[your-domain]].com/integration/v1/business_policies/?limit=200&skip=0&search=[[Workplace/HR/Regulatory]]&deleted=false"

headers = {
    "accept": "application/json",
    "TOKEN": access_token
}

response = requests.get(url, headers=headers)
policy_data = ""

for policy in json.loads(response.text):
    if policy["title"] is not None:
        policy_title = cleanhtml(policy["title"])
    else:
        policy_title = "None"
    if policy["description"] is not None:
        policy_description = cleanhtml(policy["description"])
    else:
        policy_description = "None"
    temp_data = policy_title + ":n" + policy_description + "nn"
    policy_data += temp_data

The next step is to define the Amazon Q Business application, index, and data source information that you created in the previous steps.

qbusiness_client = boto3.client('qbusiness')
application_id = "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
index_id = "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
data_source_id = "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"

Now you explicitly create the users in Amazon Q Business. Individual user access to different policy type data sets is configured later.

for principal in primary_principal_list:
    create_user_response = qbusiness_client.create_user(
        applicationId=application_id,
        userId=principal,
        userAliases=[
            {
                'indexId': index_id,
                'dataSourceId': data_source_id,
                'userId': principal
            },
        ],
    )

for principal in primary_principal_list:
    get_user_response = qbusiness_client.get_user(
        applicationId=application_id,
        userId=principal
    )
    for user_alias in get_user_response['userAliases']:
        if "dataSourceId" in user_alias:
            print(user_alias['userId'])

For each policy type data set (Workplace, HR, and Regulatory), we execute the following three steps.

Start an Amazon Q Business data source sync job.

start_data_source_sync_job_response = qbusiness_client.start_data_source_sync_job(
    dataSourceId = data_source_id,
    indexId = index_id,
    applicationId = application_id
)
job_execution_id = start_data_source_sync_job_response['executionId']

Encode and batch upload data with user access mapping.

workplace_policy_document_id = hashlib.shake_256(policy_data.encode('utf-8')).hexdigest(128)
    docs = [ {
        "id": policy_document_id,
        "content" : {
            'blob': policy_data.encode('utf-8')
        },
        "contentType": "PLAIN_TEXT",
        "title": "Unicorn Rentals – Workplace/HR/Regulatory Policy",
        "accessConfiguration" : { 'accessControls': [ { 'principals': [[xx]]_policy_principals } ] }   
    }    
    ]
    
    batch_put_document_response = qbusiness_client.batch_put_document(
        applicationId = application_id,
        indexId = index_id,
        dataSourceSyncId = job_execution_id,
        documents = docs,
    )

Stop the data source sync job and wait for the data set to be indexed.

stop_data_source_sync_job_response = qbusiness_client.stop_data_source_sync_job(
        dataSourceId = data_source_id,
        indexId = index_id,
        applicationId = application_id
    )
    max_time = time.time() + 1*60*60
    found = False
    while time.time() < max_time and bool(found) == False:
        list_documents_response = qbusiness_client.list_documents(
            applicationId=application_id,
            indexId=index_id
        )
        if list_documents_response:
            for document in list_documents_response["documentDetailList"]:
                if document["documentId"] == workplace_policy_document_id:
                    status = document["status"]
                    print(status)
                    if status == "INDEXED" or status == "FAILED" or status == "DOCUMENT_FAILED_TO_INDEX" or status == "UPDATED":
                        found = True        
                    else:
                        time.sleep(10)        
except:
    print("Exception when calling API")

Go back to the Amazon Q Business console and see if the data uploads were successful.

Find and open the custom data source from the list of data sources.

Ensure the ingested documents are added in the Sync history tab and are in the Completed status.

Also ensure the Last sync status for the custom data source connector is Completed.

Run queries with the Amazon Q Business web experience

Now that the data synchronization is complete, you can start exploring insights from Amazon Q Business. With the newly created Amazon Q Business application, select the Web Application settings tab and navigate to the auto-created URL. This will open a new tab with a preview of the user interface and options that you can customize to fit your use case.

Sign in as user Alejandro Rosales. As you might recall, Alejandro has access to all three policy type data sets (Workplace, HR and Regulator).
1. Start by asking a question about HR policy, such as “Per the HR Payroll Policy of Unicorn Rents, what are some additional voluntary deductions taken from employee paychecks.” Note how Q Business provides an answers and also shows where it pulled the answer from.
1. Next, ask a question about a Regulatory policy: “Per the PCI DSS compliance policy of Unicorn Rentals, how is the third-party service provider access to cardholder information protected?” The result includes the summarized answer on PCI DSS compliance and also shows sources where it gathered the data from.
1. Lastly, see how Amazon Q Business responds when asked a question about generic workplace policy. “What does Unicorn Rentals do to protect information of children under the age of 13.” In this case, the application returns the answer and marks it as a Workplace policy question.

Let’s next sign in as Sofia Martinez. Sofia has access to HR and Workplace policy types, but not to Regulatory policies.
1. Start by asking a question about HR policy: “Per the HR Payroll Policy of Unicorn Rentals, list the additional voluntary deductions taken from employee paychecks.” Note how Q Business list the deductions and cite policy where the answer is gathered from.
1. Next, ask a Regulatory policy question: “What are the record keeping requirements mentioned in the ECOA compliance policy of Unicorn Rentals?”. Note how Amazon Q Business contextually answers the question mentioning Sofia does not have access to that data –

Finally, sign in as Diego Ramirez. Diego has access to Workplace and Regulatory policies but not to HR policies.
1. Start by asking the same Regulatory policy question that: “Per the PCI DSS compliance policy of Unicorn Rentals, how is third-party service provider access to cardholder information protected?”. Since Diego has access to Regulatory policy data, expected answer is generated.
1. Next, when Diego asks a question about a HR policy: “Per the HR Compensation Policy of Unicorn Rentals, how is job pricing determined?.” Note how Amazon Q Business contextually answers the question mentioning Diego does not have access to that data.

Troubleshooting

If you’re unable to get answers to any of your questions and get the message “Sorry, I could not find relevant information to complete your request,” check to see if any of the following issues apply:

No permissions: ACLs applied to your account doesn’t allow you to query certain data sources. If this is the case, please reach out to your application administrator to ensure your ACLs are configured to access the data sources.
EmailID not matching UserID: In rare scenarios, a user might have a different email ID associated with the Amazon Q Business Identity Center connection than is associated in the data source’s user profile. Make sure that the Amazon Q Business user profile is updated to recognize the email ID using the update-user CLI command or the related API call.
Data connector sync failed: Data connector fails to synchronize information from the source to Amazon Q Business application. Verify the data connectors sync run schedule and sync history to help ensure that the synchronization is successful.
Empty or private data sources: Private or empty projects will not be crawled during the synchronization run.

If none of the above are true then open a support case to get this resolved.

Clean up

To avoid incurring future charges, clean up any resources created as part of this solution. Delete the Amazon Q Business custom connector data source and client application created in Alation and the Amazon Q Business application. Next, delete the Secrets Manager secret with Alation OAuth client application credential data. Also, delete the user management setup in IAM Identity Center and the SageMaker Studio domain.

Conclusion

In this post, we discussed how to configure the Amazon Q Business custom connector to crawl and index tasks from Alation as a sample. We showed how you can use Amazon Q Business generative AI-based search to enable your business leaders and agents discover insights from your enterprise data.

To learn more about the Amazon Q Business custom connector, see the Amazon Q Business developer guide. To learn more about Alation Data Catalog, which is available for purchase through AWS Marketplace. Speak to your Alation account representative for custom purchase options. For any additional information, contact your Alation business partner.

Alation – AWS Partner Spotlight

Alation is an AWS Specialization Partner that has pioneered the modern data catalog and is making the leap into a full-service source for data intelligence. Alation is passionate about helping enterprises create thriving data cultures where anyone can find, understand, and trust data.

Contact Alation | Partner Overview | AWS Marketplace

About the Authors

Gene Arnold is a Product Architect with Alation’s Forward Deployed Engineering team. A curious learner with over 25 years of experience, Gene focuses how to sharpen selling skills and constantly explores new product lines.

Prabhakar Chandrasekaran is a Senior Technical Account Manager with AWS Enterprise Support. Prabhakar enjoys helping customers build cutting-edge AI/ML solutions on the cloud. He also works with enterprise customers providing proactive guidance and operational assistance, helping them improve the value of their solutions when using AWS. Prabhakar holds eight AWS and seven other professional certifications. With over 21 years of professional experience, Prabhakar was a data engineer and a program leader in the financial services space prior to joining AWS.

Sindhu Jambunathan is a Senior Solutions Architect at AWS, specializing in supporting ISV customers in the data and generative AI vertical to build scalable, reliable, secure, and cost-effective solutions on AWS. With over 13 years of industry experience, she joined AWS in May 2021 after a successful tenure as a Senior Software Engineer at Microsoft. Sindhu’s diverse background includes engineering roles at Qualcomm and Rockwell Collins, complemented by a Master’s of Science in Computer Engineering from the University of Florida. Her technical expertise is balanced by a passion for culinary exploration, travel, and outdoor activities.

Prateek Jain is a Sr. Solutions Architect with AWS, based out of Atlanta Georgia. He is passionate about GenAI and helping customers build amazing solutions on AWS. In his free time, he enjoys spending time with Family and playing tennis.

Mistral-Small-24B-Instruct-2501 is now available on SageMaker Jumpstart and Amazon Bedrock Marketplace

February 24, 2025

by Niithiyn Vijeaswaran Amazon AWS

Today, we’re excited to announce that Mistral-Small-24B-Instruct-2501—a twenty-four billion parameter large language model (LLM) from Mistral AI that’s optimized for low latency text generation tasks—is available for customers through Amazon SageMaker JumpStart and Amazon Bedrock Marketplace. Amazon Bedrock Marketplace is a new capability in Amazon Bedrock that developers can use to discover, test, and use over 100 popular, emerging, and specialized foundation models (FMs) alongside the current selection of industry-leading models in Amazon Bedrock. These models are in addition to the industry-leading models that are already available on Amazon Bedrock. You can also use this model with SageMaker JumpStart, a machine learning (ML) hub that provides access to algorithms and models that can be deployed with one click for running inference. In this post, we walk through how to discover, deploy, and use Mistral-Small-24B-Instruct-2501.

Overview of Mistral Small 3 (2501)

Mistral Small 3 (2501), a latency-optimized 24B-parameter model released under Apache 2.0 maintains a balance between performance and computational efficiency. Mistral offers both the pretrained (Mistral-Small-24B-Base-2501) and instruction-tuned (Mistral-Small-24B-Instruct-2501) checkpoints of the model under Apache 2.0. Mistral Small 3 (2501) features a 32 k token context window. According to Mistral, the model demonstrates strong performance in code, math, general knowledge, and instruction following compared to its peers. Mistral Small 3 (2501) is designed for the 80% of generative AI tasks that require robust language and instruction following performance with very low latency. The instruction-tuning process is focused on improving the model’s ability to follow complex directions, maintain coherent conversations, and generate accurate, context-aware responses. The 2501 version follows previous iterations (Mistral-Small-2409 and Mistral-Small-2402) released in 2024, incorporating improvements in instruction-following and reliability. Currently, the instruct version of this model, Mistral-Small-24B-Instruct-2501 is available for customers to deploy and use on SageMaker JumpStart and Bedrock Marketplace.

Optimized for conversational assistance

Mistral Small 3 (2501) excels in scenarios where quick, accurate responses are critical, such as in virtual assistants. This includes virtual assistants where users expect immediate feedback and near real-time interactions. Mistral Small 3 (2501) can handle rapid function execution when used as part of automated or agentic workflows. The architecture is designed to typically respond in less than 100 milliseconds, according to Mistral, making it ideal for customer service automation, interactive assistance, live chat, and content moderation.

Performance metrics and benchmarks

According to Mistral, the instruction-tuned version of the model achieves over 81% accuracy on Massive Multitask Language Understanding (MMLU) with 150 tokens per second latency, making it currently the most efficient model in its category. In third-party evaluations conducted by Mistral, the model demonstrates competitive performance against larger models such as Llama 3.3 70B and Qwen 32B. Notably, Mistral claims that the model performs at the same level as Llama 3.3 70B instruct and is more than three times faster on the same hardware.

SageMaker JumpStart overview

SageMaker JumpStart is a fully managed service that offers state-of-the-art foundation models for various use cases such as content writing, code generation, question answering, copywriting, summarization, classification, and information retrieval. It provides a collection of pre-trained models that you can deploy quickly, accelerating the development and deployment of ML applications. One of the key components of SageMaker JumpStart is model hubs, which offer a vast catalog of pre-trained models, such as Mistral, for a variety of tasks.

You can now discover and deploy Mistral models in Amazon SageMaker Studio or programmatically through the SageMaker Python SDK, enabling you to derive model performance and MLOps controls with Amazon SageMaker features such as Amazon SageMaker Pipelines, Amazon SageMaker Debugger, or container logs. The model is deployed in a secure AWS environment and under your VPC controls, helping to support data security for enterprise security needs.

Prerequisites

To try Mistral-Small-24B-Instruct-2501 in SageMaker JumpStart, you need the following prerequisites:

An AWS account that will contain all your AWS resources.
An AWS Identity and Access Management (IAM) role to access SageMaker. To learn more about how IAM works with SageMaker, see Identity and Access Management for Amazon SageMaker.
Access to Amazon SageMaker Studio, a SageMaker notebook instance, or an interactive development environment (IDE) such as PyCharm or Visual Studio Code. We recommend using SageMaker Studio for straightforward deployment and inference.
Access to accelerated instances (GPUs) for hosting the model.

Amazon Bedrock Marketplace overview

To get started, in the AWS Management Console for Amazon Bedrock, select Model catalog in the Foundation models section of the navigation pane. Here, you can search for models that help you with a specific use case or language. The results of the search include both serverless models and models available in Amazon Bedrock Marketplace. You can filter results by provider, modality (such as text, image, or audio), or task (such as classification or text summarization).

Deploy Mistral-Small-24B-Instruct-2501 in Amazon Bedrock Marketplace

To access Mistral-Small-24B-Instruct-2501 in Amazon Bedrock, complete the following steps:

On the Amazon Bedrock console, select Model catalog under Foundation models in the navigation pane.

At the time of writing this post, you can use the InvokeModel API to invoke the model. It doesn’t support Converse APIs or other Amazon Bedrock tooling.

Filter for Mistral as a provider and select the Mistral-Small-24B-Instruct-2501

The model detail page provides essential information about the model’s capabilities, pricing structure, and implementation guidelines. You can find detailed usage instructions, including sample API calls and code snippets for integration.

The page also includes deployment options and licensing information to help you get started with Mistral-Small-24B-Instruct-2501 in your applications.

To begin using Mistral-Small-24B-Instruct-2501, choose Deploy.
You will be prompted to configure the deployment details for Mistral-Small-24B-Instruct-2501. The model ID will be pre-populated.
1. For Endpoint name, enter an endpoint name (up to 50 alphanumeric characters).
2. For Number of instances, enter a number between 1and 100.
3. For Instance type, select your instance type. For optimal performance with Mistral-Small-24B-Instruct-2501, a GPU-based instance type such as ml.g6.12xlarge is recommended.
4. Optionally, you can configure advanced security and infrastructure settings, including virtual private cloud (VPC) networking, service role permissions, and encryption settings. For most use cases, the default settings will work well. However, for production deployments, you might want to review these settings to align with your organization’s security and compliance requirements.
Choose Deploy to begin using the model.

When the deployment is complete, you can test Mistral-Small-24B-Instruct-2501 capabilities directly in the Amazon Bedrock playground.

Choose Open in playground to access an interactive interface where you can experiment with different prompts and adjust model parameters such as temperature and maximum length.

When using Mistral-Small-24B-Instruct-2501 with the Amazon Bedrock InvokeModel and Playground console, use DeepSeek’s chat template for optimal results. For example, <｜begin▁of▁sentence｜><｜User｜>content for inference<｜Assistant｜>.

This is an excellent way to explore the model’s reasoning and text generation abilities before integrating it into your applications. The playground provides immediate feedback, helping you understand how the model responds to various inputs and letting you fine-tune your prompts for optimal results.

You can quickly test the model in the playground through the UI. However, to invoke the deployed model programmatically with Amazon Bedrock APIs, you need to get the endpoint Amazon Resource Name (ARN).

Discover Mistral-Small-24B-Instruct-2501 in SageMaker JumpStart

You can access Mistral-Small-24B-Instruct-2501 through SageMaker JumpStart in the SageMaker Studio UI and the SageMaker Python SDK. In this section, we go over how to discover the models in SageMaker Studio.

SageMaker Studio is an integrated development environment (IDE) that provides a single web-based visual interface where you can access purpose-built tools to perform ML development steps, from preparing data to building, training, and deploying your ML models. For more information about how to get started and set up SageMaker Studio, see Amazon SageMaker Studio.

In the SageMaker Studio console, access SageMaker JumpStart by choosing JumpStart in the navigation pane.
Select HuggingFace.
From the SageMaker JumpStart landing page, search for Mistral-Small-24B-Instruct-2501 using the search box.
Select a model card to view details about the model such as license, data used to train, and how to use the model. Choose Deploy to deploy the model and create an endpoint.

Deploy Mistral-Small-24B-Instruct-2501 with the SageMaker SDK

Deployment starts when you choose Deploy. After deployment finishes, you will see that an endpoint is created. Test the endpoint by passing a sample inference request payload or by selecting the testing option using the SDK. When you select the option to use the SDK, you will see example code that you can use in the notebook editor of your choice in SageMaker Studio.

To deploy using the SDK, start by selecting the Mistral-Small-24B-Instruct-2501 model, specified by the model_id with the value mistral-small-24B-instruct-2501. You can deploy your choice of the selected models on SageMaker using the following code. Similarly, you can deploy Mistral-Small-24b-Instruct-2501 using its model ID.
```
from sagemaker.jumpstart.model import JumpStartModel 

accept_eula = True 

model = JumpStartModel(model_id="huggingface-llm-mistral-small-24b-instruct-2501") 
predictor = model.deploy(accept_eula=accept_eula)
```

This deploys the model on SageMaker with default configurations, including the default instance type and default VPC configurations. You can change these configurations by specifying non-default values in JumpStartModel. The EULA value must be explicitly defined as True to accept the end-user license agreement (EULA). See AWS service quotas for how to request a service quota increase.

After the model is deployed, you can run inference against the deployed endpoint through the SageMaker predictor:

prompt = "Hello!"
payload = {
    "messages": [
        {
            "role": "user",
            "content": prompt
        }
    ],
    "max_tokens": 4000,
    "temperature": 0.1,
    "top_p": 0.9,
}
    
response = predictor.predict(payload)
print(response['choices'][0]['message']['content'])

Retail math example

Here’s an example of how Mistral-Small-24B-Instruct-2501 can break down a common shopping scenario. In this case, you ask the model to calculate the final price of a shirt after applying multiple discounts—a situation many of us face while shopping. Notice how the model provides a clear, step-by-step solution to follow.

prompt = "A store is having a 20% off sale, and you have an additional 10% off coupon. If you buy a shirt that originally costs $50, how much will you pay?"
payload = {
    "messages": [
        {
            "role": "user",
            "content": prompt
        }
    ],
    "max_tokens": 1000,
    "temperature": 0.1,
    "top_p": 0.9,
}
    
response = predictor.predict(payload)
print(response['choices'][0]['message']['content'])

The following is the output:

First, we'll apply the 20% off sale discount to the original price of the shirt.

20% of $50 is calculated as:
0.20 * $50 = $10

So, the price after the 20% discount is:
$50 - $10 = $40

Next, we'll apply the additional 10% off coupon to the new price of $40.

10% of $40 is calculated as:
0.10 * $40 = $4

So, the price after the additional 10% discount is:
$40 - $4 = $36

Therefore, you will pay $36 for the shirt.

The response shows clear step-by-step reasoning without introducing incorrect information or hallucinated facts. Each mathematical step is explicitly shown, making it simple to verify the accuracy of the calculations.

Clean up

To avoid unwanted charges, complete the following steps in this section to clean up your resources.

Delete the Amazon Bedrock Marketplace deployment

If you deployed the model using Amazon Bedrock Marketplace, complete the following steps:

On the Amazon Bedrock console, under Foundation models in the navigation pane, select Marketplace deployments.
In the Managed deployments section, locate the endpoint you want to delete.
Select the endpoint, and on the Actions menu, select Delete.
Verify the endpoint details to make sure you’re deleting the correct deployment:
1. Endpoint name
2. Model name
3. Endpoint status
Choose Delete to delete the endpoint.
In the deletion confirmation dialog, review the warning message, enter confirm, and choose Delete to permanently remove the endpoint.

Delete the SageMaker JumpStart predictor

After you’re done running the notebook, make sure to delete all resources that you created in the process to avoid additional billing. For more details, see Delete Endpoints and Resources.

predictor.delete_model()
predictor.delete_endpoint()

Conclusion

In this post, we showed you how to get started with Mistral-Small-24B-Instruct-2501 in SageMaker Studio and deploy the model for inference. Because foundation models are pre-trained, they can help lower training and infrastructure costs and enable customization for your use case. Visit SageMaker JumpStart in SageMaker Studio now to get started.

For more Mistral resources on AWS, check out the Mistral-on-AWS GitHub repo.

About the Authors

Niithiyn Vijeaswaran is a Generative AI Specialist Solutions Architect with the Third-Party Model Science team at AWS. His area of focus is AWS AI accelerators (AWS Neuron). He holds a Bachelor’s degree in Computer Science and Bioinformatics.

Preston Tuggle is a Sr. Specialist Solutions Architect working on generative AI.

Shane Rai is a Principal Generative AI Specialist with the AWS World Wide Specialist Organization (WWSO). He works with customers across industries to solve their most pressing and innovative business needs using the breadth of cloud-based AI/ML services offered by AWS, including model offerings from top tier foundation model providers.

Avan Bala is a Solutions Architect at AWS. His area of focus is AI for DevOps and machine learning. He holds a bachelor’s degree in Computer Science with a minor in Mathematics and Statistics from the University of Maryland. Avan is currently working with the Enterprise Engaged East Team and likes to specialize in projects about emerging AI technologies.

Banu Nagasundaram leads product, engineering, and strategic partnerships for Amazon SageMaker JumpStart, the machine learning and generative AI hub provided by SageMaker. She is passionate about building solutions that help customers accelerate their AI journey and unlock business value.

How Rocket Companies modernized their data science solution on AWS

February 21, 2025

by Dian Xu, Joel Hawkins Amazon AWS

This post was written with Dian Xu and Joel Hawkins of Rocket Companies.

Rocket Companies is a Detroit-based FinTech company with a mission to “Help Everyone Home”. With the current housing shortage and affordability concerns, Rocket simplifies the homeownership process through an intuitive and AI-driven experience. This comprehensive framework streamlines every step of the homeownership journey, empowering consumers to search, purchase, and manage home financing effortlessly. Rocket integrates home search, financing, and servicing in a single environment, providing a seamless and efficient experience.

The Rocket brand is a synonym for offering simple, fast, and trustworthy digital solutions for complex transactions. Rocket is dedicated to helping clients realize their dream of homeownership and financial freedom. Since its inception, Rocket has grown from a single mortgage lender to an network of businesses that creates new opportunities for its clients.

Rocket takes a complicated process and uses technology to make it simpler. Applying for a mortgage can be complex and time-consuming. That’s why we use advanced technology and data analytics to streamline every step of the homeownership experience, from application to closing. By analyzing a wide range of data points, we’re able to quickly and accurately assess the risk associated with a loan, enabling us to make more informed lending decisions and get our clients the financing they need.

Our goal at Rocket is to provide a personalized experience for both our current and prospective clients. Rocket’s diverse product offerings can be customized to meet specific client needs, while our team of skilled bankers must match with the best client opportunities that align with their skills and knowledge. Maintaining strong relationships with our large, loyal client base and hedge positions to cover financial obligations is key to our success. With the volume of business we do, even small improvements can have a significant impact.

In this post, we share how we modernized Rocket’s data science solution on AWS to increase the speed to delivery from eight weeks to under one hour, improve operational stability and support by reducing incident tickets by over 99% in 18 months, power 10 million automated data science and AI decisions made daily, and provide a seamless data science development experience.

Rocket’s legacy data science environment challenges

Rocket’s previous data science solution was built around Apache Spark and combined the use of a legacy version of the Hadoop environment and vendor-provided Data Science Experience development tools. The Hadoop environment was hosted on Amazon Elastic Compute Cloud (Amazon EC2) servers, managed in-house by Rocket’s technology team, while the data science experience infrastructure was hosted on premises. Communication between the two systems was established through Kerberized Apache Livy (HTTPS) connections over AWS PrivateLink.

Data exploration and model development were conducted using well-known machine learning (ML) tools such as Jupyter or Apache Zeppelin notebooks. Apache Hive was used to provide a tabular interface to data stored in HDFS, and to integrate with Apache Spark SQL. Apache HBase was employed to offer real-time key-based access to data. Model training and scoring was performed either from Jupyter notebooks or through jobs scheduled by Apache’s Oozie orchestration tool, which was part of the Hadoop implementation.

Despite the benefits of this architecture, Rocket faced challenges that limited its effectiveness:

Accessibility limitations: The data lake was stored in HDFS and only accessible from the Hadoop environment, hindering integration with other data sources. This also led to a backlog of data that needed to be ingested.
Steep learning curve for data scientists: Many of Rocket’s data scientists did not have experience with Spark, which had a more nuanced programming model compared to other popular ML solutions like scikit-learn. This created a challenge for data scientists to become productive.
Responsibility for maintenance and troubleshooting: Rocket’s DevOps/Technology team was responsible for all upgrades, scaling, and troubleshooting of the Hadoop cluster, which was installed on bare EC2 instances. This resulted in a backlog of issues with both vendors that remained unresolved.
Balancing development vs. production demands: Rocket had to manage work queues between development and production, which were always competing for the same resources.
Deployment challenges: Rocket sought to support more real-time and streaming inferencing use cases, but this was limited by the capabilities of MLeap for real-time models and Spark Streaming for streaming use cases, which were still experimental at that time.
Inadequate data security and DevOps support – The previous solution lacked robust security measures, and there was limited support for development and operations of the data science work.

Rocket’s legacy data science architecture is shown in the following diagram.

The diagram depicts the flow; the key components are detailed below:

Data Ingestion: Data is ingested into the system using Attunity data ingestion in Spark SQL.
Data Storage and Processing: All compute is done as Spark jobs inside of a Hadoop cluster using Apache Livy and Spark. Data is stored in HDFS and is accessed via Hive, which provides a tabular interface to the data and integrates with Spark SQL. HBase is employed to offer real-time key-based access to data.
Model Development: Data exploration and model development are conducted using tools such as Jupyter or Orchestration, which communicate with the Spark server over Kerberized Livy connection.
Model Training and Scoring: Model training and scoring is performed either from Jupyter notebooks or through jobs scheduled by Apache’s Oozie orchestration tool, which is part of the Hadoop implementation.

Rocket’s migration journey

At Rocket, we believe in the power of continuous improvement and constantly seek out new opportunities. One such opportunity is using data science solutions, but to do so, we must have a strong and flexible data science environment.

To address the legacy data science environment challenges, Rocket decided to migrate its ML workloads to the Amazon SageMaker AI suite. This would allow us to deliver more personalized experiences and understand our customers better. To promote the success of this migration, we collaborated with the AWS team to create automated and intelligent digital experiences that demonstrated Rocket’s understanding of its clients and kept them connected.

We implemented an AWS multi-account strategy, standing up Amazon SageMaker Studio in a build account using a network-isolated Amazon VPC. This allows us to separate development and production environments, while also improving our security stance.

We moved our new work to SageMaker Studio and our legacy Hadoop workloads to Amazon EMR, connecting to the old Hadoop cluster using Livy and SageMaker notebooks to ease the transition. This gives us access to a wider range of tools and technologies, enabling us to choose the most appropriate ones for each problem we’re trying to solve.

In addition, we moved our data from HDFS to Amazon Simple Storage Service (Amazon S3), and now use Amazon Athena and AWS Lake Formation to provide proper access controls to production data. This makes it easier to access and analyze the data, and to integrate it with other systems. The team also provides secure interactive integration through Amazon Elastic Kubernetes Service (Amazon EKS), further improving the company’s security stance.

SageMaker AI has been instrumental in empowering our data science community with the flexibility to choose the most appropriate tools and technologies for each problem, resulting in faster development cycles and higher model accuracy. With SageMaker Studio, our data scientists can seamlessly develop, train, and deploy models without the need for additional infrastructure management.

As a result of this modernization effort, SageMaker AI enabled Rocket to scale our data science solution across Rocket Companies and integrate using a hub-and-spoke model. The ability of SageMaker AI to automatically provision and manage instances has allowed us to focus on our data science work rather than infrastructure management, increasing the number of models in production by five times and data scientists’ productivity by 80%.

Our data scientists are empowered to use the most appropriate technology for the problem at hand, and our security stance has improved. Rocket can now compartmentalize data and compute, as well as compartmentalize development and production. Additionally, we are able to provide model tracking and lineage using Amazon SageMaker Experiments and artifacts discoverable using the SageMaker model registry and Amazon SageMaker Feature Store. All the data science work has now been migrated onto SageMaker, and all the old Hadoop work has been migrated to Amazon EMR.

Overall, SageMaker AI has played a critical role in enabling Rocket’s modernization journey by building a more scalable and flexible ML framework, reducing operational burden, improving model accuracy, and accelerating deployment times.

The successful modernization allowed Rocket to overcome our previous limitations and better support our data science efforts. We were able to improve our security stance, make work more traceable and discoverable, and give our data scientists the flexibility to choose the most appropriate tools and technologies for each problem. This has helped us better serve our customers and drive business growth.

Rocket’s new data science solution architecture on AWS is shown in the following diagram.

The solution consists of the following components:

Data ingestion: Data is ingested into the data account from on-premises and external sources.
Data refinement: Raw data is refined into consumable layers (raw, processed, conformed, and analytical) using a combination of AWS Glue extract, transform, and load (ETL) jobs and EMR jobs.
Data access: Refined data is registered in the data account’s AWS Glue Data Catalog and exposed to other accounts via Lake Formation. Analytic data is stored in Amazon Redshift. Lake Formation makes this data available to both the build and compute accounts. For the build account, access to production data is restricted to read-only.
Development: Data science development is done using SageMaker Studio. Data engineering development is done using AWS Glue Studio. Both disciplines have access to Amazon EMR for Spark development. Data scientists have access to the entire SageMaker ecosystem in the build account.
Deployment: SageMaker trained models developed in the build account are registered with an MLFlow instance. Code artifacts for both data science activities and data engineering activities are stored in Git. Deployment initiation is controlled as part of CI/CD.
Workflows: We have a number of workflow triggers. For online scoring, we typically provide an external-facing endpoint using Amazon EKS with Istio. We have numerous jobs that are launched by AWS Lambda functions that in turn are triggered by timers or events. Processes that run may include AWS Glue ETL jobs, EMR jobs for additional data transformations or model training and scoring activities, or SageMaker pipelines and jobs performing training or scoring activities.

Migration impact

We’ve evolved a long way in modernizing our infrastructure and workloads. We started our journey supporting six business channels and 26 models in production, with dozens in development. Deployment times stretched for months and required a team of three system engineers and four ML engineers to keep everything running smoothly. Despite the support of our internal DevOps team, our issue backlog with the vendor was an unenviable 200+.

Today, we are supporting nine organizations and over 20 business channels, with a whopping 210+ models in production and many more in development. Our average deployment time has gone from months to just weeks—sometimes even down to mere days! With just one part-time ML engineer for support, our average issue backlog with the vendor is practically non-existent. We now support over 120 data scientists, ML engineers, and analytical roles. Our framework mix has expanded to include 50% SparkML models and a diverse range of other ML frameworks, such as PyTorch and scikit-learn. These advancements have given our data science community the power and flexibility to tackle even more complex and challenging projects with ease.

The following table compares some of our metrics before and after migration.

.	Before Migration	After Migration
Speed to Delivery	New data ingestion project took 4–8 weeks	Data-driven ingestion takes under one hour
Operation Stability and Supportability	Over a hundred incidents and tickets in 18 months	Fewer incidents: one per 18 months
Data Science	Data scientists spent 80% of their time waiting on their jobs to run	Seamless data science development experience
Scalability	Unable to scale	Powers 10 million automated data science and AI decisions made daily

Lessons learned

Throughout the journey of modernizing our data science solution, we’ve learned valuable lessons that we believe could be of great help to other organizations who are planning to undertake similar endeavors.

First, we’ve come to realize that managed services can be a game changer in optimizing your data science operations.

The isolation of development into its own account while providing read-only access to production data is a highly effective way of enabling data scientists to experiment and iterate on their models without putting your production environment at risk. This is something that we’ve achieved through the combination of SageMaker AI and Lake Formation.

Another lesson we learned is the importance of training and onboarding for teams. This is particularly true for teams that are moving to a new environment like SageMaker AI. It’s crucial to understand the best practices of utilizing the resources and features of SageMaker AI, and to have a solid understanding of how to move from notebooks to jobs.

Lastly, we found that although Amazon EMR still requires some tuning and optimization, the administrative burden is much lighter compared to hosting directly on Amazon EC2. This makes Amazon EMR a more scalable and cost-effective solution for organizations who need to manage large data processing workloads.

Conclusion

This post provided overview of the successful partnership between AWS and Rocket Companies. Through this collaboration, Rocket Companies was able to migrate many ML workloads and implement a scalable ML framework. Ongoing with AWS, Rocket Companies remains committed to innovation and staying at the forefront of customer satisfaction.

Don’t let legacy systems hold back your organization’s potential. Discover how AWS can assist you in modernizing your data science solution and achieving remarkable results, similar to those achieved by Rocket Companies.

About the Authors

Dian Xu is the Senior Director of Engineering in Data at Rocket Companies, where she leads transformative initiatives to modernize enterprise data platforms and foster a collaborative, data-first culture. Under her leadership, Rocket’s data science, AI & ML platforms power billions of automated decisions annually, driving innovation and industry disruption. A passionate advocate for Gen AI and cloud technologies, Xu is also a sought-after speaker at global forums, inspiring the next generation of data professionals. Outside of work, she channels her love of rhythm into dancing, embracing styles from Bollywood to Bachata as a celebration of cultural diversity.

Joel Hawkins is a Principal Data Scientist at Rocket Companies, where he is responsible for the data science and MLOps platform. Joel has decades of experience developing sophisticated tooling and working with data at large scales. A driven innovator, he works hand in hand with data science teams to ensure that we have the latest technologies available to provide cutting edge solutions. In his spare time, he is an avid cyclist and has been known to dabble in vintage sports car restoration.

Venkata Santosh Sajjan Alla is a Senior Solutions Architect at AWS Financial Services. He partners with North American FinTech companies like Rocket and other financial services organizations to drive cloud and AI strategy, accelerating AI adoption at scale. With deep expertise in AI & ML, Generative AI, and cloud-native architecture, he helps financial institutions unlock new revenue streams, optimize operations, and drive impactful business transformation. Sajjan collaborates closely with Rocket Companies to advance its mission of building an AI-fueled homeownership platform to Help Everyone Home. Outside of work, he enjoys traveling, spending time with his family, and is a proud father to his daughter.

Alak Eswaradass is a Principal Solutions Architect at AWS based in Chicago, IL. She is passionate about helping customers design cloud architectures using AWS services to solve business challenges and is enthusiastic about solving a variety of ML use cases for AWS customers. When she’s not working, Alak enjoys spending time with her daughters and exploring the outdoors with her dogs.