Accelerating generative AI development with fully managed MLflow 3.0 on Amazon SageMaker AI

Accelerating generative AI development with fully managed MLflow 3.0 on Amazon SageMaker AI

Amazon SageMaker now offers fully managed support for MLflow 3.0 that streamlines AI experimentation and accelerates your generative AI journey from idea to production. This release transforms managed MLflow from experiment tracking to providing end-to-end observability, reducing time-to-market for generative AI development.

As customers across industries accelerate their generative AI development, they require capabilities to track experiments, observe behavior, and evaluate performance of models and AI applications. Data scientists and developers struggle to effectively analyze the performance of their models and AI applications from experimentation to production, making it hard to find root causes and resolve issues. Teams spend more time integrating tools than improving the quality of their models or generative AI applications.

With the launch of fully managed MLflow 3.0 on Amazon SageMaker AI, you can accelerate generative AI development by making it easier to track experiments and observe behavior of models and AI applications using a single tool. Tracing capabilities in fully managed MLflow 3.0 provide customers the ability to record the inputs, outputs, and metadata at every step of a generative AI application, so developers can quickly identify the source of bugs or unexpected behaviors. By maintaining records of each model and application version, fully managed MLflow 3.0 offers traceability to connect AI responses to their source components, which means developers can quickly trace an issue directly to the specific code, data, or parameters that generated it. With these capabilities, customers using Amazon SageMaker HyperPod to train and deploy foundation models (FMs) can now use managed MLflow to track experiments, monitor training progress, gain deeper insights into the behavior of models and AI applications, and manage their machine learning (ML) lifecycle at scale. This reduces troubleshooting time and enables teams to focus more on innovation.

This post walks you through the core concepts of fully managed MLflow 3.0 on SageMaker and provides technical guidance on how to use the new features to help accelerate your next generative AI application development.

Getting started

You can get started with fully managed MLflow 3.0 on Amazon SageMaker to track experiments, manage models, and streamline your generative AI/ML lifecycle through the AWS Management Console, AWS Command Line Interface (AWS CLI), or API.

Prerequisites

To get started, you need:

Configure your environment to use SageMaker managed MLflow Tracking Server

To perform the configuration, follow these steps:

  1. In the SageMaker Studio UI, in the Applications pane, choose MLflow and choose Create.

  1. Enter a unique name for your tracking server and specify the Amazon Simple Storage Service (Amazon S3) URI where your experiment artifacts will be stored. When you’re ready, choose Create. By default, SageMaker will select version 3.0 to create the MLflow tracking server.
  2. Optionally, you can choose Update to adjust settings such as server size, tags, or AWS Identity and Access Management (IAM) role.

The server will now be provisioned and started automatically, typically within 25 minutes. After setup, you can launch the MLflow UI from SageMaker Studio to start tracking your ML and generative AI experiments. For more details on tracking server configurations, refer to Machine learning experiments using Amazon SageMaker AI with MLflow in the SageMaker Developer Guide.

To begin tracking your experiments with your newly created SageMaker managed MLflow tracking server, you need to install both MLflow and the AWS SageMaker MLflow Python packages in your environment. You can use SageMaker Studio managed Jupyter Lab, SageMaker Studio Code Editor, a local integrated development environment (IDE), or other supported environment where your AI workloads operate to track with SageMaker managed MLFlow tracking server.

To install both Python packages using pip:pip install mlflow==3.0 sagemaker-mlflow==0.1.0

To connect and start logging your AI experiments, parameters, and models directly to the managed MLflow on SageMaker, replace the Amazon Resource Name (ARN) of your SageMaker MLflow tracking server:

import mlflow

# SageMaker MLflow ARN
tracking_server_arn = "arn:aws:sagemaker:<Region>:<Account_id>:mlflow-tracking-server/<Name>" # Enter ARN
mlflow.set_tracking_uri(tracking_server_arn) 
mlflow.set_experiment("customer_support_genai_app")

Now your environment is configured and ready to track your experiments with your SageMaker Managed MLflow tracking server.

Implement generative AI application tracing and version tracking

Generative AI applications have multiple components, including code, configurations, and data, which can be challenging to manage without systematic versioning. A LoggedModel entity in managed MLflow 3.0 represents your AI model, agent, or generative AI application within an experiment. It provides unified tracking of model artifacts, execution traces, evaluation metrics, and metadata throughout the development lifecycle. A trace is a log of inputs, outputs, and intermediate steps from a single application execution. Traces provide insights into application performance, execution flow, and response quality, enabling debugging and evaluation. With LoggedModel, you can track and compare different versions of your application, making it easier to identify issues, deploy the best version, and maintain a clear record of what was deployed and when.

To implement version tracking and tracing with managed MLflow 3.0 on SageMaker, you can establish a versioned model identity using a Git commit hash, set this as the active model context so all subsequent traces will be automatically linked to this specific version, enable automatic logging for Amazon Bedrock interactions, and then make an API call to Anthropic’s Claude 3.5 Sonnet that will be fully traced with inputs, outputs, and metadata automatically captured within the established model context. Managed MLflow 3.0 tracing is already integrated with various generative AI libraries and provides one-line automatic tracing experience for all the support libraries. For information about supported libraries, refer to Supported Integrations in the MLflow documentation.

# 1. Define your application version using the git commit
logged_model= "customer_support_agent"
logged_model_name = f"{logged_model}-{git_commit}"

# 2.Set the active model context - traces will be linked to this
mlflow.set_active_model(name=logged_model_name)


# 3.Set auto logging for your model provider
mlflow.bedrock.autolog()

# 4. Chat with your LLM provider
# Ensure that your boto3 client has the necessary auth information
bedrock = boto3.client(
 service_name="bedrock-runtime",
 region_name="<REPLACE_WITH_YOUR_AWS_REGION>",
)

model = "anthropic.claude-3-5-sonnet-20241022-v2:0"
messages = [{ "role": "user", "content": [{"text": "Hello!"}]}]
# All intermediate executions within the chat session will be logged
bedrock.converse(modelId=model, messages=messages)

After logging this information, you can track these generative AI experiments and the logged model for the agent in the managed MLflow 3.0 tracking server UI, as shown in the following screenshot.

In addition to the one-line auto tracing functionality, MLflow offers Python SDK for manually instrumenting your code and manipulating traces. Refer to the code sample notebook sagemaker_mlflow_strands.ipynb in the aws-samples GitHub repository, where we use MLflow manual instrumentation to trace Strands Agents. With tracing capabilities in fully managed MLflow 3.0, you can record the inputs, outputs, and metadata associated with each intermediate step of a request, so you can pinpoint the source of bugs and unexpected behaviors.

These capabilities provide observability in your AI workload by capturing detailed information about the execution of the workload services, nodes, and tools that you can see under the Traces tab.

You can inspect each trace, as shown in the following image, by choosing the request ID in the traces tab for the desired trace.

Fully managed MLflow 3.0 on Amazon SageMaker also introduces the capability to tag traces. Tags are mutable key-value pairs you can attach to traces to add valuable metadata and context. Trace tags make it straightforward to organize, search, and filter traces based on criteria such as user session, environment, model version, or performance characteristics. You can add, update, or remove tags at any stage—during trace execution using mlflow.update_current_trace() or after a trace is logged using the MLflow APIs or UI. Managed MLflow 3.0 makes it seamless to search and analyze traces, helping teams quickly pinpoint issues, compare agent behaviors, and optimize performance. The tracing UI and Python API both support powerful filtering, so you can drill down into traces based on attributes such as status, tags, user, environment, or execution time as shown in the screenshot below. For example, you can instantly find all traces with errors, filter by production environment, or search for traces from a specific request. This capability is essential for debugging, cost analysis, and continuous improvement of generative AI applications.

The following screenshot displays the traces returned when searching for the tag ‘Production’.

The following code snippet shows how you can use search for all traces in production with a successful status:

# Search for traces in production environment with successful status 
traces = mlflow.search_traces( filter_string="attributes.status = 'OK' AND tags.environment = 'production'")

Generative AI use case walkthrough with MLflow tracing

Building and deploying generative AI agents such as chat-based assistants, code generators, or customer support assistants requires deep visibility into how these agents interact with large language models (LLMs) and external tools. In a typical agentic workflow, the agent loops through reasoning steps, calling LLMs and using tools or subsystems such as search APIs or Model Context Protocol (MCP) servers until it completes the user’s task. These complex, multistep interactions make debugging, optimization, and cost tracking especially challenging.

Traditional observability tools fall short in generative AI because agent decisions, tool calls, and LLM responses are dynamic and context-dependent. Managed MLflow 3.0 tracing provides comprehensive observability by capturing every LLM call, tool invocation, and decision point in your agent’s workflow. You can use this end-to-end trace data to:

  • Debug agent behavior – Pinpoint where an agent’s reasoning deviates or why it produces unexpected outputs.
  • Monitor tool usage – Discover how and when external tools are called and analyze their impact on quality and cost.
  • Track performance and cost – Measure latency, token usage, and API costs at each step of the agentic loop.
  • Audit and govern – Maintain detailed logs for compliance and analysis.

Imagine a real-world scenario using the managed MLflow 3.0 tracing UI for a sample finance customer support agent equipped with a tool to retrieve financial data from a datastore. While you’re developing a generative AI customer support agent or analyzing the agent behavior in production, you can observe how agent responses and the execution optionally call a product database tool for more accurate recommendations. For illustration, the first trace, shown in the following screenshot, shows the agent handling a user query without invoking any tools. The trace captures the prompt, agent response, and agent decision points. The agent’s response lacks product-specific details. The trace makes it clear that no external tool was called, and you quickly identify the behavior in the agent’s reasoning chain.

The second trace, shown in the following screenshot, captures the same agent, but this time it decides to call the product database tool. The trace logs the tool invocation, the returned product data, and how the agent incorporates this information into its final response. Here, you can observe improved answer quality, a slight increase in latency, and additional API cost with higher token usage.

By comparing these traces side by side, you can debug why the agent sometimes skips using the tool, optimize when and how tools are called, and balance quality against latency and cost. MLflow’s tracing UI makes these agentic loops transparent, actionable, and seamless to analyze at scale. This post’s sample agent and all necessary code is available on the aws-samples GitHub repository, where you can replicate and adapt it for your own applications.

Cleanup

After it’s created, a SageMaker managed MLflow tracking server will incur costs until you delete or stop it. Billing for tracking servers is based on the duration the servers have been running, the size selected, and the amount of data logged to the tracking servers. You can stop tracking servers when they’re not in use to save costs, or you can delete them using API or the SageMaker Studio UI. For more details on pricing, refer to Amazon SageMaker pricing.

Conclusion

Fully managed MLflow 3.0 on Amazon SageMaker AI is now available. Get started with sample code in the aws-samples GitHub repository. We invite you to explore this new capability and experience the enhanced efficiency and control it brings to your ML projects. To learn more, visit Machine Learning Experiments using Amazon SageMaker with MLflow.

For more information, visit the SageMaker Developer Guide and send feedback to AWS re:Post for SageMaker or through your usual AWS Support contacts.


About the authors

Ram Vittal is a Principal ML Solutions Architect at AWS. He has over 3 decades of experience architecting and building distributed, hybrid, and cloud applications. He is passionate about building secure, scalable, reliable AI/ML and big data solutions to help enterprise customers with their cloud adoption and optimization journey to improve their business outcomes. In his spare time, he rides motorcycle and walks with his three-year old sheep-a-doodle!

Sandeep Raveesh is a GenAI Specialist Solutions Architect at AWS. He works with customer through their AIOps journey across model training, Retrieval-Augmented-Generation (RAG), GenAI Agents, and scaling GenAI use-cases. He also focuses on Go-To-Market strategies helping AWS build and align products to solve industry challenges in the GenerativeAI space. You can find Sandeep on LinkedIn.

Amit Modi is the product leader for SageMaker AIOps and Governance, and Responsible AI at AWS. With over a decade of B2B experience, he builds scalable products and teams that drive innovation and deliver value to customers globally.

Rahul Easwar is a Senior Product Manager at AWS, leading managed MLflow and Partner AI Apps within the SageMaker AIOps team. With over 15 years of experience spanning startups to enterprise technology, he leverages his entrepreneurial background and MBA from Chicago Booth to build scalable ML platforms that simplify AI adoption for organizations worldwide. Connect with Rahul on LinkedIn to learn more about his work in ML platforms and enterprise AI solutions.

Read More

Amazon SageMaker HyperPod launches model deployments to accelerate the generative AI model development lifecycle

Amazon SageMaker HyperPod launches model deployments to accelerate the generative AI model development lifecycle

Today, we’re excited to announce that Amazon SageMaker HyperPod now supports deploying foundation models (FMs) from Amazon SageMaker JumpStart, as well as custom or fine-tuned models from Amazon S3 or Amazon FSx. With this launch, you can train, fine-tune, and deploy models on the same HyperPod compute resources, maximizing resource utilization across the entire model lifecycle.

SageMaker HyperPod offers resilient, high-performance infrastructure optimized for large-scale model training and tuning. Since its launch in 2023, SageMaker HyperPod has been adopted by foundation model builders who are looking to lower costs, minimize downtime, and accelerate time to market. With Amazon EKS support in SageMaker HyperPod you can orchestrate your HyperPod Clusters with EKS. Customers like Perplexity, Hippocratic, Salesforce, and Articul8 use HyperPod to train their foundation models at scale. With the new deployment capabilities, customers can now leverage HyperPod clusters across the full generative AI development lifecycle from model training and tuning to deployment and scaling.

Many customers use Kubernetes as part of their generative AI strategy, to take advantage of its flexibility, portability, and open source frameworks. You can orchestrate your HyperPod clusters with Amazon EKS support in SageMaker HyperPod so you can continue working with familiar Kubernetes workflows while gaining access to high-performance infrastructure purpose-built for foundation models. Customers benefit from support for custom containers, compute resource sharing across teams, observability integrations, and fine-grained scaling controls. HyperPod extends the power of Kubernetes by streamlining infrastructure setup and allowing customers to focus more on delivering models not managing backend complexity.

New Features: Accelerating Foundation Model Deployment with SageMaker HyperPod

Customers prefer Kubernetes for flexibility, granular control over infrastructure, and robust support for open source frameworks. However, running foundation model inference at scale on Kubernetes introduces several challenges. Organizations must securely download models, identify the right containers and frameworks for optimal performance, configure deployments correctly, select appropriate GPU types, provision load balancers, implement observability, and add auto-scaling policies to meet demand spikes. To address these challenges, we’ve launched SageMaker HyperPod capabilities to support the deployment, management, and scaling of generative AI models:

  1. One-click foundation model deployment from SageMaker JumpStart: You can now deploy over 400 open-weights foundation models from SageMaker JumpStart on HyperPod with just a click, including the latest state-of-the-art models like DeepSeek-R1, Mistral, and Llama4. SageMaker JumpStart models will be deployed on HyperPod clusters orchestrated by EKS and will be made available as SageMaker endpoints or Application Load Balancers (ALB).
  2. Deploy fine-tuned models from S3 or FSx for Lustre: You can seamlessly deploy your custom models from S3 or FSx. You can also deploy models from Jupyter notebooks with provided code samples.
  3. Flexible deployment options for different user personas: We’re providing multiple ways to deploy models on HyperPod to support teams that have different preferences and expertise levels. Beyond the one-click experience available in the SageMaker JumpStart UI, you can also deploy models using native kubectl commands, the HyperPod CLI, or the SageMaker Python SDK—giving you the flexibility to work within your preferred environment.
  4. Dynamic scaling based on demand: HyperPod inference now supports automatic scaling of your deployments based on metrics from Amazon CloudWatch and Prometheus with KEDA. With automatic scaling your models can handle traffic spikes efficiently while optimizing resource usage during periods of lower demand.
  5. Efficient resource management with HyperPod Task Governance: One of the key benefits of running inference on HyperPod is the ability to efficiently utilize accelerated compute resources by allocating capacity for both inference and training in the same cluster. You can use HyperPod Task Governance for efficient resource allocation, prioritization of inference tasks over lower priority training tasks to maximize GPU utilization, and dynamic scaling of inference workloads in near real-time.
  6. Integration with SageMaker endpoints: With this launch, you can deploy AI models to HyperPod and register them with SageMaker endpoints. This allows you to use similar invocation patterns as SageMaker endpoints along with integration with other open-source frameworks.
  7. Comprehensive observability: We’ve added the capability to get observability into the inference workloads hosted on HyperPod, including built-in capabilities to scrape metrics and export them to your observability platform. This capability provides visibility into both:
    1. Platform-level metrics such as GPU utilization, memory usage, and node health
    2. Inference-specific metrics like time to first token, request latency, throughput, and model invocations

With Amazon SageMaker HyperPod, we built and deployed the foundation models behind our agentic AI platform using the same high-performance compute. This seamless transition from training to inference streamlined our workflow, reduced time to production, and ensured consistent performance in live environments. HyperPod helped us go from experimentation to real-world impact with greater speed and efficiency.”
–Laurent Sifre, Co-founder & CTO, H.AI

Deploying models on HyperPod clusters

In this launch, we are providing new operators that manage the complete lifecycle of your generative AI models in your HyperPod cluster. These operators will provide a simplified way to deploy and invoke your models in your cluster.

Prerequisites: 

helm install hyperpod-inference-operator ./sagemaker-hyperpod-cli/helm_chart/HyperPodHelmChart/charts/inference-operator 
     -n kube-system 
     --set region=" + REGION + " 
     --set eksClusterName=" + EKS_CLUSTER_NAME + " 
     --set hyperpodClusterArn=" + HP_CLUSTER_ARN + " 
     --set executionRoleArn=" + HYPERPOD_INFERENCE_ROLE_ARN + " 
     --set s3.serviceAccountRoleArn=" + S3_CSI_ROLE_ARN + " 
     --set s3.node.serviceAccount.create=false 
     --set keda.podIdentity.aws.irsa.roleArn="arn:aws:iam::" + ACCOUNT_ID + ":role/keda-operator-role" 
     --set tlsCertificateS3Bucket=" + TLS_BUCKET_NAME + " 
     --set alb.region=" + REGION + " 
     --set alb.clusterName=" + EKS_CLUSTER_NAME + " 
     --set alb.vpcId=" + VPC_ID + " 
     --set jumpstartGatedModelDownloadRoleArn=" + JUMPSTART_GATED_ROLE_ARN

Architecture:

  • When you deploy a model using the HyperPod inference operator, the operator will identify the right instance type in the cluster, download the model from the provided source, and deploy it.
  • The operator will then provision an Application Load Balancer (ALB) and add the model’s pod IP as the target. Optionally, it can register the ALB with a SageMaker endpoint.
  • The operator will also generate a TLS certificate for the ALB which is saved in S3 at the location specified by the tlsCertificateBucket. The operator will also import the certificate into AWS Certificate Manager (ACM) to associate it with the ALB. This allows clients to connect via HTTPS to the ALB after adding the certificate to their trust store.
  • If you register with a SageMaker endpoint, the operator will allow you to invoke the model using the SageMaker runtime client and handle authentication and security aspects.
  • Metrics can be exported to CloudWatch and Prometheus accessed with Grafana dashboards

Deployment sources

Once you have the operators running in your cluster, you can then deploy AI models from multiple sources using SageMaker JumpStart, S3, or FSx:

SageMaker JumpStart 

Models hosted in SageMaker JumpStart can be deployed to your HyperPod cluster. You can navigate to SageMaker Studio, go to SageMaker JumpStart and select the open-weights model you want to deploy, and select SageMaker HyperPod. Once you provide the necessary details choose Deploy. The inference operator running in the cluster will initiate a deployment in the namespace provided.

Once deployed, you can monitor deployments in SageMaker Studio.

Alternatively, here is a YAML file that you can use to deploy the JumpStart model using kubectl. For example, the following YAML snippet will deploy DeepSeek-R1 Qwen 1.5b from SageMaker JumpStart on an ml.g5.8xlarge instance:

apiVersion: inference.sagemaker.aws.amazon.com/v1alpha1
kind: JumpStartModel
metadata:
  name: deepseek-llm-r1-distill-qwen-1-5b-july03
  namespace: default
spec:
  model:
    modelHubName: SageMakerPublicHub
    modelId: deepseek-llm-r1-distill-qwen-1-5b
    modelVersion: 2.0.7
  sageMakerEndpoint:
    name: deepseek-llm-r1-distill-qwen-1-5b
  server:
    instanceType: ml.g5.8xlarge
  tlsConfig:
    tlsCertificateOutputS3Uri: s3://<bucket_name>/certificates

Deploying model from S3 

You can deploy model artifacts directly from S3 to your HyperPod cluster using the InferenceEndpointConfig resource. The inference operator will use the S3 CSI driver to provide the model files to the pods in the cluster. Using this configuration the operator will download the files located under the prefix deepseek15b as set by the modelLocation parameter. Here is the complete YAML example and documentation:

apiVersion: inference.sagemaker.aws.amazon.com/v1alpha1
kind: InferenceEndpointConfig
metadata:
  name: deepseek15b
  namespace: default
spec:
  endpointName: deepseek15b
  instanceType: ml.g5.8xlarge
  invocationEndpoint: invocations
  modelName: deepseek15b
  modelSourceConfig:
    modelLocation: deepseek15b
    modelSourceType: s3
    s3Storage:
      bucketName: mybucket
      region: us-west-2

Deploying model from FSx

Models can also be deployed from FSx for Lustre volumes, high-performance storage that can be used to save model checkpoints. This provides the capability to launch a model without having to download artifacts from S3, thus saving the time taken to download the models during deployment or scaling up. Setup instructions for FSx in HyperPod cluster is provided in the Set Up an FSx for Lustre File System workshop. Once set up, you can deploy models using InferenceEndpointConfig. Here is the complete YAML file and a sample:

apiVersion: inference.sagemaker.aws.amazon.com/v1alpha1
kind: InferenceEndpointConfig
metadata:
  name: deepseek15b
  namespace: default
spec:
  endpointName: deepseek15b
  instanceType: ml.g5.8xlarge
  invocationEndpoint: invocations
  modelName: deepseek15b
  modelSourceConfig:
    fsxStorage:
      fileSystemId: fs-abcd1234
    modelLocation: deepseek-1-5b
    modelSourceType: fsx

Deployment experiences

We are providing multiple experiences to deploy, kubectl, the HyperPod CLI, and the Python SDK. All deployment options will need the HyperPod inference operator to be installed and running in the cluster.

Deploying with kubectl 

You can deploy models using native kubectl with YAML files as shown in the previous sections.

To deploy and monitor the status, you can run kubectl apply -f <manifest_name>.yaml.

Once deployed, you can monitor the status with:

  • kubectl get inferenceendpointconfig will show all InferenceEndpointConfig resources.
  • kubectl describe inferenceendpointconfig <name> will give detailed status information.
  • If using SageMaker JumpStart, kubectl get jumpstartmodels will show all deployed JumpStart models.
  • kubectl describe jumpstartmodel <name> will give detailed status information
  • kubectl get sagemakerendpointregistrations and kubectl describe sagemakerendpointregistration <name> will provide information on the status of the generated SageMaker endpoint and the ALB.

Other resources that are generated are deployments, services, pods, and ingress. Each resource will be visible from your cluster.

To control the invocation path on your container, you can modify the invocationEndpoint parameter. Your ELB can route requests that are sent to alternate paths such as /v1/chat/completions. To modify the health check path for the container to another path such as /health, you can annotate the generated Ingress object with:

kubectl annotate ingress --overwrite <name> alb.ingress.kubernetes.io/healthcheck-path=/health.

Deploying with the HyperPod CLI

The SageMaker HyperPod CLI also offers a method of deploying using the CLI. Once you set your context, you can deploy a model, for example:

!hyp create hyp-jumpstart-endpoint 
  --version 1.0 
  --model-id deepseek-llm-r1-distill-qwen-1-5b 
  --model-version 2.0.4 
  --instance-type ml.g5.8xlarge 
  --endpoint-name endpoint-test-jscli 
  --tls-certificate-output-s3-uri s3://<bucket_name>/

For more information, see Installing the SageMaker HyperPod CLI and SageMaker HyperPod deployment documentation.

Deploying with Python SDK

The SageMaker Python SDK also provides support to deploy models on HyperPod clusters. Using the Model, Server and SageMakerEndpoint configurations, we can construct a specification to deploy on a cluster. An example notebook to deploy with Python SDK is provided here, for example:

from sagemaker.hyperpod.inference.config.hp_jumpstart_endpoint_config import Model, Server,SageMakerEndpoint, TlsConfig, EnvironmentVariables
from sagemaker.hyperpod.inference.hp_jumpstart_endpoint import HPJumpStartEndpoint
# create configs
model=Model(
    model_id='deepseek-llm-r1-distill-qwen-1-5b',
    model_version='2.0.4',
)
server=Server(
    instance_type='ml.g5.8xlarge',
)
endpoint_name=SageMakerEndpoint(name='deepseklr1distill-qwen')
tls_config=TlsConfig(tls_certificate_output_s3_uri='s3://<bucket_name>')

# create spec
js_endpoint=HPJumpStartEndpoint(
    model=model,
    server=server,
    sage_maker_endpoint=endpoint_name,
    tls_config=tls_config,
)

# use spec to deploy
js_endpoint.create()

Run inference with deployed models

Once the model is deployed, you can access the model by invoking the model with a SageMaker endpoint or invoking directly using the ALB.

Invoking the model with a SageMaker endpoint

Once a model has been deployed and the SageMaker endpoint is created successfully, you can invoke your model with the SageMaker Runtime client. You can check the status of the deployed SageMaker endpoint by going to the SageMaker AI console, choosing Inference, and then Endpoints. For example, given an input file input.json we can invoke a SageMaker endpoint using the AWS CLI. This will route the request to the model hosted on HyperPod:

!aws sagemaker-runtime invoke-endpoint 
        --endpoint-name "<ENDPOINT NAME>" 
        --body fileb://input.json 
        --content-type application/json 
        --accept application/json 
        output2.json

Invoke the model directly using ALB

You can also invoke the load balancer directly instead of using the SageMaker endpoint. You must download the generated certificate from S3 and then you can include it in your trust store or request. You can also bring your own certificates.

For example, you can invoke a vLLM container deployed after setting the invocationEndpoint  in the deployment YAML shown in previous section value to /v1/chat/completions.

For example, using curl:

curl --cacert /path/to/cert.pem https://<name>.<region>.elb.amazonaws.com/v1/chat/completions 
     -H "Content-Type: application/json" 
     -d '{
        "model": "/opt/ml/model",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Who won the world series in 2020?"}
        ]
    }'

User experience

These capabilities are designed with different user personas in mind:

  • Administrators: Administrators create the required infrastructure for HyperPod clusters such as provisioning VPCs, subnet, Security groups, EKS Cluster. Administrators also install required operators in the cluster to support deployment of models and allocation of resources across the cluster.
  • Data scientists: Data scientists deploy foundation models using familiar interfaces—whether that’s the SageMaker console, Python SDK, or Kubectl, without needing to understand all Kubernetes concepts. Data scientists can deploy and iterate on FMs efficiently, run experiments, and fine-tune model performance without needing deep infrastructure expertise.
  • Machine Learning Operations (MLOps) engineers: MLOps engineers set up observability and autoscaling policies in the cluster to meet SLAs. They identify the right metrics to export, create the dashboards, and configure autoscaling based on metrics.

Observability

Amazon SageMaker HyperPod now provides a comprehensive, out-of-the-box observability solution that delivers deep insights into inference workloads and cluster resources. This unified observability solution automatically publishes key metrics from multiple sources including Inference Containers, NVIDIA DCGM, instance-level Kubernetes node exporters, Elastic Fabric Adapter, integrated file systems, Kubernetes APIs, and Kueue to Amazon Managed Service for Prometheus and visualizes them in Amazon Managed Grafana dashboards. With a one-click installation of this HyperPod EKS add-on, along with resource utilization and cluster utilization, users gain access to critical inference metrics:

  • model_invocations_total – Total number of invocation requests to the model
  • model_errors_total – Total number of errors during model invocation
  • model_concurrent_requests – Active concurrent model requests
  • model_latency_milliseconds – Model invocation latency in milliseconds
  • model_ttfb_milliseconds – Model time to first byte latency in milliseconds

These metrics capture model inference request and response data regardless of your model type or serving framework when deployed using inference operators with metrics enabled. You can also expose container-specific metrics that are provided by the model container such as TGI, LMI and vLLM.

You can enable metrics in JumpStart deployments by setting the metrics.enabled: true parameter:

apiVersion: inference.sagemaker.aws.amazon.com/v1alpha1
kind: JumpStartModel
metadata:
  name:mistral-model
  namespace: ns-team-a
spec:
  model:
    modelId: "huggingface-llm-mistral-7b-instruct"
    modelVersion: "3.19.0"
  metrics:
    enabled:true # Default: true (can be set to false to disable)

You can enable metrics for fine-tuned models for S3 and FSx using the following configuration. Note that the default settings are set to port 8000 and /metrics:

apiVersion: inference.sagemaker.aws.amazon.com/v1alpha1
kind: InferenceEndpointConfig
metadata:
  name: inferenceendpoint-deepseeks
  namespace: ns-team-a
spec:
  modelName: deepseeks
  modelVersion: 1.0.1
  metrics:
    enabled: true # Default: true (can be set to false to disable)
    metricsScrapeIntervalSeconds: 30 # Optional: if overriding the default 15s
    modelMetricsConfig:
        port: 8000 # Optional: if overriding the default 8080
        path: "/custom-metrics" # Optional: if overriding the default "/metrics"

For more details, check out the blog post on HyperPod observability and documentation.

Autoscaling

Effective autoscaling handles unpredictable traffic patterns with sudden spikes during peak hours, promotional events, or weekends. Without dynamic autoscaling, organizations must either overprovision resources, leading to significant costs, or risk service degradation during peak loads. LLMs require more sophisticated autoscaling approaches than traditional applications due to several unique characteristics. These models can take minutes to load into GPU memory, necessitating predictive scaling with appropriate buffer time to avoid cold-start penalties. Equally important is the ability to scale in when demand decreases to save costs. Two types of autoscaling are supported, the HyperPod interference operator and KEDA.

Autoscaling provided by HyperPod inference operator

HyperPod inference operator provides built-in autoscaling capabilities for model deployments using metrics from AWS CloudWatch and Amazon Managed Prometheus (AMP). This provides a simple and quick way to setup autoscaling for models deployed with the inference operator. Check out the complete example to autoscale in the SageMaker documentation.

Autoscaling with KEDA

If you need more flexibility for complex scaling capabilities and need to manage autoscaling policies independently from model deployment specs, you can use Kubernetes Event-driven Autoscaling (KEDA). KEDA ScaledObject configurations support a wide range of scaling triggers including Amazon CloudWatch metrics, Amazon SQS queue lengths, Prometheus queries, and resource-based metrics like GPU and memory utilization. You can apply these configurations to existing model deployments by referencing the deployment name in the scaleTargetRef section of the ScaledObject specification. For more information, see the Autoscaling documentation.

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: nd-deepseek-llm-scaler
  namespace: default
spec:
  scaleTargetRef:
    name: deepseek-llm-r1-distill-qwen-1-5b
    apiVersion: apps/v1
    kind: Deployment
  minReplicaCount: 1
  maxReplicaCount: 3
  pollingInterval: 30     # seconds between checks
  cooldownPeriod: 300     # seconds before scaling down
  triggers:
    - type: aws-cloudwatch
      metadata:
        namespace: AWS/ApplicationELB        # or your metric namespace
        metricName: RequestCount              # or your metric name
        dimensionName: LoadBalancer           # or your dimension key
        dimensionValue: app/k8s-default-albnddee-cc02b67f20/0991dc457b6e8447
        statistic: Sum
        threshold: "3"                        # change to your desired threshold
        minMetricValue: "0"                   # optional floor
        region: us-east-2                     # your AWS region
        identityOwner: operator               # use the IRSA SA bound to keda-operator

Task governance

With HyperPod task governance, you can optimize resource utilization by implementing priority-based scheduling. With this approach you can assign higher priority to inference workloads to maintain low-latency requirements during traffic spikes, while still allowing training jobs to utilize available resources during quieter periods. Task governance leverages Kueue for quota management, priority scheduling, and resource sharing policies. Through ClusterQueue configurations, administrators can establish flexible resource sharing strategies that balance dedicated capacity requirements with efficient resource utilization.

Teams can configure priority classes to define their resource allocation preferences. For example, teams should create a dedicated priority class for inference workloads, such as inference with a weight of 100, to ensure they are admitted and scheduled ahead of other task types. By giving inference pods the highest priority, they are positioned to preempt lower-priority jobs when the cluster is under load, which is essential for meeting low-latency requirements during traffic surges.Additionally, teams must appropriately size their quotas. If inference spikes are expected within a shared cluster, the team should reserve a sufficient amount of GPU resources in their ClusterQueue to handle these surges. When the team is not experiencing high traffic, unused resources within their quota can be temporarily allocated to other teams’ tasks. However, once inference demand returns, those borrowed resources can be reclaimed to prioritize pending inference pods.

Here is a sample screenshot that shows both training and deployment workloads running in the same cluster. Deployments have inference-priority class which is higher than training-priority class. So a spike in inference requests has suspended the training job to enable scaling up of deployments to handle traffic.

For more information, see the SageMaker HyperPod documentation.

Cleanup

You will incur costs for the instances running in your cluster. You can scale down the instances or delete instances in your cluster to stop accruing costs.

Conclusion

With this launch, you can quickly deploy open-weights and custom models foundation model from SageMaker JumpStart, S3, and FSx to your SageMaker HyperPod cluster. SageMaker automatically provisions the infrastructure, deploys the model on your cluster, enables auto-scaling, and configures the SageMaker endpoint. You can use SageMaker to scale the compute resources up and down through HyperPod task governance as the traffic on model endpoints changes, and automatically publish metrics to the HyperPod observability dashboard to provide full visibility into model performance. With these capabilities you can seamlessly train, fine tune, and deploy models on the same HyperPod compute resources, maximizing resource utilization across the entire model lifecycle.

You can start deploying models to HyperPod today in all AWS Regions where SageMaker HyperPod is available. To learn more, visit the Amazon SageMaker HyperPod documentation or try the HyperPod inference getting started guide in the AWS Management Console.

Acknowledgements:

We would like to acknowledge the key contributors for this launch: Pradeep Cruz, Amit Modi, Miron Perel, Suryansh Singh, Shantanu Tripathi, Nilesh Deshpande, Mahadeva Navali Basavaraj, Bikash Shrestha, Rahul Sahu.


About the authors

Vivek Gangasani is a Worldwide Lead GenAI Specialist Solutions Architect for SageMaker Inference. He drives Go-to-Market (GTM) and Outbound Product strategy for SageMaker Inference. He also helps enterprises and startups deploy, manage, and scale their GenAI models with SageMaker and GPUs. Currently, he is focused on developing strategies and content for optimizing inference performance and GPU efficiency for hosting Large Language Models. In his free time, Vivek enjoys hiking, watching movies, and trying different cuisines.

Kareem Syed-Mohammed is a Product Manager at AWS. He is focuses on enabling Gen AI model development and governance on SageMaker HyperPod. Prior to this, at Amazon QuickSight, he led embedded analytics, and developer experience. In addition to QuickSight, he has been with AWS Marketplace and Amazon retail as a Product Manager. Kareem started his career as a developer for call center technologies, Local Expert and Ads for Expedia, and management consultant at McKinsey.

Piyush Daftary is a Senior Software Engineer at AWS, working on Amazon SageMaker. His interests include databases, search, machine learning, and AI. He currently focuses on building performant, scalable inference systems for large language models. Outside of work, he enjoys traveling, hiking, and spending time with family.

Chaitanya Hazarey leads software development for inference on SageMaker HyperPod at Amazon, bringing extensive expertise in full-stack engineering, ML/AI, and data science. As a passionate advocate for responsible AI development, he combines technical leadership with a deep commitment to advancing AI capabilities while maintaining ethical considerations. His comprehensive understanding of modern product development drives innovation in machine learning infrastructure.

Andrew Smith is a Senior Cloud Support Engineer in the SageMaker, Vision & Other team at AWS, based in Sydney, Australia. He supports customers using many AI/ML services on AWS with expertise in working with Amazon SageMaker. Outside of work, he enjoys spending time with friends and family as well as learning about different technologies.

Read More

Supercharge your AI workflows by connecting to SageMaker Studio from Visual Studio Code

Supercharge your AI workflows by connecting to SageMaker Studio from Visual Studio Code

AI developers and machine learning (ML) engineers can now use the capabilities of Amazon SageMaker Studio directly from their local Visual Studio Code (VS Code). With this capability, you can use your customized local VS Code setup, including AI-assisted development tools, custom extensions, and debugging tools while accessing compute resources and your data in SageMaker Studio. By accessing familiar model development features, data scientists can maintain their established workflows, preserve their productivity tools, and seamlessly develop, train, and deploy machine learning, deep learning and generative AI models.

In this post, we show you how to remotely connect your local VS Code to SageMaker Studio development environments to use your customized development environment while accessing Amazon SageMaker AI compute resources.

The local integrated development environment (IDE) connection capability delivers three key benefits for developers and data scientists:

  • Familiar development environment with scalable compute: Work in your familiar IDE environment while harnessing the purpose-built model development environment of SageMaker AI. Keep your preferred themes, shortcuts, extensions, productivity, and AI tools while accessing SageMaker AI features.
  • Simplify operations: With a few clicks, you can minimize the complex configurations and administrative overhead of setting up remote access to SageMaker Studio spaces. The integration provides direct access to Studio spaces from your IDE.
  • Enterprise grade security: Benefit from secure connections between your IDE and SageMaker AI through automatic credentials management and session maintenance. In addition, code execution remains within the controlled boundaries of SageMaker AI.

This feature bridges the gap between local development preferences and cloud-based machine learning resources, so that teams can improve their productivity while using the features of Amazon SageMaker AI.

Solution overview

The following diagram showcases the interaction between your local IDE and SageMaker Studio spaces.

The solution architecture consists of three main components:

  • Local computer: Your development machine running VS Code with AWS Toolkit extension installed.
  • SageMaker Studio: A unified, web-based ML development environment to seamlessly build, train, deploy, and manage machine learning and analytics workflows at scale using integrated AWS tools and secure, governed access to your data.
  • AWS Systems Manager: A secure, scalable remote access and management service that enables seamless connectivity between your local VS Code and SageMaker Studio spaces to streamline ML development workflows.

The connection flow supports two options:

  • Direct launch (deep link): Users can initiate the connection directly from the SageMaker Studio web interface by choosing Open in VS Code, which automatically launches their local VS Code instance.
  • AWS Toolkit connection: Users can connect through AWS Toolkit extension in VS Code by browsing available SageMaker Studio spaces and selecting their target environment.

In addition to the preceding, users can also connect to their space directly from their IDE terminal using SSH. For instructions on connecting using SSH, refer to documentation here.

After connecting, developers can:

  • Use their custom VS Code extensions and tools
  • Remotely access and use their space’s storage
  • Run their AI and ML workloads in SageMaker compute environments
  • Work with notebooks in their preferred IDE
  • Maintain the same security parameters as the SageMaker Studio web environment

Solution implementation

Prerequisites

To try the remote IDE connection, you must meet the following prerequisites:

  1. You have access to a SageMaker Studio domain with connectivity to the internet. For domains set up in VPC-only mode, your domain should have a route out to the internet through a proxy, or a NAT gateway. If your domain is completely isolated from the internet, see Connect to VPC with subnets without internet access for setting up the remote connection. If you do not have a Studio domain, you can create one using the quick setup or custom setup option.
  2. You have permissions to update the SageMaker Studio domain or user execution role in AWS Identity and Access Management (IAM).
  3. You have the latest stable VS Code with Microsoft Remote SSH (version 0.74.0 or later), and AWS Toolkit extension (version v3.68.0 or later) installed on your local machine. Optionally, if you want to connect to SageMaker spaces directly from VS Code, you should be authenticated to access AWS resources using IAM or AWS IAM Identity Center credentials. See the administrator documentation for AWS Toolkit authentication support.
  4. You use compatible SageMaker Distribution images (2.7+ and 3.1+) for running SageMaker Studio spaces, or a custom image.
  5. If you’re initiating the connection from the IDE, you already have a user profile in the SageMaker Studio domain you want to connect to, and the spaces are already created using the Studio UI or through APIs. The AWS Toolkit does not allow creation or deletion of spaces.

Set up necessary permissions

We’ve launched the StartSession API for remote IDE connectivity. Add the sagemaker:StartSession permission to your user’s role so that they can remotely connect to a space.

For the deep-linking experience, the user starts the remote session from the Studio UI. Hence, the domain default execution role, or the user’s execution role should allow the user to call the StartSession API. Modify the permissions on your domain or user execution role by adding the following policy statement:

{
    "Version": "2012-10-17", 
    "Statement": [
        {
            "Sid": "RestrictStartSessionOnSpacesToUserProfile",
            "Effect": "Allow",
            "Action": [
                "sagemaker:StartSession"
            ],
            "Resource": "arn:*:sagemaker:${aws:Region}:${aws:AccountId}:space/${sagemaker:DomainId}/*",
            "Condition": {
                "ArnLike": {
                    "sagemaker:ResourceTag/sagemaker:user-profile-arn": "arn:*:sagemaker:${aws:Region}:${aws:AccountId}:user-profile/${sagemaker:DomainId}/${sagemaker:UserProfileName}"
                }
            }
        }
    ]
}

If you’re initializing the connection to SageMaker Studio spaces directly from VS Code, your AWS credentials should allow the user to list the spaces, start or stop a space, and initiate a connection to a running space. Make sure that your AWS credentials allow the following API actions:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "sagemaker:ListSpaces",
                "sagemaker:DescribeSpace",
                "sagemaker:UpdateSpace",
                "sagemaker:ListApps",
                "sagemaker:CreateApp",
                "sagemaker:DeleteApp",
                "sagemaker:DescribeApp",
                "sagemaker:StartSession",
                "sagemaker:DescribeDomain",
                "sagemaker:AddTags"
            ],
            "Resource": "*"
        }
    ]
}

This initial IAM policy provides a quick-start foundation for testing SageMaker features. Organizations can implement more granular access controls using resource Amazon Resource Name (ARN) constraints or attribute-based access control (ABAC). With the introduction of the StartSession API, you can restrict access by defining space ARNs in the resource section or implementing condition tags according to your specific security needs, as shown in the following example.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AllowRemoteAccessByTag",
            "Effect": "Allow",
            "Action": [
                "sagemaker:StartSession"
            ],
            "Resource": "*",
            "Condition": {
                "StringEquals": {
                    "aws:ResourceTag/User": <user-identifier>
                }
            }
        }
    ]
}

Enable remote connectivity and launch VS Code from SageMaker Studio

To connect to a SageMaker space remotely, the space must have remote access enabled.

  1. Before running a space on the Studio UI, you can toggle Remote access on to enable the feature, as shown in the following screenshot.

  1. After the feature is enabled, choose Run space to start the space. After the space is running, choose Open in VS Code to launch VS Code.

  1. The first time you choose this option, you’ll be prompted by your browser to confirm opening VS Code. Select the checkbox Always allow studio to confirm and then choose Open Visual Studio Code.

  1. This will open VS Code, and you will be prompted to update your SSH configuration. Choose Update SSH config to complete the connection. This is also a one-time setup, and you will not be prompted for future connections.

  1. On successful connection, a new window launches that is connected to the SageMaker Studio space and has access to the Studio space’s storage.

Connect to the space from VS Code

Using the AWS Toolkit, you can list the spaces, start, connect to a space, or connect to a running space that has remote connection enabled. If a running space doesn’t have remote connectivity enabled, you can stop the space from the AWS Toolkit and then select the Connect icon to automatically turn on remote connectivity and start the space. The following section describes the experience in detail.

  1. After you’re authenticated into AWS, from AWS Toolkit, access the AWS Region where your SageMaker Studio domain is. You will now see a SageMaker AI section. Choose the SageMaker AI section to list the spaces in your Region. If you’re connected using IAM, the toolkit lists the spaces across domains and users in your Region. See the [Optional] Filter spaces to a specific domain or user below on instructions to view spaces for a particular user profile. For Identity Center users, the list is already filtered to display only the spaces owned by you.

  1. After you identify the space, choose the connectivity icon as shown in the screenshot below to connect to the space.

Optional: Filter spaces to a specific domain or user

When connecting to an account using IAM, you will see a list of spaces in the account and region. This can be overwhelming if the account has tens or hundreds of domains, users and spaces. The toolkit provides a filter utility that helps you quickly filter the list of spaces to a specific user profile or a list of user profiles.

  1. Next to SageMaker AI, choose the filter icon as shown in the following screenshot.

  1. You will now see a list of user profiles and domains. Scroll through the list or enter user profile or domain name, and then select or unselect to filter the list of spaces by domain or user profile.

Use cases

Following use cases demonstrate how AI developers and machine learning (ML) engineers can use local integrated development environment (IDE) connection capability.

Connecting to a notebook kernel

After you’re connected to the space, you can start creating and running notebooks and scripts right from your local development environment. By using this method, you can use the managed infrastructure provided by SageMaker for resource-intensive AI tasks while coding in a familiar environment. You can run notebook cells on your SageMaker Distribution or custom image kernels, and can choose the IDE that maximizes your productivity. Use the following steps to create and connect your notebook to a remote kernel –

  1. On your VS Code file explorer, choose the plus (+) icon to create a new file, name it remote-kernel.ipynb.
  2. Open the notebook and run a cell (for example, print ("Hello from remote IDE"). VS Code will show a pop-up for installing the Python and Jupyter extension.
  3. Choose Install/Enable suggested extensions.
  4. After the extensions are installed, VS Code will automatically launch the kernel selector. You can also choose Select Kernel on the right to view the list of kernels.

For the next steps, follow the directions for the space you’re connected to.

Code Editor spaces:

  1. Select Python environments… and choose from a list of provided Python environments. After you are connected, you can start running the cells in your notebook.

JupyterLab spaces:

  1. Select the Existing Jupyter Server… option to have the same kernel experience as the JupyterLab environment.
    If this is the first time connecting to JupyterLab spaces, you will need to configure the Jupyter server to view the same kernels as the remote server using the following steps.

    1. Choose Enter the URL of the running Jupyter Server and enter http://localhost:8888/jupyterlab/default/lab as the URL and press Enter.
    2. Enter a custom server display name, for example, JupyterLab Space Default Server and press Enter.You will now be able to view the list of kernels that’s available on the remote Jupyter server. For consequent connections, this display name will be available for you to choose from when you select the existing Jupyter server option.

The following graphic shows the entire workflow. In this example, we’re running a JupyterLab space with the SageMaker Distribution image, so we can view the list of kernels available in the image.

You can choose the kernel of your choice, for example, the Python 3 kernel, and you can start running the notebook cells on the remote kernel. With access to the SageMaker managed kernels, you can now focus on model development rather than infrastructure and runtime management, while using the development environment you know and trust.

Best practices and guardrails

  1. Follow the principle of least privilege when allowing users to connect remotely to SageMaker Studio spaces applications. SageMaker Studio supports custom tag propagation, we recommend tagging each user with a unique identifier and using the tag to allow the StartSession API to only their private applications.
  2. As an administrator, if you want to disable this feature for your users, you can enforce it using the sagemaker:RemoteAccess condition key. The following is an example policy.
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AllowCreateSpaceWithRemoteAccessDisabled",
            "Effect": "Allow",
            "Action": [
                "sagemaker:CreateSpace",
                "sagemaker:UpdateSpace"
                ],
            "Resource": "*",
            "Condition": {
                "StringEquals": {
                    "sagemaker:RemoteAccess": [
                        "DISABLED"
                    ]
                }
            }
        },
        {
            "Sid": "AllowCreateSpaceWithNoRemoteAccess",
            "Effect": "Allow",
            "Action":  [
                "sagemaker:CreateSpace",
                "sagemaker:UpdateSpace"
                ],
            "Resource": "*",
            "Condition": {
                "Null": {
                    "sagemaker:RemoteAccess": "true"
                }
            }
        }
    ]
}
  1. When connecting remotely to the SageMaker Studio spaces from your local IDE, be aware of bandwidth constraints. For optimal performance, avoid using the remote connection to transfer or access large datasets. Instead, use data transfer methods built for cloud and in-place data processing to facilitate a smooth user experience. We recommend an instance with at least 8 GB of storage to start with, and the SageMaker Studio UI will throw an exception if you choose a smaller instance.

Cleanup

If you have created a SageMaker Studio domain for the purposes of this post, remember to delete the applications, spaces, user profiles, and the domain. For instructions, see Delete a domain.

For the SageMaker Studio spaces, use the idle shutdown functionality to avoid incurring charges for compute when it is not in use.

Conclusion

The remote IDE connection feature for Amazon SageMaker Studio bridges the gap between local development environments and powerful ML infrastructure of SageMaker AI. With direct connections from local IDEs to SageMaker Studio spaces, developers and data scientists can now:

  • Maintain their preferred development environment while using the compute resources of SageMaker AI
  • Use custom extensions, debugging tools, and familiar workflows
  • Access governed data and ML resources within existing security boundaries
  • Choose between convenient deep linking or AWS Toolkit connection methods
  • Operate within enterprise-grade security controls and permissions

This integration minimizes the productivity barriers of context switching while facilitating secure access to SageMaker AI resources. Get started today with SageMaker Studio remote IDE connection to connect your local development environment to SageMaker Studio and experience streamlined ML development workflows using your familiar tools while the powerful ML infrastructure of SageMaker AI.


About the authors


Durga Sury
 is a Senior Solutions Architect at Amazon SageMaker, where she helps enterprise customers build secure and scalable AI/ML systems. When she’s not architecting solutions, you can find her enjoying sunny walks with her dog, immersing herself in murder mystery books, or catching up on her favorite Netflix shows.

Edward Sun is a Senior SDE working for SageMaker Studio at Amazon Web Services. He is focused on building interactive ML solution and simplifying the customer experience to integrate SageMaker Studio with popular technologies in data engineering and ML landscape. In his spare time, Edward is big fan of camping, hiking, and fishing, and enjoys spending time with his family.

Raj Bagwe is a Senior Solutions Architect at Amazon Web Services, based in San Francisco, California. With over 6 years at AWS, he helps customers navigate complex technological challenges and specializes in Cloud Architecture, Security and Migrations. In his spare time, he coaches a robotics team and plays volleyball. He can be reached at X handle @rajesh_bagwe.

Sri Aakash Mandavilli is a Software Engineer on the Amazon SageMaker Studio team, where he has been building innovative products since 2021. He specializes in developing various solutions across the Studio service to enhance the machine learning development experience. Outside of work, SriAakash enjoys staying active through hiking, biking, and taking long walks.

Read More

Use K8sGPT and Amazon Bedrock for simplified Kubernetes cluster maintenance

Use K8sGPT and Amazon Bedrock for simplified Kubernetes cluster maintenance

As Kubernetes clusters grow in complexity, managing them efficiently becomes increasingly challenging. Troubleshooting modern Kubernetes environments requires deep expertise across multiple domains—networking, storage, security, and the expanding ecosystem of CNCF plugins. With Kubernetes now hosting mission-critical workloads, rapid issue resolution has become paramount to maintaining business continuity.

Integrating advanced generative AI tools like K8sGPT and Amazon Bedrock can revolutionize Kubernetes cluster operations and maintenance. These solutions go far beyond simple AI-powered troubleshooting, offering enterprise-grade operational intelligence that transforms how teams manage their infrastructure. Through pre-trained knowledge and both built-in and custom analyzers, these tools enable rapid debugging, continuous monitoring, and proactive issue identification—allowing teams to resolve problems before they impact critical workloads.

K8sGPT, a CNCF sandbox project, revolutionizes Kubernetes management by scanning clusters and providing actionable insights in plain English through cutting-edge AI models including Anthropic’s Claude, OpenAI, and Amazon SageMaker custom and open source models. Beyond basic troubleshooting, K8sGPT features sophisticated auto-remediation capabilities that function like an experienced Site Reliability Engineer (SRE), tracking change deltas against current cluster state, enforcing configurable risk thresholds, and providing rollback mechanisms through Mutation custom resources. Its Model Communication Protocol (MCP) server support enables structured, real-time interaction with AI assistants for persistent cluster analysis and natural language operations. Amazon Bedrock complements this ecosystem by providing fully managed access to foundation models with seamless AWS integration. This approach represents a paradigm shift from reactive troubleshooting to proactive operational intelligence, where AI assists in resolving problems with enterprise-grade controls and complete audit trails.

This post demonstrates the best practices to run K8sGPT in AWS with Amazon Bedrock in two modes: K8sGPT CLI and K8sGPT Operator. It showcases how the solution can help SREs simplify Kubernetes cluster management through continuous monitoring and operational intelligence.

Solution overview

K8sGPT operates in two modes: the K8sGPT CLI for local, on-demand analysis, and the K8sGPT Operator for continuous in-cluster monitoring. The CLI offers flexibility through command-line interaction, and the Operator integrates with Kubernetes workflows, storing results as custom resources and enabling automated remediation. Both operational models can invoke Amazon Bedrock models to provide detailed analysis and recommendations.

K8sGPT CLI architecture

The following architecture diagram shows that after a user’s role is authenticated through AWS IAM Identity Center, the user runs the K8sGPT CLI to scan Amazon Elastic Kubernetes Service (Amazon EKS) resources and invoke an Amazon Bedrock model for analysis. The K8sGPT CLI provides an interactive interface for retrieving scan results, and model invocation logs are sent to Amazon CloudWatch for further monitoring. This setup facilitates troubleshooting and analysis of Kubernetes resources in the CLI, with Amazon Bedrock models offering insights and recommendations on the Amazon EKS environment.

The K8sGPT CLI comes with rich features, including a custom analyzer, filters, anonymization, remote caching, and integration options. See the Getting Started Guide for more details.

K8sGPT Operator architecture

The following architecture diagram shows a solution where the K8sGPT Operator installed in the EKS cluster uses Amazon Bedrock models to analyze and explain findings from the EKS cluster in real time, helping users understand issues and optimize workloads. The user collects these instance insights from the K8sGPT Operator by simply querying through a standard Kubernetes method such as kubectl. Model invocation logs, including detailed findings from the K8sGPT Operator, are logged in CloudWatch for further analysis.

In this model, no additional CLI tools are required to install other than the kubectl CLI. In addition, the single sign-on (SSO) role that the user assumed doesn’t need to have Amazon Bedrock access, because the K8sGPT Operator will assume an AWS Identity and Access Management (IAM) machine role to invoke the Amazon Bedrock large language model (LLM).

When to use which modes

The following table provides a comparison of the two modes with common use cases.

K8sGPT CLI K8sGPT Operator
Access Management Human role (IAM Identity Center) Machine role (IAM)
Feature Rich features:

  • Analyzer
  • Filters
  • Anonymization
  • Integration
  • Continuous scan and error reconciliation
  • Straightforward integration with AWS services
  • Flexibility in IAM permission changes
Common Use cases
  • Integration with supported tooling (such as Prometheus and Grafana)
  • Custom analyzer and filtering for detailed and custom analysis
  • Anonymization requirement
  • User-based troubleshooting
  • Continuous monitoring and operation
  • Kubernetes Operational Dashboard and Business as Usual (BAU) operation
  • Integration with observability tools, or additional custom analyzers

In the following sections, we walk you through the two installation modes of K8sGPT.

Install the K8sGPT CLI

Complete the following steps to install the K8sGPT CLI:

  1. Enable Amazon Bedrock in the US West (Oregon) AWS Region. Make sure to include the following role-attached policies to request or modify access to Amazon Bedrock FMs:
    1. aws-marketplace:Subscribe
    2. aws-marketplace:Unsubscribe
    3. aws-marketplace:ViewSubscriptions
  2. Request access to Amazon Bedrock FMs in US West (Oregon) Region:
    1. On the Amazon Bedrock console, in the navigation pane, under Bedrock configurations, choose Model access.
    2. On the Model access page, choose Enable specific models.
    3. Select the models, then choose Next and Submit to request access.
  3. Install K8sGPT following the official instructions.
  4. Add Amazon Bedrock and the FM as an AI backend provider to the K8sGPT configuration:
k8sgpt auth add --backend amazonbedrock --model anthropic.claude-3-5-sonnet-20240620-v1 --providerRegion <region-name>

Note: At the time of writing, K8sGPT includes support for Anthropic’s state-of-the-art Claude 4 Sonnet and 3.7 Sonnet models.

  1. Make the Amazon Bedrock backend default:
k8sgpt auth default -p amazonbedrock
  1. Update Kubeconfig to connect to an EKS cluster:
aws eks update-kubeconfig --region <region-name> --name my-cluster
  1. Analyze issues within the cluster using Amazon Bedrock:
k8sgpt analyze --explain --backend amazonbedrock

Install the K8sGPT Operator

To install the K8sGPT Operator, first complete the following prerequisites:

  1. Install the latest version of Helm. To check your version, run helm version.
  2. Install the latest version of eksctl. To check your version, run eksctl version.

Create the EKS cluster

Create an EKS cluster with eksctl with the pre-defined eksctl config file:

cat >cluster-config.yaml <<EOF
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
  name: eks
  region: us-west-2
  version: "1.32"
  tags:
    environment: eks
iam:
  withOIDC: true
  podIdentityAssociations: 
  - namespace: kube-system
    serviceAccountName: cluster-autoscaler
    roleName: pod-identity-role-cluster-autoscaler
    wellKnownPolicies:
      autoScaler: true  
managedNodeGroups:
  - name: managed-ng
    instanceType: m5.large
    minSize: 2
    desiredCapacity: 3
    maxSize: 5
    privateNetworking: true
    volumeSize: 30
    volumeType: gp3 
    tags:
      k8s.io/cluster-autoscaler/enabled: "true"
      k8s.io/cluster-autoscaler/eks: "owned"      
addonsConfig:
  autoApplyPodIdentityAssociations: true
addons:
  - name: eks-pod-identity-agent 
    tags:
      team: eks
  - name: vpc-cni
    version: latest  
  - name: aws-ebs-csi-driver
    version: latest  
  - name: coredns
    version: latest 
  - name: kube-proxy
    version: latest
cloudWatch:
 clusterLogging:
   enableTypes: ["*"]
   logRetentionInDays: 30
EOF

eksctl create cluster -f cluster-config.yaml

You should get the following expected output:
EKS cluster "eks" in "us-west-2" region is ready

Create an Amazon Bedrock and CloudWatch VPC private endpoint (optional)

To facilitate private communication between Amazon EKS and Amazon Bedrock, as well as CloudWatch, it is recommended to use a virtual private cloud (VPC) private endpoint. This will make sure that the communication is retained within the VPC, providing a secure and private channel.

Refer to Create a VPC endpoint to set up the Amazon Bedrock and CloudWatch VPC endpoints.

Create an IAM policy, trust policy, and role

Complete the following steps to create an IAM policy, trust policy, and role to only allow the K8sGPT Operator to interact with Amazon Bedrock for least privilege:

  1. Create a role policy with Amazon Bedrock permissions:
cat >k8sgpt-bedrock-permission.json <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "bedrock:InvokeModel",
        "bedrock:InvokeModelWithResponseStream"
      ],
      "Resource": "arn:aws:bedrock:us-west-2::foundation-model/anthropic.claude-3-5-sonnet-20240620-v1:0" 
    }
  ]
}
EOF
  1. Create a permission policy:
aws iam create-policy 
    --policy-name bedrock-k8sgpt-policy 
    --policy-document file://k8sgpt-bedrock-permission.json
  1. Create a trust policy:
cat >k8sgpt-bedrock-Trust-Policy.json <<EOF
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AllowEksAuthToAssumeRoleForPodIdentity",
            "Effect": "Allow",
            "Principal": {
                "Service": "pods.eks.amazonaws.com"
            },
            "Action": [
                "sts:AssumeRole",
                "sts:TagSession"
            ]
        }
    ]
}
EOF
  1. Create a role and attach the trust policy:
aws iam create-role 
    --role-name k8sgpt-bedrock 
    --assume-role-policy-document file://k8sgpt-bedrock-Trust-Policy.json
aws iam attach-role-policy --role-name k8sgpt-bedrock --policy-arn=arn:aws:iam::123456789:policy/bedrock-k8sgpt-policy

Install Prometheus

Prometheus will be used for monitoring. Use the following command to install Prometheus using Helm in the k8sgpt-operator-system namespace:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update 
helm install prometheus prometheus-community/kube-prometheus-stack -n k8sgpt-operator-system --create-namespace

Install the K8sGPT Operator through Helm

Install the K8sGPT Operator through Helm with Prometheus and Grafana enabled:

helm upgrade --install release k8sgpt/k8sgpt-operator -n k8sgpt-operator-system --set serviceAccount.annotations."eks.amazonaws.com/role-arn"=arn:aws:iam::123456789:role/k8sgpt-bedrock --set serviceMonitor.enabled=true --set grafanaDashboard.enabled=true

Patch the K8sGPT controller manager to be recognized by the Prometheus operator:

kubectl -n k8sgpt-operator-system patch serviceMonitor release-k8sgpt-operator-controller-manager-metrics-monitor -p '{"metadata":{"labels":{"release":"prometheus"}}}' --type=merge

Associate EKS Pod Identity

EKS Pod Identity is an AWS feature that simplifies how Kubernetes applications obtain IAM permissions by empowering cluster administrators to associate IAM roles that have least privileged permissions with Kubernetes service accounts directly through Amazon EKS. It provides a simple way to allow EKS pods to call AWS services such as Amazon Simple Storage Service (Amazon S3). Refer to Learn how EKS Pod Identity grants pods access to AWS services for more details.

Use the following command to perform the association:

aws eks create-pod-identity-association 
          --cluster-name eks 
          --namespace k8sgpt-operator-system 
          --service-account k8sgpt-k8sgpt-operator-system  
          --role-arn arn:aws:iam::123456789:role/k8sgpt-bedrock

Scan the cluster with Amazon Bedrock as the backend

Complete the following steps:

  1. Deploy a K8sGPT resource using the following YAML, using Anthropic’s Claude 3.5 model on Amazon Bedrock as the backend:
cat > k8sgpt-bedrock.yaml<<EOF
apiVersion: core.k8sgpt.ai/v1alpha1
kind: K8sGPT
metadata:
  name: bedrock
  namespace: k8sgpt-operator-system
spec:
  ai:
    enabled: true
    model: anthropic.claude-3-5-sonnet-20240620-v1:0
    region: us-west-2
    backend: amazonbedrock
    language: english
  noCache: false
  repository: ghcr.io/k8sgpt-ai/k8sgpt
  version: v0.4.12
EOF

kubectl apply -f k8sgpt-bedrock.yaml
  1. When the k8sgpt-bedrock pod is running, use the following command to check the list of scan results:
kubectl get results -n k8sgpt-operator-system
  1. Use the following command to check the details of each scan result:
kubectl get results <scanresult> -n k8sgpt-operator-system -o json

Set up Amazon Bedrock invocation logging

Complete the following steps to enable Amazon Bedrock invocation logging, forwarding to CloudWatch or Amazon S3 as log destinations:

  1. Create a CloudWatch log group:
    1. On the CloudWatch console, choose Log groups under Logs in the navigation pane.
    2. Choose Create log group.
    3. Provide details for the log group, then choose Create.

  1. Enable model invocation logging:
    1. On the Amazon Bedrock console, under Bedrock configurations in the navigation pane, choose Settings.
    2. Enable Model invocation logging.
    3. Select which data requests and responses you want to publish to the logs.
    4. Select CloudWatch Logs only under Select the logging destinations and enter the invocation logs group name.
    5. For Choose a method to authorize Bedrock, select Create and use a new role.
    6. Choose Save settings.

Use case- Continuously scan the EKS cluster with the K8sGPT Operator

This section demonstrates how to leverage the K8sGPT Operator for continuous monitoring of your Amazon EKS cluster. By integrating with popular observability tools, the solution provides comprehensive cluster health visibility through two key interfaces: a Grafana dashboard that visualizes scan results and cluster health metrics, and CloudWatch logs that capture detailed AI-powered analysis and recommendations from Amazon Bedrock. This automated approach eliminates the need for manual kubectl commands while ensuring proactive identification and resolution of potential issues. The integration with existing monitoring tools streamlines operations and helps maintain optimal cluster health through continuous assessment and intelligent insights.

Observe the health status of your EKS cluster through Grafana

Log in to Grafana dashboard using localhost:3000 with the following credentials embedded:

kubectl port-forward service/prometheus-grafana -n k8sgpt-operator-system 3000:80
admin-password: prom-operator
admin-user: admin

The following screenshot showcases the K8sGPT Overview dashboard.

The dashboard features the following:

  • The Result Kind types section represents the breakdown of the different Kubernetes resource types, such as services, pods, or deployments, that experienced issues based on the K8sGPT scan results
  • The Analysis Results section represents the number of scan results based on the K8sGPT scan
  • The Results over time section represents the count of scan results change over time
  • The rest of the metrics showcase the performance of the K8sGPT controller over time, which help in monitoring the operational efficiency of the K8sGPT Operator

Use a CloudWatch dashboard to check identified issues and get recommendations

Amazon Bedrock model invocation logs are logged into CloudWatch, which we set up previously. You can use a CloudWatch Logs Insights query to filter model invocation input and output for cluster scan recommendations and output as a dashboard for quick access. Complete the following steps:

  1. On the CloudWatch console, create a dashboard.

  1. On the CloudWatch console, choose the CloudWatch log group and run the following query to filter the scan result performed by the K8sGPT Operator:
fields ,input.inputBodyJson.prompt,output.outputBodyJson.completion
| sort  desc
| filter identity.arn like "k8sgpt-bedrock"
  1. Choose Create Widget to save the dashboard.

It will automatically show the model invocation log with input and output from the K8sGPT Operator. You can expand the log to check the model input for errors and output for recommendations given by the Amazon Bedrock backend.

Extend K8sGPT with Custom Analyzers

K8sGPT’s custom analyzers feature enables teams to create specialized checks for their Kubernetes environments, extending beyond the built-in analysis capabilities. This powerful extension mechanism allows organizations to codify their specific operational requirements and best practices into K8sGPT’s scanning process, making it possible to monitor aspects of cluster health that aren’t covered by default analyzers.

You can create custom analyzers to monitor various aspects of your cluster health. For example, you might want to monitor Linux disk usage on nodes – a common operational concern that could impact cluster stability. The following steps demonstrate how to implement and deploy such an analyzer:

First, create the analyzer code:

package analyzer

import (
    "context"
    rpc "buf.build/gen/go/k8sgpt-ai/k8sgpt/grpc/go/schema/v1/schemav1grpc"
    v1 "buf.build/gen/go/k8sgpt-ai/k8sgpt/protocolbuffers/go/schema/v1"
    "github.com/ricochet2200/go-disk-usage/du"
)

func (a *Handler) Run(context.Context, *v1.RunRequest) (*v1.RunResponse, error) {
    usage := du.NewDiskUsage("/")
    diskUsage := int((usage.Size() - usage.Free()) * 100 / usage.Size())
    return &v1.RunResponse{
        Result: &v1.Result{
            Name:    "diskuse",
            Details: fmt.Sprintf("Disk usage is %d%%", diskUsage),
            Error: []*v1.ErrorDetail{{
                Text: fmt.Sprintf("High disk usage detected: %d%%", diskUsage),
            }},
        },
    }, nil
}

Build your analyzer into a docker image and deploy the analyzer to your cluster:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: diskuse-analyzer
  namespace: k8sgpt-system
spec:
  replicas: 1
  selector:
    matchLabels:
      app: diskuse-analyzer
  template:
    metadata:
      labels:
        app: diskuse-analyzer
    spec:
      containers:
      - name: diskuse-analyzer
        image: <your-registry>/diskuse-analyzer:latest
        ports:
        - containerPort: 8085
---
apiVersion: v1
kind: Service
metadata:
  name: diskuse-analyzer
  namespace: k8sgpt-system
spec:
  selector:
    app: diskuse-analyzer
  ports:
    - protocol: TCP
      port: 8085
      targetPort: 8085

Finally, configure K8sGPT to use your custom analyzer:

apiVersion: core.k8sgpt.ai/v1alpha1
kind: K8sGPT
metadata:
  name: k8sgpt-instance
  namespace: k8sgpt-system
spec:
  customAnalyzers:
    - name: diskuse
      connection:
        url: diskuse-analyzer
        port: 8085

This approach allows you to extend K8sGPT’s capabilities while maintaining its integration within the Kubernetes ecosystem. Custom analyzers can be used to implement specialized health checks, security scans, or any other cluster analysis logic specific to your organization’s needs. When combined with K8sGPT’s AI-powered analysis through Amazon Bedrock, these custom checks provide detailed, actionable insights in plain English, helping teams quickly understand and resolve potential issues.

K8sGPT privacy considerations

K8sGPT collects data through its analyzers, including container status messages and pod details, which can be displayed to users or sent to an AI backend when the --explain flag is used. Data sharing with the AI backend occurs only if the user opts in by using this flag and authenticates with the backend. To enhance privacy, you can anonymize sensitive data such as deployment names and namespaces with the --anonymize flag before sharing. K8sGPT doesn’t collect logs or API server data beyond what is necessary for its analysis functions. These practices make sure users have control over their data and that it is handled securely and transparently. For more information, refer to Privacy in the K8sGPT documentation.

Clean Up

Complete the following steps to clean up your resources:

  1. Run the following command to delete the EKS cluster:
eksctl delete cluster -f cluster-config.yaml
  1. Delete the IAM role (k8sgpt-bedrock).
  2. Delete the CloudWatch logs and dashboard.

Conclusion

The K8sGPT and Amazon Bedrock integration can revolutionize Kubernetes maintenance using AI for cluster scanning, issue diagnosis, and actionable insights. The post discussed best practices for K8sGPT on Amazon Bedrock in CLI and Operator modes and highlighted use cases for simplified cluster management. This solution combines K8sGPT’s SRE expertise with Amazon Bedrock FMs to automate tasks, predict issues, and optimize resources, reducing operational overhead and enhancing performance.

You can use these best practices to identify and implement the most suitable use cases for your specific operational and management needs. By doing so, you can effectively improve Kubernetes management efficiency and achieve higher productivity in your DevOps and SRE workflows.

To learn more about K8sGPT and Amazon Bedrock, refer to the following resources:


About the authors

Angela Wang is a Technical Account Manager based in Australia with over 10 years of IT experience, specializing in cloud-native technologies and Kubernetes. She works closely with customers to troubleshoot complex issues, optimize platform performance, and implement best practices for cost optimized, reliable and scalable cloud-native environments. Her hands-on expertise and strategic guidance make her a trusted partner in navigating modern infrastructure challenges.

Haofei Feng is a Senior Cloud Architect at AWS with over 18 years of expertise in DevOps, IT Infrastructure, Data Analytics, and AI. He specializes in guiding organizations through cloud transformation and generative AI initiatives, designing scalable and secure GenAI solutions on AWS. Based in Sydney, Australia, when not architecting solutions for clients, he cherishes time with his family and Border Collies.

Eva Li is a Technical Account Manager at AWS located in Australia with over 10 years of experience in the IT industry. Specializing in IT infrastructure, cloud architecture and Kubernetes, she guides enterprise customers to navigate their cloud transformation journeys and optimize their AWS environments. Her expertise in cloud architecture, containerization, and infrastructure automation helps organizations bridge the gap between business objectives and technical implementation. Outside of work, she enjoys yoga and exploring Australia’s bush walking trails with friends.

Alex Jones is a Principal Engineer at AWS. His career has focused largely on highly constrained environments for physical and digital infrastructure. Working at companies such as Microsoft, Canoncial and American Express, he has been both an engineering leader and individual contributor. Outside of work he has founded several popular projects such as OpenFeature and more recently the GenAI accelerator for Kubernetes, K8sGPT. Based in London, Alex has a partner and two children.

Read More

How Rocket streamlines the home buying experience with Amazon Bedrock Agents

How Rocket streamlines the home buying experience with Amazon Bedrock Agents

Rocket Companies is a Detroit-based FinTech company with a mission to “Help Everyone Home.” Although known to many as a mortgage lender, Rocket’s mission extends to the entire home ownership journey from finding the perfect home, purchasing, financing, and using your home equity. Rocket has grown by making the complex simple, empowering clients to navigate the home ownership journey through intuitive, technology-driven solutions. Rocket’s web and mobile app brings together home search, financing, and servicing in one seamless experience. By combining data analytics and their 11PB of data with advanced automation, Rocket speeds up everything from loan approval to servicing, while maintaining a personalized touch at scale.

Rocket’s client-first approach is central to everything they do. With customizable digital tools and expert guidance from skilled mortgage bankers, Rocket aims to match every client with the right product and the right support quickly, accurately, and securely.

With the advent of generative AI, Rocket recognized an opportunity to go further. Buying a home can still feel overwhelming. This led Rocket to ask: How can we offer the same trusted guidance our clients expect at any hour, on any channel? The result is Rocket AI Agent, a conversational AI assistant designed to transform how clients engage with Rocket’s digital properties. Built on Amazon Bedrock Agents, the Rocket AI Agent combines deep domain knowledge, personalized guidance, and the ability to perform meaningful actions on behalf of clients. Since its launch, it has become a central part of Rocket’s client experience. Clients who interact with Rocket AI Agent are three times more likely to close a loan compared to those who don’t.

Because it’s embedded directly into Rocket’s web and mobile services, it delivers support exactly when and where clients need it. This post explores how Rocket brought that vision to life using Amazon Bedrock Agents, powering a new era of AI-driven support that is consistently available, deeply personalized, and built to take action.

Introducing Rocket AI Agent: A personalized AI homeownership guide

Rocket AI Agent is now available across the majority of Rocket’s web pages and mobile apps. It’s helping clients during loan origination, in servicing, and even within Rocket’s third-party broker system (Rocket Pro), essentially meeting clients wherever they interact with Rocket digitally.The Rocket AI Agent is a purpose-built AI agent designed to do more than answer questions. It delivers real-time, personalized guidance and takes action when needed. It offers:

  • 24/7, multilingual assistance through Rocket’s website and mobile services
  • Contextual awareness. Rocket AI agent knows what page the client was viewing and tailors its responses based upon this context
  • Real-time answers about mortgage options, rates, documents, and processes
  • Guided self-service actions, such as filling out preapproval forms or scheduling payments
  • Personalized experiences using Rocket’s proprietary data and user context
  • Seamless transitions to Rocket Mortgage bankers when human support is needed

Whether someone wants to know why their escrow changed or how to qualify for a refinance, Rocket AI Agent is designed to respond with clarity, confidence, and action.

Amazon Bedrock Agents

Amazon Bedrock Agents is a fully managed, cloud-based capability that customers use to quickly build, test, and scale agentic AI applications on Amazon Web Services (AWS). With built-in integrations and security, customers like Rocket use Amazon Bedrock Agents to accelerate from proof-of-concept to production securely and reliably. These agents extend foundation models (FMs) using the Reasoning and acting (ReAct) framework, allowing them to interpret user intent, plan and execute tasks, and integrate seamlessly with enterprise data and APIs much like a skilled digital assistant.

Agents use the FM to analyze a user’s request, break it into actionable steps, retrieve relevant data, and trigger downstream APIs to complete tasks. This allows Rocket AI Agent to move beyond passive support into proactive assistance, helping clients navigate complex financial processes in real time.Key capabilities of Amazon Bedrock Agents used in Rocket AI Agent include:

  • Agent instructions – Set the agent’s objective and role (for example, a mortgage servicing expert), enabling goal-oriented behavior
  • Amazon Bedrock Knowledge Bases – Provide fast, accurate retrieval of information from Rocket’s Learning Center and other proprietary documents
  • Action group – Define secure operations—such as submitting leads or scheduling payments—that the agent can execute by interacting with Rocket’s backend services
  • Agent memory – Memory retention allows Rocket AI Agent to maintain contextual awareness across multiple turns, enhancing user experience with more natural, personalized interactions.
  • Amazon Bedrock Guardrails – Supports Rocket’s responsible AI goals by making sure that the agent stays within appropriate topic boundaries.

By combining structured reasoning with the ability to act across systems, Amazon Bedrock Agents empower Rocket AI Agent to deliver outcomes, not just answers.

How the Rocket AI Agent works: Architecture overview

The Rocket AI Agent is a centralized capability deployed across Rocket’s suite of digital properties, designed for scale, flexibility, and job-specific precision. At the core of its architecture is a growing network of domain-specific agents currently eight each focused on distinct functions such as loan origination, servicing, or broker support. These agents work together behind a unified interface to provide seamless, context-aware assistance. The following diagram shows the solution architecture.

Here are three foundational elements that shape Rocket AI Agent’s architecture:

  1. Client initiation: The client uses the chat function within Rocket’s mobile app or web page
  2. Rocket AI Agent API: Rocket’s AI Agent API provides a unified API interface to the agents supporting the chat functionality
  3. Agent routing: The AI Agent API routes the request to the correct Amazon Bedrock agent based on static criteria, such as web or mobile property that the client entered the chat through, or the use of LLM-based intent identification
  4. Agent processing: The agent breaks the task into subtasks, determines the right sequence, and executes actions and knowledge as it works
  5. Task execution: The agent uses Rocket data in knowledge bases to find info, send results to the client, and perform actions to get work done
  6. Guardrails: Enforce Rocket’s responsible AI policies by blocking topics and language that deviate from the goals of the experience
  7. Prompt management: Helps Rocket manage a library of prompts for its AI agents and optimize prompts for particular FMs

This modular, scalable design has allowed Rocket to serve diverse client needs efficiently and consistently across services and across the homeownership lifecycle.

Impact and outcomes

Since launching Rocket AI Agent, we’ve seen transformative improvements across the client journey and internal operations:

  • Threefold increase in conversion rates from web traffic to closed loans, as Rocket AI Agent captures leads around the clock even outside traditional business hours.
  • Operational efficiency gains, particularly through chat containment. With the implementation of the AI assistant to support prospective clients exploring Rocket’s offerings, Rocket saw an 85% decrease in transfer to customer care and a 45% decrease in transfer to servicing specialists. This reduction in handoffs to human agents has freed up team capacity to focus on more complex, high-impact client needs.
  • Higher customer satisfaction (CSAT) scores, with 68% of clients providing high satisfaction ratings across servicing and origination chat interactions. Top drivers include quick response times, clear communication, and accurate information, all contributing to greater client trust and reduced friction.
  • Stronger client engagement, with users completing more tasks independently, driven by intuitive, personalized self-service capabilities.
  • Greater personalization and flexibility. Rocket AI Agents adapt to each client’s stage in the homeownership journey and their preferences, offering the ability to escalate to a banker on their terms. This personalized support reflects Rocket’s core mission to “Help Everyone Home,” by meeting clients where they are and giving them the confidence to move forward.
  • Expanded language support, including Spanish-language assistance, to better serve a diverse and growing demographic.

Rocket has deployed Rocket AI Agents across its digital services, including the servicing portal and third-party broker systems facilitating, providing continuity of experience wherever clients engage. By delivering consistent, on-brand support across these touchpoints, Rocket is transforming the way clients experience homeownership. Through the personalization capabilities of Amazon Bedrock Agents, Rocket can tailor every interaction to a client’s context and preferences bringing its mission to “Help Everyone Home” to life through scalable, intelligent engagement.

Lessons learned

Throughout the development and deployment of the Rocket AI Agent, the Rocket team uncovered several key lessons that shaped both its technical strategy and the overall client experience. These insights can serve as valuable guidance for other organizations building generative AI applications at scale:

  • Curate your data carefully: The quality of responses generated by generative AI is closely tied to the quality and structure of its source data. Rocket built their enterprise knowledge base using Amazon Bedrock Knowledge Bases, which internally uses Amazon Kendra for retrieval across Rocket’s content libraries, including FAQs, compliance documents, and servicing workflows.
  • Limit the agent’s scope per task: Rocket found that assigning each agent a tight scope of 3–5 actions led to more maintainable, testable, and high-performing agents. For example, the payment agent focuses only on tasks like scheduling payments and providing due dates, while the refinance agent handles rate simulations and lead capture. Each agent’s capabilities use Amazon Bedrock action groups with well-documented interfaces and monitored task resolution rates separately.
  • Prioritize graceful escalation: Escalation isn’t failure, it’s a critical part of user trust. Rocket implemented uncertainty thresholds using confidence scores and specific keyword triggers to detect when an interaction might require human assistance. In those cases, Rocket AI Agent proactively transitions the session to a live support agent or gives the user the option to escalate. This avoids frustrating conversational loops and makes sure that complex or sensitive interactions receive the appropriate level of human care.
  • Expect user behavior to evolve: Real-world usage is dynamic. Clients will interact with the system in unexpected ways, and patterns change over time. Investing in observability and user feedback loops is essential for adapting quickly.
  • Using cross-Region inference from the start: To provide scalable, resilient model performance, Rocket enabled cross-Region inference early in development. This allows inference requests to be routed to the optimal AWS Region within the supported geography, improving latency and model availability by automatically distributing load based on capacity. During peak traffic windows such as product launches or interest rate shifts this architecture has allowed Rocket to avoid Regional service quota bottlenecks, maintain responsiveness, and increase throughput by taking advantage of compute capacity across multiple AWS Regions. The result is a smoother, more consistent user experience even under bursty, unpredictable load conditions.

These lessons are a reminder that although generative AI can unlock powerful capabilities, thoughtful implementation is key to delivering sustainable value and trusted experiences.

What’s Next: Moving toward multi-agent collaboration

Rocket is just beginning to realize the potential of agentic AI. Building on the success of domain-specific agents, the next phase focuses on scaling these capabilities through multi-agent collaboration powered by Amazon Bedrock Agents. This evolution will allow Rocket to orchestrate agents across domains and deliver intelligent, end-to-end experiences that mirror the complexity of real client journeys.

By enabling agents to work together seamlessly, Rocket is laying the groundwork for a future where AI not only responds to questions but proactively navigates entire workflows from discovery and qualification to servicing and beyond.

Benefits for Rocket

Multi-agent collaboration marks a transformative step forward in Rocket’s journey to build agentic AI–powered experiences that reimagine homeownership from the very first question to the final signature. By enabling multiple specialized agents to coordinate within a single conversation, Rocket can unlock a new level of intelligence, automation, and personalization across its digital services.

  • End-to-end personalization: By allowing multiple domain-specific agents (such as refinance, servicing, and loan options) to share context and coordinate, Rocket can deliver more tailored, intelligent responses that evolve with the client’s homeownership journey in real time.
  • Back-office integration: With agents capable of invoking secure backend APIs and workflows, Rocket can begin to automate parts of its back-office operations, such as document verification, follow-ups, and lead routing, improving speed, accuracy, and operational efficiency.
  • Context switching: Move fluidly between servicing, origination, and refinancing within one chat.
  • Orchestration: Handle multistep tasks that span multiple Rocket business units.

With multi-agent orchestration, Rocket is laying the foundation for a consistently-available, deeply personalized assistant that not only answers questions but drives meaningful outcomes from home search to loan closing and beyond. It represents the next chapter in Rocket’s mission to “Help Everyone Home.”

Conclusion

Rocket AI Agent is more than a digital assistant. It’s a reimagined approach to client engagement, powered by agentic AI. By combining Amazon Bedrock Agents with Rocket’s proprietary data and backend systems, Rocket has created a smarter, more scalable, and more human experience available 24/7, without the wait.

To dive deeper into building intelligent, multi-agent applications with Amazon Bedrock Agents, explore the AWS workshop, Unified User Experiences with Hierarchical Multi-Agent Collaboration. This hands-on workshop includes open source code and best practices drawn from real-world financial services implementations, demonstrating how multi-agent systems can automate complex workflows to deliver next-generation customer experience.

Rocket puts it simply: “Together with AWS, we’re getting started. Our goal is to empower every client to move forward with confidence and, ultimately, to Help Everyone Home.”


About the authors

Manali Sapre is a Senior Director at Rocket Mortgage, bringing over 20 years of experience leading transformative technology initiatives across the company. She has been at the forefront of innovation—spearheading Rocket’s first-generation AI chat platform, building the company’s original digital mortgage application, and launching scalable lead generation systems. Manali has also led multiple AI-driven initiatives focused on banker efficiency and internal productivity, helping to embed smart, human-centric technology into the daily workflows of team members. Her passion lies in solving complex challenges through collaboration, mentoring the next generation of tech leaders, and creating intuitive, high-impact experiences. Outside of work, Manali enjoys hiking, traveling, and spending quality time with her family.

Seshidhar Raghupathi is a software architect at Rocket with over 12 years of experience driving innovation, scalability, and system resilience across AI and client communication platforms. He was instrumental in developing Rocket’s first cloud-based digital mortgage application and has since led several impactful initiatives to enhance intelligent, personalized client experiences. His expertise spans backend architecture, AI integration, platform modernization, and cross-team enablement. He is known for his ability to execute tactically while aligning with long-term strategic goals, particularly in enhancing security, scalability, and user experience. Outside of work, Seshi enjoys spending time with family, playing sports, and connecting with friends.

Venkata Santosh Sajjan Alla is a Senior Solutions Architect at AWS Financial Services, driving AI-led transformation across North America’s FinTech sector. He partners with organizations to design and execute cloud and AI strategies that speed up innovation and deliver measurable business impact. His work has consistently translated into millions in value through enhanced efficiency and additional revenue streams. With deep expertise in AI/ML, Generative AI, and cloud-native architectures, Sajjan enables financial institutions to achieve scalable, data-driven outcomes. When not architecting the future of finance, he enjoys traveling and spending time with family. Connect with him on LinkedIn.

Axel Larsson is a Principal Solutions Architect at AWS based in the greater New York City area. He supports FinTech customers and is passionate about helping them transform their business through cloud and AI technology. Outside of work, he is an avid tinkerer and enjoys experimenting with home automation.

Read More

Build an MCP application with Mistral models on AWS

Build an MCP application with Mistral models on AWS

This post is cowritten with Siddhant Waghjale and Samuel Barry from Mistral AI.

Model Context Protocol (MCP) is a standard that has been gaining significant traction in recent months. At a high level, it consists of a standardized interface designed to streamline and enhance how AI models interact with external data sources and systems. Instead of hardcoding retrieval and action logic or relying on one-time tools, MCP offers a structured way to pass contextual data (for example, user profiles, environment metadata, or third-party content) into a large language model (LLM) context and to route model outputs to external systems. For developers, MCP abstracts away integration complexity and creates a unified layer for injecting external knowledge and executing model actions, making it more straightforward to build robust and efficient agentic AI systems that remain decoupled from data-fetching logic.

Mistral AI is a frontier research lab that emerged in 2023 as a leading open source contender in the field of generative AI. Mistral has released many state-of-the-art models, from Mistral 7B and Mixtral in the early days up to the recently announced Mistral Medium 3 and Small 3—effectively popularizing the mixture of expert architecture along the way. Mistral models are generally described as extremely efficient and versatile, frequently reaching state-of-the-art levels of performance at a fraction of the cost. These models are now seamlessly integrated into Amazon Web Services (AWS) services, unlocking powerful deployment options for developers and enterprises. Through Amazon Bedrock, users can access Mistral models using a fully managed API, enabling rapid prototyping without managing infrastructure. Amazon Bedrock Marketplace further extends this by allowing quick model discovery, licensing, and integration into existing workflows. For power users seeking fine-tuning or custom training, Amazon SageMaker JumpStart offers a streamlined environment to customize Mistral models with their own data, using the scalable infrastructure of AWS. This integration makes it faster than ever to experiment, scale, and productionize Mistral models across a wide range of applications.

This post demonstrates building an intelligent AI assistant using Mistral AI models on AWS and MCP, integrating real-time location services, time data, and contextual memory to handle complex multimodal queries. This use case, restaurant recommendations, serves as an example, but this extensible framework can be adapted for enterprise use cases by modifying MCP server configurations to connect with your specific data sources and business systems.

Solution overview

This solution uses Mistral models on Amazon Bedrock to understand user queries and route the query to relevant MCP servers to provide accurate and up-to-date answers. The system follows this general flow:

  1. User input – The user sends a query (text, image, or both) through either a terminal-based or web-based Gradio interface
  2. Image processing – If an image is detected, the system processes and optimizes it for the AI model
  3. Model request – The query is sent to the Amazon Bedrock Converse API with appropriate system instructions
  4. Tool detection – If the model determines it needs external data, it requests a tool invocation
  5. Tool execution – The system routes the tool request to the appropriate MCP server and executes it
  6. Response generation – The model incorporates the tool’s results to generate a comprehensive response
  7. Response delivery – The final answer is displayed to the user

In this example, we demonstrate the MCP framework using a general use case of restaurant or location recommendation and route planning. Users can provide multimodal input (such as text plus image), and the application integrates Google Maps, Time, and Memory MCP servers. Additionally, this post showcases how to use the Strands Agent framework as an alternative approach to build the same MCP application with significantly reduced complexity and code. Strands Agent is an open source, multi-agent coordination framework that simplifies the development of intelligent, context-aware agent systems across various domains. You can build your own MCP application by modifying the MCP server configurations to suit your specific needs. You can find the complete source code for this example in our Git repository. The following diagram is the solution architecture.

MCP Module architecture with Host, Clients, Servers components bridging UI and Bedrock foundation models

Prerequisites

Before implementing the example, you need to set up the account and environment. Use the following steps.To set up the AWS account :

  1. Create an AWS account. If you don’t already have one, sign up at https://aws.amazon.com
  2. To enable Amazon Bedrock access, go to the Amazon Bedrock console and request access to the models you plan to use (for this walkthrough, request access to Mistral Pixtral Large). Or deploy Mistral Small 3 model from Amazon Bedrock Marketplace. (For more details, refer to the Mistral Model Deployments on AWS section later in this post.) When your request is approved, you’ll be able to use these models through the Amazon Bedrock Converse API

To set up the local environment:

  1. Install the required tools:
    1. Python 3.10 or later
    2. Node.js (required for MCP tool servers)
    3. AWS Command Line Interface (AWS CLI), which is needed for configuration
  2. Clone the Repository:
git clone https://github.com/aws-samples/mistral-on-aws.git
cd mistral-on-aws/MCP/MCP_Mistral_app_demo/
  1. Install Python dependencies:
pip install -r requirements.txt
  1. Configure AWS credentials:
aws configure

Then enter your AWS access key ID, secret access key, and preferred AWS Region.

  1. Set up MCP tool servers. The server configurations are provided in file: server_configs.py. The system uses Node.js-based MCP servers. They’ll be installed automatically when you run the application for the first time using NPM. You can add other MCP server configurations in this file. This solution can be quickly modified and extended to meet your business requirements.

Mistral model deployments on AWS

Mistral models can be accessed or deployed using the following methods. To use foundation models (FMs) in MCP applications, the models must support tool use functionality.

Amazon Bedrock serverless (Pixtral Large)

To enable this model, follow these steps:

  1. Go to the Amazon Bedrock console.
  2. From the left navigation pane, select Model access.
  3. Choose Manage model access.
  4. Search for the model using the keyword Pixtral, select it, and choose Next, as shown in the following screenshot. The model will then be ready to use.

This model has cross-Region inference enabled. When using the model ID, always add the Region prefix eu or us before the model ID, such as eu.mistral.pixtral-large-2502-v1:0. Provide this model ID in config.py. You can now test the example with the Gradio web-based app.

Amazon Bedrock interface for managing base model access with Pixtral Large model highlighted

Amazon Bedrock Marketplace (Mistral-Small-24B-Instruct-2501)

Amazon Bedrock Marketplace and SageMaker JumpStart deployments are dedicated instances (serverful) and incur charges as long as the instance remains deployed. For more information, refer to Amazon Bedrock pricing and Amazon SageMaker pricing.

To enable this model, follow these steps:

  1. Go to the Amazon Bedrock console
  2. In the left navigation pane, select Model catalog
  3. In the search bar, search for “Mistral-Small-24B-Instruct-25-1,” as shown in the following screenshot

Amazon Bedrock UI with model catalog, filters, and Mistral-Small-24B-Instruct-2501 model spotlight

  1. Select the model and select Deploy.
  2. In the configuration page, you can keep all fields as default. This endpoint requires an instance type ml.g6.12xlarge. Check service quotas under the Amazon SageMaker service to make sure you have more than two instances available for endpoint usage (you’ll use another instance for Amazon SageMaker JumpStart deployment). If you don’t have more than two instances, request a quota increase for this instance type. Then choose Deploy. The model deployment might take a few minutes.
  3. When the model is in service, copy the endpoint Amazon Resource Name (ARN), as shown in the following screenshot, and add it to the config.py file in the model_id field. Then you can test the solution with the Gradio web-based app.
  4. The Mistral-Small-24B-Instruct-25-1 model doesn’t support image input, so only text-based Q&A is supported.

AWS Bedrock marketplace deployments interface with workflow steps and active Mistral endpoint

Amazon SageMaker JumpStart (Mistral-Small-24B-Instruct-2501)

To enable this model, follow these steps:

  1. Go to the Amazon SageMaker console
  2. Create a domain and user profile
  3. Under the created user profile, launch Studio
  4. In the left navigation pane, select JumpStart, then search for “Mistral”
  5. Select Mistral-Small-24B-Instruct-2501, then choose Deploy

This deployment might take a few minutes. The following screenshot shows that this model is marked as Bedrock ready. This means you can register this model as an Amazon Bedrock Marketplace deployment and use Amazon Bedrock APIs to invoke this Amazon SageMaker endpoint.

Dark-themed SageMaker dashboard displaying Mistral AI models with Bedrock ready status

  1. After the model is in service, copy its endpoint ARN from the Amazon Bedrock Marketplace deployment, as shown in the following screenshot, and provide it to the config.py file in the model_id field. Then you can test the solution with the Gradio web-based app.

The Mistral-Small-24B-Instruct-25-1 model doesn’t support image input, so only text-based Q&A is supported.

SageMaker real-time inference endpoint for Mistral small model with AllTraffic variant on ml.g6 instance

Build an MCP application with Mistral models on AWS

The following sections provide detailed insights into building MCP applications from the ground up using a component-level approach. We explore how to implement the three core MCP components, MCP host, MCP client, and MCP servers, giving you complete control and understanding of the underlying architecture.

MCP host component

The MCP is designed to facilitate seamless interaction between AI models and external tools, systems, and data sources. In this architecture, the MCP host plays a pivotal role in managing the lifecycle and orchestration of MCP clients and servers, enabling AI applications to access and utilize external resources effectively. The MCP host is responsible for integration with FMs, providing context, capabilities discovery, initialization, and MCP client management. In this solution, we have three files to provide this capability.

The first file is agent.py. The BedrockConverseAgent class in agent.py is the core component that manages communication with the Amazon Bedrock service and provides the FM models integration. The constructor initializes the agent with model settings and sets up the AWS Bedrock client.

def __init__(self, model_id, region, system_prompt='You are a helpful assistant.'):
    """
    Initialize the Bedrock agent with model configuration.
    
    Args:
        model_id (str): The Bedrock model ID to use
        region (str): AWS region for Bedrock service
        system_prompt (str): System instructions for the model
    """
    self.model_id = model_id
    self.region = region
    self.client = boto3.client('bedrock-runtime', region_name=self.region)
    self.system_prompt = system_prompt
    self.messages = []
    self.tools = None

Then, the agent intelligently handles multimodal inputs with its image processing capabilities. This method validates image URLs provided by the user, downloads images, detects and normalizes image formats, resizes large images to meet API constraints, and converts incompatible formats to JPEG.

async def _fetch_image_from_url(self, image_url):
    # Download image from URL
    # Process and optimize for model compatibility
    # Return binary image data with MIME type

When users enter a prompt, the agent detects whether it contains an uploaded image or an image URL and processes it accordingly in the invoke_with_prompt function. This way, users can paste an image URL in their query or upload an image from their local device and have it analyzed by the AI model.

async def invoke_with_prompt(self, prompt):
    # Check if prompt contains an image URL
    has_image, image_url = self._is_image_url(prompt)
    if image_input:
        # First check for direct image upload
        # ...
    if has_image_url:
       # Second check for image URL in prompt
    else:
        # Standard text-only prompt
        content = [{'text': prompt}]
    return await self.invoke(content)

The most powerful feature is the agent’s ability to use external tools provided by MCP servers. When the model wants to use a tool, the agent detects the tool_use stop reason from Amazon Bedrock and extracts tool request details, including names and inputs. It then executes the tool through the UtilityHelper, and the tool use results are returned back to the model. The MCP host then continues the conversation with the tool results incorporated.

async def _handle_response(self, response):
    # Add the response to the conversation history
    self.messages.append(response['output']['message'])
    # Check the stop reason
    stop_reason = response['stopReason']
    if stop_reason == 'tool_use':
        # Extract tool use details and execute
        tool_response = []
        for content_item in response['output']['message']['content']:
            if 'toolUse' in content_item:
                tool_request = {
                    "toolUseId": content_item['toolUse']['toolUseId'],
                    "name": content_item['toolUse']['name'],
                    "input": content_item['toolUse']['input']
                }
                tool_result = await self.tools.execute_tool(tool_request)
                tool_response.append({'toolResult': tool_result})
        # Continue conversation with tool results
        return await self.invoke(tool_response)

The second file is utility.py. The UtilityHelper class in utility.py serves as a bridge between Amazon Bedrock and external tools. It manages tool registration, formatting tool specifications for Bedrock compatibility, and tool execution.

def register_tool(self, name, func, description, input_schema):
    corrected_name = UtilityHelper._correct_name(name)
    self._name_mapping[corrected_name] = name
    self._tools[corrected_name] = {
        "function": func,
        "description": description,
        "input_schema": input_schema,
        "original_name": name,
    }

For Amazon Bedrock to understand available tools from MCP servers, the utility module generates tool specifications by providing name, description, and inputSchema in the following function:

def get_tools(self):
    tool_specs = []
    for corrected_name, tool in self._tools.items():
        # Ensure the inputSchema.json.type is explicitly set to 'object'
        input_schema = tool["input_schema"].copy()
        if 'json' in input_schema and 'type' not in input_schema['json']:
            input_schema['json']['type'] = 'object'
        tool_specs.append(
            {
                "toolSpec": {
                    "name": corrected_name,
                    "description": tool["description"],
                    "inputSchema": input_schema,
                }
            }
        )
    return {"tools": tool_specs}

When the model requests a tool, the utility module executes it and formats the result:

async def execute_tool(self, payload):
    tool_use_id = payload["toolUseId"]
    corrected_name = payload["name"]
    tool_input = payload["input"]
    # Find and execute the tool
    tool_func = self._tools[corrected_name]["function"]
    original_name = self._tools[corrected_name]["original_name"]
    # Execute the tool
    result_data = await tool_func(original_name, tool_input)
    # Format and return the result
    return {
        "toolUseId": tool_use_id,
        "content": [{"text": str(result)}],
    }

The final component in the MCP host is the gradio_app.py file, which implements a web-based interface for our AI assistant using Gradio. First, it initializes the model configurations and the agent, then connects to MCP servers and retrieves available tools from the MCP servers.

async def initialize_agent():
  """Initialize Bedrock agent and connect to MCP tools"""
  # Initialize model configuration from config.py
  model_id = AWS_CONFIG["model_id"]
  region = AWS_CONFIG["region"]
  # Set up the agent and tool manager
  agent = BedrockConverseAgent(model_id, region)
  agent.tools = UtilityHelper()
  # Define the agent's behavior through system prompt
  agent.system_prompt = """
  You are a helpful assistant that can use tools to help you answer questions and perform tasks.
  Please remember and save user's preferences into memory based on user questions and conversations.
  """
  # Connect to MCP servers and register tools
  # ...
  return agent, mcp_clients, available_tools

When a user sends a message, the app processes it through the agent invoke_with_prompt() function. The response from the model is displayed on the Gradio GUI:

async def process_message(message, history):
  """Process a message from the user and get a response from the agent"""
  global agent
  if agent is None:
      # First-time initialization
      agent, mcp_clients, available_tools = await initialize_agent()
  try:
      # Process message and get response
      response = await agent.invoke_with_prompt(message)
      # Return the response
      return response
  except Exception as e:
      logger.error(f"Error processing message: {e}")
      return f"I encountered an error: {str(e)}"

MCP client implementation

MCP clients serve as intermediaries between the AI model and the MCP server. Each client maintains a one-to-one session with a server, managing the lifecycle of interactions, including handling interruptions, timeouts, and reconnections. MCP clients route protocol messages bidirectionally between the host application and the server. They parse responses, handle errors, and make sure that the data is relevant and appropriately formatted for the AI model. They also facilitate the invocation of tools exposed by the MCP server and manage the context so that the AI model has access to the necessary resources and tools for its tasks.

The following function in the mcpclient.py file is designed to establish connections to MCP servers and manage connection sessions.

async def connect(self):
  """
  Establishes connection to MCP server.
  Sets up stdio client, initializes read/write streams,
  and creates client session.
  """
  # Initialize stdio client with server parameters
  self._client = stdio_client(self.server_params)
  # Get read/write streams
  self.read, self.write = await self._client.__aenter__()
  # Create and initialize session
  session = ClientSession(self.read, self.write)
  self.session = await session.__aenter__()
  await self.session.initialize()

After it’s connected with MCP servers, the client lists available tools from each MCP server with their specifications:

async def get_available_tools(self):
    """List available tools from the MCP server."""
    if not self.session:
        raise RuntimeError("Not connected to MCP server")
    response = await self.session.list_tools()
    # Extract and format tools
    tools = response.tools if hasattr(response, 'tools') else []
    formatted_tools = [
        {
            'name': tool.name,
            'description': str(tool.description),
            'inputSchema': {
                'json': {
                    'type': 'object',
                    'properties': tool.inputSchema.get('properties', {}),
                    'required': tool.inputSchema.get('required', [])
                }
            }
        }
        for tool in tools
    ]
    return formatted_tools

When a tool is defined and called, the client first validates the session is active, then executes the tool through the MCP session that is established between client and server. Finally, it returns the structured response.

async def call_tool(self, tool_name, arguments):
    # Execute tool
    start_time = time.time()
    result = await self.session.call_tool(tool_name, arguments=arguments)
    execution_time = time.time() - start_time
    # Augment result with server info
    return {
        "result": result,
        "tool_info": {
            "tool_name": tool_name,
            "server_name": server_name,
            "server_info": server_info,
            "execution_time": f"{execution_time:.2f}s"
        }
    }

MCP server configuration

The server_configs.py file defines the MCP tool servers that our application will connect to. This configuration sets up Google Maps MCP server with an API key, adds a time server for date and time operations, and includes a memory server for storing conversation context. Each server is defined as a StdioServerParameters object, which specifies how to launch the server process using Node.js (using npx). You can add or remove MCP server configurations based on your application objectives and requirements.

from mcp import StdioServerParameters
SERVER_CONFIGS = [
        StdioServerParameters(
            command="npx",
            args=["-y", "@modelcontextprotocol/server-google-maps"],
            env={"GOOGLE_MAPS_API_KEY": "<ADD_GOOGLE_API_KEY>"}
        ),
        StdioServerParameters(
            command="npx",
            args=["-y", "time-mcp"],
        ),
        StdioServerParameters(
            command="npx",
            args=["@modelcontextprotocol/server-memory"]
            )
]

Alternative implementation: Strands Agent framework

For developers seeking a more streamlined approach to building MCP-powered applications, the Strands Agents framework provides an alternative that significantly reduces implementation complexity while maintaining full MCP compatibility. This section demonstrates how the same functionality can be achieved with substantially less code using Strands Agents. The code sample is available in this Git repository.

First, initialize the model and provide the Mistral model ID on Amazon Bedrock.

from strands import Agent
from strands.tools.mcp import MCPClient
from strands.models import BedrockModel
# Initialize the Bedrock model
bedrock_model = BedrockModel(
    model_id="us.mistral.pixtral-large-2502-v1:0",
    streaming=False
)

The following code creates multiple MCP clients from server configurations, automatically manages their lifecycle using context managers, collects available tools from each client, and initializes an AI agent with the unified set of tools.

from contextlib import ExitStack
from mcp import stdio_client
# Create MCP clients with automatic lifecycle management
mcp_clients = [
    MCPClient(lambda cfg=server_config: stdio_client(cfg))
    for server_config in SERVER_CONFIGS
]
with ExitStack() as stack:
    # Enter all MCP clients automatically
    for mcp_client in mcp_clients:
        stack.enter_context(mcp_client)
    
    # Aggregate tools from all clients
    tools = []
    for i, mcp_client in enumerate(mcp_clients):
        client_tools = mcp_client.list_tools_sync()
        tools.extend(client_tools)
        logger.info(f"Loaded {len(client_tools)} tools from client {i+1}")
    
    # Create agent with unified tool registry
    agent = Agent(model=bedrock_model, tools=tools, system_prompt=system_prompt)

The following function processes user messages with optional image inputs by formatting them for multimodal AI interaction, sending them to an agent that handles tool routing and response generation, and returning the agent’s text response:

def process_message(message, image=None):
    """Process user message with optional image input"""
    try:
        if image is not None:
            # Convert PIL image to Bedrock format
            image_data = convert_image_to_bytes(image)
            if image_data:
                # Create multimodal message structure
                multimodal_message = {
                    "role": "user",
                    "content": [
                        {
                            "image": {
                                "format": image_data['format'],
                                "source": {"bytes": image_data['bytes']}
                            }
                        },
                        {
                            "text": message if message.strip() else "Please analyze the content of the image."
                        }
                    ]
                }
                agent.messages.append(multimodal_message)
        
        # Single call handles tool routing and response generation
        response = agent(message)
        
        # Extract response content
        return response.text if hasattr(response, 'text') else str(response)
        
    except Exception as e:
        return f"Error: {str(e)}"

The Strands Agents approach streamlines MCP integration by reducing code complexity, automating resource management, and unifying tools from multiple servers into a single interface. It also offers built-in error handling and native multimodal support, minimizing manual effort and enabling more robust, efficient development.

Demo

This demo showcases an intelligent food recognition application with integrated location services. Users can submit an image of a dish, and the AI assistant:

    1. Accurately identifies the cuisine from the image
    2. Provides restaurant recommendations based on the identified food
    3. Offers route planning powered by the Google Maps MCP server

The application demonstrates sophisticated multi-server collaboration to answer complex queries such as “Is the restaurant open when I arrive?” To answer this, the system:

  1. Determines the current time in the user’s location using the time MCP server
  2. Retrieves restaurant operating hours and calculates travel time using the Google Maps MCP server
  3. Synthesizes this information to provide a clear, accurate response

We encourage you to modify the solution by adding additional MCP server configurations tailored to your specific personal or business requirements.

MCP applicaiton demo

Clean up

When you finish experimenting with this example, delete the SageMaker endpoints that you created in the process:

  1. Go to Amazon SageMaker console
  2. In the left navigation pane, choose Inference and then choose Endpoints
  3. From the endpoints list, delete the ones that you created from Amazon Bedrock Marketplace and SageMaker JumpStart.

Conclusion

This post covers how integrating MCP with Mistral AI models on AWS enables the rapid development of intelligent applications that interact seamlessly with external systems. By standardizing tool use, developers can focus on core logic while keeping AI reasoning and tool execution cleanly separated, improving maintainability and scalability. The Strands Agent framework enhances this by streamlining implementation without sacrificing MCP compatibility. With AWS offering flexible deployment options, from Amazon Bedrock to Amazon Bedrock Marketplace and SageMaker, this approach balances performance and cost. The solution demonstrates how even lightweight setups can connect AI to real-time services.

We encourage developers to build upon this foundation by incorporating additional MCP servers tailored to their specific requirements. As the landscape of MCP-compatible tools continues to expand, organizations can create increasingly sophisticated AI assistants that effectively reason over external knowledge and take meaningful actions, accelerating the adoption of practical, agentic AI systems across industries while reducing implementation barriers.

Ready to implement MCP in your own projects? Explore the official AWS MCP server repository for examples and reference implementations. For more information about the Strands Agents framework, which simplifies agent building with its intuitive, code-first approach to data source integration, visit Strands Agent. Finally, dive deeper into open protocols for agent interoperability in the recent AWS blog post: Open Protocols for Agent Interoperability, which explores how these technologies are shaping the future of AI agent development.


About the authors

Ying Hou, PhD, is a Sr. Specialist Solution Architect for Gen AI at AWS, where she collaborates with model providers to onboard the latest and most intelligent AI models onto AWS platforms. With deep expertise in Gen AI, ASR, computer vision, NLP, and time-series forecasting models, she works closely with customers to design and build cutting-edge ML and GenAI applications.

Siddhant Waghjale, is an Applied AI Engineer at Mistral AI, where he works on challenging customer use cases and applied science, helping customers achieve their goals with Mistral models. He’s passionate about building solutions that bridge  AI capabilities with actual business applications, specifically in agentic workflows and code generation.

Samuel-BarrySamuel Barry is an Applied AI Engineer at Mistral AI, where he helps organizations design, deploy, and scale cutting-edge AI systems. He partners with customers to deliver high-impact solutions across a range of use cases, including RAG, agentic workflows, fine-tuning, and model distillation. Alongside engineering efforts, he also contributes to applied research initiatives that inform and strengthen production use cases.

Preston TugglePreston Tuggle is a Sr. Specialist Solutions Architect with the Third-Party Model Provider team at AWS. He focuses on working with model providers across Amazon Bedrock and Amazon SageMaker, helping them accelerate their go-to-market strategies through technical scaling initiatives and customer engagement.

Read More

Build real-time conversational AI experiences using Amazon Nova Sonic and LiveKit

Build real-time conversational AI experiences using Amazon Nova Sonic and LiveKit

The rapid growth of generative AI technology has been a catalyst for business productivity growth, creating new opportunities for greater efficiency, enhanced customer service experiences, and more successful customer outcomes. Today’s generative AI advances are helping existing technologies achieve their long-promised potential. For example, voice-first applications have been gaining traction across industries for years—from customer service to education to personal voice assistants and agents. But early versions of this technology struggled to interpret human speech or mimic real conversation. Building real-time, natural-sounding, low-latency voice AI has until recently remained complex, especially when working with streaming infrastructure and speech foundation models (FMs).

The rapid progress of conversational AI technology has led to the development of powerful models that address the historical challenges of traditional voice-first applications. Amazon Nova Sonic is a state-of-the-art speech-to-speech FM designed to build real-time conversational AI applications in Amazon Bedrock. This model offers industry-leading price-performance and low latency. The Amazon Nova Sonic architecture unifies speech understanding and generation into a single model, to enable real, human-like voice conversations in AI applications.

Amazon Nova Sonic accommodates the breadth and richness of human language. It can understand speech in different speaking styles and generate speech in expressive voices, including both masculine-sounding and feminine-sounding voices. Amazon Nova Sonic can also adapt the patterns of stress, intonation, and style of the generated speech response to align with the context and content of the speech input. Additionally, Amazon Nova Sonic supports function calling and knowledge grounding with enterprise data using Retrieval-Augmented Generation (RAG). To further simplify the process of getting the most from this technology, Amazon Nova Sonic is now integrated with LiveKit’s WebRTC framework, a widely used platform that enables developers to build real-time audio, video, and data communication applications. This integration makes it possible for developers to build conversational voice interfaces without needing to manage complex audio pipelines or signaling protocols. In this post, we explain how this integration works, how it addresses the historical challenges of voice-first applications, and some initial steps to start using this solution.

Solution overview

LiveKit is a popular open source WebRTC platform that provides scalable, multi‑user real‑time video, audio, and data communication. Designed as a full-stack solution, it offers a Selective Forwarding Unit (SFU) architecture; modern client SDKs across web, mobile, and server environments; and built‑in features such as speaker detection, bandwidth optimization, simulcast support, and seamless room management. You can deploy it as a self-hosted system or on AWS, so developers can focus on application logic without managing the underlying media infrastructure.

Building real-time, voice-first AI applications requires developers to manage multiple layers of infrastructure—from handling audio capture and streaming protocols to coordinating signaling, routing, and event-driven state management. Working with bidirectional streaming models such as Amazon Nova Sonic often meant setting up custom pipelines, managing audio buffers, and working to maintain low-latency performance across diverse client environments. These tasks added development overhead and required specialized knowledge in networking and real-time systems, making it difficult to quickly prototype or scale production-ready voice AI solutions. To address this complexity, we implemented a real-time plugin for Amazon Nova Sonic in the LiveKit Agent SDK. This solution removes the need for developers to manage audio signaling, streaming protocols, or custom transport layers. LiveKit handles real-time audio routing and session management, and Amazon Nova Sonic powers speech understanding and generation. Together, LiveKit and Amazon Nova Sonic provide a streamlined, production-ready setup for building voice-first AI applications. Features such as full-duplex audio, voice activity detection, and noise suppression are available out of the box, so developers can focus on application logic rather than infrastructure orchestration.

The following video shows Amazon Nova Sonic and LiveKit in action. You can find the code for this example in the LiveKit Examples GitHub repo.

The following diagram illustrates the solution architecture of Amazon Nova Sonic deployed as a voice agent in the LiveKit framework on AWS.

Diagram illustrates the solution architecture of Amazon Nova Sonic

Prerequisites

To implement the solution, you must have the following prerequisites:

  • Python version 3.12 or higher
  • An AWS account with appropriate Identity and Access Management (IAM) permissions for Amazon Bedrock
  • Access to Amazon Nova Sonic on Amazon Bedrock
  • A web browser (such as Google Chrome or Mozilla Firefox) with WebRTC support

Deploy the solution

Complete the following steps to get started talking to Amazon Nova Sonic through LiveKit:

  1. Install the necessary dependencies:
brew install livekit livekit-cli
curl -LsSf https://astral.sh/uv/install.sh | sh

uv is a fast, drop-in replacement for pip, used in the LiveKit Agents SDK (you can also choose to use pip).

  1. Set up a new local virtual environment:
uv init sonic_demo
cd sonic_demo
uv venv --python 3.12
uv add livekit-agents python-dotenv 'livekit-plugins-aws[realtime]'

  1. To run the LiveKit server locally, open a new terminal (for example, a new UNIX process) and run the following command:
livekit-server --dev

You must keep the LiveKit server running for the entire duration that the Amazon Nova Sonic agent is running, because it’s responsible for proxying data between parties.

  1. Generate an access token using the following code. The default values for api-key and api-secret are devkey and secret, respectively. When creating an access token for permission to join a LiveKit room, you must specify the room name and user identity.
lk token create 
 --api-key devkey --api-secret secret 
 --join --room my-first-room --identity user1 
 --valid-for 24h

  1. Create environment variables. You must specify the AWS credentials:
vim .env

// contents of the .env file
AWS_ACCESS_KEY_ID=<aws access key id>
AWS_SECRET_ACCESS_KEY=<aws secret access key>

# if using a permanent identity (e.g. IAM user)
# then session token is optional
AWS_SESSION_TOKEN=<aws session token>
LIVEKIT_API_KEY=devkey
LIVEKIT_API_SECRET=secret

  1. Create the main.py file:
from dotenv import load_dotenv
from livekit import agents
from livekit.agents import AgentSession, Agent, AutoSubscribe
from livekit.plugins.aws.experimental.realtime import RealtimeModel

load_dotenv()

async def entrypoint(ctx: agents.JobContext):
    # Connect to the LiveKit server
    await ctx.connect(auto_subscribe=AutoSubscribe.AUDIO_ONLY)
    
    # Initialize the Amazon Nova Sonic agent
    agent = Agent(instructions="You are a helpful voice AI assistant.")
    session = AgentSession(llm=RealtimeModel())
    
    # Start the session in the specified room
    await session.start(
        room=ctx.room,
        agent=agent,
    )

if __name__ == "__main__":
    agents.cli.run_app(agents.WorkerOptions(entrypoint_fnc=entrypoint))

  1. Run the main.py file:
uv run python main.py connect --room my-first-room

Now you’re ready to connect to the agent frontend.

  1. Go to https://agents-playground.livekit.io/.
  2. Choose Manual.
  3. In the first text field, enter ws://localhost:7880.
  4. In the second text field, enter the access token you generated.
  5. Choose Connect.

You should now be able to talk to Amazon Nova Sonic in real time.

If you’re disconnected from the LiveKit room, you will have to restart the agent process (main.py) to talk to Amazon Nova Sonic again.

Clean up

This example runs locally, meaning there are no special teardown steps required for cleanup. You can simply exit the agent and LiveKit server processes. The only cost incurred are the costs of making calls to Amazon Bedrock to talk to Amazon Nova Sonic. After you have disconnected from the LiveKit room, you will no longer incur charges and no AWS resources will remain in use.

Conclusion

Thanks to generative AI, the qualitative benefits long promised by voice-first applications can now be realized. By combining Amazon Nova Sonic with LiveKit’s WebRTC infrastructure, developers can build real-time, voice-first AI applications with less complexity and faster deployment. The integration reduces the need for custom audio pipelines, so teams can focus on building engaging conversational experiences.

“Our goal with this integration is to simplify the development of real-time voice applications,” said Josh Wulf, CEO of LiveKit. “By combining LiveKit’s robust media routing and session management with Nova Sonic’s speech capabilities, we’re helping developers move faster—no need to manage low-level infrastructure, so they can focus on building the conversation.”

To learn more about Amazon Nova Sonic, read the AWS News Blog, Amazon Nova Sonic product page, and Amazon Nova Sonic User Guide. To get started with Amazon Nova Sonic in Amazon Bedrock, visit the Amazon Bedrock console.


About the authors

Glen Ko is an AI developer at AWS Bedrock, where his focus is on enabling the proliferation of open source AI tooling and supporting open source innovation.

Anuj Jauhari is a Senior Product Marketing Manager at Amazon Web Services, where he helps customers realize value from innovations in generative AI.

Osman Ipek is a Solutions Architect on Amazon’s AGI team focusing on Nova foundation models. He guides teams to accelerate development through practical AI implementation strategies, with expertise spanning voice AI, NLP, and MLOps.

Read More

AWS AI infrastructure with NVIDIA Blackwell: Two powerful compute solutions for the next frontier of AI

AWS AI infrastructure with NVIDIA Blackwell: Two powerful compute solutions for the next frontier of AI

Imagine a system that can explore multiple approaches to complex problems, drawing on its understanding of vast amounts of data, from scientific datasets to source code to business documents, and reasoning through the possibilities in real time. This lightning-fast reasoning isn’t waiting on the horizon. It’s happening today in our customers’ AI production environments. The scale of the AI systems that our customers are building today—across drug discovery, enterprise search, software development, and more—is truly remarkable. And there’s much more ahead.

To accelerate innovation across emerging generative AI developments such as reasoning models and agentic AI systems, we’re excited to announce general availability of P6e-GB200 UltraServers, accelerated by NVIDIA Grace Blackwell Superchips. P6e-GB200 UltraServers are designed for training and deploying the largest, most sophisticated AI models. Earlier this year, we launched P6-B200 instances, accelerated by NVIDIA Blackwell GPUs, for diverse AI and high-performance computing workloads.

In this post, we share how these powerful compute solutions build on everything we’ve learned about delivering secure, reliable GPU infrastructure at a massive scale, so that customers can confidently push the boundaries of AI.

Meeting the expanding compute demands of AI workloads

P6e-GB200 UltraServers represent our most powerful GPU offering to date, featuring up to 72 NVIDIA Blackwell GPUs interconnected using fifth-generation NVIDIA NVLink—all functioning as a single compute unit. Each UltraServer delivers a massive 360 petaflops of dense FP8 compute and 13.4 TB of total high bandwidth GPU memory (HBM3e)—which is over 20 times the compute and over 11 times the memory in a single NVLink domain compared to P5en instances. P6e-GB200 UltraServers support up to 28.8 Tbps aggregate bandwidth of fourth-generation Elastic Fabric Adapter (EFAv4) networking.P6-B200 instances are a versatile option for a broad range of AI use cases. Each instance provides 8 NVIDIA Blackwell GPUs interconnected using NVLink with 1.4 TB of high bandwidth GPU memory, up to 3.2 Tbps of EFAv4 networking, and fifth-generation Intel Xeon Scalable processors. P6-B200 instances offer up to 2.25 times the GPU TFLOPs, 1.27 times the GPU memory size, and 1.6 times the GPU memory bandwidth compared to P5en instances.

How do you choose between P6e-GB200 and P6-B200? This choice comes down to your specific workload requirements and architectural needs:

  • P6e-GB200 UltraServers are ideal for the most compute and memory intensive AI workloads, such as training and deploying frontier models at the trillion-parameter scale. Their NVIDIA GB200 NVL72 architecture really shines at this scale. Imagine all 72 GPUs working as one, with a unified memory space and coordinated workload distribution. This architecture enables more efficient distributed training by reducing communication overhead between GPU nodes. For inference workloads, the ability to fully contain trillion-parameter models within a single NVLink domain means faster, more consistent response times at scale. When combined with optimization techniques such as disaggregated serving with NVIDIA Dynamo, the large domain size of GB200 NVL72 architecture unlocks significant inference efficiencies for various model architectures such as mixture of experts models. GB200 NVL72 is particularly powerful when you need to handle extra-large context windows or run high-concurrency applications in real time.
  • P6-B200 instances support a broad range of AI workloads and are an ideal option for medium to large-scale training and inference workloads. If you want to port your existing GPU workloads, P6-B200 instances offer a familiar 8-GPU configuration that minimizes code changes and simplifies migration from current generation instances. Additionally, although NVIDIA’s AI software stack is optimized for both Arm and x86, if your workloads are specifically built for x86 environments, P6-B200 instances, with their Intel Xeon processors, will be your ideal choice.

Innovation built on AWS core strengths

Bringing NVIDIA Blackwell to AWS isn’t about a single breakthrough—it’s about continuous innovation across multiple layers of infrastructure. By building on years of learning and innovation across compute, networking, operations, and managed services, we’ve brought NVIDIA Blackwell’s full capabilities with the reliability and performance customers expect from AWS.

Robust instance security and stability

When customers tell me why they choose to run their GPU workloads on AWS, one crucial point comes up consistently: they highly value our focus on instance security and stability in the cloud. The specialized hardware, software, and firmware of the AWS Nitro System are designed to enforce restrictions so that nobody, including anyone in AWS, can access your sensitive AI workloads and data. Beyond security, the Nitro System fundamentally changes how we maintain and optimize infrastructure. The Nitro System, which handles networking, storage, and other I/O functions, makes it possible to deploy firmware updates, bug fixes, and optimizations while it remains operational. This ability to update without system downtime, which we call live update, is crucial in today’s AI landscape, where any interruption significantly impacts production timelines. P6e-GB200 and P6-B200 both feature the sixth generation of the Nitro System, but these security and stability benefits aren’t new—our innovative Nitro architecture has been protecting and optimizing Amazon Elastic Compute Cloud (Amazon EC2) workloads since 2017.

Reliable performance at massive scale

In AI infrastructure, the challenge isn’t just reaching massive scale—it’s delivering consistent performance and reliability at that scale. We’ve deployed P6e-GB200 UltraServers in third-generation EC2 UltraClusters, which creates a single fabric that can encompass our largest data centers. Third-generation UltraClusters cut power consumption by up to 40% and reduce cabling requirements by more than 80%—not only improving efficiency, but also significantly reducing potential points of failure.

To deliver consistent performance at this massive scale, we use Elastic Fabric Adapter (EFA) with its Scalable Reliable Datagram protocol, which intelligently routes traffic across multiple network paths to maintain smooth operation even during congestion or failures. We’ve continuously improved EFA’s performance across four generations. P6e-GB200 and P6-B200 instances with EFAv4 show up to 18% faster collective communications in distributed training compared to P5en instances that use EFAv3.

Infrastructure efficiency

Whereas P6-B200 instances use our proven air-cooling infrastructure, P6e-GB200 UltraServers use liquid cooling, which enables higher compute density in large NVLink domain architectures, delivering higher system performance. P6e-GB200 are liquid cooled with novel mechanical cooling solutions providing configurable liquid-to-chip cooling in both new and existing data centers, so we can support both liquid-cooled accelerators and air-cooled network and storage infrastructure in the same facility. With this flexible cooling design, we can deliver maximum performance and efficiency at the lowest cost.

Getting started with NVIDIA Blackwell on AWS

We’ve made it simple to get started with P6e-GB200 UltraServers and P6-B200 instances through multiple deployment paths, so you can quickly begin using Blackwell GPUs while maintaining the operational model that works best for your organization.

Amazon SageMaker HyperPod

If you’re accelerating your AI development and want to spend less time managing infrastructure and cluster operations, that’s exactly where Amazon SageMaker HyperPod excels. It provides managed, resilient infrastructure that automatically handles provisioning and management of large GPU clusters. We keep enhancing SageMaker HyperPod, adding innovations like flexible training plans to help you gain predictable training timelines and run training workloads within your budget requirements.

SageMaker HyperPod will support both P6e-GB200 UltraServers and P6-B200 instances, with optimizations to maximize performance by keeping workloads within the same NVLink domain. We’re also building in a comprehensive, multi-layered recovery system: SageMaker HyperPod will automatically replace faulty instances with preconfigured spares in the same NVLink domain. Built-in dashboards will give you visibility into everything from GPU utilization and memory usage to workload metrics and UltraServer health status.

Amazon EKS

For large-scale AI workloads, if you prefer to manage your infrastructure using Kubernetes, Amazon Elastic Kubernetes Service (Amazon EKS) is often the control plane of choice. We continue to drive innovations in Amazon EKS with capabilities like Amazon EKS Hybrid Nodes, which enable you to manage both on-premises and EC2 GPUs in a single cluster—delivering flexibility for AI workloads.

Amazon EKS will support both P6e-GB200 UltraServers and P6-B200 instances with automated provisioning and lifecycle management through managed node groups. For P6e-GB200 UltraServers, we’re building in topology awareness that understands the GB200 NVL72 architecture, automatically labeling nodes with their UltraServer ID and network topology information to enable optimal workload placement. You will be able to span node groups across multiple UltraServers or dedicate them to individual UltraServers, giving you flexibility in organizing your training infrastructure. Amazon EKS monitors GPU and accelerator errors and relays them to the Kubernetes control plane for optional remediation.

NVIDIA DGX Cloud on AWS

P6e-GB200 UltraServers will also be available through NVIDIA DGX Cloud. DGX Cloud is a unified AI platform optimized at every layer with multi-node AI training and inference capabilities and NVIDIA’s complete AI software stack. You benefit from NVIDIA’s latest optimizations, benchmarking recipes, and technical expertise to improve efficiency and performance. It offers flexible term lengths along with comprehensive NVIDIA expert support and services to help you accelerate your AI initiatives.

This launch announcement is an important milestone, and it’s just the beginning. As AI capabilities evolve rapidly, you need infrastructure built not just for today’s demands but for all the possibilities that lie ahead. With innovations across compute, networking, operations, and managed services, P6e-GB200 UltraServers and P6-B200 instances are ready to enable these possibilities. We can’t wait to see what you will build with them.

Resources


About the author

David Brown is the Vice President of AWS Compute and Machine Learning (ML) Services. In this role he is responsible for building all AWS Compute and ML services, including Amazon EC2, Amazon Container Services, AWS Lambda, Amazon Bedrock and Amazon SageMaker. These services are used by all AWS customers but also underpin most of AWS’s internal Amazon applications. He also leads newer solutions, such as AWS Outposts, that bring AWS services into customers’ private data centers.

David joined AWS in 2007 as a Software Development Engineer based in Cape Town, South Africa, where he worked on the early development of Amazon EC2. In 2012, he relocated to Seattle and continued to work in the broader Amazon EC2 organization. Over the last 11 years, he has taken on larger leadership roles as more of the AWS compute and ML products have become part of his organization.

Prior to joining Amazon, David worked as a Software Developer at a financial industry startup. He holds a Computer Science & Economics degree from the Nelson Mandela University in Port Elizabeth, South Africa.

Read More

Unlock retail intelligence by transforming data into actionable insights using generative AI with Amazon Q Business

Unlock retail intelligence by transforming data into actionable insights using generative AI with Amazon Q Business

Businesses often face challenges in managing and deriving value from their data. According to McKinsey, 78% of organizations now use AI in at least one business function (as of 2024), showing the growing importance of AI solutions in business. Additionally, 21% of organizations using generative AI have fundamentally redesigned their workflows, showing how AI is transforming business operations.

Gartner identifies AI-powered analytics and reporting as a core investment area for retail organizations, with most large retailers expected to deploy or scale such solutions within the next 12–18 months. The retail sector’s data complexity demands sophisticated solutions that can integrate seamlessly with existing systems. Amazon Q Business offers features that can be tailored to meet specific business needs, including integration capabilities with popular retail management systems, point-of-sale systems, inventory management software, and ecommerce systems. Through advanced AI algorithms, the system analyzes historical data and current trends, helping businesses prepare effectively for seasonal fluctuations in demand and make data-driven decisions.

Amazon Q Business for Retail Intelligence is an AI-powered assistant designed to help retail businesses streamline operations, improve customer service, and enhance decision-making processes. This solution is specifically engineered to be scalable and adaptable to businesses of various sizes, helping them compete more effectively. In this post, we show how you can use Amazon Q Business for Retail Intelligence to transform your data into actionable insights.

Solution overview

Amazon Q Business for Retail Intelligence is a comprehensive solution that transforms how retailers interact with their data using generative AI. The solution architecture combines the powerful generative AI capabilities of Amazon Q Business and Amazon QuickSight visualizations to deliver actionable insights across the entire retail value chain. Our solution also uses Amazon Q Apps so retail personas and users can create custom AI-powered applications to streamline day-to-day tasks and automate workflows and business processes.

The following diagram illustrates the solution architecture.

SolutionArchitecture

The solution uses the AWS architecture above to deliver a secure, high-performance, and reliable solution for retail intelligence. Amazon Q Business serves as the primary generative AI engine, enabling natural language interactions and powering custom retail-specific applications. The architecture incorporates AWS IAM Identity Center for robust authentication and access control, and Amazon Simple Storage Service (Amazon S3) provides secure data lake storage for retail data sources. We use QuickSight for interactive visualizations, enhancing data interpretation. The solution’s flexibility is further enhanced by AWS Lambda for serverless processing, Amazon API Gateway for efficient endpoint management, and Amazon CloudFront for optimized content delivery. This solution uses the Amazon Q Business custom plugin to call the API endpoints to start the automated workflows directly from the Amazon Q Business web application interface based on customer queries and interactions.

This setup implements a three-tier architecture: a data integration layer that securely ingests data from multiple retail sources, a processing layer where Amazon Q Business analyzes queries and generates insights, and a presentation layer that delivers personalized, role-based insights through a unified interface.

We have provided an AWS CloudFormation template, sample datasets, and scripts that you can use to set up the environment for this demonstration.

In the following sections, we dive deeper on how this solution works.

Deployment

We have provided the Amazon Q Business for Retail Intelligence solution as open source—you can use it as a starting point for your own solution and help us make it better by contributing fixes and features through GitHub pull requests. Visit the GitHub repository to explore the code, choose Watch to be notified of new releases, and check the README for the latest documentation updates.

After you set up the environment, you can access the Amazon Q Business for Retail Intelligence dashboard, as shown in the following screenshot.

RetailItelligenceDashboard

You can interact with the QuickSight visualizations and Amazon Q Business chat interface to ask questions using natural language.

Key features and capabilities

Retail users can interact with this solution in many ways. In this section, we explore the key features.

For C-suite executives or senior leadership wanting to know how your business is performing, our solution provides a single pane of glass and makes it straightforward to access and interact with your enterprise’s qualitative and quantitative data using natural language. For example, users can analyze quantitative data like product sales or marketing campaign performance using the interactive visualizations powered by QuickSight and qualitative data like customer feedback from Amazon Q Business using natural language, all from a single interface.

Consider that you are a marketing analyst and you want to evaluate campaign performance and reach across channels and conduct an analysis on ad spend vs. revenue. With Amazon Q Business, you can run complex queries with natural language questions and with share the Q Apps with multiple teams. The solution provides automated insights about customer behavior and campaign effectiveness, helping marketing teams make faster decisions and quick adjustments to maximize ROI.

marketingCampaignInfo

Similarly, let’s assume you are a merchandising planner or a vendor manager and you want to understand the impact of cost-prohibitive events for your international business that deals with importing and exporting of goods and services. You can add inputs to Amazon Q Apps and get responses based on that specific product or product family.

AlternativeProducts

Users can also send requests through APIs using Amazon Q Business custom plugins for real-time interactions with downstream applications. For example, a store manager might want to know which items in the current inventory they need to replenish or rebalance for the next week based on weather predictions or local sporting events.

To learn more, refer to the following complete demo.

For this post, we haven’t used the generative business intelligence (BI) capabilities of Amazon Q with our QuickSight visualizations. To learn more, see Amazon Q in QuickSight.

Empowering retail personas with AI-driven intelligence

Amazon Q Business for Retail Intelligence transforms how retailers handle their data challenges through a generative AI-powered assistant. This solution integrates seamlessly with existing systems, using Retrieval Augmented Generation (RAG) to unify disparate data sources and deliver actionable insights in real time.The following are some of the key benefits for various roles:

  • C-Suite executives – Access comprehensive real-time dashboards for company-wide metrics and KPIs while using AI-driven recommendations for strategic decisions. Use predictive analytics to anticipate consumer shifts and enable proactive strategy adjustments for business growth.
  • Merchandisers – Gain immediate insights into sales trends, profit margins, and inventory turnover rates through automated analysis tools and AI-powered pricing strategies. Identify and capitalize on emerging trends through predictive analytics for optimal product mix and category management.
  • Inventory managers – Implement data-driven stock level optimization across multiple store locations while streamlining operations with automated reorder point calculations. Accurately predict and prepare for seasonal demand fluctuations to maintain optimal inventory levels during peak periods.
  • Store managers – Maximize operational efficiency through AI-predicted staffing optimization while accessing detailed insights about local conditions affecting store performance. Compare store metrics against other locations using sophisticated benchmarking tools to identify improvement opportunities.
  • Marketing analysts – Monitor and analyze marketing campaign effectiveness across channels in real time while developing sophisticated customer segments using AI-driven analysis. Calculate and optimize marketing ROI across channels for efficient budget allocation and improved campaign performance.

Amazon Q Business for Retail Intelligence makes complex data analysis accessible to different users through its natural language interface. This solution enables data-driven decision-making across organizations by providing role-specific insights that break down traditional data silos. By providing each retail persona tailored analytics and actionable recommendations, organizations can achieve greater operational efficiency and maintain a competitive edge in the dynamic retail landscape.

Conclusion

Amazon Q Business for Retail Intelligence combines generative AI capabilities with powerful visualization tools to revolutionize retail operations. By enabling natural language interactions with complex data systems, this solution democratizes data access across organizational levels, from C-suite executives to store managers. The system’s ability to provide role-specific insights, automate workflows, and facilitate real-time decision-making positions it as a crucial tool for retail businesses seeking to maintain competitiveness in today’s dynamic landscape. As retailers continue to embrace AI-driven solutions, Amazon Q Business for Retail Intelligence can help meet the industry’s growing needs for sophisticated data analysis and operational efficiency.

To learn more about our solutions and offerings, refer to Amazon Q Business and Generative AI on AWS. For expert assistance, AWS Professional Services, AWS Generative AI partner solutions, and AWS Generative AI Competency Partners are here to help.


About the authors

Suprakash Dutta is a Senior Solutions Architect at Amazon Web Services, leading strategic cloud transformations for Fortune 500 retailers and large enterprises. He specializes in architecting mission-critical retail solutions that drive significant business outcomes, including cloud-native based systems, generative AI implementations, and retail modernization initiatives. He’s a multi-cloud certified architect and has delivered transformative solutions that modernized operations across thousands of retail locations while driving breakthrough efficiencies through AI-powered retail intelligence solutions.

Alberto Alonso is a Specialist Solutions Architect at Amazon Web Services. He focuses on generative AI and how it can be applied to business challenges.

Abhijit Dutta is a Sr. Solutions Architect in the Retail/CPG vertical at AWS, focusing on key areas like migration and modernization of legacy applications, data-driven decision-making, and implementing AI/ML capabilities. His expertise lies in helping organizations use cloud technologies for their digital transformation initiatives, with particular emphasis on analytics and generative AI solutions.

Ramesh Venkataraman is a Solutions Architect who enjoys working with customers to solve their technical challenges using AWS services. Outside of work, Ramesh enjoys following stack overflow questions and answers them in any way he can.

Girish Nazhiyath is a Sr. Solutions Architect in the Amazon Web Services Retail/CPG vertical. He enjoys working with retail/CPG customers to enable technology-driven retail innovation, with over 20 years of expertise in multiple retail segments and domains worldwide.

Krishnan Hariharan is a Sr. Manager, Solutions Architecture at AWS based out of Chicago. In his current role, he uses his diverse blend of customer, product, technology, and operations skills to help retail/CPG customers build the best solutions using AWS. Prior to AWS, Krishnan was President/CEO at Kespry, and COO at LightGuide. He has an MBA from The Fuqua School of Business, Duke University and a Bachelor of Science in Electronics from Delhi University.

Read More

Democratize data for timely decisions with text-to-SQL at Parcel Perform

Democratize data for timely decisions with text-to-SQL at Parcel Perform

This post was co-written with Le Vy from Parcel Perform.

Access to accurate data is often the true differentiator of excellent and timely decisions. This is even more crucial for customer-facing decisions and actions. A correctly implemented state-of-the-art AI can help your organization simplify access to data for accurate and timely decision-making for the customer-facing business team, while reducing the undifferentiated heavy lifting done by your data team. In this post, we share how Parcel Perform, a leading AI Delivery Experience Platform for e-commerce businesses worldwide, implemented such a solution.

Accurate post-purchase deliveries tracking can be crucial for many ecommerce merchants. Parcel Perform provides an AI-driven, intelligent end-to-end data and delivery experience and software as a service (SaaS) system for ecommerce merchants. The system uses AWS services and state-of-the-art AI to process hundreds of millions of daily parcel delivery movement data and provide a unified tracking capability across couriers for the merchants, with emphasis on accuracy and simplicity.

The business team in Parcel Perform often needs access to data to answer questions related to merchants’ parcel deliveries, such as “Did we see a spike in delivery delays last week? If so, in which transit facilities were this observed, and what was the primary cause of the issue?” Previously, the data team had to manually form the query and run it to fetch the data. With the new generative AI-powered text-to-SQL capability in Parcel Perform, the business team can self-serve their data needs by using an AI assistant interface. In this post, we discuss how Parcel Perform incorporated generative AI, data storage, and data access through AWS services to make timely decisions.

Data analytics architecture

The solution starts with data ingestion, storage, and access. Parcel Perform adopted the data analytics architecture shown in the following diagram.

Architecture diagram of the parcel event data ingestion at Parcel Perform

One key data type in the Parcel Perform parcel monitoring application is the parcel event data, which can reach billions of rows. This includes the parcel’s shipment status change, location change, and much more. This day-to-day data from multiple business units lands in relational databases hosted on Amazon Relational Database Service (Amazon RDS).

Although relational databases are suitable for rapid data ingestion and consumption from the application, a separate analytics stack is needed to handle analytics in a scalable and performant way without disrupting the main application. These analytics needs include answering aggregation queries from questions like “How many parcels were delayed last week?”

Parcel Perform uses Amazon Simple Storage Service (Amazon S3) with a query engine provided by Amazon Athena to meet their analytics needs. With this approach, Parcel Perform benefits from cost-effective storage while still being able to run SQL queries as needed on the data through Athena, which is priced on usage.

Data in Amazon S3 is stored in Apache Iceberg data format that allows data updates, which is useful in this case because the parcel events sometimes get updated. It also supports partitioning for better performance. Amazon S3 Tables, launched in late 2024, is a managed Iceberg tables feature that can also be an option for you.

Parcel Perform uses an Apache Kafka cluster managed by Amazon Managed Streaming for Apache Kafka (Amazon MSK) as the stream to move the data from the source to the S3 bucket. Amazon MSK Connect with a Debezium connector streams data with change data capture (CDC) from Amazon RDS to Amazon MSK.

Apache Flink, running on Amazon Elastic Kubernetes Service (Amazon EKS), processes data streams from Amazon MSK. It writes this data to an S3 bucket according to the Iceberg format, and updates the data schema in the AWS Glue Data Catalog. The data schema enables Athena to correctly query the data in the S3 bucket.

Now that you understand how the data is ingested and stored, let’s show how the data is consumed using the generative AI-powered data serving assistant for the business teams in Parcel Perform.

AI agent that can query data

The users of the data serving AI agent in Parcel Perform are customer-facing business team members who often query the parcel event data to answer questions from ecommerce merchants regarding the parcel deliveries and to proactively assist them. The following screenshot shows the UI experience for the AI agent assistant, powered by text-to-SQL with generative AI.

A screenshot of the AI assistant

This functionality helped the Parcel Perform team and their customers save time, which we discuss later in this post. In the following section, we present the architecture that powers this feature.

Text-to-SQL AI agent architecture

The data serving AI assistant architecture in Parcel Perform is shown in the following diagram.

Architecture diagram of the AI assistantThe AI assistant UI is powered by an application built with the Fast API framework hosted on Amazon EKS. It is also fronted by an Application Load Balancer to allow for potential horizontal scalability.

The application uses LangGraph to orchestrate the workflow of large language model (LLM) invocations, the use of tools, and the memory checkpointing. The graph uses multiple tools, including those from SQLDatabase Toolkit, to automatically fetch the data schema through Athena. The graph also uses an Amazon Bedrock Knowledge Bases retriever to retrieve business information from a knowledge base. Parcel Perform uses Anthropic’s Claude models in Amazon Bedrock to generate SQL.

Although the function of Athena as a query engine to query the parcel event data on Amazon S3 is clear, Parcel Perform still needs a knowledge base. In this use case, the SQL generation performs better when the LLM has more business contextual information to help interpret database fields and translate logistics terminology into data representations. This is better illustrated with the following two examples:

  • Parcel Perform’s data lake operations use specific codes: c for create and u for update. When analyzing data, Parcel Perform sometimes needs to focus only on initial creation records, where operation code is equal to c. Because this business logic might not be inherent in the training of LLMs in general, Parcel Perform explicitly defines this in their business context.
  • In logistics terminology, transit time has specific industry conventions. It’s measured in days, and same-day deliveries are recorded as transit_time = 0. Although this is intuitive for logistics professionals, an LLM might incorrectly interpret a request like “Get me all shipments with same-day delivery” by using WHERE transit_time = 1 instead of WHERE transit_time = 0 in the generated SQL statement.

Therefore, each incoming question goes to a Retrieval Augmented Generation (RAG) workflow to find potentially relevant stored business information, to enrich the context. This mechanism helps provide the specific rules and interpretations that even advanced LLMs might not be able to derive from general training data.

Parcel Perform uses Amazon Bedrock Knowledge Bases as a managed solution for the RAG workflow. They ingest business contextual information by uploading files to Amazon S3. Amazon Bedrock Knowledge Bases processes the files, converts them to chunks, uses embedding models to generate vectors, and stores the vectors in a vector database to make them searchable. The steps are fully managed by Amazon Bedrock Knowledge Bases. Parcel Perform stores the vectors in Amazon OpenSearch Serverless as the vector database of choice to simplify infrastructure management.

Amazon Bedrock Knowledge Bases provides the Retrieve API, which takes in an input (such as a question from the AI assistant), converts it into a vector embedding, searches for relevant chunks of business context information in the vector database, and returns the top relevant document chunks. It is integrated with the LangChain Amazon Bedrock Knowledge Bases retriever by calling the invoke method.

The next step involves invoking an AI agent with the supplied business contextual information and the SQL generation prompt. The prompt was inspired by a prompt in LangChain Hub. The following is a code snippet of the prompt:

You are an agent designed to interact with a SQL database.
Given an input question, create a syntactically correct {dialect} query to run, then look at the results of the query and return the answer.
Unless the user specifies a specific number of examples they wish to obtain, always limit your query to at most {top_k} results.
Relevant context:
{rag_context}
You can order the results by a relevant column to return the most interesting examples in the database.
Never query for all the columns from a specific table, only ask for the relevant columns given the question.
You have access to tools for interacting with the database.
- Only use the below tools. Only use the information returned by the below tools to construct your final answer.
- DO NOT make any DML statements (INSERT, UPDATE, DELETE, DROP etc.) to the database.
- To start querying for final answer you should ALWAYS look at the tables in the database to see what you can query. Do NOT skip this step.
- Then you should query the schema of the most relevant tables

The prompt sample is part of the initial instruction for the agent. The data schema is automatically inserted by the tools from the SQLDatabase Toolkit at a later step of this agentic workflow. The following steps occur after a user enters a question in the AI assistant UI:

  1. The question triggers a run of the LangGraph graph.
  2. The following processes happen in parallel:
    1. The graph fetches the database schema from Athena through SQLDatabase Toolkit.
    2. The graph passes the question to the Amazon Bedrock Knowledge Bases retriever and gets a list of relevant business information regarding the question.
  3. The graph invokes an LLM using Amazon Bedrock by passing the question, the conversation context, data schema, and business context information. The result is the generated SQL.
  4. The graph uses SQLDatabase Toolkit again to run the SQL through Athena and fetch the data output.
  5. The data output is passed into an LLM to generate the final response based on the initial question asked. Amazon Bedrock Guardrails is used as a safeguard to avoid inappropriate inputs and responses.
  6. The final response is returned to the user through the AI assistant UI.

The following diagram illustrates these steps.

Architecture diagram of the AI assistant with numbered steps

This implementation demonstrates how Parcel Perform transforms raw inquiries into actionable data for timely decision-making. Security is also implemented in multiple components. From a network perspective, the EKS pods are placed in private subnets in Amazon Virtual Private Cloud (Amazon VPC) to improve network security of the AI assistant application. This AI agent is placed behind a backend layer that requires authentication. For data security, sensitive data is masked at rest in the S3 bucket. Parcel Perform also limits the permissions of the AWS Identity and Access Management (IAM) role used to access the S3 bucket so it can only access certain tables.

In the following sections, we discuss Parcel Perform’s approach to building this data transformation solution.

From idea to production

Parcel Perform started with the idea of freeing their data team from manually serving the request from the business team, while also improving the timeliness of the data availability to support the business team’s decision-making.

With the help of the AWS Solutions Architect team, Parcel Perform completed a proof of concept using AWS services and a Jupyter notebook in Amazon SageMaker Studio. After an initial success, Parcel Perform integrated the solution with their orchestration tool of choice, LangGraph.

Before going into production, Parcel Perform conducted extensive testing to verify the results were consistent. They added LangSmith Tracing to log the AI agent’s steps and results to evaluate its performance.

The Parcel Perform team discovered challenges during their journey, which we discuss in the following section. They performed prompt engineering to address those challenges. Eventually, the AI agent was integrated into production to be used by the business team. Afterward, Parcel Perform collected user feedback internally and monitored logs from LangSmith Tracing to verify performance was maintained.

The challenges

This journey isn’t free from challenges. Firstly, some ecommerce merchants might have several records in the data lake under various names. For example, a merchant with the name “ABC” might have multiple records such, as “ABC Singapore Holdings Pte. Ltd.,” “ABC Demo Account,” “ABC Test Group,” and so on. For a question like “Was there any parcel shipment delay by ABC last week?”, the generated SQL has the element of WHERE merchant_name LIKE '%ABC%', which might result in ambiguity. During the proof of concept stage, this problem caused incorrect matching of the result.

For this challenge, Parcel Perform relies on careful prompt engineering to instruct the LLM to identify when the name was potentially ambiguous. The AI agent then calls Athena again to look for matching names. The LLM decides which merchant name to use based on multiple factors, including the significance in data volume contribution and the account status in the data lake. In the future, Parcel Perform intends to implement a more sophisticated technique by prompting the user to resolve the ambiguity.

The second challenge is about unrestricted questions that might yield expensive queries running across large amounts of data and resulting in longer query waiting time. Some of these questions might not have a LIMIT clause imposed in the query. To solve this, Parcel Perform instructs the LLM to add a LIMIT clause with a certain number of maximum results if the user doesn’t specify the intended number of results. In the future, Parcel Perform plans to use the query EXPLAIN results to identify heavy queries.

The third challenge is related to tracking usage and incurred cost of this particular solution. Having started multiple generative AI projects using Amazon Bedrock and sometimes with the same LLM ID, Parcel Perform must distinguish usage incurred by projects. Parcel Perform creates an inference profile for each project, associates the profile with tags, and includes that profile in each LLM call for that project. With this setup, Parcel Perform is able to segregate costs based on projects to improve cost visibility and monitoring.

The impact

To extract data, the business team clarifies details with the data team, makes a request, checks feasibility, and waits for bandwidth. This process lengthens when requirements come from customers or teams in different time zones, with each clarification adding 12–24 hours due to asynchronous communication. Simpler requests made early in the workday might complete within 24 hours, whereas more complex requests or those during busy periods can take 3–5 business days.

With the text-to-SQL AI agent, this process is dramatically streamlined—minimizing the back-and-forth communication for requirement clarification, removing the dependency on data team bandwidth, and automating result interpretation.

Parcel Perform’s measurements show that the text-to-SQL AI agent reduces the average time-to-insight by 99%, from 2.3 days to an average of 10 minutes, saving approximately 3,850 total hours of wait time per month across requesters while maintaining data accuracy.

Users can directly query the data without intermediaries, receiving results in minutes rather than days. Teams across time zones can now access insights any time of day, alleviating the frustrating “wait until Asia wakes up” or “catch EMEA before they leave” delays, leading to happier customers and faster problem-solving.

This transformation has profoundly impacted the data analytics team’s capacity and focus, freeing the data team for more strategic work and helping everyone make faster, more informed decisions. Before, the analysts spent approximately 25% of their working hours handling routine data extraction requests—equivalent to over 260 hours monthly across the team. Now, with basic and intermediate queries automated, this number has dropped to just 10%, freeing up nearly 160 hours each month for high-impact work. Analysts now focus on complex data analysis rather than spending time on basic data retrieval tasks.

Conclusion

Parcel Perform’s solution demonstrates how you can use generative AI to enhance productivity and customer experience. Parcel Perform has built a text-to-SQL AI agent that transforms a business team’s question into SQL that can fetch the actual data. This improves the timeliness of data availability for decision-making that involves customers. Furthermore, the data team can avoid the undifferentiated heavy lifting to focus on complex data analysis tasks.

This solution uses multiple AWS services like Amazon Bedrock and tools like LangGraph. You can start with a proof of concept and consult your AWS Solutions Architect or engage with AWS Partners. If you have questions, post them on AWS re:Post. You can also make the development more straightforward with the help of Amazon Q Developer. When you face challenges, you can iterate to find the solution, which might include prompt engineering or adding additional steps to your workflow.

Security is a top priority. Make sure your AI assistant has proper guardrails in place to protect against prompt threats, inappropriate topics, profanity, leaked data, and other security issues. You can integrate Amazon Bedrock Guardrails with your generative AI application through an API.To learn more, refer to the following resources:


About the authors

Yudho Ahmad Diponegoro profile pictureYudho Ahmad Diponegoro is a Senior Solutions Architect at AWS. Having been part of Amazon for 10+ years, he has had various roles from software development to solutions architecture. He helps startups in Singapore when it comes to architecting in the cloud. While he keeps his breadth of knowledge across technologies and industries, he focuses in AI and machine learning where he has been guiding various startups in ASEAN to adopt machine learning and generative AI at AWS.

Le Vy is the AI Team Lead at Parcel Perform, where she drives the development of AI applications and explores emerging AI research. She started her career in data analysis and deepened her focus on AI through a Master’s in Artificial Intelligence. Passionate about applying data and AI to solve real business problems, she also dedicates time to mentoring aspiring technologists and building a supportive community for youth in tech. Through her work, Vy actively challenges gender norms in the industry and champions lifelong learning as a key to innovation.

Loke Jun Kai is a GenAI/ML Specialist Solutions Architect in AWS, covering strategic customers across the ASEAN region. He works with customers ranging from Start-up to Enterprise to build cutting-edge use cases and scalable GenAI Platforms. His passion in the AI space, constant research and reading, have led to many innovative solutions built with concrete business outcomes. Outside of work, he enjoys a good game of tennis and chess.

Read More