Benchmarking customized models on Amazon Bedrock using LLMPerf and LiteLLM

Benchmarking customized models on Amazon Bedrock using LLMPerf and LiteLLM

Open foundation models (FMs) allow organizations to build customized AI applications by fine-tuning for their specific domains or tasks, while retaining control over costs and deployments. However, deployment can be a significant portion of the effort, often requiring 30% of project time because engineers must carefully optimize instance types and configure serving parameters through careful testing. This process can be both complex and time-consuming, requiring specialized knowledge and iterative testing to achieve the desired performance.

Amazon Bedrock Custom Model Import simplifies deployments of custom models by offering a straightforward API for model deployment and invocation. You can upload model weights and let AWS handle an optimal, fully managed deployment. This makes sure that deployments are performant and cost effective. Amazon Bedrock Custom Model Import also handles automatic scaling, including scaling to zero. When not in use and there are no invocations for 5 minutes, it scales to zero. You pay only for what you use in 5-minute increments. It also handles scaling up, automatically increasing the number of active model copies when higher concurrency is required. These features make Amazon Bedrock Custom Model Import an attractive solution for organizations looking to use custom models on Amazon Bedrock providing simplicity and cost-efficiency.

Before deploying these models in production, it’s crucial to evaluate their performance using benchmarking tools. These tools help to proactively detect potential production issues such as throttling and verify that deployments can handle expected production loads.

This post begins a blog series exploring DeepSeek and open FMs on Amazon Bedrock Custom Model Import. It covers the process of performance benchmarking of custom models in Amazon Bedrock using popular open source tools: LLMPerf and LiteLLM. It includes a notebook that includes step-by-step instructions to deploy a DeepSeek-R1-Distill-Llama-8B model, but the same steps apply for any other model supported by Amazon Bedrock Custom Model Import.

Prerequisites

This post requires an Amazon Bedrock custom model. If you don’t have one in your AWS account yet, follow the instructions from Deploy DeepSeek-R1 distilled Llama models with Amazon Bedrock Custom Model Import.

Using open source tools LLMPerf and LiteLLM for performance benchmarking

To conduct performance benchmarking, you will use LLMPerf, a popular open-source library for benchmarking foundation models. LLMPerf simulates load tests on model invocation APIs by creating concurrent Ray Clients and analyzing their responses. A key advantage of LLMPerf is wide support of foundation model APIs. This includes LiteLLM, which supports all models available on Amazon Bedrock.

Setting up your custom model invocation with LiteLLM

LiteLLM is a versatile open source tool that can be used both as a Python SDK and a proxy server (AI gateway) for accessing over 100 different FMs using a standardized format. LiteLLM standardizes inputs to match each FM provider’s specific endpoint requirements. It supports Amazon Bedrock APIs, including InvokeModel and Converse APIs, and FMs available on Amazon Bedrock, including imported custom models.

To invoke a custom model with LiteLLM, you use the model parameter (see Amazon Bedrock documentation on LiteLLM). This is a string that follows the bedrock/provider_route/model_arn format.

The provider_route indicates the LiteLLM implementation of request/response specification to use. DeepSeek R1 models can be invoked using their custom chat template using the DeepSeek R1 provider route, or with the Llama chat template using the Llama provider route.

The model_arn is the model Amazon Resource Name (ARN) of the imported model. You can get the model ARN of your imported model in the console or by sending a ListImportedModels request.

For example, the following script invokes the custom model using the DeepSeek R1 chat template.

import time
from litellm import completion

while True:
    try:
        response = completion(
            model=f"bedrock/deepseek_r1/{model_id}",
            messages=[{"role": "user", "content": """Given the following financial data:
        - Company A's revenue grew from $10M to $15M in 2023
        - Operating costs increased by 20%
        - Initial operating costs were $7M
        
        Calculate the company's operating margin for 2023. Please reason step by step."""},
                      {"role": "assistant", "content": "<think>"}],
            max_tokens=4096,
        )
        print(response['choices'][0]['message']['content'])
        break
    except:
        time.sleep(60)

After the invocation parameters for the imported model have been verified, you can configure LLMPerf for benchmarking.

Configuring a token benchmark test with LLMPerf

To benchmark performance, LLMPerf uses Ray, a distributed computing framework, to simulate realistic loads. It spawns multiple remote clients, each capable of sending concurrent requests to model invocation APIs. These clients are implemented as actors that execute in parallel. llmperf.requests_launcher manages the distribution of requests across the Ray Clients, and allows for simulation of various load scenarios and concurrent request patterns. At the same time, each client will collect performance metrics during the requests, including latency, throughput, and error rates.

Two critical metrics for performance include latency and throughput:

  • Latency refers to the time it takes for a single request to be processed.
  • Throughput measures the number of tokens that are generated per second.

Selecting the right configuration to serve FMs typically involves experimenting with different batch sizes while closely monitoring GPU utilization and considering factors such as available memory, model size, and specific requirements of the workload. To learn more, see Optimizing AI responsiveness: A practical guide to Amazon Bedrock latency-optimized inference. Although Amazon Bedrock Custom Model Import simplifies this by offering pre-optimized serving configurations, it’s still crucial to verify your deployment’s latency and throughput.

Start by configuring token_benchmark.py, a sample script that facilitates the configuration of a benchmarking test. In the script, you can define parameters such as:

  • LLM API: Use LiteLLM to invoke Amazon Bedrock custom imported models.
  • Model: Define the route, API, and model ARN to invoke similarly to the previous section.
  • Mean/standard deviation of input tokens: Parameters to use in the probability distribution from which the number of input tokens will be sampled.
  • Mean/standard deviation of output tokens: Parameters to use in the probability distribution from which the number of output tokens will be sampled.
  • Number of concurrent requests: The number of users that the application is likely to support when in use.
  • Number of completed requests: The total number of requests to send to the LLM API in the test.

The following script shows an example of how to invoke the model. See this notebook for step-by-step instructions on importing a custom model and running a benchmarking test.

python3 ${{LLM_PERF_SCRIPT_DIR}}/token_benchmark_ray.py \
--model "bedrock/llama/{model_id}" \
--mean-input-tokens {mean_input_tokens} \
--stddev-input-tokens {stddev_input_tokens} \
--mean-output-tokens {mean_output_tokens} \
--stddev-output-tokens {stddev_output_tokens} \
--max-num-completed-requests ${{LLM_PERF_MAX_REQUESTS}} \
--timeout 1800 \
--num-concurrent-requests ${{LLM_PERF_CONCURRENT}} \
--results-dir "${{LLM_PERF_OUTPUT}}" \
--llm-api litellm \
--additional-sampling-params '{{}}'

At the end of the test, LLMPerf will output two JSON files: one with aggregate metrics, and one with separate entries for every invocation.

Scale to zero and cold-start latency

One thing to remember is that because Amazon Bedrock Custom Model Import will scale down to zero when the model is unused, you need to first make a request to make sure that there is at least one active model copy. If you obtain an error indicating that the model isn’t ready, you need to wait for approximately ten seconds and up to 1 minute for Amazon Bedrock to prepare at least one active model copy. When ready, run a test invocation again, and proceed with benchmarking.

Example scenario for DeepSeek-R1-Distill-Llama-8B

Consider a DeepSeek-R1-Distill-Llama-8B model hosted on Amazon Bedrock Custom Model Import, supporting an AI application with low traffic of no more than two concurrent requests. To account for variability, you can adjust parameters for token count for prompts and completions. For example:

  • Number of clients: 2
  • Mean input token count: 500
  • Standard deviation input token count: 25
  • Mean output token count: 1000
  • Standard deviation output token count: 100
  • Number of requests per client: 50

This illustrative test takes approximately 8 minutes. At the end of the test, you will obtain a summary of results of aggregate metrics:

inter_token_latency_s
    p25 = 0.010615988283217918
    p50 = 0.010694698716183695
    p75 = 0.010779359342088015
    p90 = 0.010945443657517748
    p95 = 0.01100556307365132
    p99 = 0.011071086908721675
    mean = 0.010710014800224604
    min = 0.010364670612635254
    max = 0.011485444453299149
    stddev = 0.0001658793389904756
ttft_s
    p25 = 0.3356793452499005
    p50 = 0.3783651359990472
    p75 = 0.41098671700046907
    p90 = 0.46655246950049334
    p95 = 0.4846706690498647
    p99 = 0.6790834719300077
    mean = 0.3837810468001226
    min = 0.1878921090010408
    max = 0.7590946710006392
    stddev = 0.0828713133225014
end_to_end_latency_s
    p25 = 9.885957818500174
    p50 = 10.561580732000039
    p75 = 11.271923759749825
    p90 = 11.87688222009965
    p95 = 12.139972019549713
    p99 = 12.6071144856102
    mean = 10.406450886010116
    min = 2.6196457750011177
    max = 12.626598834998731
    stddev = 1.4681851822617253
request_output_throughput_token_per_s
    p25 = 104.68609252502657
    p50 = 107.24619111072519
    p75 = 108.62997591951486
    p90 = 110.90675007239598
    p95 = 113.3896235445618
    p99 = 116.6688412475626
    mean = 107.12082450567561
    min = 97.0053466021563
    max = 129.40680882698936
    stddev = 3.9748004356837137
number_input_tokens
    p25 = 484.0
    p50 = 500.0
    p75 = 514.0
    p90 = 531.2
    p95 = 543.1
    p99 = 569.1200000000001
    mean = 499.06
    min = 433
    max = 581
    stddev = 26.549294727074212
number_output_tokens
    p25 = 1050.75
    p50 = 1128.5
    p75 = 1214.25
    p90 = 1276.1000000000001
    p95 = 1323.75
    p99 = 1372.2
    mean = 1113.51
    min = 339
    max = 1392
    stddev = 160.9598415942952
Number Of Errored Requests: 0
Overall Output Throughput: 208.0008834264341
Number Of Completed Requests: 100
Completed Requests Per Minute: 11.20784995697034

In addition to the summary, you will receive metrics for individual requests that can be used to prepare detailed reports like the following histograms for time to first token and token throughput.

Analyzing performance results from LLMPerf and estimating costs using Amazon CloudWatch

LLMPerf gives you the ability to benchmark the performance of custom models served in Amazon Bedrock without having to inspect the specifics of the serving properties and configuration of your Amazon Bedrock Custom Model Import deployment. This information is valuable because it represents the expected end user experience of your application.

In addition, the benchmarking exercise can serve as a valuable tool for cost estimation. By using Amazon CloudWatch, you can observe the number of active model copies that Amazon Bedrock Custom Model Import scales to in response to the load test. ModelCopy is exposed as a CloudWatch metric in the AWS/Bedrock namespace and is reported using the imported model ARN as a label. The plot for the ModelCopy metric is shown in the figure below. This data will assist in estimating costs, because billing is based on the number of active model copies at a given time.

Conclusion

While Amazon Bedrock Custom Model Import simplifies model deployment and scaling, performance benchmarking remains essential to predict production performance, and compare models across key metrics such as cost, latency, and throughput.

To learn more, try the example notebook with your custom model.

Additional resources:


About the Authors

Felipe Lopez is a Senior AI/ML Specialist Solutions Architect at AWS. Prior to joining AWS, Felipe worked with GE Digital and SLB, where he focused on modeling and optimization products for industrial applications.

Rupinder Grewal is a Senior AI/ML Specialist Solutions Architect with AWS. He currently focuses on the serving of models and MLOps on Amazon SageMaker. Prior to this role, he worked as a Machine Learning Engineer building and hosting models. Outside of work, he enjoys playing tennis and biking on mountain trails.

Paras Mehra is a Senior Product Manager at AWS. He is focused on helping build Amazon Bedrock. In his spare time, Paras enjoys spending time with his family and biking around the Bay Area.

Prashant Patel is a Senior Software Development Engineer in AWS Bedrock. He’s passionate about scaling large language models for enterprise applications. Prior to joining AWS, he worked at IBM on productionizing large-scale AI/ML workloads on Kubernetes. Prashant has a master’s degree from NYU Tandon School of Engineering. While not at work, he enjoys traveling and playing with his dogs.

Read More

Creating asynchronous AI agents with Amazon Bedrock

Creating asynchronous AI agents with Amazon Bedrock

The integration of generative AI agents into business processes is poised to accelerate as organizations recognize the untapped potential of these technologies. Advancements in multimodal artificial intelligence (AI), where agents can understand and generate not just text but also images, audio, and video, will further broaden their applications. This post will discuss agentic AI driven architecture and ways of implementing.

The emergence of generative AI agents in recent years has contributed to the transformation of the AI landscape, driven by advances in large language models (LLMs) and natural language processing (NLP). Companies like Anthropic, Cohere, and Amazon have made significant strides in developing powerful language models capable of understanding and generating human-like content across multiple modalities, revolutionizing how businesses integrate and utilize artificial intelligence in their processes.

These AI agents have demonstrated remarkable versatility, being able to perform tasks ranging from creative writing and code generation to data analysis and decision support. Their ability to engage in intelligent conversations, provide context-aware responses, and adapt to diverse domains has revolutionized how businesses approach problem-solving, customer service, and knowledge dissemination.

One of the most significant impacts of generative AI agents has been their potential to augment human capabilities through both synchronous and asynchronous patterns. In synchronous orchestration, just like in traditional process automation, a supervisor agent orchestrates the multi-agent collaboration, maintaining a high-level view of the entire process while actively directing the flow of information and tasks. This approach allows businesses to offload repetitive and time-consuming tasks in a controlled, predictable manner.

Alternatively, asynchronous choreography follows an event-driven pattern where agents operate autonomously, triggered by events or state changes in the system. In this model, agents publish events or messages that other agents can subscribe to, creating a workflow that emerges from their collective behavior. These patterns have proven particularly valuable in enhancing customer experiences, where agents can provide round-the-clock support, resolve issues promptly, and deliver personalized recommendations through either orchestrated or event-driven interactions, leading to increased customer satisfaction and loyalty.

Agentic AI architecture

Agentic AI architecture is a shift in process automation through autonomous agents towards the capabilities of AI, with the purpose of imitating cognitive abilities and enhancing the actions of traditional autonomous agents. This architecture can enable businesses to streamline operations, enhance decision-making processes, and automate complex tasks in new ways.

Much like traditional business process automation through technology, the agentic AI architecture is the design of AI systems designed to resolve complex problems with limited or indirect human intervention. These systems are composed of multiple AI agents that converse with each other or execute complex tasks through a series of choreographed or orchestrated processes. This approach empowers AI systems to exhibit goal-directed behavior, learn from experience, and adapt to changing environments.

The difference between a single agent invocation and a multi-agent collaboration lies in the complexity and the number of agents involved in the process.

When you interact with a digital assistant like Alexa, you’re typically engaging with a single agent, also known as a conversational agent. This agent processes your request, such as setting a timer or checking the weather, and provides a response without needing to consult other agents.

Now, imagine expanding this interaction to include multiple agents working together. Let’s start with a simple travel booking scenario:

Your interaction begins with telling a travel planning agent about your desired trip. In this first step, the AI model, in this case an LLM, is acting as an interpreter and user experience interface between your natural language input and the structured information needed by the travel planning system. It’s processing your request, which might be a complex statement like “I want to plan a week-long beach vacation in Hawaii for my family of four next month,” and extracting key details such as the destination, duration, number of travelers, and approximate dates.

The LLM is also likely to infer additional relevant information that wasn’t explicitly stated, such as the need for family-friendly accommodations or activities. It might ask follow-up questions to clarify ambiguous points or gather more specific preferences. Essentially, the LLM is transforming your casual, conversational input into a structured set of travel requirements that can be used by the specialized booking agents in the subsequent steps of the workflow.

This initial interaction sets the foundation for the entire multi-agent workflow, making sure that the travel planning agent has a clear understanding of your needs before engaging other specialized agents.

By adding another agent, the flight booking agent, the travel planning agent can call upon it to find suitable flights. The travel planning agent needs to provide the flight booking agent with relevant information (dates, destinations), and wait for and process the flight booking agent’s response, to incorporate the flight options into its overall plan

Now, let’s add another agent to the workflow; a hotel booking agent to support finding accommodations. With this addition, the travel planning agent must also communicate with the hotel booking agent, which needs to make sure that the hotel dates align with the flight dates and provide the information back to the overall plan to include both flight and hotel options.

As we continue to add agents, such as a car rental agent or a local activities agent, each new addition receives relevant information from the travel planning agent, performs its specific task, and returns its results to be incorporated into the overall plan. The travel planning agent acts not only as the user experience interface, but also as a coordinator, deciding when to involve each specialized agent and how to combine their inputs into a cohesive travel plan.

This multi-agent workflow allows for more complex tasks to be accomplished by taking advantage of the specific capabilities of each agent. The system remains flexible, because agents can be added or removed based on the specific needs of each request, without requiring significant changes to the existing agents and minimal change to the overall workflow.

For more on the benefits of breaking tasks into agents, see How task decomposition and smaller LLMs can make AI more affordable.

Process automation with agentic AI architecture

The preceding scenario, just like in traditional process automation, is a common orchestration pattern, where the multi-agent collaboration is orchestrated by a supervisor agent. The supervisor agent acts like a conductor leading an orchestra, telling each instrument when to play and how to harmonize with others. For this approach, Amazon Bedrock Agents enables generative AI applications to execute multi-step tasks orchestrated by an agent and create a multi-agent collaboration with Amazon Bedrock Agents to solve complex tasks. This is done by designating an Amazon Bedrock agent as a supervisor agent, associating one or more collaborator agents with the supervisor. For more details, read on creating and configuring Amazon Bedrock Agents and Use multi-agent collaboration with Amazon Bedrock Agents.

The following diagram illustrates the supervisor agent methodology.

Supervisor agent methodology

Supervisor agent methodology

Following traditional process automation patterns, the other end of the spectrum to synchronous orchestration would be asynchronous choreography: an asynchronous event-driven multi-agent workflow. In this approach, there would be no central orchestrating agent (supervisor). Agents operate autonomously where actions are triggered by events or changes in a system’s state and agents publish events or messages that other agents can subscribe to. In this approach, the workflow emerges from the collective behavior of the agents reacting to events asynchronously. It’s more like a jazz improvisation, where each musician responds to what others are playing without a conductor. The following diagram illustrates this event-driven workflow.

Event-driven workflow methodology

Event-driven workflow methodology

The event-driven pattern in asynchronous systems operates without predefined workflows, creating a dynamic and potentially chaotic processing environment. While agents subscribe to and publish messages through a central event hub, the flow of processing is determined organically by the message requirements and the available subscribed agents. Although the resulting pattern may resemble a structured workflow when visualized, it’s important to understand that this is emergent behavior rather than orchestrated design. The absence of centralized workflow definitions means that message processing occurs naturally based on publication timing and agent availability, creating a fluid and adaptable system that can evolve with changing requirements.

The choice between synchronous orchestration and asynchronous event-driven patterns fundamentally shapes how agentic AI systems operate and scale. Synchronous orchestration, with its supervisor agent approach, provides precise control and predictability, making it ideal for complex processes requiring strict oversight and sequential execution. This pattern excels in scenarios where the workflow needs to be tightly managed, audited, and debugged. However, it can create bottlenecks as all operations must pass through the supervisor agent. Conversely, asynchronous event-driven systems offer greater flexibility and scalability through their distributed nature. By allowing agents to operate independently and react to events in real-time, these systems can handle dynamic scenarios and adapt to changing requirements more readily. While this approach may introduce more complexity in tracking and debugging workflows, it excels in scenarios requiring high scalability, fault tolerance, and adaptive behavior. The decision between these patterns often depends on the specific requirements of the system, balancing the need for control and predictability against the benefits of flexibility and scalability.

Getting the best of both patterns

You can use a single agent to route messages to other agents based on the context of the event data (message) at runtime, with no prior knowledge of the downstream agents, without having to rely on each agent subscribing to an event hub. This is traditionally known as the message broker or event broker pattern, which for the purpose of this article we will call an agent broker pattern, to represent brokering of messages to AI agents. The agent broker pattern is a hybrid approach that combines elements of both centralized synchronous orchestration and distributed asynchronous event-driven systems.

The key to this pattern is that a single agent acts as a central hub for message distribution but doesn’t control the entire workflow. The broker agent determines where to send each message based on its content or metadata, making routing decisions at runtime. The processing agents are decoupled from each other and from the message source, only interacting with the broker to receive messages. The agent broker pattern is different from the supervisor pattern because it awaits a response from collaborating agents by routing a message to an agent and not awaiting a response. The following diagram illustrates the agent broker methodology.

Agent broker methodology

Agent broker methodology

Following an agent broker pattern, the system is still fundamentally event-driven, with actions triggered by the arrival of messages. New agents can be added to handle specific types of messages without changing the overall system architecture. Understanding how to implement this type of pattern will be explained later in this post.

This pattern is often used in enterprise messaging systems, microservices architectures, and complex event processing systems. It provides a balance between the structure of orchestrated workflows and the flexibility of pure event-driven systems.

Agentic architecture with the Amazon Bedrock Converse API

Traditionally, we might have had to sacrifice some flexibility in the broker pattern by having to update the routing logic in the broker when adding additional processes (agents) to the architecture. This is, however, not the case when using the Amazon Bedrock Converse API. With the Converse API, we can call a tool to complete an Amazon Bedrock model response. The only change is the additional agent added to the collaboration stored as configuration outside of the broker.

To let a model use a tool to complete a response for a message, the message and the definitions for one or more tools (agents) are sent to the model. If the model determines that one of the tools can help generate a response, it returns a request to use the tool.

AWS AppConfig, a capability of AWS Systems Manager, is used to store each of the agents’ tool context data as a single configuration in a managed data store, to be sent to the Converse API tool request. By using AWS Lambda as the message broker to receive all message and send requests to the Converse API with the tool context stored in AWS AppConfig, the architecture allows for adding additional agents to the system without having to update the routing logic, by ‘registering’ agents as ‘tool context’ in the configuration stored in AWS AppConfig, to be read by Lambda at run time (event message received). For more information about when to use AWS Config, see AWS AppConfig use cases.

Implementing the agent broker pattern

The following diagram demonstrates how Amazon EventBridge and Lambda act as a central message broker, with the Amazon Bedrock Converse API to let a model use a tool in a conversation to dynamically route messages to appropriate AI agents.

Agent broker architecture diagram

Agent broker architecture

Messages sent to EventBridge are routed through an EventBridge rule to Lambda. There are three tasks the EventBridge Lambda function performs as the agent broker:

  1. Query AWS AppConfig for all agents’ tool context. An agent tool context is a description of the agent’s capability along with the Amazon Resource Name (ARN) or URL of the agent’s message ingress.
  2. Provide the agent tool context along with the inbound event message to the Amazon Bedrock LLM through the Converse API; in this example, using an Amazon Bedrock tools-compatible LLM. The LLM, using the Converse API, combines the event message context compared to the agent tool context to provide a response back to the requesting Lambda function, containing the recommended tool or tools that should be used to process the message.
  3. Receive the response from the Converse API request containing one or more tools that should be called to process the event message, and hands the event message to the ingress of the recommended tools.

In this example, the architecture demonstrates brokering messages asynchronously to an Amazon SageMaker based agent, an Amazon Bedrock agent, and an external third-party agent, all from the same agent broker.

Although the brokering Lambda function could connect directly to the SageMaker or Amazon Bedrock agent API, the architecture provides for adaptability and scalability in message throughput, allowing messages from the agent broker to be queued, in this example with Amazon Simple Queue Service (Amazon SQS), and processed according to the capability of the receiving agent. For adaptability, the Lambda function subscribed to the agent ingress queue provides additional system prompts (pre-prompting of the LLM for specific tool context) and message formatted, and required functions for the expected input and output of the agent request.

To add new agents to the system, the only integration requirements are to update the AWS AppConfig with the new agent tool context (description of the agents’ capability and ingress endpoint), and making sure the brokering Lambda function has permissions to write to the agent ingress endpoint.

Agents can be added to the system without rewriting the Lambda function or integration that requires downtime, allowing the new agent to be used on the next instantiation of the brokering Lambda function.

Implementing the supervisor pattern with an agent broker

Building upon the agent broker pattern, the architecture can be extended to handle more complex, stateful interactions. Although the broker pattern effectively uses AWS AppConfig and Amazon Bedrock’s Converse API tool use capability for dynamic routing, its unidirectional nature has limitations. Events flow in and are distributed to agents, but complex scenarios like travel booking require maintaining context across multiple agent interactions. This is where the supervisor pattern provides additional capabilities without compromising the flexible routing we achieved with the broker pattern.

Using the example of the travel booking agent: the example has the broker agent and several task-based agents that events will be pushed to. When processing a request like “Book a 3-night trip to Sydney from Melbourne during the first week of September for 2 people”, we encounter several challenges. Although this statement contains clear intent, it lacks critical details that the agent might need, such as:

  1. Specific travel dates
  2. Accommodation preferences and room configurations

The broker pattern alone can’t effectively manage these information gaps while maintaining context between agent interactions. This is where adding the capability of a supervisor to the broker agent provides:

  • Contextual awareness between events and agent invocations
  • Bi-directional information flow capabilities

The following diagram illustrates the supervisor pattern workflow

Supervisor pattern architecture diagram

Supervisor pattern architecture

When a new event enters the system, the workflow initiates the following steps:

  1. The event is assigned a unique identifier for tracking
  2. The supervisor performs the following actions:
    • Evaluates which agents to invoke (brokering)
    • Creates a new state record with the identifier and timestamp
    • Provides this contextual information to the selected agents along with their invocation parameters
  3. Agents process their tasks and emit ‘task completion’ events back to EventBridge
  4. The supervisor performs the following actions:
    • Collects and processes completed events
    • Evaluates the combined results and context
    • Determines if additional agent invocations are needed
    • Continues this cycle until all necessary actions are completed

This pattern handles scenarios where agents might return varying results or request additional information. The supervisor can either:

  • Derive missing information from other agent responses
  • Request additional information from the source
  • Coordinate with other agents to resolve information gaps

To handle information gaps without architectural modifications, we can introduce an answers agent to the existing system. This agent operates within the same framework as other agents, but specializes in context resolution. When agents report incomplete information or require clarification, the answers agent can:

  • Process queries about missing information
  • Emit task completion events with enhanced context
  • Allow the supervisor to resume workflow execution with newly available information, the same way that it would after another agent emits its task-completion event.

This enhancement enables complex, multi-step workflows while maintaining the system’s scalability and flexibility. The supervisor can manage dependencies between agents, handle partial completions, and make sure that the necessary information is gathered before finalizing tasks.

Implementation considerations:

Implementing the supervisor pattern on top of the existing broker agent architecture provides the advantages of both the broker pattern and the complex state management of orchestration. The state management can be handled through Amazon DynamoDB, and maintaining the use of EventBridge for event routing and AWS AppConfig for agent configuration. The Amazon Bedrock Converse API continues to play a crucial role in agent selection, but now with added context from the supervisor’s state management. This allows you to preserve the dynamic routing capabilities we established with the broker pattern while adding the sophisticated workflow management needed for complex, multi-step processes.

Conclusion

Agentic AI architecture, powered by Amazon Bedrock and AWS services, represents a leap forward in the evolution of automated AI systems. By combining the flexibility of event-driven systems with the power of generative AI, this architecture enables businesses to create more adaptive, scalable, and intelligent automated processes. The agent broker pattern offers a robust solution for dynamically routing complex tasks to specialized AI agents, and the agent supervisor pattern extends these capabilities to handle sophisticated, context-aware workflows.

These patterns take advantage of the strengths of the Amazon Bedrock’s Converse API, Lambda, EventBridge, and AWS AppConfig to create a flexible and extensible system. The broker pattern excels at dynamic routing and seamless agent integration, while the supervisor pattern adds crucial state management and contextual awareness for complex, multi-step processes. Together, they provide a comprehensive framework for building sophisticated AI systems that can handle both simple routing and complex, stateful interactions.

This architecture not only streamlines operations, but also opens new possibilities for innovation and efficiency across various industries. Whether implementing simple task routing or orchestrating complex workflows requiring maintained context, organizations can build scalable, maintainable AI systems that evolve with their needs while maintaining operational stability.

To get started with an agentic AI architecture, consider the following next steps:

  • Explore Amazon Bedrock – If you haven’t already, sign up for Amazon Bedrock and experiment with its powerful generative AI models and APIs. Familiarize yourself with the Converse API and its tool use capabilities.
  • Prototype your own agent broker – Use the architecture outlined in this post as a starting point to build a proof-of-concept agent broker system tailored to your organization’s needs. Start small with a few specialized agents and gradually expand.
  • Identify use cases – Analyze your current business processes to identify areas where an agentic AI architecture could drive significant improvements. Consider complex, multi-step tasks that could benefit from AI assistance.
  • Stay informed – Keep up with the latest developments in AI and cloud technologies. AWS regularly updates its offerings, so stay tuned for new features that could enhance your agentic AI systems.
  • Collaborate and shareJoin AI and cloud computing communities to share your experiences and learn from others. Consider contributing to open-source projects or writing about your implementation to help advance the field.
  • Invest in training – Make sure your team has the necessary skills to work with these advanced AI technologies. Consider AWS training and certification programs to build expertise in your organization.

By embracing an agentic AI architecture, you’re not just optimizing your current processes – you’re positioning your organization at the forefront of the AI revolution. Start your journey today and unlock the full potential of AI-driven automation for your business.


About the Authors

aaron sempfAaron Sempf is Next Gen Tech Lead for the AWS Partner Organization in Asia-Pacific and Japan. With over 20 years in distributed system engineering design and development, he focuses on solving for large scale complex integration and event driven systems. In his spare time, he can be found coding prototypes for autonomous robots, IoT devices, distributed solutions, and designing agentic architecture patterns for generative AI assisted business automation.

josh tothJoshua Toth is a Senior Prototyping Engineer with over a decade of experience in software engineering and distributed systems. He specializes in solving complex business challenges through technical prototypes, demonstrating the art of the possible. With deep expertise in proof of concept development, he focuses on bridging the gap between emerging technologies and practical business applications. In his spare time, he can be found developing next-generation interactive demonstrations and exploring cutting-edge technological innovations.

sara van de moosdijkSara van de Moosdijk, simply known as Moose, is an AI/ML Specialist Solution Architect at AWS. She helps AWS customers and partners build and scale AI/ML solutions through technical enablement, support, and architectural guidance. Moose spends her free time figuring out how to fit more books in her overflowing bookcase.

Read More

How to run Qwen 2.5 on AWS AI chips using Hugging Face libraries

How to run Qwen 2.5 on AWS AI chips using Hugging Face libraries

The Qwen 2.5 multilingual large language models (LLMs) are a collection of pre-trained and instruction tuned generative models in 0.5B, 1.5B, 3B, 7B, 14B, 32B, and 72B (text in/text out and code out). The Qwen 2.5 fine tuned text-only models are optimized for multilingual dialogue use cases and outperform both previous generations of Qwen models, and many of the publicly available chat models based on common industry benchmarks.

At its core, Qwen 2.5 is an auto-regressive language model that uses an optimized transformer architecture. The Qwen2.5 collection can support over 29 languages and has enhanced role-playing abilities and condition-setting for chatbots.

In this post, we outline how to get started with deploying the Qwen 2.5 family of models on an Inferentia instance using Amazon Elastic Compute Cloud (Amazon EC2) and Amazon SageMaker using the Hugging Face Text Generation Inference (TGI) container and the Hugging Face Optimum Neuron library. Qwen2.5 Coder and Math variants are also supported.

Preparation

Hugging Face provides two tools that are frequently used when using AWS Inferentia and AWS Trainium: Text Generation Inference (TGI) containers, which provide support for deploying and serving LLMS, and the Optimum Neuron library, which serves as an interface between the Transformers library and the Inferentia and Trainium accelerators.

The first time a model is run on Inferentia or Trainium, you compile the model to make sure that you have a version that will perform optimally on Inferentia and Trainium chips. The Optimum Neuron library from Hugging Face along with the Optimum Neuron cache will transparently supply a compiled model when available. If you’re using a different model with the Qwen2.5 architecture, you might need to compile the model before deploying. For more information, see Compiling a model for Inferentia or Trainium.

You can deploy TGI as a docker container on an Inferentia or Trainium EC2 instance or on Amazon SageMaker.

Option 1: Deploy TGI on Amazon EC2 Inf2

In this example, you will deploy Qwen2.5-7B-Instruct on an inf2.xlarge instance. (See this article for detailed instructions on how to deploy an instance using the Hugging Face DLAMI.)

For this option, you SSH into the instance and create a .env file (where you’ll define your constants and specify where your model is cached) and a file named docker-compose.yaml (where you’ll define all of the environment parameters that you’ll need to deploy your model for inference). You can copy the following files for this use case.

  1. Create a .env file with the following content:
MODEL_ID='Qwen/Qwen2.5-7B-Instruct'
#MODEL_ID='/data/exportedmodel' 
HF_AUTO_CAST_TYPE='bf16' # indicates the auto cast type that was used to compile the model
MAX_BATCH_SIZE=4
MAX_INPUT_TOKENS=4000
MAX_TOTAL_TOKENS=4096

  1. Create a file named docker-compose.yaml with the following content:
version: '3.7'

services:
  tgi-1:
    image: ghcr.io/huggingface/neuronx-tgi:latest
    ports:
      - "8081:8081"
    environment:
      - PORT=8081
      - MODEL_ID=${MODEL_ID}
      - HF_AUTO_CAST_TYPE=${HF_AUTO_CAST_TYPE}
      - HF_NUM_CORES=2
      - MAX_BATCH_SIZE=${MAX_BATCH_SIZE}
      - MAX_INPUT_TOKENS=${MAX_INPUT_TOKENS}
      - MAX_TOTAL_TOKENS=${MAX_TOTAL_TOKENS}
      - MAX_CONCURRENT_REQUESTS=512
      #- HF_TOKEN=${HF_TOKEN} #only needed for gated models
    volumes:
      - $PWD:/data #can be removed if you aren't loading locally
    devices:
      - "/dev/neuron0"
  1. Use docker compose to deploy the model:

docker compose -f docker-compose.yaml --env-file .env up

  1. To confirm that the model deployed correctly, send a test prompt to the model:
curl 127.0.0.1:8081/generate 
    -X POST 
    -d '{
  "inputs":"Tell me about AWS.",
  "parameters":{
    "max_new_tokens":60
  }
}' 
    -H 'Content-Type: application/json'
  1. To confirm that the model can respond in multiple languages, try sending a prompt in Chinese:
#"Tell me how to open an AWS account"
curl 127.0.0.1:8081/generate 
    -X POST 
    -d '{
  "inputs":"告诉我如何开设 AWS 账户。", 
  "parameters":{
    "max_new_tokens":60
  }
}' 
    -H 'Content-Type: application/json'

Option 2: Deploy TGI on SageMaker

You can also use Hugging Face’s Optimum Neuron library to quickly deploy models directly from SageMaker using instructions on the Hugging Face Model Hub.

  1. From the Qwen 2.5 model card hub, choose Deploy, then SageMaker, and finally AWS Inferentia & Trainium.

How to deploy the model on Amazon SageMaker

How to find the code you'll need to deploy the model using AWS Inferentia and Trainium

  1. Copy the example code into a SageMaker notebook, then choose Run.
  2. The notebook you copied will look like the following:
import json
import sagemaker
import boto3
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client("iam")
    role = iam.get_role(RoleName="sagemaker_execution_role")["Role"]["Arn"]

# Hub Model configuration. https://huggingface.co/models
hub = {
    "HF_MODEL_ID": "Qwen/Qwen2.5-7B-Instruct",
    "HF_NUM_CORES": "2",
    "HF_AUTO_CAST_TYPE": "bf16",
    "MAX_BATCH_SIZE": "8",
    "MAX_INPUT_TOKENS": "3686",
    "MAX_TOTAL_TOKENS": "4096",
}


region = boto3.Session().region_name
image_uri = f"763104351884.dkr.ecr.{region}.amazonaws.com/huggingface-pytorch-tgi-inference:2.1.2-optimum0.0.27-neuronx-py310-ubuntu22.04"

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
    image_uri=image_uri,
    env=hub,
    role=role,
)

# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
    initial_instance_count=1,
    instance_type="ml.inf2.xlarge",
    container_startup_health_check_timeout=1800,
    volume_size=512,
)

# send request
predictor.predict(
    {
        "inputs": "What is is the capital of France?",
        "parameters": {
            "do_sample": True,
            "max_new_tokens": 128,
            "temperature": 0.7,
            "top_k": 50,
            "top_p": 0.95,
        }
    }
)

Clean Up

Make sure that you terminate your EC2 instances and delete your SageMaker endpoints to avoid ongoing costs.

Terminate EC2 instances through the AWS Management Console.

Terminate a SageMaker endpoint through the console or with the following commands:

predictor.delete_model()
predictor.delete_endpoint(delete_endpoint_config=True)

Conclusion

AWS Trainium and AWS Inferentia deliver high performance and low cost for deploying Qwen2.5 models. We’re excited to see how you will use these powerful models and our purpose-built AI infrastructure to build differentiated AI applications. To learn more about how to get started with AWS AI chips, see the AWS Neuron documentation.


About the Authors

Jim Burtoft is a Senior Startup Solutions Architect at AWS and works directly with startups as well as the team at Hugging Face. Jim is a CISSP, part of the AWS AI/ML Technical Field Community, part of the Neuron Data Science community, and works with the open source community to enable the use of Inferentia and Trainium. Jim holds a bachelor’s degree in mathematics from Carnegie Mellon University and a master’s degree in economics from the University of Virginia.

Miriam Lebowitz ProfileMiriam Lebowitz is a Solutions Architect focused on empowering early-stage startups at AWS. She leverages her experience with AIML to guide companies to select and implement the right technologies for their business objectives, setting them up for scalable growth and innovation in the competitive startup world.

Rhia Soni is a Startup Solutions Architect at AWS. Rhia specializes in working with early stage startups and helps customers adopt Inferentia and Trainium. Rhia is also part of the AWS Analytics Technical Field Community and is a subject matter expert in Generative BI. Rhia holds a bachelor’s degree in Information Science from the University of Maryland.

Paul Aiuto is a Senior Solution Architect Manager focusing on Startups at AWS. Paul created a team of AWS Startup Solution architects that focus on the adoption of Inferentia and Trainium. Paul holds a bachelor’s degree in Computer Science from Siena College and has multiple Cyber Security certifications.

Read More

Revolutionizing customer service: MaestroQA’s integration with Amazon Bedrock for actionable insight

Revolutionizing customer service: MaestroQA’s integration with Amazon Bedrock for actionable insight

This post is cowritten with Harrison Hunter is the CTO and co-founder of MaestroQA.

MaestroQA augments call center operations by empowering the quality assurance (QA) process and customer feedback analysis to increase customer satisfaction and drive operational efficiencies. They assist with operations such as QA reporting, coaching, workflow automations, and root cause analysis.

In this post, we dive deeper into one of MaestroQA’s key features—conversation analytics, which helps support teams uncover customer concerns, address points of friction, adapt support workflows, and identify areas for coaching through the use of Amazon Bedrock. We discuss the unique challenges MaestroQA overcame and how they use AWS to build new features, drive customer insights, and improve operational inefficiencies.

Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies, such as AI21 Labs, Anthropic, Cohere, Meta, Mistral, Stability AI, and Amazon through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI.

The opportunity for open-ended conversation analysis at enterprise scale

MaestroQA serves a diverse clientele across various industries, including ecommerce, marketplaces, healthcare, talent acquisition, insurance, and fintech. All of these customers have a common challenge: the need to analyze a high volume of interactions with their customers. Analyzing these customer interactions is crucial to improving their product, improving their customer support, providing customer satisfaction, and identifying key industry signals. However, customer interaction data such as call center recordings, chat messages, and emails are highly unstructured and require advanced processing techniques in order to accurately and automatically extract insights.

When customers receive incoming calls at their call centers, MaestroQA employs its proprietary transcription technology, built by enhancing open source transcription models, to transcribe the conversations. After the data is transcribed, MaestroQA uses technology they have developed in combination with AWS services such as Amazon Comprehend to run various types of analysis on the customer interaction data. For example, MaestroQA offers sentiment analysis for customers to identify the sentiment of their end customer during the support interaction, enabling MaestroQA’s customers to sort their interactions and manually inspect the best or worst interactions. MaestroQA also offers a logic/keyword-based rules engine for classifying customer interactions based on other factors such as timing or process steps including metrics like Average Handle Time (AHT), compliance or process checks, and SLA adherence.

MaestroQA’s customers love these analysis features because they allow them to continuously improve the quality of their support and identify areas where they can improve their product to better satisfy their end customers. However, they were also interested in more advanced analysis, such as asking open-ended questions like “How many times did the customer ask for an escalation?” MaestroQA’s existing rules engine couldn’t always answer these types of queries because end-users could ask for the same outcome in many different ways. For example, “Can I speak to your manager?” and “I would like to speak to someone higher up” don’t share the same keywords, but are both asking for an escalation. MaestroQA needed a way to accurately classify customer interactions based on open-ended questions.

MaestroQA faced an additional hurdle: the immense scale of customer interactions their clients manage. With clients handling anywhere from thousands to millions of customer engagements monthly, there was a pressing need for comprehensive analysis of support team performance across this vast volume of interactions. Consequently, MaestroQA had to develop a solution capable of scaling to meet their clients’ extensive needs.

To start developing this product, MaestroQA first rolled out a product called AskAI. AskAI allowed customers to run open-ended questions on a targeted list of up to 1,000 conversations. For example, a customer might use MaestroQA’s filters to find customer interactions in Oregon within the past two months and then run a root cause analysis query such as “What are customers frustrated about in Oregon?” to find churn risk anecdotes. Their customers really liked this feature and surprised MaestroQA with the breadth of use cases they covered, including analyzing marketing campaigns, service issues, and product opportunities. Customers started to request the ability to run this type of analysis across all of their transcripts, which could number in the millions, so they could quantify the impact of what they were seeing and find instances of important issues.

Solution overview

MaestroQA decided to use Amazon Bedrock to address their customers’ need for advanced analysis of customer interaction transcripts. Amazon Bedrock’s broad choice of FMs from leading AI companies, along with its scalability and security features, made it an ideal solution for MaestroQA.

MaestroQA integrated Amazon Bedrock into their existing architecture using Amazon Elastic Container Service (Amazon ECS). The customer interaction transcripts are stored in an Amazon Simple Storage Service (Amazon S3) bucket.

The following architecture diagram demonstrates the request flow for AskAI. When a customer submits an analysis request through MaestroQA’s web application, an ECS cluster retrieves the relevant transcripts from Amazon S3, cleans and formats the prompt, sends them to Amazon Bedrock for analysis using the customer’s selected FM, and stores the results in a database hosted in Amazon Elastic Compute Cloud (Amazon EC2), where they can be retrieved by MaestroQA’s frontend web application.

Solution architecture

MaestroQA offers their customers the flexibility to choose from multiple FMs available through Amazon Bedrock, including Anthropic’s Claude 3.5 Sonnet, Anthropic’s Claude 3 Haiku, Mistral 7b/8x7b, Cohere’s Command R and R+, and Meta’s Llama 3.1 models. This allows customers to select the model that best suits their specific use case and requirements.

The following screenshot shows how the AskAI feature allows MaestroQA’s customers to use the wide variety of FMs available on Amazon Bedrock to ask open-ended questions such as “What are some of the common issues in these tickets?” and generate useful insights from customer service interactions.

Product screenshot

To handle the high volume of customer interaction transcripts and provide low-latency responses, MaestroQA takes advantage of the cross-Region inference capabilities of Amazon Bedrock. Originally, they were doing the load balancing themselves, distributing requests between available AWS US Regions (us-east-1, us-west-2, and so on) and available EU Regions (eu-west-3, eu-central-1, and so on) for their North American and European customers, respectively. Now, the cross-Region inference capability of Amazon Bedrock enables MaestroQA to achieve twice the throughput compared to single-Region inference, a critical factor in scaling their solution to accommodate more customers. MaestroQA’s team no longer has to spend time and effort to predict their demand fluctuations, which is especially key when usage increases for their ecommerce customers around the holiday season. Cross-Region inference dynamically routes traffic across multiple Regions, providing optimal availability for each request and smoother performance during these high-usage periods. MaestroQA monitors this setup’s performance and reliability using Amazon CloudWatch.

Benefits: How Amazon Bedrock added value

Amazon Bedrock has enabled MaestroQA to innovate faster and gain a competitive advantage by offering their customers powerful generative AI features for analyzing customer interaction transcripts. With Amazon Bedrock, MaestroQA can now provide their customers with the ability to run open-ended queries across millions of transcripts, unlocking valuable insights that were previously inaccessible.

The broad choice of FMs available through Amazon Bedrock allows MaestroQA to cater to their customers’ diverse needs and preferences. Customers can select the model that best aligns with their specific use case, finding the right balance between performance and price.

The scalability and cross-Region inference capabilities of Amazon Bedrock enable MaestroQA to handle high volumes of customer interaction transcripts while maintaining low latency, regardless of their customers’ geographical locations.

MaestroQA takes advantage of the robust security features and ethical AI practices of Amazon Bedrock to bolster customer confidence. These measures make sure that client data remains secure during processing and isn’t used for model training by third-party providers. Additionally, Amazon Bedrock availability in Europe, coupled with its geographic control capabilities, allows MaestroQA to seamlessly extend AI services to European customers. This expansion is achieved without introducing additional complexities, thereby maintaining operational efficiency while adhering to Regional data regulations.

The adoption of Amazon Bedrock proved to be a game changer for MaestroQA’s compact development team. Its serverless architecture allowed the team to rapidly prototype and refine their application without the burden of managing complex hardware infrastructure. This shift enabled MaestroQA to channel their efforts into optimizing application performance rather than grappling with resource allocation. Moreover, Amazon Bedrock offers seamless compatibility with their existing AWS environment, allowing for a smooth integration process and further streamlining their development workflow. MaestroQA was able to use their existing authentication process with AWS Identity and Access Management (IAM) to securely authenticate their application to invoke large language models (LLMs) within Amazon Bedrock. They were also able to use the familiar AWS SDK to quickly and effortlessly integrate Amazon Bedrock into their application.

Overall, by using Amazon Bedrock, MaestroQA is able to provide their customers with a powerful and flexible solution for extracting valuable insights from their customer interaction data, driving continuous improvement in their products and support processes.

Success metrics

The early results have been remarkable.

A lending company uses MaestroQA to detect compliance risks on 100% of their conversations. Before, agents would raise internal escalations if a consumer complained about the loan or expressed being in a vulnerable state. However, this process was manual and error prone, and the lending company would miss many of these risks. Now, they are able to detect compliance risks with almost 100% accuracy.

A medical device company, who is required to report device issues to the FDA, no longer relies solely on agents to report internally customer-reported issues, but uses this service to analyze all of their conversations to make sure all complaints are flagged.

An education company has been able to replace their manual survey scores with an automated customer sentiment score that increased their sample size from 15% to 100% of conversations.

The best is yet to come.

Conclusion

Using AWS, MaestroQA was able to innovate faster and gain a competitive advantage. Companies from different industries such as financial services, healthcare and life sciences, and EdTech all share the common desire to provide better customer services for their clients. MaestroQA was able to enable them to do that by quickly pivoting to offer powerful generative AI features that solved tangible business problems and enhanced overall compliance.

Check out MaestroQA’s feature AskAI and their LLM-powered AI Classifiers if you’re interested in better understanding your customer conversations and survey scores. For more about Amazon Bedrock, see Get started with Amazon Bedrock and learn about features such as cross-Region inference to help scale your generative AI features globally.


About the Authors

Carole Suarez is a Senior Solutions Architect at AWS, where she helps guide startups through their cloud journey. Carole specializes in data engineering and holds an array of AWS certifications on a variety of topics including analytics, AI, and security. She is passionate about learning languages and is fluent in English, French, and Tagalog.

Ben Gruher is a Generative AI Solutions Architect at AWS, focusing on startup customers. Ben graduated from Seattle University where he obtained bachelor’s and master’s degrees in Computer Science and Data Science.

Harrison Hunter is the CTO and co-founder of MaestroQA where he leads the engineer and product teams. Prior to MaestroQA, Harrison studied computer science and AI at MIT.

Read More

Optimize hosting DeepSeek-R1 distilled models with Hugging Face TGI on Amazon SageMaker AI

Optimize hosting DeepSeek-R1 distilled models with Hugging Face TGI on Amazon SageMaker AI

DeepSeek-R1, developed by AI startup DeepSeek AI, is an advanced large language model (LLM) distinguished by its innovative, multi-stage training process. Instead of relying solely on traditional pre-training and fine-tuning, DeepSeek-R1 integrates reinforcement learning to achieve more refined outputs. The model employs a chain-of-thought (CoT) approach that systematically breaks down complex queries into clear, logical steps. Additionally, it uses NVIDIA’s parallel thread execution (PTX) constructs to boost training efficiency, and a combined framework of supervised fine-tuning (SFT) and group robust policy optimization (GRPO) makes sure its results are both transparent and interpretable.

In this post, we demonstrate how to optimize hosting DeepSeek-R1 distilled models with Hugging Face Text Generation Inference (TGI) on Amazon SageMaker AI.

Model Variants

The current DeepSeek model collection consists of the following models:

  • DeepSeek-V3 – An LLM that uses a Mixture-of-Experts (MoE) architecture. MoE models like DeepSeek-V3 and Mixtral replace the standard feed-forward neural network in transformers with a set of parallel sub-networks called experts. These experts are selectively activated for each input, allowing the model to efficiently scale to a much larger size without a corresponding increase in computational cost. For example, DeepSeek-V3 is a 671-billion-parameter model, but only 37 billion parameters (approximately 5%) are activated during the output of each token. DeepSeek-V3-Base is the base model from which the R1 variants are derived.
  • DeepSeek-R1-Zero – A fine-tuned variant of DeepSeek-V3 based on using reinforcement learning to guide CoT reasoning capabilities, without any SFT done prior. According to the DeepSeek R1 paper, DeepSeek-R1-Zero excelled at reasoning behaviors but encountered challenges with readability and language mixing.
  • DeepSeek-R1 – Another fine-tuned variant of DeepSeek-V3-Base, built similarly to DeepSeek-R1-Zero, but with a multi-step training pipeline. DeepSeek-R1 starts with a small amount of cold-start data prior to the GRPO process. It also incorporates SFT data through rejection sampling, combined with supervised data generated from DeepSeek-V3 to retrain DeepSeek-V3-base. After this, the retrained model goes through another round of RL, resulting in the DeepSeek-R1 model checkpoint.
  • DeepSeek-R1-Distill – Variants of Hugging Face’s Qwen and Meta’s Llama based on Qwen2.5-Math-1.5B, Qwen2.5-Math-7B, Qwen2.5-14B, Qwen2.5-32B, Llama-3.1-8B, and Llama-3.3-70B-Instruct. The distilled model variants are a result of fine-tuning Qwen or Llama models through knowledge distillation, where DeepSeek-R1 acts as the teacher and Qwen or Llama as the student. These models retain their existing architecture while gaining additional reasoning capabilities through a distillation process. They are exclusively fine-tuned using SFT and don’t incorporate any RL techniques.

The following figure illustrates the performance of DeepSeek-R1 compared to other state-of-the-art models on standard benchmark tests, such as MATH-500, MMLU, and more.

Hugging Face Text Generation Inference (TGI)

Hugging Face Text Generation Inference (TGI) is a high-performance, production-ready inference framework optimized for deploying and serving large language models (LLMs) efficiently. It is designed to handle the demanding computational and latency requirements of state-of-the-art transformer models, including Llama, Falcon, Mistral, Mixtral, and GPT variants – for a full list of TGI supported models refer to supported models.

Amazon SageMaker AI provides a managed way to deploy TGI-optimized models, offering deep integration with Hugging Face’s inference stack for scalable and cost-efficient LLM deployment. To learn more about Hugging Face TGI support on Amazon SageMaker AI, refer to this announcement post and this documentation on deploy models to Amazon SageMaker AI.

Key Optimizations in Hugging Face TGI

Hugging Face TGI is built to address the challenges associated with deploying large-scale text generation models, such as inference latency, token throughput, and memory constraints.

The benefits of using TGI include:

  • Tensor parallelism – Splits large models across multiple GPUs for efficient memory utilization and computation
  • Continuous batching – Dynamically batches multiple inference requests to maximize token throughput and reduce latency
  • Quantization – Lowers memory usage and computational cost by converting model weights to INT8 or FP16
  • Speculative decoding – Uses a smaller draft model to speed up token prediction while maintaining accuracy
  • Key-value cache optimization – Reduces redundant computations for faster response times in long-form text generation
  • Token streaming – Streams tokens in real time for low-latency applications like chatbots and virtual assistants

Runtime TGI Arguments

TGI containers support runtime configurations that provide greater control over LLM deployments. These configurations allow you to adjust settings such as quantization, model parallel size (tensor parallel size), maximum tokens, data type (dtype), and more using container environment variables.

Notable runtime parameters influencing your model deployment include:

  1. HF_MODEL_ID : This parameter specifies the identifier of the model to load, which can be a model ID from the Hugging Face Hub (e.g., meta-llama/Llama-3.2-11B-Vision-Instruct) or Simple Storage Service (S3) URI containing the model files.
  2. HF_TOKEN : This parameter variable provides the access token required to download gated models from the Hugging Face Hub, such as Llama or Mistral.
  3. SM_NUM_GPUS : This parameter specifies the number of GPUs to use for model inference, allowing the model to be sharded across multiple GPUs for improved performance.
  4. MAX_CONCURRENT_REQUESTS : This parameter controls the maximum number of concurrent requests that the server can handle, effectively managing the load and ensuring optimal performance.
  5. DTYPE : This parameter sets the data type for the model weights during loading, with options like float16 or bfloat16, influencing the model’s memory consumption and computational performance.

There are additional optional runtime parameters that are already pre-optimized in TGI containers to maximize performance on host hardware. However, you can modify them to exercise greater control over your LLM inference performance:

  1. MAX_TOTAL_TOKENS: This parameter sets the upper limit on the combined number of input and output tokens a deployment can handle per request, effectively defining the “memory budget” for client interactions.
  2. MAX_INPUT_TOKENS : This parameter specifies the maximum number of tokens allowed in the input prompt of each request, controlling the length of user inputs to manage memory usage and ensure efficient processing.
  3. MAX_BATCH_PREFILL_TOKENS : This parameter caps the total number of tokens processed during the prefill stage across all batched requests, a phase that is both memory-intensive and compute-bound, thereby optimizing resource utilization and preventing out-of-memory errors.

For a complete list of runtime configurations, please refer to text-generation-launcher arguments.

DeepSeek Deployment Patterns with TGI on Amazon SageMaker AI

Amazon SageMaker AI offers a simple and streamlined approach to deploy DeepSeek-R1 models with just a few lines of code. Additionally, SageMaker endpoints support automatic load balancing and autoscaling, enabling your LLM deployment to scale dynamically based on incoming requests. During non-peak hours, the endpoint can scale down to zero, optimizing resource usage and cost efficiency.

The table below summarizes all DeepSeek-R1 models available on the Hugging Face Hub, as uploaded by the original model provider, DeepSeek.

Model

# Total Params

# Activated Params

Context Length

Download

DeepSeek-R1-Zero

671B

37B

128K

🤗 deepseek-ai/DeepSeek-R1-Zero

DeepSeek-R1

671B

37B

128K

🤗 deepseek-ai/DeepSeek-R1

DeepSeek AI also offers distilled versions of its DeepSeek-R1 model to offer more efficient alternatives for various applications.

Model

Base Model

Download

DeepSeek-R1-Distill-Qwen-1.5B

Qwen2.5-Math-1.5B

🤗 deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B

DeepSeek-R1-Distill-Qwen-7B

Qwen2.5-Math-7B

🤗 deepseek-ai/DeepSeek-R1-Distill-Qwen-7B

DeepSeek-R1-Distill-Llama-8B

Llama-3.1-8B

🤗 deepseek-ai/DeepSeek-R1-Distill-Llama-8B

DeepSeek-R1-Distill-Qwen-14B

Qwen2.5-14B

🤗 deepseek-ai/DeepSeek-R1-Distill-Qwen-14B

DeepSeek-R1-Distill-Qwen-32B

Qwen2.5-32B

🤗 deepseek-ai/DeepSeek-R1-Distill-Qwen-32B

DeepSeek-R1-Distill-Llama-70B

Llama-3.3-70B-Instruct

🤗 deepseek-ai/DeepSeek-R1-Distill-Llama-70B

There are two ways to deploy LLMs, such as DeepSeek-R1 and its distilled variants, on Amazon SageMaker:

Option 1: Direct Deployment from Hugging Face Hub

The easiest way to host DeepSeek-R1 in your AWS account is by deploying it (along with its distilled variants) using TGI containers. These containers simplify deployment through straightforward runtime environment specifications. The architecture diagram below shows a direct download from the Hugging Face Hub, ensuring seamless integration with Amazon SageMaker.

The following code shows how to deploy the DeepSeek-R1-Distill-Llama-8B model to a SageMaker endpoint, directly from the Hugging Face Hub.

import sagemaker
from sagemaker.huggingface import (
    HuggingFaceModel, 
    get_huggingface_llm_image_uri
)

role = sagemaker.get_execution_role()
session = sagemaker.Session()

# select the latest 3+ version container 
deploy_image_uri = get_huggingface_llm_image_uri(
     "huggingface", 
     version="3.0.1" 
)

deepseek_tgi_model = HuggingFaceModel(
            image_uri=deploy_image_uri,
    env={
        "HF_MODEL_ID": "deepseek-ai/DeepSeek-R1-Distill-Llama-8B",
        "ENDPOINT_SERVER_TIMEOUT": "3600",
        ...
    },
    role=role,
    sagemaker_session=session,
    name="deepseek-r1-llama-8b-model" # optional
)

pretrained_tgi_predictor = deepseek_tgi_model.deploy(
   endpoint_name="deepseek-r1-llama-8b-endpoint", # optional 
    initial_instance_count=1,
    instance_type="ml.g5.2xlarge",
    container_startup_health_check_timeout=600,
    wait=False, # set to true to wait for endpoint InService
)

To deploy other distilled models, simply update the HF_MODEL_ID to any of the DeepSeek distilled model variants, such as deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B, deepseek-ai/DeepSeek-R1-Distill-Qwen-32B, or deepseek-ai/DeepSeek-R1-Distill-Llama-70B.

Option 2: Deployment from a Private S3 Bucket

To deploy models privately within your AWS account, upload the DeepSeek-R1 model weights to a S3 bucket and set HF_MODEL_ID to the corresponding S3 bucket prefix. TGI will then retrieve and deploy the model weights from S3, eliminating the need for internet downloads during each deployment. This approach reduces model loading latency by keeping the weights closer to your SageMaker endpoints and enable your organization’s security teams to perform vulnerability scans before deployment. SageMaker endpoints also support auto-scaling, allowing DeepSeek-R1 to scale horizontally based on incoming request volume while seamlessly integrating with elastic load balancing.

Deploying a DeepSeek-R1 distilled model variant from S3 follows the same process as option 1, with one key difference: HF_MODEL_ID points to the S3 bucket prefix instead of the Hugging Face Hub. Before deployment, you must first download the model weights from the Hugging Face Hub and upload them to your S3 bucket.

deepseek_tgi_model = HuggingFaceModel(
            image_uri=deploy_image_uri,
    env={
        "HF_MODEL_ID": "s3://my-model-bucket/path/to/model",
        ...
    }, 
          vpc_config={ 
                 "Subnets": ["subnet-xxxxxxxx", "subnet-yyyyyyyy"],
    "SecurityGroupIds": ["sg-zzzzzzzz"] 
     },
    role=role, 
    sagemaker_session=session,
    name="deepseek-r1-llama-8b-model-s3" # optional
)

Deployment Best Practices

The following are some best practices to consider when deploying DeepSeek-R1 models on SageMaker AI:

  • Deploy within a private VPC – It’s recommended to deploy your LLM endpoints inside a virtual private cloud (VPC) and behind a private subnet, preferably with no egress. See the following code:
deepseek_tgi_model = HuggingFaceModel(
            image_uri=deploy_image_uri,
    env={
        "HF_MODEL_ID": "deepseek-ai/DeepSeek-R1-Distill-Llama-8B",
        "ENDPOINT_SERVER_TIMEOUT": "3600",
        ...
    },
    role=role,
          vpc_config={ 
                 "Subnets": ["subnet-xxxxxxxx", "subnet-yyyyyyyy"],
                 "SecurityGroupIds": ["sg-zzzzzzzz"] 
     },
    sagemaker_session=session,
    name="deepseek-r1-llama-8b-model" # optional
) 
  • Implement guardrails for safety and compliance – Always apply guardrails to validate incoming and outgoing model responses for safety, bias, and toxicity. You can use Amazon Bedrock Guardrails to enforce these protections on your SageMaker endpoint responses.

Inference Performance Evaluation

This section presents examples of the inference performance of DeepSeek-R1 distilled variants on Amazon SageMaker AI. Evaluating LLM performance across key metrics—end-to-end latency, throughput, and resource efficiency—is crucial for ensuring responsiveness, scalability, and cost-effectiveness in real-world applications. Optimizing these metrics directly enhances user experience, system reliability, and deployment feasibility at scale.

All DeepSeek-R1 Qwen (1.5B, 7B, 14B, 32B) and Llama (8B, 70B) variants are evaluated against four key performance metrics:

  1. End-to-End Latency
  2. Throughput (Tokens per Second)
  3. Time to First Token
  4. Inter-Token Latency

Please note that the main purpose of this performance evaluation is to give you an indication about relative performance of distilled R1 models on different hardware. We didn’t try to optimize the performance for each model/hardware/use case combination. These results should not be treated like a best possible performance of a particular model on a particular instance type. You should always perform your own testing using your own datasets and input/output sequence length.

If you are interested in running this evaluation job inside your own account, refer to our code on GitHub.

Scenarios

We tested the following scenarios:

  • Tokens – Two input token lengths were used to evaluate the performance of DeepSeek-R1 distilled variants hosted on SageMaker endpoints. Each test was executed 100 times, with concurrency set to 1, and the average values across key performance metrics were recorded. All models were run with dtype=bfloat16.
    • Short-length test – 512 input tokens, 256 output tokens.
    • Medium-length test – 3,072 input tokens, 256 output tokens.
  • Hardware – Several instance families were tested, including p4d (NVIDIA A100), g5 (NVIDIA A10G), g6 (NVIDIA L4), and g6e (NVIDIA L40s), each equipped with 1, 4, or 8 GPUs per instance. For additional details regarding pricing and instance specifications, refer to Amazon SageMaker AI pricing. In the following table, a green cell indicates a model was tested on a specific instance type, and a red cell indicates the model wasn’t tested because the instance type or size was excessive for the model or unable to run because of insufficient GPU memory.

Box Plots

In the following sections we use a box plot to visualize model performance. A box is a concise visual summary that displays a dataset’s median, interquartile range (IQR), and potential outliers using a box for the middle 50% of the data with whiskers extending to the smallest and largest non-outlier values. By examining the median’s placement within the box, the box’s size, and the whiskers’ lengths, you can quickly assess the data’s central tendency, variability, and skewness.

DeepSeek-R1-Distill-Qwen-1.5B

A single GPU instance is sufficient to host one or multiple concurrent DeepSeek-R1-Distill-Qwen-1.5B serving workers on TGI. In this test, a single worker was deployed, and performance was evaluated across the four outlined metrics. Results show that the ml.g5.xlarge outperforms the ml.g6.xlarge across all metrics.

DeepSeek-R1-Distill-Qwen-7B

DeepSeek-R1-Distill-Qwen-7B was tested on ml.g5.xlarge, ml.g5.2xlarge, ml.g6.xlarge, ml.g6.2xlarge, and ml.g6e.xlarge. The ml.g6e.xlarge instance performed the best, followed by ml.g5.2xlarge, ml.g5.xlarge, and the ml.g6 instances.

DeepSeek-R1-Distill-Llama-8B

Similar to the 7B variant, DeepSeek-R1-Distill-Llama-8B was benchmarked across ml.g5.xlarge, ml.g5.2xlarge, ml.g6.xlarge, ml.g6.2xlarge, and ml.g6e.xlarge, with ml.g6e.xlarge demonstrating the highest performance among all instances.

DeepSeek-R1-Distill-Qwen-14B

DeepSeek-R1-Distill-Qwen-14B was tested on ml.g5.12xlarge, ml.g6.12xlarge, ml.g6e.2xlarge, and ml.g6e.xlarge. The ml.g5.12xlarge exhibited the highest performance, followed by ml.g6.12xlarge. Although at a lower performance profile, DeepSeek-R1-14B can also be deployed on the single GPU g6e instances due to their larger memory footprint.

DeepSeek-R1-Distill-Qwen-32B

DeepSeek-R1-Distill-Qwen-32B requires more than 48GB of memory, making ml.g5.12xlarge, ml.g6.12xlarge, and ml.g6e.12xlarge suitable for performance comparison. In this test, ml.g6e.12xlarge delivered the highest performance, followed by ml.g5.12xlarge, with ml.g6.12xlarge ranking third.

DeepSeek-R1-Distill-Llama-70B

DeepSeek-R1-Distill-Llama-70B was tested on ml.g5.48xlarge, ml.g6.48xlarge, ml.g6e.12xlarge, ml.g6e.48xlarge, and ml.p4dn.24xlarge. The best performance was observed on ml.p4dn.24xlarge, followed by ml.g6e.48xlarge, ml.g6e.12xlarge, ml.g5.48xlarge, and finally ml.g6.48xlarge.

Clean Up

To avoid incurring cost after completing your evaluation, ensure you delete the endpoints you created earlier.

import boto3

# Create a low-level SageMaker service client.
sagemaker_client = boto3.client('sagemaker', region_name=<region>)

# Delete endpoint
sagemaker_client.delete_endpoint(EndpointName=endpoint_name)

Conclusion

In this blog, you learned about the current versions of DeepSeek models and how to use the Hugging Face TGI containers to simplify the deployment of DeepSeek-R1 distilled models (or any other LLM) on Amazon SageMaker AI to just a few lines of code. You also learned how to deploy models directly from the Hugging Face Hub for quick experimentation and from a private S3 bucket to provide enhanced security and model deployment performance.

Finally, you saw an extensive evaluation of all DeepSeek-R1 distilled models across four key inference performance metrics using 13 different NVIDIA accelerator instance types. This analysis provides valuable insights to help you select the optimal instance type for your DeepSeek-R1 deployment. All code used to analyze DeepSeek-R1 distilled model variants are available on GitHub.


About the Authors

Pranav Murthy is a Worldwide Technical Lead and Sr. GenAI Data Scientist at AWS. He helps customers build, train, deploy, evaluate, and monitor Machine Learning (ML), Deep Learning (DL), and Generative AI (GenAI) workloads on Amazon SageMaker. Pranav specializes in multimodal architectures, with deep expertise in computer vision (CV) and natural language processing (NLP). Previously, he worked in the semiconductor industry, developing AI/ML models to optimize semiconductor processes using state-of-the-art techniques. In his free time, he enjoys playing chess, training models to play chess, and traveling. You can find Pranav on LinkedIn.

Simon Pagezy is a Cloud Partnership Manager at Hugging Face, dedicated to making cutting-edge machine learning accessible through open source and open science. With a background in AI/ML consulting at AWS, he helps organizations leverage the Hugging Face ecosystem on their platform of choice.

Giuseppe Zappia is a Principal AI/ML Specialist Solutions Architect at AWS, focused on helping large enterprises design and deploy ML solutions on AWS. He has over 20 years of experience as a full stack software engineer, and has spent the past 5 years at AWS focused on the field of machine learning.

Dmitry Soldatkin is a Senior Machine Learning Solutions Architect at AWS, helping customers design and build AI/ML solutions. Dmitry’s work covers a wide range of ML use cases, with a primary interest in generative AI, deep learning, and scaling ML across the enterprise. He has helped companies in many industries, including insurance, financial services, utilities, and telecommunications. He has a passion for continuous innovation and using data to drive business outcomes. Prior to joining AWS, Dmitry was an architect, developer, and technology leader in data analytics and machine learning fields in the financial services industry.

Read More

Exploring creative possibilities: A visual guide to Amazon Nova Canvas

Exploring creative possibilities: A visual guide to Amazon Nova Canvas

Compelling AI-generated images start with well-crafted prompts. In this follow-up to our Amazon Nova Canvas Prompt Engineering Guide, we showcase a curated gallery of visuals generated by Nova Canvas—categorized by real-world use cases—from marketing and product visualization to concept art and design exploration.

Each image is paired with the prompt and parameters that generated it, providing a practical starting point for your own AI-driven creativity. Whether you’re crafting specific types of images, optimizing workflows, or simply seeking inspiration, this guide will help you unlock the full potential of Amazon Nova Canvas.

Solution overview

Getting started with Nova Canvas is straightforward. You can access the model through the Image Playground on the AWS Management Console for Amazon Bedrock, or through APIs. For detailed setup instructions, including account requirements and necessary permissions, visit our documentation on Creative content generation with Amazon Nova. Our previous post on prompt engineering best practices provides comprehensive guidance on crafting effective prompts.

A visual guide to Amazon Nova Canvas

In this gallery, we showcase a diverse range of images and the prompts used to generate them, highlighting how Amazon Nova Canvas adapts to various use cases—from marketing and product design to storytelling and concept art.

All images that follow were generated using Nova Canvas at a 1280x720px resolution with a CFG scale of 6.5, seed of 0, and the Premium setting for image quality. This resolution also matches the image dimensions expected by Nova Reel, allowing you to take these images into Amazon Nova Reel to experiment with video generation.

Landscapes

Overhead perspective of winding river delta, capturing intricate branching waterways and sediment patterns. Soft morning light revealing subtle color gradations between water and land. Revealing landscape’s hidden fluid dynamics from bird’s-eye view. Sparse arctic tundra landscape at twilight, expansive white terrain with isolated rock formations silhouetted against a deep blue sky. Low-contrast black and white composition capturing the infinite horizon, with subtle purple hues in the shadows. Ultra-wide-angle perspective emphasizing the vastness of negative space and geological simplicity.
Wide-angle aerial shot of patchwork agricultural terrain at golden hour, with long shadows accentuating the texture and topography of the land. Emphasis on the interplay of light and shadow across the geometric field divisions. Dynamic drone perspective of a dramatic shoreline at golden hour, capturing long shadows cast by towering sea stacks and coastal cliffs. Hyper-detailed imagery showcasing the interplay of warm sunlight on rocky textures and the cool, foamy edges of incoming tides.
Dramatic wide-angle shot of a rugged mountain range at sunset, with a lone tree silhouetted in the foreground, creating a striking focal point. Wide-angle capture of a hidden beach cove, surrounded by towering cliffs, with a shipwreck partially visible in the shallow waters.

Character portraits

A profile view of a weathered fisherman, silhouetted against a pastel dawn sky. The rim lighting outlines the shape of his beard and the texture of his knit cap. Rendered with high contrast to emphasize the rugged contours of his face and the determined set of his jaw.
A weathered fisherman with a thick gray beard and a knit cap, framed against the backdrop of a misty harbor at dawn. The image captures him in a medium shot, revealing more of his rugged attire. Cool, blue tones dominate the scene, contrasting with the warm highlights on his face. An intimate portrait of a seasoned fisherman, his face filling the frame. His thick gray beard is flecked with sea spray, and his knit cap is pulled low over his brow. The warm glow of sunset bathes his weathered features in golden light, softening the lines of his face while still preserving the character earned through years at sea. His eyes reflect the calm waters of the harbor behind him.
A seaside cafe at sunrise, with a seasoned barista’s silhouette visible through the window. Their kind smile is illuminated by the warm glow of the rising sun, creating a serene atmosphere. The image has a dreamy, soft-focus quality with pastel hues.
A dynamic profile shot of a barista in motion, captured mid-conversation with a customer. Their smile is genuine and inviting, with laugh lines accentuating their seasoned experience. The cafe’s interior is rendered in soft bokeh, maintaining the cinematic feel with a shallow depth of field. A front-facing portrait of an experienced barista, their welcoming smile framed by the sleek espresso machine. The background bustles with blurred cafe activity, while the focus remains sharp on the barista’s friendly demeanor. The lighting is contrasty, enhancing the cinematic mood.

Fashion photography

A model with sharp cheekbones and platinum pixie cut in a distressed leather bomber jacket stands amid red smoke in an abandoned subway tunnel. Wide-angle lens, emphasizing tunnel’s converging lines, strobed lighting creating a sense of motion.
A model with sharp cheekbones and platinum pixie cut wears a distressed leather bomber jacket, posed against a stark white cyclorama. Low-key lighting creates deep shadows, emphasizing the contours of her face. Shot from a slightly lower angle with a medium format camera, highlighting the jacket’s texture. Close-up portrait of a model with defined cheekbones and a platinum pixie cut, emerging from an infinity pool while wearing a wet distressed leather bomber jacket. Shot from a low angle with a tilt-shift lens, blurring the background for a dreamy fashion magazine aesthetic.
A model with sharp cheekbones and platinum pixie cut is wearing a distressed leather bomber jacket, caught mid-laugh at a backstage fashion show. Black and white photojournalistic style, natural lighting. Side profile of a model with defined cheekbones and a platinum pixie cut, standing still amidst the chaos of Chinatown at midnight. The distressed leather bomber jacket contrasts with the blurred neon lights in the background, creating a sense of urban solitude.

Product photography

A flat lay featuring a premium matte metal water bottle with bamboo accents, placed on a textured linen cloth. Eco-friendly items like a cork notebook, a sprig of eucalyptus, and a reusable straw are arranged around it. Soft, natural lighting casts gentle shadows, emphasizing the bottle’s matte finish and bamboo details. The background is an earthy tone like beige or light gray, creating a harmonious and sustainable composition. Angled perspective of the premium water bottle with bamboo elements, positioned on a natural jute rug. Surrounding it are earth-friendly items: a canvas tote bag, a stack of recycled paper notebooks, and a terracotta planter with air-purifying plants. Warm, golden hour lighting casts long shadows, emphasizing textures and creating a cozy, sustainable atmosphere. The scene evokes a sense of eco-conscious home or office living.
An overhead view of the water bottle’s bamboo cap, partially unscrewed to reveal the threaded metal neck. Soft, even lighting illuminates the entire scene, showcasing the natural variations in the bamboo’s color and grain. The bottle’s matte metal body extends out of frame, creating a minimalist composition that draws attention to the sustainable materials and precision engineering. An angled view of a premium matte metal water bottle with bamboo accents, showcasing its sleek profile. The background features a soft blur of a serene mountain lake. Golden hour sunlight casts a warm glow on the bottle’s surface, highlighting its texture. Captured with a shallow depth of field for product emphasis.
A pair of premium over-ear headphones with a matte black finish and gold accents, arranged in a flat lay on a clean white background. Organic leaves for accents. small notepad, pencils, and a carrying case are neatly placed beside the headphones, creating a symmetrical and balanced composition. Bright, diffused lighting eliminates shadows, emphasizing the sleek design without distractions. A shadowless, crisp aesthetic.
An overhead shot of premium over-ear headphones resting on a reflective surface, showcasing the symmetry of the design. Dramatic side lighting accentuates the curves and edges, casting subtle shadows that highlight the product’s premium build quality. An extreme macro shot focusing on the junction where the leather ear cushion meets the metallic housing of premium over-ear headphones. Sharp details reveal the precise stitching and material textures, while selective focus isolates this area against a softly blurred, dark background, showcasing the product’s premium construction.
An overhead shot of premium over-ear headphones resting on a reflective surface, showcasing the symmetry of the design. Dramatic side lighting casts long shadows, accentuating the curves of the headband and the depth of the ear cups against a minimalist white background. A dynamic composition of premium over-ear headphones floating in space, with the headband and ear cups slightly separated to showcase individual components. Rim lighting outlines each piece, while a gradient background adds depth and sophistication.
A smiling student holding up her smartphone, displaying a green matte screen for easy image replacement, in a classroom setting. Overhead view of a young man typing on a laptop with a green matte screen, surrounded by work materials on a wooden table.

Food photography

Monochromatic macarons arranged in precise geometric pattern. Strong shadow play. Architectural lighting. Minimal composition.
A pyramid of macarons in ombre pastels, arranged on a matte black slate surface. Dramatic side lighting from left. Close-up view highlighting texture of macaron shells. Garnished with edible gold leaf accents. Shot at f/2 aperture for shallow depth of field. Disassembled macaron parts in zero-g chamber. Textured cookie halves, viscous filling streams, and scattered almond slivers drifting. High-contrast lighting with subtle shadows on off-white. Wide-angle shot showcasing full dispersal pattern.

Architectural design

A white cubic house with floor-to-ceiling windows, interior view from living room. Double-height space, floating steel staircase, polished concrete floors. Late afternoon sunbeams streaming across minimal furnishings. Ultra-wide architectural lens. A white cubic house with floor-to-ceiling windows, kitchen and dining space. Monolithic marble island, integrated appliances, dramatic shadows from skylight above. Shot from a low angle with a wide-angle lens, emphasizing the height and openness of the space, late afternoon golden hour light streaming in.
An angular white modernist house featuring expansive glass walls, photographed for Architectural Digest’s cover. Misty morning atmosphere, elongated infinity pool creating a mirror image, three-quarter aerial view, lush coastal vegetation framing the scene.
A white cubic house with floor-to-ceiling windows presented as detailed architectural blueprints. Site plan view showing landscaping and property boundaries, technical annotations, blue background with white lines, precise measurements and zoning specifications visible. A white cubic house with floor-to-ceiling windows in precise isometric projection. X-ray style rendering revealing internal framework, electrical wiring, and plumbing systems. Technical cross-hatching on load-bearing elements and foundation.

Concept art

A stylized digital painting of a bustling plaza in a futuristic eco-city, with soft impressionistic brushstrokes. Crystalline towers frame the scene, while suspended gardens create a canopy overhead. Holographic displays and eco-friendly vehicles add life to the foreground. Dreamlike and atmospheric, with glowing highlights in sapphire and rose gold. A stylized digital painting of an elevated park in a futuristic eco-city, viewed from a high angle, with soft impressionistic brushstrokes. Crystalline towers peek through a canopy of trees, while winding elevated walkways connect floating garden platforms. People relax in harmony with nature. Dreamlike and atmospheric, with glowing highlights in jade and amber.
Concept art of a floating garden platform in a futuristic city, viewed from below. Translucent roots and hanging vines intertwine with advanced technology, creating a mesmerizing canopy. Soft bioluminescent lights pulse through the vegetation, casting ethereal patterns on the ocean’s surface. A gradient of deep purples and blues dominates the twilight sky.
An enchanted castle atop a misty cliff at sunrise, warm golden light bathing the ivy-covered spires. A wide-angle view capturing a flock of birds soaring past the tallest tower, set against a dramatic sky with streaks of orange and pink. Mystical ambiance and dynamic composition. A magical castle rising from morning fog on a rugged cliff face, bathed in cool blue twilight. A low-angle shot showcasing the castle’s imposing silhouette against a star-filled sky, with a crescent moon peeking through wispy clouds. Mysterious mood and vertical composition emphasizing height.
An enchanted fortress clinging to a mist-shrouded cliff, caught in the moment between night and day. A panoramic view from below, revealing the castle’s reflection in a tranquil lake at the base of the cliff. Ethereal pink and purple hues in the sky, with a V-formation of birds flying towards the castle. Serene atmosphere and balanced symmetry.

Illustration

Japanese ink wash painting of a cute baby dragon with pearlescent mint-green scales and tiny wings curled up in a nest made of cherry blossom petals. Delicate brushstrokes, emphasis on negative space. Art nouveau-inspired composition centered on an endearing dragon hatchling with gleaming mint-green scales. Sinuous morning glory stems and blossoms intertwine around the subject, creating a harmonious balance. Soft, dreamy pastels and characteristic decorative elements frame the scene.
Watercolor scene of a cute baby dragon with pearlescent mint-green scales crouched at the edge of a garden puddle, tiny wings raised. Soft pastel flowers and foliage frame the composition. Loose, wet-on-wet technique for a dreamy atmosphere, with sunlight glinting off ripples in the puddle.
A playful, hand-sculpted claymation-style baby dragon with pearlescent mint scales and tiny wings, sitting on a puffy marshmallow cloud. Its soft, rounded features and expressive googly eyes give it a lively, mischievous personality as it giggles and flaps its stubby wings, trying to take flight in a candy-colored sky. A whimsical, animated-style render of a baby dragon with pearlescent mint scales nestled in a bed of oversized, bioluminescent flowers. The floating island garden is bathed in the warm glow of sunset, with fireflies twinkling like stars. Dynamic lighting accentuates the dragon’s playful expression.

Graphic design

A set of minimalist icons for a health tracking app. Dual-line design with 1.5px stroke weight on solid backgrounds. Each icon uses teal for the primary line and a lighter shade for the secondary line, with ample negative space. Icons maintain consistent 64x64px dimensions with centered compositions. Clean, professional aesthetic suitable for both light and dark modes. Stylized art deco icons for fitness tracking. Geometric abstractions of health symbols with gold accents. Balanced designs incorporating circles, triangles, and zigzag motifs. Clean and sophisticated.
Set of charming wellness icons for digital health tracker. Organic, hand-drawn aesthetic with soft, curvy lines. Uplifting color combination of lemon yellow and fuchsia pink. Subtle size variations among icons for a dynamic, handcrafted feel.
Lush greenery tapestry in 16:9 panoramic view. Detailed monstera leaves overlap in foreground, giving way to intricate ferns and tendrils. Emerald and sage watercolor washes create atmospheric depth. Foliage density decreases towards center, suggesting an enchanted forest clearing. Modern botanical line drawing in 16:9 widescreen. Forest green single-weight outlines of stylized foliage. Negative space concentrated in the center for optimal text placement. Geometric simplification of natural elements with a focus on curves and arcs.
3D sculptural typography spelling out “BRAVE” with each letter made from a different material, arranged in a dynamic composition. Experimental typographic interpretation of “BRAVE” using abstract, interconnected geometric shapes that flow and blend organically. Hyper-detailed textures reminiscent of fractals and natural patterns create a mesmerizing, otherworldly appearance with sharp contrast.
A dreamy photograph overlaid with delicate pen-and-ink drawings, blending reality and fantasy to reveal hidden magic in ordinary moments. Surreal digital collage blending organic and technological elements in a futuristic style.
Abstract figures emerging from digital screens, gradient color transitions, mixed textures, dynamic composition, conceptual narrative style. Abstract humanoid forms materializing from multiple digital displays, vibrant color gradients flowing between screens, contrasting smooth and pixelated textures, asymmetrical layout with visual tension, surreal storytelling aesthetic.
Abstract figures emerging from digital screens, glitch art aesthetic with RGB color shifts, fragmented pixel clusters, high contrast scanlines, deep shadows cast by volumetric lighting.

Conclusion

The examples showcased here are just the beginning of what’s possible with Amazon Nova Canvas. For even greater control, you can guide generations with reference images, use custom color palettes, or make precise edits—such as swapping backgrounds or refining details— with simple inputs. Plus, with built-in safeguards such as watermarking and content moderation, Nova Canvas offers a responsible and secure creative experience. Whether you’re a professional creator, a marketing team, or an innovator with a vision, Nova Canvas provides the tools to bring your ideas to life.

We invite you to explore these possibilities yourself and discover how Nova Canvas can transform your creative process. Stay tuned for our next installment, where we’ll dive into the exciting world of video generation with Amazon Nova Reel.

Ready to start creating? Visit the Amazon Bedrock console today and bring your ideas to life with Nova Canvas. For more information about features, specifications, and additional examples, explore our documentation on creative content generation with Amazon Nova.


About the authors

Yanyan Zhang is a Senior Generative AI Data Scientist at Amazon Web Services, where she has been working on cutting-edge AI/ML technologies as a Generative AI Specialist, helping customers use generative AI to achieve their desired outcomes. Yanyan graduated from Texas A&M University with a PhD in Electrical Engineering. Outside of work, she loves traveling, working out, and exploring new things.

Kris Schultz has spent over 25 years bringing engaging user experiences to life by combining emerging technologies with world class design. As Sr. Solutions Architect within Amazon AGI, he influences the development of Amazon’s first-party generative AI models. Kris is passionate about empowering users and creators of all types with generative AI tools and knowledge.

Sanju Sunny is a Generative AI Design Technologist with AWS Prototyping & Cloud Engineering (PACE), specializing in strategy, engineering, and customer experience. He collaborates with customers across diverse industries, leveraging Amazon’s customer-obsessed innovation mechanisms to rapidly conceptualize, validate, and prototype innovative products, services, and experiences.

Nitin Eusebius is a Sr. Enterprise Solutions Architect at AWS, experienced in Software Engineering, Enterprise Architecture, and AI/ML. He is deeply passionate about exploring the possibilities of generative AI. He collaborates with customers to help them build well-architected applications on the AWS platform, and is dedicated to solving technology challenges and assisting with their cloud journey.

Read More

Benchmarking Amazon Nova and GPT-4o models with FloTorch

Benchmarking Amazon Nova and GPT-4o models with FloTorch

Based on original post by Dr. Hemant Joshi, CTO, FloTorch.ai

A recent evaluation conducted by FloTorch compared the performance of Amazon Nova models with OpenAI’s GPT-4o.

Amazon Nova is a new generation of state-of-the-art foundation models (FMs) that deliver frontier intelligence and industry-leading price-performance. The Amazon Nova family of models includes Amazon Nova Micro, Amazon Nova Lite, and Amazon Nova Pro, which support text, image, and video inputs while generating text-based outputs. These models offer enterprises a range of capabilities, balancing accuracy, speed, and cost-efficiency.

Using its enterprise software, FloTorch conducted an extensive comparison between Amazon Nova models and OpenAI’s GPT-4o models with the Comprehensive Retrieval Augmented Generation (CRAG) benchmark dataset. FloTorch’s evaluation focused on three critical factors—latency, accuracy, and cost—across five diverse topics.

Key findings from the benchmark study:

  • GPT-4o demonstrated a slight advantage in accuracy over Amazon Nova Pro
  • Amazon Nova Pro outperformed GPT-4o in efficiency, operating 97% faster while being 65.26% more cost-effective
  • Amazon Nova Micro and Amazon Nova Lite outperformed GPT-4o-mini by 2 percentage points in accuracy
  • In terms of affordability, Amazon Nova Micro and Amazon Nova Lite were 10% and 56.59% cheaper than GPT-4o-mini, respectively
  • Amazon Nova Micro and Amazon Nova Lite also demonstrated faster response times, with 48% and 26.60% improvements, respectively

In this post, we discuss the findings from this benchmarking in more detail.

The growing need for cost-effective AI models

The landscape of generative AI is rapidly evolving. OpenAI launched GPT-4o in May 2024, and Amazon introduced Amazon Nova models at AWS re:Invent in December 2024. Although GPT-4o has gained traction in the AI community, enterprises are showing increased interest in Amazon Nova due to its lower latency and cost-effectiveness.

Large language models (LLMs) are generally proficient in responding to user queries, but they sometimes generate overly broad or inaccurate responses. Additionally, LLMs might provide answers that extend beyond the company-specific context, making them unsuitable for certain enterprise use cases.

One of the most critical applications for LLMs today is Retrieval Augmented Generation (RAG), which enables AI models to ground responses in enterprise knowledge bases such as PDFs, internal documents, and structured data. This is a crucial requirement for enterprises that want their AI systems to provide responses strictly within a defined scope.

To better serve the enterprise customers, the evaluation aimed to answer three key questions:

  • How does Amazon Nova Pro compare to GPT-4o in terms of latency, cost, and accuracy?
  • How do Amazon Nova Micro and Amazon Nova Lite perform against GPT-4o mini in these same metrics?
  • How well do these models handle RAG use cases across different industry domains?

By addressing these questions, the evaluation provides enterprises with actionable insights into selecting the right AI models for their specific needs—whether optimizing for speed, accuracy, or cost-efficiency.

Overview of the CRAG benchmark dataset

The CRAG dataset was released by Meta for testing with factual queries across five domains with eight question types and a large number of question-answer pairs. Five domains in CRAG dataset are Finance, Sports, Music, Movie, and Open (miscellaneous). The eight different question types are simple, simple_w_condition, comparison, aggregation, set, false_premise, post-processing, and multi-hop. The following table provides example questions with their domain and question type.

Domain Question Question Type
Sports Can you carry less than the maximum number of clubs during a round of golf? simple
Music Can you tell me how many grammies were won by arlo guthrie until 60th grammy (2017)? simple_w_condition
Open Can i make cookies in an air fryer? simple
Finance Did meta have any mergers or acquisitions in 2022? simple_w_condition
Movie In 2016, which movie was distinguished for its visual effects at the oscars? simple_w_condition

The evaluation considered 200 queries from this dataset representing five domains and two question types, simple and simple_w_condition. Both types of questions are common from users, and a typical Google search for the query such as “Can you tell me how many grammies were won by arlo guthrie until 60th grammy (2017)?” will not give you the correct answer (one Grammy). FloTorch used these queries and their ground truth answers to create a subset benchmark dataset. The CRAG dataset also provides top five search result pages for each query. These five webpages act as a knowledge base (source data) to limit the RAG model’s response. The goal is to index these five webpages dynamically using a common embedding algorithm and then use a retrieval (and reranking) strategy to retrieve chunks of data from the indexed knowledge base to infer the final answer.

Evaluation setup

The RAG evaluation pipeline consists of the several key components, as illustrated in the following diagram.

In this section, we explore each component in more detail.

Knowledge base

FloTorch used the top five HTML webpages provided with the CRAG dataset for each query as the knowledge base source data. HTML pages were parsed to extract text for the embedding stage.

Chunking strategy

FloTorch used a fixed chunking strategy with a chunk size of 512 tokens (four characters is usually around one token) and a 10% overlap between chunks. Further experiments with different chunking strategies, chunk sizes, and percent overlap will be done in coming weeks and will update this post.

Embedding strategy

FloTorch used the Amazon Titan Text Embeddings V2 model on Amazon Bedrock with an output vector size of 1024. With a maximum input token limit of 8,192 for the model, the system successfully embedded chunks from the knowledge base source data as well as short queries from the CRAG dataset efficiently. Amazon Bedrock APIs make it straightforward to use Amazon Titan Text Embeddings V2 for embedding data.

Vector database

FloTorch selected Amazon OpenSearch Service as a vector database for its high-performance metrics. The implementation included a provisioned three-node sharded OpenSearch Service cluster. Each provisioned node was r7g.4xlarge, selected for its availability and sufficient capacity to meet the performance requirements. FloTorch used HSNW indexing in OpenSearch Service.

Retrieval (and reranking) strategy

FloTorch used a retrieval strategy with a k-nearest neighbor (k-NN) of five for retrieved chunks. The experiments excluded reranking algorithms to make sure retrieved chunks remained consistent for both models when inferring the answer to the provided query. The following code snippet embeds the given query and passes the embeddings to the search function:

def search_results(interaction_ids: List[str], queries: List[str], k: int):
   """Retrieve search results for queries."""
   results = []
   embedding_max_length = int(os.getenv("EMBEDDING_MAX_LENGTH", 1024))
   normalize_embeddings = os.getenv("NORMALIZE_EMBEDDINGS", "True").lower() == "true"

   for interaction_id, query in zip(interaction_ids, queries):
       try:
           _, _, embedding = create_embeddings_with_titan_bedrock(query, embedding_max_length, normalize_embeddings)
           results.append(search(interaction_id + '_titan', embedding, k))
       except Exception as e:
           logger.error(f"Error processing query {query}: {e}")
           results.append(None)
   return results

Inferencing

FloTorch used the GPT-4o model from OpenAI using the API key available and used the Amazon Nova Pro model with conversation APIs. GPT-4o supports a context window of 128,000 compared to Amazon Nova Pro with a context window of 300,000. The maximum output token limit of GPT-4o is 16,384 vs. the Amazon Nova Pro maximum output token limit of 5,000. The benchmarking experiments were conducted without Amazon Bedrock Guardrails functionality. The implementation used the universal gateway provided by the FloTorch enterprise version to enable consistent API calls using the same function and to track token count and latency metrics uniformly. The inference function code is as follows:

def generate_responses(dataset_path: str, model_name: str, batch_size: int, api_endpoint: str, auth_header: str,
                        max_tokens: int, search_k: int, system_prompt: str):
   """Generate response for queries."""
   results = []

   for batch in tqdm(load_data_in_batches(dataset_path, batch_size), desc="Generating responses"):
       interaction_ids = [item["interaction_id"] for item in batch]
       queries = [item["query"] for item in batch]
       search_results_list = search_results(interaction_ids, queries, search_k)

       for i, item in enumerate(batch):
           item["search_results"] = search_results_list[i]

       responses = send_batch_request(batch, model_name, api_endpoint, auth_header, max_tokens, system_prompt)

       for i, response in enumerate(responses):
           results.append({
               "interaction_id": interaction_ids[i],
               "query": queries[i],
               "prediction": response.get("choices", [{}])[0].get("message", {}).get("content") if response else None,
               "response_time": response.get("response_time") if response else None,
               "response": response,
           })

   return results

Evaluation

Both models were evaluated by running batch queries. A batch of eight was selected to comply with Amazon Bedrock quota limits as well as GPT-4o rate limits. The query function code is as follows:

def send_batch_request(batch: List[Dict], model_name: str, api_endpoint: str, auth_header: str, max_tokens: int,
                      system_prompt: str):
   """Send batch queries to the API."""
   headers = {"Authorization": auth_header, "Content-Type": "application/json"}
   responses = []

   for item in batch:
       query = item["query"]
       query_time = item["query_time"]
       retrieval_results = item.get("search_results", [])

       references = "# References n" + "n".join(
           [f"Reference {_idx + 1}:n{res['text']}n" for _idx, res in enumerate(retrieval_results)])
       user_message = f"{references}n------nnUsing only the references listed above, answer the following question:nQuestion: {query}n"

       payload = {
           "model": model_name,
           "messages": [{"role": "system", "content": system_prompt},
                        {"role": "user", "content": user_message}],
           "max_tokens": max_tokens,
       }

       try:
           start_time = time.time()
           response = requests.post(api_endpoint, headers=headers, json=payload, timeout=25000)
           response.raise_for_status()
           response_json = response.json()
           response_json['response_time'] = time.time() - start_time
           responses.append(response_json)
       except requests.RequestException as e:
           logger.error(f"API request failed for query: {query}. Error: {e}")
           responses.append(None)

   return responses

Benchmarking on the CRAG dataset

In this section, we discuss the latency, accuracy, and cost measurements of benchmarking on the CRAG dataset.

Latency

Latency measurements for each query response were calculated as the difference between two timestamps: the timestamp when the API call is made to the inference LLM, and a second timestamp when the entire response is received from the inference endpoint. The difference between these two timestamps determines the latency. A lower latency indicates a faster-performing LLM, making it suitable for applications requiring rapid response times. The study indicates that latency can be further reduced for both models through optimizations and caching techniques; however, the evaluation focused on measuring out-of-the-box latency performance for both models.

Accuracy

FloTorch used a modified version of the local_evaluation.py script provided with the CRAG benchmark for accuracy evaluations. The script was enhanced to provide proper categorization of correct, incorrect, and missing responses. The default GPT-4o evaluation LLM in the evaluation script was replaced with the mixtral-8x7b-instruct-v0:1 model API. Additional modifications to the script enabled monitoring of input and output tokens and latency as described earlier.

Cost

Cost calculations were straightforward because both Amazon Nova Pro and GPT-4o have published price per million input and output tokens separately. The calculation methodology involved multiplying input tokens by corresponding rates and applying the same process for output tokens. The total cost for running 200 queries was determined by combining input token and output token costs. OpenSearch Service provisioned cluster costs were excluded from this analysis because the cost comparison focused solely on the inference level between Amazon Nova Pro and GPT-4o LLMs.

Results

The following table summarizes the results.

 . Amazon Nova Pro GPT-4o Observation
Accuracy on subset of the CRAG dataset

51.50%

(103 correct responses out of 200)

53.00%

(106 correct responses out of 200)

GPT-4o outperforms Amazon Nova Pro by 1.5% on accuracy
Cost for running inference for 200 queries $0.00030205 $0.000869537 Amazon Nova Pro saves 65.26% in costs compared to GPT-4o
Average latency (seconds) 1.682539835 2.15615045 Amazon Nova Pro is 21.97% faster than GPT-4o
Average of input and output tokens 1946.621359 1782.707547 Typical GPT-4o responses are shorter than Amazon Nova responses

For simple queries, Amazon Nova Pro and GPT-4o have similar accuracies (55 and 56 correct responses, respectively) but for simple queries with conditions, GPT-4o performs slightly better than Amazon Nova Pro (50 vs. 48 correct answers). Imagine you are part of an organization running an AI assistant service that handles 1,000 questions per month from 10,000 users (10,000,000 queries per month). Amazon Nova Pro will save your organization $5,674.88 per month ($68,098 per year) compared to GPT-4o.

Let’s look at similar results for Amazon Nova Micro, Amazon Nova Lite, and GPT-4o mini models on the same dataset.

 

 

Amazon Nova Lite Nove Micro GPT-4o mini Observation
Accuracy on subset of the CRAG dataset

52.00%

(104 correct responses out of 200)

54.00%

(108 correct responses out of 200)

50.00%

(100 correct responses out of 200)

Both Amazon Nova Lite and Amazon Nova Micro outperform GPT-4o mini by 2 and 4 points, respectively
Cost for running inference for 200 queries

$0.00002247

(56.59% cheaper than GPT-4o mini)

$0.000013924

(73.10% cheaper than GPT-4o mini)

$0.000051768 Amazon Nova Lite and Amazon Nova Micro are cheaper than GPT-4o mini by 56.59% and 73.10%, respectively

Average latency

(seconds)

1.553371465

(26.60% faster than GPT-4o mini)

1.6828564

(20.48% faster than GPT-4o mini)

2.116291895 Amazon Nova models are at least 20% faster than GPT-4o mini
Average of input and output tokens 1930.980769 1940.166667 1789.54 GPT-4o mini returns shorter answers

Amazon Nova Micro is significantly faster and less expensive compared to GPT-4o mini while providing more accurate answers. If you are running a service that handles about 10 million queries each month, it will save you on average 73% of what you will be paying for slightly less accurate results from the GPT-4o mini model.

Conclusion

Based on these tests for RAG cases, Amazon Nova models produce comparable or higher accuracy at significantly lower cost and latency compared to GPT-4o and GPT-4o mini models. FloTorch is continuing further experimentation with other relevant LLMs for comparison. Future research will include additional experiments with various query types such as comparison, aggregation, set, false_premise, post-processing, and multi-hop queries.

Get started with Amazon Nova on the Amazon Bedrock console. Learn more at the Amazon Nova product page.

About FloTorch

FloTorch.ai is helping enterprise customers design and manage agentic workflows in a secure and scalable manner. FloTorch’s mission is to help enterprises make data-driven decisions in the end-to-end generative AI pipeline, including but not limited to model selection, vector database selection, and evaluation strategies. FloTorch offers an open source version for customers with scalable experimentation with different chunking, embedding, retrieval, and inference strategies. The open source version works on a customer’s AWS account so you can experiment on your AWS account with your proprietary data. Interested users are invited to try out FloTorch from AWS Marketplace or from GitHub. FloTorch also offers an enterprise version of this product for scalable experimentation with LLM models and vector databases on cloud platforms. The enterprise version also includes a universal gateway with model registry to custom define new LLMs and recommendation engine to suggest ew LLMs and agent workflows. For more information, contact us at info@flotorch.ai.


About the author

Prasanna Sridharan is a Principal Gen AI/ML Architect at AWS, specializing in designing and implementing AI/ML and Generative AI solutions for enterprise customers. With a passion for helping AWS customers build innovative Gen AI applications, he focuses on creating scalable, cutting-edge AI solutions that drive business transformation. You can connect with Prasanna on LinkedIn.

Dr. Hemant Joshi has over 20 years of industry experience building products and services with AI/ML technologies. As CTO of FloTorch, Hemant is engaged with customers to implement State of the Art GenAI solutions and agentic workflows for enterprises.

Read More

Deploy DeepSeek-R1 distilled models on Amazon SageMaker using a Large Model Inference container

Deploy DeepSeek-R1 distilled models on Amazon SageMaker using a Large Model Inference container

DeepSeek-R1 is a large language model (LLM) developed by DeepSeek AI that uses reinforcement learning to enhance reasoning capabilities through a multi-stage training process from a DeepSeek-V3-Base foundation. A key distinguishing feature is its reinforcement learning step, which was used to refine the model’s responses beyond the standard pre-training and fine-tuning process. By incorporating RL, DeepSeek-R1 can adapt more effectively to user feedback and objectives, ultimately enhancing both relevance and clarity. In addition, DeepSeek-R1 employs a chain-of-thought (CoT) approach, meaning it’s equipped to break down complex queries and reason through them in a step-by-step manner. This guided reasoning process allows the model to produce more accurate, transparent, and detailed answers. This model combines RL-based fine-tuning with CoT capabilities, aiming to generate structured responses while focusing on interpretability and user interaction. With its wide-ranging capabilities, DeepSeek-R1 has captured the industry’s attention as a versatile text-generation model that can be integrated into various workflows such as agents, logical reasoning, and data interpretation tasks.

DeepSeek-R1 uses a Mixture of Experts (MoE) architecture and is 671 billion parameters in size. The MoE architecture allows activation of 37 billion parameters, enabling efficient inference by routing queries to the most relevant expert clusters. This approach allows the model to specialize in different problem domains while maintaining overall efficiency.

DeepSeek-R1 distilled models bring the reasoning capabilities of the main R1 model to more efficient architectures based on popular open models like Meta’s Llama (8B and 70B) and Hugging Face’s Qwen (1.5B, 7B, 14B, and 32B). Distillation refers to a process of training smaller, more efficient models to mimic the behavior and reasoning patterns of the larger DeepSeek-R1 model, using it as a teacher model. For example, DeepSeek-R1-Distill-Llama-8B offers an excellent balance of performance and efficiency. By integrating this model with Amazon SageMaker AI, you can benefit from the AWS scalable infrastructure while maintaining high-quality language model capabilities.

In this post, we show how to use the distilled models in SageMaker AI, which offers several options to deploy the distilled versions of the R1 model.

Solution overview

You can use DeepSeek’s distilled models within the AWS managed machine learning (ML) infrastructure. We demonstrate how to deploy these models on SageMaker AI inference endpoints.

SageMaker AI offers a choice of which serving container to use for deployments:

  • LMI container – A Large Model Inference (LMI) container with different backends (vLLM, TensortRT-LLM, and Neuron). See the following GitHub repo for more details.
  • TGI container – A Hugging Face Text Generation Interface (TGI) container. You can find more details in the following GitHub repo.

In the following code snippets, we use the LMI container example. See the following GitHub repo for more deployment examples using TGI, TensorRT-LLM, and Neuron.

LMI containers

LMI containers are a set of high-performance Docker containers purpose built for LLM inference. With these containers, you can use high-performance open source inference libraries like vLLM, TensorRT-LLM, and Transformers NeuronX to deploy LLMs on SageMaker endpoints. These containers bundle together a model server with open source inference libraries to deliver an all-in-one LLM serving solution.

LMI containers provide many features, including:

  • Optimized inference performance for popular model architectures like Meta Llama, Mistral, Falcon, and more
  • Integration with open source inference libraries like vLLM, TensorRT-LLM, and Transformers NeuronX
  • Continuous batching for maximizing throughput at high concurrency
  • Token streaming
  • Quantization through AWQ, GPTQ, FP8, and more
  • Multi-GPU inference using tensor parallelism
  • Serving LoRA fine-tuned models
  • Text embedding to convert text data into numeric vectors
  • Speculative decoding support to decrease latency

LMI containers provide these features through integrations with popular inference libraries. A unified configuration format enables you to use the latest optimizations and technologies across libraries. To learn more about the LMI components, see Components of LMI.

Prerequisites

To run the example notebooks, you need an AWS account with an AWS Identity and Access Management (IAM) role with permissions to manage resources created. For details, refer to Create an AWS account.

If this is your first time working with Amazon SageMaker Studio, you first need to create a SageMaker domain. Additionally, you might need to request a service quota increase for the corresponding SageMaker hosting instances. In this example, you host the base model and multiple adapters on the same SageMaker endpoint, so you will use an ml.g5.2xlarge SageMaker hosting instance.

Deploy DeepSeek-R1 for inference

The following is a step-by-step example that demonstrates how to programmatically deploy DeepSeek-R1-Distill-Llama-8B for inference. The code for deploying the model is provided in the GitHub repo. You can clone the repo and run the notebook from SageMaker AI Studio.

  1. Configure the SageMaker execution role and import the necessary libraries:
!pip install --force-reinstall --no-cache-dir sagemaker==2.235.2

import json
import boto3
import sagemaker

# Set up IAM Role
try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

There are two ways to deploy an LLM like DeepSeek-R1 or its distilled variants on SageMaker:

  • Deploy uncompressed model weights from an Amazon S3 bucket – In this scenario, you need to set the HF_MODEL_ID variable to the Amazon Simple Storage Service (Amazon S3) prefix that has model artifacts. This method is generally much faster, with the model typically downloading in just a couple of minutes from Amazon S3.
  • Deploy directly from Hugging Face Hub (requires internet access) – To do this, set HF_MODEL_ID to the Hugging Face repository or model ID (for example, “deepseek-ai/DeepSeek-R1-Distill-Llama-8B”). However, this method tends to be slower and can take significantly longer to download the model compared to using Amazon S3. This approach will not work if enable_network_isolation is enabled, because it requires internet access to retrieve model artifacts from the Hugging Face Hub.
  1. In this example, we deploy the model directly from the Hugging Face Hub:
vllm_config = {
    "HF_MODEL_ID": "deepseek-ai/DeepSeek-R1-Distill-Llama-8B",
    "OPTION_TENSOR_PARALLEL_DEGREE": "max",
    "OPTION_ROLLING_BATCH": "vllm",
    "OPTION_MAX_ROLLING_BATCH_SIZE": "16",
}

The OPTION_MAX_ROLLING_BATCH_SIZE parameter limits number of concurrent requests that can be processed by the endpoint. We set it to 16 to limit GPU memory requirements. You should adjust it based on your latency and throughput requirements.

  1. Create and deploy the model:
# Create a Model object
lmi_model = sagemaker.Model(
    image_uri = inference_image_uri,
    env = vllm_config,
    role = role,
    name = model_name,
    enable_network_isolation=True, # Ensures model is isolated from the internet
    vpc_config={
        "Subnets": ["subnet-xxxxxxxx", "subnet-yyyyyyyy"],
        "SecurityGroupIds": ["sg-zzzzzzzz"]
    }
)
# Deploy to SageMaker
lmi_model.deploy(
    initial_instance_count = 1,
    instance_type = "ml.g5.2xlarge",
    container_startup_health_check_timeout = 1600,
    endpoint_name = endpoint_name,
)
  1. Make inference requests:
sagemaker_client = boto3.client('sagemaker-runtime', region_name='us-east-1')
endpoint_name = predictor.endpoint_name

input_payload = {
    "inputs": "What is Amazon SageMaker? Answer concisely.",
    "parameters": {"max_new_tokens": 250, "temperature": 0.1}
}

serialized_payload = json.dumps(input_payload)

response = sagemaker_client.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType='application/json',
    Body=serialized_payload
)

Performance and cost considerations

The ml.g5.2xlarge instance provides a good balance of performance and cost. For large-scale inference, use larger batch sizes for real-time inference to optimize cost and performance. You can also use batch transform for offline, large-volume inference to reduce costs. Monitor endpoint usage to optimize costs.

Clean up

Clean up your resources when they’re no longer needed:

predictor.delete_endpoint()

Security

You can configure advanced security and infrastructure settings for the DeepSeek-R1 model, including virtual private cloud (VPC) networking, service role permissions, encryption settings, and EnableNetworkIsolation to restrict internet access. For production deployments, it’s essential to review these settings to maintain alignment with your organization’s security and compliance requirements.

By default, the model runs in a shared AWS managed VPC with internet access. To enhance security and control access, you should explicitly configure a private VPC with appropriate security groups and IAM policies based on your requirements.

SageMaker AI provides enterprise-grade security features to help keep your data and applications secure and private. We do not share your data with model providers, unless you direct us to, providing you full control over your data. This applies to all models—both proprietary and publicly available, including DeepSeek-R1 on SageMaker.

For more details, see Configure security in Amazon SageMaker AI.

Logging and monitoring

You can monitor SageMaker AI using Amazon CloudWatch, which collects and processes raw data into readable, near real-time metrics. These metrics are retained for 15 months, allowing you to analyze historical trends and gain deeper insights into your application’s performance and health.

Additionally, you can configure alarms to monitor specific thresholds and trigger notifications or automated actions when those thresholds are met, helping you proactively manage your deployment.

For more details, see Metrics for monitoring Amazon SageMaker AI with Amazon CloudWatch.

Best practices

It’s always recommended to deploy your LLMs endpoints inside your VPC and behind a private subnet, without internet gateways, and preferably with no egress. Ingress from the internet should also be blocked to minimize security risks.

Always apply guardrails to make sure incoming and outgoing model responses are validated for safety, bias, and toxicity. You can guard your SageMaker endpoints model responses with Amazon Bedrock Guardrails. See DeepSeek-R1 model now available in Amazon Bedrock Marketplace and Amazon SageMaker JumpStart for more details.

Inference performance evaluation

In this section, we focus on inference performance of DeepSeek-R1 distilled variants on SageMaker AI. Evaluating the performance of LLMs in terms of end-to-end latency, throughput, and resource efficiency is crucial for providing responsiveness, scalability, and cost-effectiveness in real-world applications. Optimizing these metrics directly impacts user experience, system reliability, and deployment feasibility at scale. For this post, we test all DeepSeek-R1 distilled variants—1.5B, 7B, 8B, 14B, 32B, and 70B—along four performance metrics:

  • End-to-end latency (time between sending a request and receiving the response)
  • Throughput tokens
  • Time to first token
  • Inter-token latency

The main purpose of this performance evaluation is to give you an indication about relative performance of distilled R1 models on different hardware for generic traffic patterns. We didn’t try to optimize the performance for each model/hardware/use case combination. These results should not be treated like a best possible performance of a particular model on a particular instance type. You should always perform your own testing using your own datasets and traffic patterns as well as I/O sequence length.

Scenarios

We tested the following scenarios:

  • Container/model configuration – We used LMI container v14 with default parameters, except MAX_MODEL_LEN, which was set to 10000 (no chunked prefix and no prefix caching). On instances with multiple accelerators, we sharded the model across all available GPUs.
  • Tokens – We evaluated SageMaker endpoint hosted DeepSeek-R1 distilled variants on performance benchmarks using two sample input token lengths. We ran both tests 50 times each before measuring the average across the different metrics. Then we repeated the test with concurrency 10.
    • Short-length test – 512 input tokens and 256 output tokens.
    • Medium-length test – 3072 input tokens and 256 output tokens.
  • Hardware – We tested the distilled variants on a variety of instance types ranging from 1, 4, or 8 GPUs per instance. In the following table, a green cell indicates that a model was tested on that particular instance type, and red indicates that a model wasn’t tested with that instance type, either because the instance was excessive for a given model size or too small to fit the model in memory.

Deployment options

Box plots

In the following sections, we use a box plot to visualize model performance. A box is a concise visual summary that displays a dataset’s median, interquartile range (IQR), and potential outliers using a box for the middle 50% of the data, with whiskers extending to the smallest and largest non-outlier values. By examining the median’s placement within the box, the box’s size, and the whiskers’ lengths, you can quickly assess the data’s central tendency, variability, and skewness, as illustrated in the following figure.

Box plot example

DeepSeek-R1-Distill-Qwen-1.5B

This model can be deployed on a single GPU instance. The results indicate that the ml.g5.xlarge instance outperforms the ml.g6.xlarge instance across all measured performance criteria and concurrency settings.

The following figure illustrates testing with concurrency = 1.

Qwen-1.5-C1

The following figure illustrates testing with concurrency = 10.

Qwen-1.5-C10

DeepSeek-R1-Distill-Qwen-7B

DeepSeek-R1-Distill-Qwen-7B was tested on ml.g5.2xlarge and ml.g6e.2xlarge. Among all instances, ml.g6e.2xlarge demonstrated the highest performance.

The following figure illustrates testing with concurrency = 1.

Qwen-7-C1

The following figure illustrates testing with concurrency = 10.

Qwen-7-C10

DeepSeek-R1-Distill-Llama-8B

DeepSeek-R1-Distill-Llama-8B was benchmarked across ml.g5.2xlarge, ml.g5.12xlarge, ml.g6e.2xlarge, and ml.g6e.12xlarge, with ml.g6e.12xlarge demonstrating the highest performance among all instances.

The following figure illustrates testing with concurrency = 1.

Llama-8B-C1

The following figure illustrates testing with concurrency = 10.

Llama-8B-C10

DeepSeek-R1-Distill-Qwen-14B

We tested this model on ml.g6.12xlarge, ml.g5.12xlarge, ml.g6e.48xlarge, and ml.g6e.12xlarge. The instance with 8 GPU (ml.g6e.48xlarge) showed the best results.

The following figure illustrates testing with concurrency = 1.

Qwen-14B-C1

The following figure illustrates testing with concurrency = 10.

Qwen-14B-C10

DeepSeek-R1-Distill-Qwen-32B

This is a fairly large model, and we only deployed it on multi-GPU instances: ml.g6.12xlarge, ml.g5.12xlarge, and ml.g6e.12xlarge. The latest generation (ml.g6e.12xlarge) showed the best performance across all concurrency settings.

The following figure illustrates testing with concurrency = 1.

Qwen-32B-C1

The following figure illustrates testing with concurrency = 10.

Qwen-32B-C10

DeepSeek-R1-Distill-Llama-70B

We tested this model on two different 8 GPUs instances: ml.g6e.48xlarge and ml.p4d.24xlarge. The latter showed the best performance.

The following figure illustrates testing with concurrency = 1.

Llama-70B-C1

The following figure illustrates testing with concurrency = 10.

Llama-70B-C10

Conclusion

Deploying DeepSeek models on SageMaker AI provides a robust solution for organizations seeking to use state-of-the-art language models in their applications. The combination of DeepSeek’s powerful models and SageMaker AI managed infrastructure offers a scalable and efficient approach to natural language processing tasks.

The performance evaluation section presents a comprehensive performance evaluation of all DeepSeek-R1 distilled models across four key inference metrics, using 13 different NVIDIA accelerator instance types. This analysis offers valuable insights to assist in the selection of the optimal instance type for deploying the DeepSeek-R1 solution.

Check out the complete code in the following GitHub repos:

For additional resources, refer to:


About the Authors

Dmitry Soldatkin is a Senior AI/ML Solutions Architect at Amazon Web Services (AWS), helping customers design and build AI/ML solutions. Dmitry’s work covers a wide range of ML use cases, with a primary interest in Generative AI, deep learning, and scaling ML across the enterprise. He has helped companies in many industries, including insurance, financial services, utilities, and telecommunications. You can connect with Dmitry on LinkedIn.

Vivek Gangasani is a Lead Specialist Solutions Architect for Inference at AWS. He helps emerging generative AI companies build innovative solutions using AWS services and accelerated compute. Currently, he is focused on developing strategies for fine-tuning and optimizing the inference performance of large language models. In his free time, Vivek enjoys hiking, watching movies, and trying different cuisines.

Prasanna Sridharan is a Principal Gen AI/ML Architect at AWS, specializing in designing and implementing AI/ML and Generative AI solutions for enterprise customers. With a passion for helping AWS customers build innovative Gen AI applications, he focuses on creating scalable, cutting-edge AI solutions that drive business transformation. You can connect with Prasanna on LinkedIn.

Pranav Murthy is an AI/ML Specialist Solutions Architect at AWS. He focuses on helping customers build, train, deploy and migrate machine learning (ML) workloads to SageMaker. He previously worked in the semiconductor industry developing large computer vision (CV) and natural language processing (NLP) models to improve semiconductor processes using state of the art ML techniques. In his free time, he enjoys playing chess and traveling. You can find Pranav on LinkedIn.

Read More

From fridge to table: Use Amazon Rekognition and Amazon Bedrock to generate recipes and combat food waste

From fridge to table: Use Amazon Rekognition and Amazon Bedrock to generate recipes and combat food waste

In today’s fast-paced world, time is of the essence and even basic tasks like grocery shopping can feel rushed and challenging. Despite our best intentions to plan meals and shop accordingly, we often end up ordering takeout; leaving unused perishable items to spoil in the refrigerator. This seemingly small issue of wasted groceries, paired with the about-to-perish grocery supplies thrown away by grocery stores, contributes significantly to the global food waste problem. This demonstrates how we can help solve this problem by harnessing the power of generative AI on AWS.

By using computer vision capabilities through Amazon Rekognition and the content generation capabilities offered by foundation models (FMs) available through Amazon Bedrock, we developed a solution that will recommend recipes based on what you already have in your refrigerator and an inventory of about-to-expire items in local supermarkets, making sure that both food in your home and food in grocery stores are used, saving money and reducing waste.

In this post, we walk through how to build the FoodSavr solution (fictitious name used for the purposes of this post) using Amazon Rekognition Custom Labels to detect the ingredients and generate personalized recipes using Anthropic’s Claude 3.0 on Amazon Bedrock. We demonstrate an end-to-end architecture where a user can upload an image of their fridge, and using the ingredients found there (detected by Amazon Rekognition), the solution will give them a list of recipes (generated by Amazon Bedrock). The architecture also recognizes missing ingredients and provides the user with a list of nearby grocery stores.

Solution overview

The following reference architecture shows how you can use Amazon Bedrock, Amazon Rekognition, and other AWS services to implement the FoodSavr solution.

As shown in the preceding figure, the architecture includes the following steps:

  1. For an end-to-end solution, we recommend having a frontend where your users can upload images of items that they want detected and labeled. To learn more about frontend deployment on AWS, see Front-end Web & Mobile on AWS.
  2. The picture taken by the user is stored in an Amazon Simple Storage Service (Amazon S3) This S3 bucket should be configured with a lifecycle policy that deletes the image after use. To learn more about S3 lifecycle policies, see Managing your storage lifecycle.
  3. This architecture uses different AWS Lambda Lambda is a serverless AWS compute service that runs event driven code and automatically manages the compute resources. The first Lambda function, DetectIngredients harnesses the power of Amazon Rekognition by using the Boto3 Python API. Amazon Rekognition is a cutting-edge computer vision service that uses machine learning (ML) models to analyze the uploaded images.
  4. We use Rekognition Custom Labels to train a model with a dataset of ingredients. You can adopt this architecture to use Rekognition Custom Labels with your own use case. With the aid of custom labels trained to recognize various ingredients, Amazon Rekognition identifies the items present in the images.
  5. The detected ingredient names are then securely stored in an Amazon DynamoDB (a fully managed NoSQL database service) table. for retrieval and modification. Users are presented with list of the ingredients that have been detected, along with the option of adding other ingredients or deleting ingredients that they might not want or were misidentified.
  6. After the ingredient list is confirmed by the user through the web interface, they can initiate the recipe generation process with a click of a button. This action invokes another Lambda function called GenerateRecipes, which uses the advanced language capabilities of the Amazon Bedrock API (Anthropic’s Claude v3 in this post). This state-of-the-art FM analyzes the confirmed ingredient list retrieved from DynamoDB and generates relevant recipes tailored to those specific ingredients. Additionally, the model provides images to accompany each recipe, providing a visually appealing and inspiring culinary experience.
  7. Amazon Bedrock contains two key FMs that are used for this solution example: Anthropic’s Claude v3 (newer versions have been released since the writing of this post) and Stable Diffusion, used for recipe generation and image generation respectively. For this solution, you can use any combination of FMs that suit your use case. The generated content (recipes as text and recipe images, in this case) can then be displayed to the user on the frontend.
  8. For this use case, you can also set up an optional ordering pipeline, which allows a user to place orders for the ingredients described by the FMs. This would be fronted by a Lambda function, FindGroceryItems, that can look for the recommended grocery items in a database contributed to by local supermarkets. This database would consist of about-to-expire ingredients along with prices for those ingredients.

In the following sections, we dive into how you can set up this architecture on your own account. Step 8 is optional and therefore not covered in this post.

Using Amazon Rekognition to detect images

The image recognition is powered by Amazon Rekognition, which offers pre-trained and customizable computer vision capabilities to allow users to obtain information and insights from their images. For customizability, you can use Rekognition Custom Labels to identify scenes and objects in your images that are specific to your business needs. If your images are already labeled, you can begin training a model from the Amazon Rekognition console. Otherwise, you can label them directly from the Amazon Rekognition labeling interface, or use other services such as Amazon SageMaker Ground Truth. The following screenshot shows an example of what the bounding box process would look like on the Amazon Rekognition labeling interface.

To get started with labeling, see Using Amazon Rekognition Custom Labels and Amazon A2I for detecting pizza slices and augmenting predictions. For this architecture, we collected a dataset of up to 70 images of common food items typically found in refrigerators. We recommend that you gather your own relevant images and store them in an S3 bucket to use for training with Amazon Rekognition. You can then use Rekognition Custom Labels to create labels with food names, and assign bounding boxes on the images so the model knows where to look. To get started with training your own custom model, see Training an Amazon Rekognition Custom Labels model.

When model training is complete, you will see all your trained models under Projects on the AWS Management Console for Amazon Rekognition. Here, you can also look at the model performance, measured by the F1 score (shown in the following screenshot).

You can also iterate and modify your existing models to create newer versions. Before using your model, make sure it’s in STARTED state. To use the model, choose the model you want to use, and on the Use model tab, choose Start.

You also have the option to programmatically start and stop your model (the exact API call can be copied from the Amazon Rekognition console, but the following is provided as an example):

Use the following API (which is present in the Lambda function) call to detect groceries in an image using your custom labels and custom models:

aws rekognition detect-custom-labels 
--project-version-arn "MODEL_ARN" 
--image '{"S3Object": {"Bucket": "MY_BUCKET","Name": "PATH_TO_MY_IMAGE"}}' 
--region us-east-1

To stop incurring costs, you can also stop your model when not in use:

aws rekognition stop-project-version 
--project-version-arn "MODEL ARN 
--region us-east-1

Because we’re using Python, the boto3 Python package is used to make all AWS API calls mentioned in this post. For more information about Boto3, see the Boto3 documentation.

Starting a model might take a few minutes to complete. To check the current status of the model readiness, check the details page for the project or use DescribeProjectVersions. Wait for the model status to change to RUNNING.

In the meantime, you can explore the different statistics provided by Amazon Rekognition about your model. Some notable ones are the model performance (F1 score), precision, and recall. These statistics are gathered by Amazon Rekognition at both the model level (as seen in the earlier screenshot) and the individual custom label level (as shown in the following screenshot).

For more information on these statistics, see Metrics for evaluating your model.

Be aware that, while Anthropic’s Claude models offer impressive multi-modal capabilities for understanding and generating content based on text and images, we chose to use Amazon Rekognition Custom Labels for ingredient detection in this solution. Amazon Rekognition is a specialized computer vision service optimized for tasks such as object detection and image classification, using state-of-the-art models trained on massive datasets. Additionally, Rekognition Custom Labels allows us to train custom models tailored to recognize specific food items and ingredients, providing a level of customization that might not be as straightforward with a general-purpose language model. Furthermore, as a fully managed service, Amazon Rekognition can scale seamlessly to handle large volumes of images. While a hybrid approach combining Rekognition and Claude’s multi-modal capabilities could be explored, we chose Rekognition Custom Labels for its specialized computer vision capabilities, customizability, and to demonstrate combining FMs on Amazon Bedrock with other AWS services for this specific use case.

Using Amazon Bedrock FMs to generate recipes

To generate the recipes, we use Amazon Bedrock, a fully managed service that offers high-performing FMs. We use the Amazon Bedrock API to query Anthropic’s Claude v3 Sonnet model. We use the following prompt to provide context to the FM:

You are an expert chef, with expertise in diverse cuisines and recipes. 
I am currently a novice and I require you to write me recipes based on the ingredients provided below. 
The requirements for the recipes are as follows:
- I need 3 recipes from you
- These recipes can only use ingredients listed below, and nothing else
- For each of the recipes, provide detailed step by step methods for cooking. Format it like this:
1. Step 1: <instructions>
2. Step 2: <instructions>
...
n. Step n: <instructions>
Remember, you HAVE to use ONLY the ingredients that are provided to you. DO NOT use any other ingredient. 
This is crucial. For example, if you are given ingredients "Bread" and "Butter", you can ONLY use Bread and Butter, 
and no other ingredient can be added on. 
An example recipe with these two can be:
Recipe 1: Fried Bread
Ingredients:
- Bread
- Butter
1. Step 1: Heat up the pan until it reaches 40 degrees
2. Step 2: Drop in a knob of butter and melt it
3. Step 3: Once butter is melted, add a piece of bread onto pan
4. Step 4: Cook until the bread is browned and crispy
5. Step 5: Repeat on the other side
6. Step 6: You can repeat this for other breads, too

The following code is the body of the Amazon Bedrock API call:

# master_ingredients_str: Labels retrieved from DynamoDB table
# prompt: Prompt shown above
content = "Here is a list of ingredients that a person currently has." + user_ingredients_str + "nn And here are a list of ingredients at a local grocery store " + master_ingredients_str + prompt

body = json.dumps({
"max_tokens": 2047,
"messages": [{"role": "user", "content": content}],
"anthropic_version": "bedrock-2023-05-31"
})

j_body = json.dumps(body)

modelId = "anthropic.claude-3-sonnet-20240229-v1:0"

response = bedrock.invoke_model(body=body, modelId=modelId)

Using the combination of the prompt and API call, we generate three recipes using the ingredients retrieved from the DynamoDB table. You can add additional parameters to body such as temperature, top_p, and top_k to further set thresholds for your prompt. For more information on getting responses from the Anthropic’s Claude 3 model using the Amazon Bedrock API, see Anthropic Claude Messages API. We recommend setting the temperature to something low (such as 0.1 or 0.2) to help ensure deterministic and structured generation of recipes. We also recommend setting the top_p value (nucleus sampling) to something high (such as 0.9) to limit the FM’s predictions to the most probable tokens (in this case, the model will consider the most probable tokens that make up 90% of the total probability mass for its next prediction). top_k is another sampling technique that limits the model’s predictions to the top_k most probable tokens. For example, if top_k = 10, the model will only consider the 10 most probable tokens for its next prediction. One of the key benefits of using Amazon Bedrock is the ability to use multiple FMs for different tasks within the same solution. In addition to generating textual recipes with Anthropic’s Claude 3, we can also dynamically generate visually appealing images to accompany those recipes. For this task, we chose to use the Stable Diffusion model available on Amazon Bedrock. Amazon Bedrock also offers other powerful image generation models such as Titan, and we’ve given you an example API call for that, too. Similar to using the Amazon Bedrock API to generate a response from Anthropic’s Claude 3, we use the following code:

modelId = "stability.stable-diffusion-xl-v0" 
accept = "application/json"
contentType = "application/json"

body = json.dumps({
"text_prompts": [
{
"text": recipe_name
}
], 
"cfg_scale": 10,
"seed": 20,
"steps": 50
})

response = brt.invoke_model(
body = body,
modelId = modelId,
accept = accept, 
contentType = contentType
)

For Titan, you might use something like:

modelId="amazon.titan-image-generator-v1",
accept="application/json", 
contentType="application/json"

body = json.dumps({
    "taskType": "TEXT_IMAGE",
    "textToImageParams": {
        "text":prompt,   # Required
    },
    "imageGenerationConfig": {
        "numberOfImages": 1,   # Range: 1 to 5 
        "quality": "premium",  # Options: standard or premium
        "height": 768,         # Supported height list in the docs 
        "width": 1280,         # Supported width list in the docs
        "cfgScale": 7.5,       # Range: 1.0 (exclusive) to 10.0
        "seed": 42             # Range: 0 to 214783647
    }
})

response = brt.invoke_model(
body = body, 
modelId = modelId,
accept = accept,
contentType = contentType
)

This returns a base64 encoded string that you need to decode in your frontend so that you can display it. For more information about other parameters that you can include in your API call, see Stability.ai Diffusion 1.0 text to image, and Using Amazon Bedrock to generate images with Titan Image Generator models. In the following sections, you walk through the steps to deploy the solution in your AWS account.

Prerequisites

You need an AWS account to deploy this solution. If you don’t have an existing account, you can sign up for one. The instructions in this post use the us-east-1 AWS Region. Make sure you deploy your resources in a Region with AWS Machine Learning services available. For the Lambda functions to run successfully, Lambda requires an AWS Identity and Access Management (IAM) role and policy with the appropriate permissions. Complete the necessary steps from Defining Lambda function permissions with an execution role to create and attach a Lambda execution role for the Lambda functions to access all necessary actions for DynamoDB, Amazon Rekognition, and Amazon Bedrock.

Create the Lambda function to detect ingredients

Complete the following steps to create your first Lambda function (DetectIngredients):

  1. On the Lambda console, choose Functions in the navigation pane.
  2. Choose Create Lambda function.
  3. Choose Author from scratch.
  4. Name your function DetectIngredients, select Python 3.12 for Runtime, and choose Create function.
  5. For your Lambda configuration, choose lambdaDynamoRole for Execution role, increase Timeout to 8 seconds, verify the settings, and choose Save.
  6. Replace the text in the Lambda function code with the following sample code and choose Save:
import json
import boto3
import inference
import time
s3 = boto3.client('s3') 

dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('TestDataTable')
table_name = 'TestDataTable'

def lambda_handler(event, context):
    clearTable()

test = inference.main()

labels, label_count = inference.main()

# The names array will contain a list of all the grocery ingredients detected
# in the image
names = []

for label_dic in labels:
name = label_dic['Name']
# Getting rid of unnecessary parts of label string
if "Food" in name:
    # Remove "Food" from name
    name = name.replace("Food", "")
if "In Fridge" in name:
    # Remove "In Fridge" from name
    name = name.replace("In Fridge", "")
    name = name.strip()

names.append(name)

# Loop through the list of grocery ingredients to construct a dictionary called
# items
# the items dict will be used to batch write up to 25 items at a time when
# batch_write_all is called
items=[]
for name in names:
    if (len(items)) < 29:
        items.append({
           'grocery_item': name
        })

# Remove all duplicates from array
seen = set()
unique_grocery_items = []
for item in items:
    val = item['grocery_item'].lower().strip()
    if val not in seen:
        unique_grocery_items.append(item)
        seen.add(val)

batch_write_all(unique_grocery_items)

table.put_item(
Item={
'grocery_item': "DONE"
})

def batch_write_all(items):
    batch_write_requests = [{
        'PutRequest': {
            'Item': item
        }
    } for item in items]

response = dynamodb.batch_write_item(
    RequestItems={
         table_name:batch_write_requests
    }
)

def clearTable():
    response = table.scan()
    with table.batch_writer() as batch:
        for each in response['Items']:
             batch.delete_item(
                 Key={
                         'grocery_item': each['grocery_item'] 
                 }

Create a DynamoDB table to store ingredients

Complete the following steps to create your DynamoDB table.

  1. On the DynamoDB console, choose Tables in the navigation pane.
  2. Choose Create table.
  3. For Table name, enter MasterGroceryDB.
  4. For Partition key, use grocery_item (string).
  5. Verify that all entries on the page are accurate, leave the rest of the settings as default, and choose Create.

Wait for the table creation to complete and for your table status to change to Active before proceeding to the next step.

Create the Lambda function to call Amazon Bedrock

Complete the following steps to create another Lambda function that will call the Amazon Bedrock APIs to generate recipes:

  1. On the Lambda console, choose Functions in the navigation pane.
  2. Choose Create function.
  3. Choose Author from scratch.
  4. Name your function GenerateRecipes, choose Python 3.12 for Runtime, and choose Create function.
  5. For your Lambda configuration, choose lambdaDynamoRole for Execution role, increase Timeout to 8 seconds, verify the settings, and choose Save.
  6. Replace the text in the Lambda function code with the following sample code choose Save:
import json
import boto3
import re
import base64
import image_gen

dynamodb = boto3.resource('dynamodb')

bedrock = boto3.client(service_name='bedrock-runtime')

def get_ingredients(tableName):
    table = dynamodb.Table(tableName)
    response = table.scan()
    data = response['Items']

    # Support for pagination
    while 'LastEvaluatedKey' in response:
        response = table.scan(ExclusiveStartKey=response['LastEvaluatedKey'])
        data.extend(response['Items'])

    data = [g_i for g_i in data if g_i['grocery_item'] != 'DONE']
    return data


# Converts dynamoDB grocery items into a string
def convertItemsToString(grocery_dict):
    ingredients_list = []
    for each in grocery_dict:
        ingredients_list.append(each['grocery_item'])
        ingredients_list_str = ", ".join(ingredients_list)
    return ingredients_list_str

def read_prompt():
    with open ('Prompt.md', 'r') as f:
        text = f.read() 
    return text

# Gets the names of all the recipes generated
def get_recipe_names(response_body):
    recipe_names = []
    for i in range(len(response_body)):
        if response_body[i] == 'n' and response_body[i + 1] == 'n' and response_body[i + 2] == 'R':
    recipe_str = ""
    while response_body[i + 2] != 'n':
        recipe_str += response_body[i + 2]
        i += 1
    recipe_str = recipe_str.replace("Recipe", '') 
    recipe_str = recipe_str.replace(": ", '')
    recipe_str = re.sub(" d+", "", recipe_str) 
    recipe_names.append(recipe_str)
return recipe_names 

def lambda_handler(event, context):
    # Write the ingredients to a .md file
    user_ingredients_dict = get_ingredients('TestDataTable')
    master_ingredients_dict = get_ingredients('MasterGroceryDB')

    # Get string values for ingredients in both databases
    user_ingredients_str = convertItemsToString(user_ingredients_dict)
    master_ingredients_str = convertItemsToString(master_ingredients_dict)

    # Convert dictionary into comma seperated string arg to pass into prompt

    # Read the prompt + ingredients file
    prompt = read_prompt()
    # Query for recipes using prompt + ingredients

    content = "Here is a list of ingredients that a person currently has." + user_ingredients_str + "nn And here are a list of ingredients at a local grocery store " + master_ingredients_str + prompt

    body = json.dumps({
        "max_tokens": 2047,
        "messages": [{"role": "user", "content": content}],
        "anthropic_version": "bedrock-2023-05-31"
    })

    j_body = json.dumps(body)

    modelId = "anthropic.claude-3-sonnet-20240229-v1:0"

    
    response = bedrock.invoke_model(body=body, modelId=modelId)


    response_body = json.loads(response.get('body').read())
    response_body_content = response_body.get("content")
    response_body_completion = response_body_content[0]['text']

    recipe_names_list = get_recipe_names(response_body_completion)

    first_image_imgstr = image_gen.image_gen(recipe_names_list[0])
    second_image_imgstr = image_gen.image_gen(recipe_names_list[1])
    third_image_imgstr = image_gen.image_gen(recipe_names_list[2])

    return response_body_completion, first_image_imgstr, second_image_imgstr, third_image_imgstr

Create an S3 bucket to store the images

Lastly, you create an S3 bucket to store the images you upload, which automatically invokes the DetectIngredients Lambda function after each upload. Complete the following steps to create the bucket and configure the Lambda function:

  1. On the Amazon S3 console, choose Buckets in the navigation pane.
  2. Choose Create bucket.
  3. Enter a unique bucket name, set the desired Region to us-east-1, and choose Create bucket.
  4. On the Lambda console, navigate to the DetectIngredients
  5. On the Configuration tab, choose Add trigger.
  6. Select the trigger type as S3 and choose the bucket you created.
  7. Set Event type to All object create events and choose Add.
  8. On the Amazon S3 console, navigate to the bucket you created.
  9. Under Properties and Event Notifications, choose Create event notification.
  10. Enter an event name (for example, Trigger DetectIngredients) and set the events to All object create events.
  11. For Destination, select Lambda Function and select the DetectIngredients Lambda function.
  12. Choose Save.

Conclusion

In this post, we explored the use of Amazon Rekognition and FMs on Amazon Bedrock with AWS services such as Lambda and DynamoDB to build a comprehensive solution that addresses food waste in the US. With the use of cutting-edge AWS services including Rekognition Custom Labels and content generation with models on Amazon Bedrock, this application provides value and proof of work for AWS generative AI capabilities.

Stay on the lookout for a follow-up to this post, where we demonstrate using the multi-modal capabilities of FMs such as Anthropic’s Claude v3.1 on Amazon Bedrock to deploy this entire solution end-to-end.

Although we highlighted a food waste use case in this post, we urge you to apply your own use case to this solution. The flexibility of this architecture allows you to adapt these services to multiple scenarios, enabling you to solve a wide range of challenges.

Special thanks to Tommy Xie and Arnav Verma for their contributions to the blog.


About the Authors

Aman Shanbhag is an Associate Specialist Solutions Architect on the ML Frameworks team at Amazon Web Services, where he helps customers and partners with deploying ML training and inference solutions at scale. Before joining AWS, Aman graduated from Rice University with degrees in Computer Science, Mathematics, and Entrepreneurship.

Michael Lue is a Sr. Solution Architect at AWS Canada based out of Toronto. He works with Canadian enterprise customers to accelerate their business through optimization, innovation, and modernization. He is particularly passionate and curious about disruptive technologies like containers and AI/ML. In his spare time, he coaches and plays tennis and enjoys hanging at the beach with his French Bulldog, Marleé.

Vineet Kachhawaha is a Solutions Architect at AWS with expertise in machine learning. He is responsible for helping customers architect scalable, secure, and cost-effective workloads on AWS.

Read More

Transforming financial analysis with CreditAI on Amazon Bedrock: Octus’s journey with AWS

Transforming financial analysis with CreditAI on Amazon Bedrock: Octus’s journey with AWS

Investment professionals face the mounting challenge of processing vast amounts of data to make timely, informed decisions. The traditional approach of manually sifting through countless research documents, industry reports, and financial statements is not only time-consuming but can also lead to missed opportunities and incomplete analysis. This challenge is particularly acute in credit markets, where the complexity of information and the need for quick, accurate insights directly impacts investment outcomes. Financial institutions need a solution that can not only aggregate and process large volumes of data but also deliver actionable intelligence in a conversational, user-friendly format. The intersection of AI and financial analysis presents a compelling opportunity to transform how investment professionals access and use credit intelligence, leading to more efficient decision-making processes and better risk management outcomes.

Founded in 2013, Octus, formerly Reorg, is the essential credit intelligence and data provider for the world’s leading buy side firms, investment banks, law firms and advisory firms. By surrounding unparalleled human expertise with proven technology, data and AI tools, Octus unlocks powerful truths that fuel decisive action across financial markets. Visit octus.com to learn how we deliver rigorously verified intelligence at speed and create a complete picture for professionals across the entire credit lifecycle. Follow Octus on LinkedIn and X.

Using advanced GenAI, CreditAI by Octus™ is a flagship conversational chatbot that supports natural language queries and real-time data access with source attribution, significantly reducing analysis time and streamlining research workflows. It gives instant access to insights on over 10,000 companies from hundreds of thousands of proprietary intel articles, helping financial institutions make informed credit decisions while effectively managing risk. Key features include chat history management, being able to ask questions that are targeted to a specific company or more broadly to a sector, and getting suggestions on follow-up questions.

In this post, we demonstrate how Octus migrated its flagship product, CreditAI, to Amazon Bedrock, transforming how investment professionals access and analyze credit intelligence. We walk through the journey Octus took from managing multiple cloud providers and costly GPU instances to implementing a streamlined, cost-effective solution using AWS services including Amazon Bedrock, AWS Fargate, and Amazon OpenSearch Service. We share detailed insights into the architecture decisions, implementation strategies, security best practices, and key learnings that enabled Octus to maintain zero downtime while significantly improving the application’s performance and scalability.

Opportunities for innovation

CreditAI by Octus™ version 1.x uses Retrieval Augmented Generation (RAG). It was built using a combination of in-house and external cloud services on Microsoft Azure for large language models (LLMs), Pinecone for vectorized databases, and Amazon Elastic Compute Cloud (Amazon EC2) for embeddings. Based on our operational experience, and as we started scaling up, we realized that there were several operational inefficiencies and opportunities for improvement:

  • Our in-house services for embeddings (deployed on EC2 instances) were not as scalable and reliable as needed. They also required more time on operational maintenance than our team could spare.
  • The overall solution was incurring high operational costs, especially due to the use of on-demand GPU instances. The real-time nature of our application meant that Spot Instances were not an option. Additionally, our investigation of lower-cost CPU-based instances revealed that they couldn’t meet our latency requirements.
  • The use of multiple external cloud providers complicated DevOps, support, and budgeting.

These operational inefficiencies meant that we had to revisit our solution architecture. It became apparent that a cost-effective solution for our generative AI needs was required. Enter Amazon Bedrock Knowledge Bases. With its support for knowledge bases that simplify RAG operations, vectorized search as part of its integration with OpenSearch Service, availability of multi-tenant embeddings, as well as Anthropic’s Claude suite of LLMs, it was a compelling choice for Octus to migrate its solution architecture. Along the way, it also simplified operations as Octus is an AWS shop more generally. However, we were still curious about how we would go about this migration, and whether there would be any downtime through the transition.

Strategic requirements

To help us move forward systematically, Octus identified the following key requirements to guide the migration to Amazon Bedrock:

  • Scalability – A crucial requirement was the need to scale operations from handling hundreds of thousands of documents to millions of documents. A significant challenge in the previous system was the slow (and relatively unreliable) process of embedding new documents into vector databases, which created bottlenecks in scaling operations.
  • Cost-efficiency and infrastructure optimization – CreditAI 1.x, though performant, was incurring high infrastructure costs due to the use of GPU-based, single-tenant services for embeddings and reranking. We needed multi-tenant alternatives that were much cheaper while enabling elasticity and scale.
  • Response performance and latency – The success of generative AI-based applications depends on the response quality and speed. Given our user base, it’s important that our responses are accurate while valuing users’ time (low latency). This is a challenge when the data size and complexity grow. We want to balance spatial and temporal retrieval in order to give responses that have the best answer and context relevance, especially when we get large quantities of data updated every day.
  • Zero downtime – CreditAI is in production and we could not afford any downtime during this migration.
  • Technological agility and innovation – In the rapidly evolving AI landscape, Octus recognized the importance of maintaining technological competitiveness. We wanted to move away from in-house development and feature maintenance such as embeddings services, rerankers, guardrails, and RAG evaluators. This would allow Octus to focus on product innovation and faster feature deployment.
  • Operational consolidation and reliability – Octus’s goal is to consolidate cloud providers, and to reduce support overheads and operational complexity.

Migration to Amazon Bedrock and addressing our requirements

Migrating to Amazon Bedrock addressed our aforementioned requirements in the following ways:

  • Scalability – The architecture of Amazon Bedrock, combined with AWS Fargate for Amazon ECS, Amazon Textract, and AWS Lambda, provided the elastic and scalable infrastructure necessary for this expansion while maintaining performance, data integrity, compliance, and security standards. The solution’s efficient document processing and embedding capabilities addressed the previous system’s limitations, enabling faster and more efficient knowledge base updates.
  • Cost-efficiency and infrastructure optimization – By migrating to Amazon Bedrock multi-tenant embedding, Octus achieved significant cost reduction while maintaining performance standards through Anthropic’s Claude Sonnet and improved embedding capabilities. This move alleviated the need for GPU-instance-based services in favor of more cost-effective and serverless Amazon ECS and Fargate solutions.
  • Response performance and latency – Octus verified the quality and latency of responses from Anthropic’s Claude Sonnet to confirm that response accuracy and latency are not maintained (or even exceeded) as part of this migration. With this LLM, CreditAI was now able to respond better to broader, industry-wide queries than before.
  • Zero downtime – We were able to achieve zero downtime migration to Amazon Bedrock for our application using our in-house centralized infrastructure frameworks. Our frameworks comprise infrastructure as code (IaC) through Terraform, continuous integration and delivery (CI/CD), SOC2 security, monitoring, observability, and alerting for our infrastructure and applications.
  • Technological agility and innovation – Amazon Bedrock emerged as an ideal partner, offering solutions specifically designed for AI application development. Amazon Bedrock built-in features, such as embeddings services, reranking, guardrails, and the upcoming RAG evaluator, alleviated the need for in-house development of these components, allowing Octus to focus on product innovation and faster feature deployment.
  • Operational consolidation and reliability – The comprehensive suite of AWS services offers a streamlined framework that simplifies operations while providing high availability and reliability. This consolidation minimizes the complexity of managing multiple cloud providers and creates a more cohesive technological ecosystem. It also enables economies of scale with development velocity given that over 75 engineers at Octus already use AWS services for application development.

In addition, the Amazon Bedrock Knowledge Bases team worked closely with us to address several critical elements, including expanding embedding limits, managing the metadata limit (250 characters), testing different chunking methods, and syncing throughput to the knowledge base.

In the following sections, we explore our solution and how we addressed the details around the migration to Amazon Bedrock and Fargate.

Solution overview

The following figure illustrates our system architecture for CreditAI on AWS, with two key paths: the document ingestion and content extraction workflow, and the Q&A workflow for live user query response.

Solution Architecture

In the following sections, we dive into crucial details within key components in our solution. In each case, we connect them to the requirements discussed earlier for readability.

The document ingestion workflow (numbered in blue in the preceding diagram) processes content through five distinct stages:

  1. Documents uploaded to Amazon Simple Storage Service (Amazon S3) automatically invoke Lambda functions through S3 Event Notifications. This event-driven architecture provides immediate processing of new documents.
  2. Lambda functions process the event payload containing document location, perform format validation, and prepare content for extraction. This includes file type verification, size validation, and metadata extraction before routing to Amazon Textract.
  3. Amazon Textract processes the documents to extract both text and structural information. This service handles various formats, including PDFs, images, and forms, while preserving document layout and relationships between content elements.
  4. The extracted content is stored in a dedicated S3 prefix, separate from the source documents, maintaining clear data lineage. Each processed document maintains references to its source file, extraction timestamp, and processing metadata.
  5. The extracted content flows into Amazon Bedrock Knowledge Bases, where our semantic chunking strategy is implemented to divide content into optimal segments. The system then generates embeddings for each chunk and stores these vectors in OpenSearch Service for efficient retrieval. Throughout this process, the system maintains comprehensive metadata to support downstream filtering and source attribution requirements.

The Q&A workflow (numbered in yellow in the preceding diagram) processes user interactions through six integrated stages:

  1. The web application, hosted on AWS Fargate, handles user interactions and query inputs, managing initial request validation before routing queries to appropriate processing services.
  2. Amazon Managed Streaming for Kafka (Amazon MSK) serves as the streaming service, providing reliable inter-service communication while maintaining message ordering and high-throughput processing for query handling.
  3. The Q&A handler, running on AWS Fargate, orchestrates the complete query response cycle by coordinating between services and processing responses through the LLM pipeline.
  4. The pipeline integrates with Amazon Bedrock foundation models through these components:
    1. Cohere Embeddings model performs vector transformations of the input.
    2. Amazon OpenSearch Service manages vector embeddings and performs similarity searches.
    3. Amazon Bedrock Knowledge Bases provides efficient access to the document repository.
  5. Amazon Bedrock Guardrails implements content filtering and safety checks as part of the query processing pipeline.
  6. Anthropic Claude LLM performs the natural language processing, generating responses that are then returned to the web application.

This integrated workflow provides efficient query processing while maintaining response quality and system reliability.

For scalability: Using OpenSearch Service as our vector database

Amazon OpenSearch Serverless emerged as the optimal solution for CreditAI’s evolving requirements, offering advanced capabilities while maintaining seamless integration within the AWS ecosystem:

  • Vector search capabilities – OpenSearch Serverless provides robust built-in vector search capabilities essential for our needs. The service supports hybrid search, allowing us to combine vector embeddings with raw text search without modifying our embedding model. This capability proved crucial for enabling broader question support in CreditAI 2.x, enhancing its overall usability and flexibility.
  • Serverless architecture benefits – The serverless design alleviates the need to provision, configure, or tune infrastructure, significantly reducing operational complexities. This shift allows our team to focus more time and resources on feature development and application improvements rather than managing underlying infrastructure.
  • AWS integration advantages – The tight integration with other AWS services, particularly Amazon S3 and Amazon Bedrock, streamlines our content ingestion process. This built-in compatibility provides a cohesive and scalable landscape for future enhancements while maintaining optimal performance.

OpenSearch Serverless enabled us to scale our vector search capabilities efficiently while minimizing operational overhead and maintaining high performance standards.

For scalability and security: Splitting data across multiple vector databases with in-house support for intricate permissions

To enhance scalability and security, we implemented isolated knowledge bases (corresponding to vector databases) for each client data. Although this approach slightly increases costs, it delivers multiple significant benefits. Primarily, it maintains complete isolation of client data, providing enhanced privacy and security. Thanks to Amazon Bedrock Knowledge Bases, this solution doesn’t compromise on performance. Amazon Bedrock Knowledge Bases enables concurrent embedding and synchronization across multiple knowledge bases, allowing us to maintain real-time updates without delays—something previously unattainable with our previous GPU based architectures.

Additionally, we introduced two in-house services within Octus to strengthen this system:

  • AuthZ access management service – This service enforces granular access control, making sure users and applications can only interact with the data they are authorized to access. We had to migrate our AuthZ backend from Airbyte to native SQL replication so that it can support access management in near real time at scale.
  • Global identifiers service – This service provides a unified framework to link identifiers across multiple domains, enabling seamless integration and cross-referencing of identifiers across multiple datasets.

Together, these enhancements create a robust, secure, and highly efficient environment for managing and accessing client data.

For cost efficiency: Adopting a multi-tenant embedding service

In our migration to Amazon Bedrock Knowledge Bases, Octus made a strategic shift from using an open-source embedding service on EC2 instances to using the managed embedding capabilities of Amazon Bedrock through Cohere’s multilingual model. This transition was carefully evaluated based on several key factors.

Our selection of Cohere’s multilingual model was driven by two primary advantages. First, it demonstrated superior retrieval performance in our comparative testing. Second, it offered robust multilingual support capabilities that were essential for our global operations.

The technical benefits of this migration manifested in two distinct areas: document embedding and message embedding. In document embedding, we transitioned from a CPU-based system to Amazon Bedrock Knowledge Bases, which enabled faster and higher throughput document processing through its multi-tenant architecture. For message embedding, we alleviated our dependency on dedicated GPU instances while maintaining optimal performance with 20–30 millisecond embedding times. The Amazon Bedrock Knowledge Bases API also simplified our operations by combining embedding and retrieval functionality into a single API call.

The migration to Amazon Bedrock Knowledge Bases managed embedding delivered two significant advantages: it eliminated the operational overhead of maintaining our own open-source solution while providing access to industry-leading embedding capabilities through Cohere’s model. This helped us achieve both our cost-efficiency and performance objectives without compromises.

For cost-efficiency and response performance: Choice of chunking strategy

Our primary goal was to improve three critical aspects of CreditAI’s responses: quality (accuracy of information), groundedness (ability to trace responses back to source documents), and relevance (providing information that directly answers user queries). To achieve this, we tested three different approaches to breaking down documents into smaller pieces (chunks):

  • Fixed chunking – Breaking text into fixed-length pieces
  • Semantic chunking – Breaking text based on natural semantic boundaries like paragraphs, sections, or complete thoughts
  • Hierarchical chunking – Creating a two-level structure with smaller child chunks for precise matching and larger parent chunks for contextual understanding

Our testing showed that both semantic and hierarchical chunking performed significantly better than fixed chunking in retrieving relevant information. However, each approach came with its own technical considerations.

Hierarchical chunking requires a larger chunk size to maintain comprehensive context during retrieval. This approach creates a two-level structure: smaller child chunks for precise matching and larger parent chunks for contextual understanding. During retrieval, the system first identifies relevant child chunks and then automatically includes their parent chunks to provide broader context. Although this method optimizes both search precision and context preservation, we couldn’t implement it with our preferred Cohere embeddings because they only support chunks up to 512 tokens, which is insufficient for the parent chunks needed to maintain effective hierarchical relationships.

Semantic chunking uses LLMs to intelligently divide text by analyzing both semantic similarity and natural language structures. Instead of arbitrary splits, the system identifies logical break points by calculating embedding-based similarity scores between sentences and paragraphs, making sure semantically related content stays together. The resulting chunks maintain context integrity by considering both linguistic features (like sentence and paragraph boundaries) and semantic coherence, though this precision comes at the cost of additional computational resources for LLM analysis and embedding calculations.

After evaluating our options, we chose semantic chunking despite two trade-offs:

  • It requires additional processing by our LLMs, which increases costs
  • It has a limit of 1,000,000 tokens per document processing batch

We made this choice because semantic chunking offered the best balance between implementation simplicity and retrieval performance. Although hierarchical chunking showed promise, it would have been more complex to implement and harder to scale. This decision helped us maintain high-quality, grounded, and relevant responses while keeping our system manageable and efficient.

For response performance and technical agility: Adopting Amazon Bedrock Guardrails with Amazon Bedrock Knowledge Bases

Our implementation of Amazon Bedrock Guardrails focused on three key objectives: enhancing response security, optimizing performance, and simplifying guardrail management. This service plays a crucial role in making sure our responses are both safe and efficient.

Amazon Bedrock Guardrails provides a comprehensive framework for content filtering and response moderation. The system works by evaluating content against predefined rules before the LLM processes it, helping prevent inappropriate content and maintaining response quality. Through the Amazon Bedrock Guardrails integration with Amazon Bedrock Knowledge Bases, we can configure, test, and iterate on our guardrails without writing complex code.

We achieved significant technical improvements in three areas:

  • Simplified moderation framework – Instead of managing multiple separate denied topics, we consolidated our content filtering into a unified guardrail service. This approach allows us to maintain a single source of truth for content moderation rules, with support for customizable sample phrases that help fine-tune our filtering accuracy.
  • Performance optimization – We improved system performance by integrating guardrail checks directly into our main prompts, rather than running them as separate operations. This optimization reduced our token usage and minimized unnecessary API calls, resulting in lower latency for each query.
  • Enhanced content control – The service provides configurable thresholds for filtering potentially harmful content and includes built-in capabilities for detecting hallucinations and assessing response relevance. This alleviated our dependency on external services like TruLens while maintaining robust content quality controls.

These improvements have helped us maintain high response quality while reducing both operational complexity and processing overhead. The integration with Amazon Bedrock has given us a more streamlined and efficient approach to content moderation.

To achieve zero downtime: Infrastructure migration

Our migration to Amazon Bedrock required careful planning to provide uninterrupted service for CreditAI while significantly reducing infrastructure costs. We achieved this through our comprehensive infrastructure framework that addresses deployment, security, and monitoring needs:

  • IaC implementation – We used reusable Terraform modules to manage our infrastructure consistently across environments. These modules enabled us to share configurations efficiently between services and projects. Our approach supports multi-Region deployments with minimal configuration changes while maintaining infrastructure version control alongside application code.
  • Automated deployment strategy – Our GitOps-embedded framework streamlines the deployment process by implementing a clear branching strategy for different environments. This automation handles CreditAI component deployments through CI/CD pipelines, reducing human error through automated validation and testing. The system also enables rapid rollback capabilities if needed.
  • Security and compliance – To maintain SOC2 compliance and robust security, our framework incorporates comprehensive access management controls and data encryption at rest and in transit. We follow network security best practices, conduct regular security audits and monitoring, and run automated compliance checks in the deployment pipeline.

We maintained zero downtime during the entire migration process while reducing infrastructure costs by 70% by eliminating GPU instances. The successful transition from Amazon ECS on Amazon EC2 to Amazon ECS with Fargate has simplified our infrastructure management and monitoring.

Achieving excellence

CreditAI’s migration to Amazon Bedrock has yielded remarkable results for Octus:

  • Scalability – We have almost doubled the number of documents available for Q&A across three environments in days instead of weeks. Our use of Amazon ECS with Fargate with auto scaling rules and controls gives us elastic scalability for our services during peak usage hours.
  • Cost-efficiency and infrastructure optimization – By moving away from GPU-based clusters to Fargate, our monthly infrastructure costs are now 78.47% lower, and our per-question costs have reduced by 87.6%.
  • Response performance and latency – There has been no drop in latency, and have seen a 27% increase in questions answered successfully. We have also seen a 250% boost in user engagement. Users especially love our support for broad, industry-wide questions enabled by Anthropic’s Claude Sonnet.
  • Zero downtime – We experienced zero downtime during migration and 99% uptime overall for the whole application.
  • Technological agility and innovation – We have been able to add new document sources in a quarter of the time it took pre-migration. In addition, we adopted enhanced guardrails support for free and no longer have to retrieve documents from the knowledge base and pass the chunks to Anthropic’s Claude Sonnet to trigger a guardrail.
  • Operational consolidation and reliability – Post-migration, our DevOps and SRE teams see 20% less maintenance burden and overheads. Supporting SOC2 compliance is also straightforward now that we’re using only one cloud provider.

Operational monitoring

We use Datadog to monitor both LLM latency and our document ingestion pipeline, providing real-time visibility into system performance. The following screenshot showcases how we use custom Datadog dashboards to provide a live view of the document ingestion pipeline. This visualization offers both a high-level overview and detailed insights into the ingestion process, helping us understand the volume, format, and status of the documents processed. The bottom half of the dashboard presents a time-series view of document processing volumes. The timeline tracks fluctuations in processing rates, identifies peak activity periods, and provides actionable insights to optimize throughput. This detailed monitoring system enables us to maintain efficiency, minimize failures, and provide scalability.

Observability Dashboard

Roadmap

Looking ahead, Octus plans to continue enhancing CreditAI by taking advantage of new capabilities released by Amazon Bedrock that continue to meet and exceed our requirements. Future developments will include:

  • Enhance retrieval by testing and integrating with reranking techniques, allowing the system to prioritize the most relevant search results for better user experience and accuracy.
  • Explore the Amazon Bedrock RAG evaluator to capture detailed metrics on CreditAI’s performance. This will add to the existing mechanisms at Octus to track performance that include tracking unanswered questions.
  • Expand to ingest large-scale structured data, making it capable of handling complex financial datasets. The integration of text-to-SQL will enable users to query structured databases using natural language, simplifying data access.
  • Explore replacing our in-house content extraction service (ADE) with the Amazon Bedrock advanced parsing solution to potentially further reduce document ingestion costs.
  • Improve CreditAI’s disaster recovery and redundancy mechanisms, making sure that our services and infrastructure are more fault tolerant and can recover from outages faster.

These upgrades aim to boost the precision, reliability, and scalability of CreditAI.

Vishal Saxena, CTO at Octus, shares: “CreditAI is a first-of-its-kind generative AI application that focuses on the entire credit lifecycle. It is truly ’AI embedded’ software that combines cutting-edge AI technologies with an enterprise data architecture and a unified cloud strategy.”

Conclusion

CreditAI by Octus is the company’s flagship conversational chatbot that supports natural language queries and gives instant access to insights on over 10,000 companies from hundreds of thousands of proprietary intel articles. In this post, we described in detail our motivation, process, and results on Octus’s migration to Amazon Bedrock. Through this migration, Octus achieved remarkable results that included an over 75% reduction in operating costs as well as a 250% boost in engagement. Future steps include adopting new features such as reranking, RAG evaluator, and advanced parsing to further reduce costs and improve performance. We believe that the collaboration between Octus and AWS will continue to revolutionize financial analysis and research workflows.

To learn more about Amazon Bedrock, refer to the Amazon Bedrock User Guide.


About the Authors

Vaibhav Sabharwal is a Senior Solutions Architect with Amazon Web Services based out of New York. He is passionate about learning new cloud technologies and assisting customers in building cloud adoption strategies, designing innovative solutions, and driving operational excellence. As a member of the Financial Services Technical Field Community at AWS, he actively contributes to the collaborative efforts within the industry.

Yihnew Eshetu is a Senior Director of AI Engineering at Octus, leading the development of AI solutions at scale to address complex business problems. With seven years of experience in AI/ML, his expertise spans GenAI and NLP, specializing in designing and deploying agentic AI systems. He has played a key role in Octus’s AI initiatives, including leading AI Engineering for its flagship GenAI chatbot, CreditAI.

Harmandeep Sethi is a Senior Director of SRE Engineering and Infrastructure Frameworks at Octus, with nearly 10 years of experience leading high-performing teams in the design, implementation, and optimization of large-scale, highly available, and reliable systems. He has played a pivotal role in transforming and modernizing Credit AI infrastructure and services by driving best practices in observability, resilience engineering, and the automation of operational processes through Infrastructure Frameworks.

Rohan Acharya is an AI Engineer at Octus, specializing in building and optimizing AI-driven solutions at scale. With expertise in GenAI and NLP, he focuses on designing and deploying intelligent systems that enhance automation and decision-making. His work involves developing robust AI architectures and advancing Octus’s AI initiatives, including the evolution of CreditAI.

Hasan Hasibul is a Principal Architect at Octus leading the DevOps team, with nearly 12 years of experience in building scalable, complex architectures while following software development best practices. A true advocate of clean code, he thrives on solving complex problems and automating infrastructure. Passionate about DevOps, infrastructure automation, and the latest advancements in AI, he has architected Octus initial CreditAI, pushing the boundaries of innovation.

Philipe Gutemberg is a Principal Software Engineer and AI Application Development Team Lead at Octus, passionate about leveraging technology for impactful solutions. An AWS Certified Solutions Architect – Associate (SAA), he has expertise in software architecture, cloud computing, and leadership. Philipe led both backend and frontend application development for CreditAI, ensuring a scalable system that integrates AI-driven insights into financial applications. A problem-solver at heart, he thrives in fast-paced environments, delivering innovative solutions for financial institutions while fostering mentorship, team development, and continuous learning.

Kishore Iyer is the VP of AI Application Development and Engineering at Octus. He leads teams that build, maintain and support Octus’s customer-facing GenAI applications, including CreditAI, our flagship AI offering. Prior to Octus, Kishore has 15+ years of experience in engineering leadership roles across large corporations, startups, research labs, and academia. He holds a Ph.D. in computer engineering from Rutgers University.

Kshitiz Agarwal is an Engineering Leader at Amazon Web Services (AWS), where he leads the development of Amazon Bedrock Knowledge Bases. With a decade of experience at Amazon, having joined in 2012, Kshitiz has gained deep insights into the cloud computing landscape. His passion lies in engaging with customers and understanding the innovative ways they leverage AWS to drive their business success. Through his work, Kshitiz aims to contribute to the continuous improvement of AWS services, enabling customers to unlock the full potential of the cloud.

Sandeep Singh is a Senior Generative AI Data Scientist at Amazon Web Services, helping businesses innovate with generative AI. He specializes in generative AI, machine learning, and system design. He has successfully delivered state-of-the-art AI/ML-powered solutions to solve complex business problems for diverse industries, optimizing efficiency and scalability.

Tim Ramos is a Senior Account Manager at AWS. He has 12 years of sales experience and 10 years of experience in cloud services, IT infrastructure, and SaaS. Tim is dedicated to helping customers develop and implement digital innovation strategies. His focus areas include business transformation, financial and operational optimization, and security. Tim holds a BA from Gonzaga University and is based in New York City.

Read More