Amazon AWS – Page 34

How Aetion is using generative AI and Amazon Bedrock to unlock hidden insights about patient populations

January 30, 2025

by Javier Beltrán Amazon AWS

The real-world data collected and derived from patient journeys offers a wealth of insights into patient characteristics and outcomes and the effectiveness and safety of medical innovations. Researchers ask questions about patient populations in the form of structured queries; however, without the right choice of structured query and deep familiarity with complex real-world patient datasets, many trends and patterns can remain undiscovered.

Aetion is a leading provider of decision-grade real-world evidence software to biopharma, payors, and regulatory agencies. The company provides comprehensive solutions to healthcare and life science customers to transform real-world data into real-world evidence.

The use of unsupervised learning methods on semi-structured data along with generative AI has been transformative in unlocking hidden insights. With Aetion Discover, users can conduct rapid, exploratory analyses with real-world data while experiencing a structured approach to research questions. To help accelerate data exploration and hypothesis generation, Discover uses unsupervised learning methods to uncover Smart Subgroups. These subgroups of patients within a larger population display similar characteristics or profiles across a vast range of factors, including diagnoses, procedures, and therapies.

In this post, we review how Aetion’s Smart Subgroups Interpreter enables users to interact with Smart Subgroups using natural language queries. Powered by Amazon Bedrock and Anthropic’s Claude 3 large language models (LLMs), the interpreter responds to user questions expressed in conversational language about patient subgroups and provides insights to generate further hypotheses and evidence. Aetion chose to use Amazon Bedrock for working with LLMs due to its vast model selection from multiple providers, security posture, extensibility, and ease of use.

Amazon Bedrock is a fully managed service that provides access to high-performing foundation models (FMs) from leading AI startups and Amazon through a unified API. It offers a wide range of FMs, allowing you to choose the model that best suits your specific use case.

Aetion’s technology

Aetion uses the science of causal inference to generate real-world evidence on the safety, effectiveness, and value of medications and clinical interventions. Aetion has partnered with the majority of top 20 biopharma, leading payors, and regulatory agencies.

Aetion brings deep scientific expertise and technology to life sciences, regulatory agencies (including FDA and EMA), payors, and health technology assessment (HTA) customers in the US, Canada, Europe, and Japan with analytics that can achieve the following:

Optimize clinical trials by identifying target populations, creating external control arms, and contextualizing settings and populations underrepresented in controlled settings
Expand industry access through label changes, pricing, coverage, and formulary decisions
Conduct safety and effectiveness studies for medications, treatments, and diagnostics

Aetion’s applications, including Discover and Aetion Substantiate, are powered by the Aetion Evidence Platform (AEP), a core longitudinal analytic engine capable of applying rigorous causal inference and statistical methods to hundreds of millions of patient journeys.

AetionAI is a set of generative AI capabilities embedded across the core environment and applications. Smart Subgroups Interpreter is an AetionAI feature in Discover.

The following figure illustrates the organization of Aetion’s services.

Smart Subgroups

For a user-specified patient population, the Smart Subgroups feature identifies clusters of patients with similar characteristics (for example, similar prevalence profiles of diagnoses, procedures, and therapies).

These subgroups are further classified and labeled by generative AI models based on each subgroup’s prevalent characteristics. For example, as shown in the following generated heat map, the first two Smart Subgroups within a population of patients who were prescribed GLP-1 agonists are labeled “Cataract and Retinal Disease” and “Inflammatory Skin Conditions,” respectively, to capture their defining characteristics.

After the subgroups are displayed, a user engages with AetionAI to probe further with inquiries expressed in natural language. The user can express questions about the subgroups, such as “What are the most common characteristics for patients in the cataract disorders subgroup?” As shown in the following screenshot, AetionAI responds to the user in natural language, citing relevant subgroup statistics in its response.

A user might also ask AetionAI detailed questions such as “Compare the prevalence of cardiovascular diseases or conditions among the ‘Dulaglutide’ group vs the overall population.” The following screenshot shows AetionAI’s response.

In this example, the insights enable the user to hypothesize that Dulaglutide patients might experience fewer circulatory signs and symptoms. They can explore this further in Aetion Substantiate to produce decision-grade evidence with causal inference to assess the effectiveness of Dulaglutide use in cardiovascular disease outcomes.

Solution overview

Smart Subgroups Interpreter combines elements of unsupervised machine learning with generative AI to uncover hidden patterns in real-world data. The following diagram illustrates the workflow.

Let’s review each step in detail:

Create the patient population – Users define a patient population using the Aetion Measure Library (AML) features. The AML feature store standardizes variable definitions using scientifically validated algorithms. The user selects the AML features that define the patient population for analysis.
Generate features for the patient population – The AEP computes over 1,000 AML features for each patient across various categories, such as diagnoses, therapies, and procedures.
Build clusters and summarize cluster features – The Smart Subgroups component trains a topic model using the patient features to determine the optimal number of clusters and assign patients to clusters. The prevalences of the most distinctive features within each cluster, as determined by a trained classification model, are used to describe the cluster characteristics.
Generate cluster names and answer user queries – A prompt engineering technique for Anthropic’s Claude 3 Haiku on Amazon Bedrock generates descriptive cluster names and answers user queries. Amazon Bedrock provides access to LLMs from a variety of model providers. Anthropic’s Claude 3 Haiku was selected as the model due to its speed and satisfactory intelligence level.

The solution uses Amazon Simple Storage Service (Amazon S3) and Amazon Aurora for data persistence and data exchange, and Amazon Bedrock with Anthropic’s Claude 3 Haiku models for cluster names generation. Discover and its transactional and batch applications are deployed and scaled on a Kubernetes on AWS cluster to optimize performance, user experience, and portability.

The following diagram illustrates the solution architecture.

The workflow includes the following steps:

Users create Smart Subgroups for their patient population of interest.
AEP uses real-world data and a custom query language to compute over 1,000 science-validated features for the user-selected population. The features are stored in Amazon S3 and encrypted with AWS Key Management Service (AWS KMS) for downstream use.
The Smart Subgroups component trains the clustering algorithm and summarizes the most important features of each cluster. The cluster feature summaries are stored in Amazon S3 and displayed as a heat map to the user. Smart Subgroups is deployed as a Kubernetes job and is run on demand.
Users interact with the Interpreter API microservice by using questions expressed in natural language to retrieve descriptive subgroup names. The data transmitted to the service is encrypted using Transport Layer Security 1.2 (TLS). The Interpreter API uses composite prompt engineering techniques with Anthropic’s Claude 3 Haiku to answer user queries:
- Versioned prompt templates generate descriptive subgroup names and answer user queries.
- AML features are added to the prompt template. For example, the description of the feature “Benign Ovarian Cyst” is expanded in a prompt to the LLM as “This measure covers different types of cysts that can form in or on a woman’s ovaries, including follicular cysts, corpus luteum cysts, endometriosis, and unspecified ovarian cysts.”
- Lastly, the top feature prevalences of each subgroup are added to the prompt template. For example: “In Smart Subgroup 1 the relative prevalence of ‘Cornea and external disease (EYE001)’ is 30.32% In Smart Subgroup 1 the relative prevalence of ‘Glaucoma (EYE003)’ is 9.94%…”
Amazon Bedrock responds back to the application that displays the heat map to the user.

Outcomes

Smart Subgroups Interpreter enables users of the AEP who are unfamiliar with real-world data to discover patterns among patient populations using natural language queries. Users now can turn findings from such discoveries into hypotheses for further analyses across Aetion’s software to generate decision-grade evidence in a matter of minutes, as opposed to days, and without the need of support staff.

Conclusion

In this post, we demonstrated how Aetion uses Amazon Bedrock and other AWS services to help users uncover meaningful patterns within patient populations, even without prior expertise in real-world data. These discoveries lay the groundwork for deeper analysis within Aetion’s Evidence Platform, generating decision-grade evidence that drives smarter, data-informed outcomes.

As we continue expanding our generative AI capabilities, Aetion remains committed to enhancing user experiences and accelerating the journey from real-world data to real-world evidence.

With Amazon Bedrock, the future of innovation is at your fingertips. Explore Generative AI Application Builder on AWS to learn more about building generative AI capabilities to unlock new insights, build transformative solutions, and shape the future of healthcare today.

About the Authors

Javier Beltrán is a Senior Machine Learning Engineer at Aetion. His career has focused on natural language processing, and he has experience applying machine learning solutions to various domains, from healthcare to social media.

Ornela Xhelili is a Staff Machine Learning Architect at Aetion. Ornela specializes in natural language processing, predictive analytics, and MLOps, and holds a Master’s of Science in Statistics. Ornela has spent the past 8 years building AI/ML products for tech startups across various domains, including healthcare, finance, analytics, and ecommerce.

Prasidh Chhabri is a Product Manager at Aetion, leading the Aetion Evidence Platform, core analytics, and AI/ML capabilities. He has extensive experience building quantitative and statistical methods to solve problems in human health.

Mikhail Vaynshteyn is a Solutions Architect with Amazon Web Services. Mikhail works with healthcare life sciences customers and specializes in data analytics services. Mikhail has more than 20 years of industry experience covering a wide range of technologies and sectors.

Deploy DeepSeek-R1 Distilled Llama models in Amazon Bedrock

January 30, 2025

by Raj Pathak Amazon AWS

Open foundation models (FMs) have become a cornerstone of generative AI innovation, enabling organizations to build and customize AI applications while maintaining control over their costs and deployment strategies. By providing high-quality, openly available models, the AI community fosters rapid iteration, knowledge sharing, and cost-effective solutions that benefit both developers and end-users. DeepSeek AI, a research company focused on advancing AI technology, has emerged as a significant contributor to this ecosystem. Their DeepSeek-R1 models represent a family of large language models (LLMs) designed to handle a wide range of tasks, from code generation to general reasoning, while maintaining competitive performance and efficiency.

Amazon Bedrock Custom Model Import enables the import and use of your customized models alongside existing FMs through a single serverless, unified API. You can access your imported custom models on-demand and without the need to manage underlying infrastructure. Accelerate your generative AI application development by integrating your supported custom models with native Bedrock tools and features like Knowledge Bases, Guardrails, and Agents.

In this post, we explore how to deploy distilled versions of DeepSeek-R1 with Amazon Bedrock Custom Model Import, making them accessible to organizations looking to use state-of-the-art AI capabilities within the secure and scalable AWS infrastructure at an effective cost.

DeepSeek-R1 distilled variations

From the foundation of DeepSeek-R1, DeepSeek AI has created a series of distilled models based on both Meta’s Llama and Qwen architectures, ranging from 1.5–70 billion parameters. The distillation process involves training smaller, more efficient models to mimic the behavior and reasoning patterns of the larger DeepSeek-R1 model by using it as a teacher—essentially transferring the knowledge and capabilities of the 671 billion parameter model into more compact architectures. The resulting distilled models, such as DeepSeek-R1-Distill-Llama-8B (from base model Llama-3.1-8B) and DeepSeek-R1-Distill-Llama-70B (from base model Llama-3.3-70B-Instruct), offer different trade-offs between performance and resource requirements. Although distilled models might show some reduction in reasoning capabilities compared to the original 671B model, they significantly improve inference speed and reduce computational costs. For instance, smaller distilled models like the 8B version can process requests much faster and consume fewer resources, making them more cost-effective for production deployments, whereas larger distilled versions like the 70B model maintain closer performance to the original while still offering meaningful efficiency gains.

Solution overview

In this post, we demonstrate how to deploy distilled versions of DeepSeek-R1 models using Amazon Bedrock Custom Model Import. We focus on importing the variants currently supported DeepSeek-R1-Distill-Llama-8B and DeepSeek-R1-Distill-Llama-70B, which offer an optimal balance between performance and resource efficiency. You can import these models from Amazon Simple Storage Service (Amazon S3) or an Amazon SageMaker AI model repo, and deploy them in a fully managed and serverless environment through Amazon Bedrock. The following diagram illustrates the end-to-end flow.

In this workflow, model artifacts stored in Amazon S3 are imported into Amazon Bedrock, which then handles the deployment and scaling of the model automatically. This serverless approach eliminates the need for infrastructure management while providing enterprise-grade security and scalability.

You can use the Amazon Bedrock console for deploying using the graphical interface and following the instructions in this post, or alternatively use the following notebook to deploy programmatically with the Amazon Bedrock SDK.

Prerequisites

You should have the following prerequisites:

An AWS account with access to Amazon Bedrock.
Appropriate AWS Identity and Access Management (IAM) roles and permissions for Amazon Bedrock and Amazon S3. For more information, see Create a service role for model import.
An S3 bucket prepared to store the custom model. For more information, see Creating a bucket.
Sufficient local storage space, at least 17 GB for the 8B model or 135 GB for the 70B model.

Prepare the model package

Complete the following steps to prepare the model package:

Download the DeepSeek-R1-Distill-Llama model artifacts from Hugging Face, from one of the following links, depending on the model you want to deploy:
1. https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-8B/tree/main
2. https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-70B/tree/main

For more information, you can follow the Hugging Face’s Downloading models or Download files from the hub instructions.

You typically need the following files:

- Model configuration file: config.json
- Tokenizer files: tokenizer.json, tokenizer_config.json, and tokenizer.mode
- Model weights files in .safetensors format

Upload these files to a folder in your S3 bucket, in the same AWS Region where you plan to use Amazon Bedrock. Take note of the S3 path you’re using.

Import the model

Complete the following steps to import the model:

On the Amazon Bedrock console, choose Imported models under Foundation models in the navigation pane.

Choose Import model.

For Model name, enter a name for your model (it’s recommended to use a versioning scheme in your name, for tracking your imported model).
For Import job name, enter a name for your import job.
For Model import settings, select Amazon S3 bucket as your import source, and enter the S3 path you noted earlier (provide the full path in the form s3://<your-bucket>/folder-with-model-artifacts/).
For Encryption, optionally choose to customize your encryption settings.
For Service access role, choose to either create a new IAM role or provide your own.
Choose Import model.

Importing the model will take several minutes depending on the model being imported (for example, the Distill-Llama-8B model could take 5–20 minutes to complete).

Watch this video demo for a step-by-step guide.

Test the imported model

After you import the model, you can test it by using the Amazon Bedrock Playground or directly through the Amazon Bedrock invocation APIs. To use the Playground, complete the following steps:

On the Amazon Bedrock console, choose Chat / Text under Playgrounds in the navigation pane.
From the model selector, choose your imported model name.
Adjust the inference parameters as needed and write your test prompt. For example:
<｜begin▁of▁sentence｜><｜User｜>Given the following financial data: - Company A's revenue grew from $10M to $15M in 2023 - Operating costs increased by 20% - Initial operating costs were $7M Calculate the company's operating margin for 2023. Please reason step by step, and put your final answer within \boxed{}<｜Assistant｜>

As we’re using an imported model in the playground, we must include the “beginning_of_sentence” and “user/assistant” tags to properly format the context for DeepSeek models; these tags help the model understand the structure of the conversation and provide more accurate responses. If you’re following the programmatic approach in the following notebook then this is being automatically taken care of by configuring the model.

Review the model response and metrics provided.

Note: When you invoke the model for the first time, if you encounter a ModelNotReadyException error the SDK automatically retries the request with exponential backoff. The restoration time varies depending on the on-demand fleet size and model size. You can customize the retry behavior using the AWS SDK for Python (Boto3) Config object. For more information, see Handling ModelNotReadyException.

Once you are ready to import the model, use this step-by-step video demo to help you get started.

Pricing

Custom Model Import enables you to use your custom model weights within Amazon Bedrock for supported architectures, serving them alongside Amazon Bedrock hosted FMs in a fully managed way through On-Demand mode. Custom Model Import does not charge for model import, you are charged for inference based on two factors: the number of active model copies and their duration of activity.

Billing occurs in 5-minute windows, starting from the first successful invocation of each model copy. The pricing per model copy per minute varies based on factors including architecture, context length, region, and compute unit version, and is tiered by model copy size. The Custom Model Units required for hosting depends on the model’s architecture, parameter count, and context length, with examples ranging from 2 Units for a Llama 3.1 8B 128K model to 8 Units for a Llama 3.1 70B 128K model.

Amazon Bedrock automatically manages scaling, maintaining zero to three model copies by default (adjustable through Service Quotas) based on your usage patterns. If there are no invocations for 5 minutes, it scales to zero and scales up when needed, though this may involve cold-start latency of tens of seconds. Additional copies are added if inference volume consistently exceeds single-copy concurrency limits. The maximum throughput and concurrency per copy is determined during import, based on factors such as input/output token mix, hardware type, model size, architecture, and inference optimizations.

Consider the following pricing example: An application developer imports a customized Llama 3.1 type model that is 8B parameter in size with a 128K sequence length in us-east-1 region and deletes the model after 1 month. This requires 2 Custom Model Units. So, the price per minute will be $0.1570 and the model storage costs will be $3.90 for the month.

For more information, see Amazon Bedrock pricing.

Benchmarks

DeepSeek has published benchmarks comparing their distilled models against the original DeepSeek-R1 and base Llama models, available in the model repositories. The benchmarks show that depending on the task DeepSeek-R1-Distill-Llama-70B maintains between 80-90% of the original model’s reasoning capabilities, while the 8B version achieves between 59-92% performance with significantly reduced resource requirements. Both distilled versions demonstrate improvements over their corresponding base Llama models in specific reasoning tasks.

Other considerations

When deploying DeepSeek models in Amazon Bedrock, consider the following aspects:

Model versioning is essential. Because Custom Model Import creates unique models for each import, implement a clear versioning strategy in your model names to track different versions and variations.
The current supported model formats focus on Llama-based architectures. Although DeepSeek-R1 distilled versions offer excellent performance, the AI ecosystem continues evolving rapidly. Keep an eye on the Amazon Bedrock model catalog as new architectures and larger models become available through the platform.
Evaluate your use case requirements carefully. Although larger models like DeepSeek-R1-Distill-Llama-70B provide better performance, the 8B version might offer sufficient capability for many applications at a lower cost.
Consider implementing monitoring and observability. Amazon CloudWatch provides metrics for your imported models, helping you track usage patterns and performance. You can monitor costs with AWS Cost Explorer.
Start with a lower concurrency quota and scale up based on actual usage patterns. The default limit of three concurrent model copies per account is suitable for most initial deployments.

Conclusion

Amazon Bedrock Custom Model Import empowers organizations to use powerful publicly available models like DeepSeek-R1 distilled versions, among others, while benefiting from enterprise-grade infrastructure. The serverless nature of Amazon Bedrock eliminates the complexity of managing model deployments and operations, allowing teams to focus on building applications rather than infrastructure. With features like auto scaling, pay-per-use pricing, and seamless integration with AWS services, Amazon Bedrock provides a production-ready environment for AI workloads. The combination of DeepSeek’s innovative distillation approach and the Amazon Bedrock managed infrastructure offers an optimal balance of performance, cost, and operational efficiency. Organizations can start with smaller models and scale up as needed, while maintaining full control over their model deployments and benefiting from AWS security and compliance capabilities.

The ability to choose between proprietary and open FMs Amazon Bedrock gives organizations the flexibility to optimize for their specific needs. Open models enable cost-effective deployment with full control over the model artifacts, making them ideal for scenarios where customization, cost optimization, or model transparency are crucial. This flexibility, combined with the Amazon Bedrock unified API and enterprise-grade infrastructure, allows organizations to build resilient AI strategies that can adapt as their requirements evolve.

For more information, refer to the Amazon Bedrock User Guide.

About the Authors

Raj Pathak is a Principal Solutions Architect and Technical advisor to Fortune 50 and Mid-Sized FSI (Banking, Insurance, Capital Markets) customers across Canada and the United States. Raj specializes in Machine Learning with applications in Generative AI, Natural Language Processing, Intelligent Document Processing, and MLOps.

Yanyan Zhang is a Senior Generative AI Data Scientist at Amazon Web Services, where she has been working on cutting-edge AI/ML technologies as a Generative AI Specialist, helping customers use generative AI to achieve their desired outcomes. Yanyan graduated from Texas A&M University with a PhD in Electrical Engineering. Outside of work, she loves traveling, working out, and exploring new things.

Ishan Singh is a Generative AI Data Scientist at Amazon Web Services, where he helps customers build innovative and responsible generative AI solutions and products. With a strong background in AI/ML, Ishan specializes in building Generative AI solutions that drive business value. Outside of work, he enjoys playing volleyball, exploring local bike trails, and spending time with his wife and dog, Beau.

Morgan Rankey is a Solutions Architect based in New York City, specializing in Hedge Funds. He excels in assisting customers to build resilient workloads within the AWS ecosystem. Prior to joining AWS, Morgan led the Sales Engineering team at Riskified through its IPO. He began his career by focusing on AI/ML solutions for machine asset management, serving some of the largest automotive companies globally.

Harsh Patel is an AWS Solutions Architect supporting 200+ SMB customers across the United States to drive digital transformation through cloud-native solutions. As an AI&ML Specialist, he focuses on Generative AI, Computer Vision, Reinforcement Learning and Anomaly Detection. Outside the tech world, he recharges by hitting the golf course and embarking on scenic hikes with his dog.

Generative AI operating models in enterprise organizations with Amazon Bedrock

January 29, 2025

by Martin Tunstall Amazon AWS

Generative AI can revolutionize organizations by enabling the creation of innovative applications that offer enhanced customer and employee experiences. Intelligent document processing, translation and summarization, flexible and insightful responses for customer support agents, personalized marketing content, and image and code generation are a few use cases using generative AI that organizations are rolling out in production.

Large organizations often have many business units with multiple lines of business (LOBs), with a central governing entity, and typically use AWS Organizations with an Amazon Web Services (AWS) multi-account strategy. They implement landing zones to automate secure account creation and streamline management across accounts, including logging, monitoring, and auditing. Although LOBs operate their own accounts and workloads, a central team, such as the Cloud Center of Excellence (CCoE), manages identity, guardrails, and access policies

As generative AI adoption grows, organizations should establish a generative AI operating model. An operating model defines the organizational design, core processes, technologies, roles and responsibilities, governance structures, and financial models that drive a business’s operations.

In this post, we evaluate different generative AI operating model architectures that could be adopted.

Operating model patterns

Organizations can adopt different operating models for generative AI, depending on their priorities around agility, governance, and centralized control. Governance in the context of generative AI refers to the frameworks, policies, and processes that streamline the responsible development, deployment, and use of these technologies. It encompasses a range of measures aimed at mitigating risks, promoting accountability, and aligning generative AI systems with ethical principles and organizational objectives. Three common operating model patterns are decentralized, centralized, and federated, as shown in the following diagram.

Decentralized model

In a decentralized approach, generative AI development and deployment are initiated and managed by the individual LOBs themselves. LOBs have autonomy over their AI workflows, models, and data within their respective AWS accounts.

This enables faster time-to-market and agility because LOBs can rapidly experiment and roll out generative AI solutions tailored to their needs. However, even in a decentralized model, often LOBs must align with central governance controls and obtain approvals from the CCoE team for production deployment, adhering to global enterprise standards for areas such as access policies, model risk management, data privacy, and compliance posture, which can introduce governance complexities.

Centralized model

In a centralized operating model, all generative AI activities go through a central generative artificial intelligence and machine learning (AI/ML) team that provisions and manages end-to-end AI workflows, models, and data across the enterprise.

LOBs interact with the central team for their AI needs, trading off agility and potentially increased time-to-market for stronger top-down governance. A centralized model may introduce bottlenecks that slow down time-to-market, so organizations need to adequately resource the team with sufficient personnel and automated processes to meet the demand from various LOBs efficiently. Failure to scale the team can negate the governance benefits of a centralized approach.

Federated model

A federated model strikes a balance by having key activities of the generative AI processes managed by a central generative AI/ML platform team.

While LOBs drive their AI use cases, the central team governs guardrails, model risk management, data privacy, and compliance posture. This enables agile LOB innovation while providing centralized oversight on governance areas.

Generative AI architecture components

Before diving deeper into the common operating model patterns, this section provides a brief overview of a few components and AWS services used in the featured architectures.

Large language models

Large language models (LLMs) are large-scale ML models that contain billions of parameters and are pre-trained on vast amounts of data. LLMs may hallucinate, which means a model can provide a confident but factually incorrect response. Furthermore, the data that the model was trained on might be out of date, which leads to providing inaccurate responses. One way to mitigate LLMs from giving incorrect information is by using a technique known as Retrieval Augmented Generation (RAG). RAG is an advanced natural language processing technique that combines knowledge retrieval with generative text models. RAG combines the powers of pre-trained language models with a retrieval-based approach to generate more informed and accurate responses. To set up RAG, you need to have a vector database to provide your model with related source documents. Using RAG, the relevant document segments or other texts are retrieved and shared with LLMs to generate targeted responses with enhanced content quality and relevance.

Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies, including AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon using a single API, along with a broad set of capabilities you need to build generative AI applications with security, privacy, and responsible AI.

Amazon SageMaker JumpStart provides access to proprietary FMs from third-party providers such as AI21 Labs, Cohere, and LightOn. In addition, Amazon SageMaker JumpStart onboards and maintains open source FMs from third-party sources such as Hugging Face.

Data sources, embeddings, and vector store

Organizations’ domain-specific data, which provides context and relevance, typically resides in internal databases, data lakes, unstructured data repositories, or document stores, collectively referred to as organizational data sources or proprietary data stores.

A vector store is a system you can use to store and query vectors at scale, with efficient nearest neighbor query algorithms and appropriate indexes to improve data retrieval. It includes not only the embeddings of an organization’s data (mathematical representation of data in the form of vectors) but also raw text from the data in chunks. These vectors are generated by specialized embedding LLMs, which process the organization’s text chunks to create numerical representations (vectors), which are stored with the text chunks in the vector store. For a comprehensive read about vector store and embeddings, you can refer to The role of vector databases in generative AI applications.

With Amazon Bedrock Knowledge Bases, you securely connect FMs in Amazon Bedrock to your company data for RAG. Amazon Bedrock Knowledge Bases facilitates data ingestion from various supported data sources; manages data chunking, parsing, and embeddings; and populates the vector store with the embeddings. With all that provided as a service, you can think of Amazon Bedrock Knowledge Bases as a fully managed and serverless option to build powerful conversational AI systems using RAG.

Guardrails

Content filtering mechanisms are implemented as safeguards to control user-AI interactions, aligning with application requirements and responsible AI policies by minimizing undesirable and harmful content. Guardrails can check user inputs and FM outputs and filter or deny topics that are unsafe, redact personally identifiable information (PII), and enhance content safety and privacy in generative AI applications.

Amazon Bedrock Guardrails is a feature of Amazon Bedrock that you can use to put safeguards in place. You determine what qualifies based on your company policies. These safeguards are FM agnostic. You can create multiple guardrails with different configurations tailored to specific use cases. For a review on Amazon Bedrock Guardrails, you can refer to these blog posts: Guardrails for Amazon Bedrock helps implement safeguards customized to your use cases and responsible AI policies and Guardrails for Amazon Bedrock now available with new safety filters and privacy controls.

Operating model architectures

This section provides an overview of the three kinds of operating models.

Decentralized operating model

In a decentralized operating model, LOB teams maintain control and ownership of their AWS accounts. Each LOB configures and orchestrates generative AI components, common functionalities, applications, and Amazon Bedrock configurations within their respective AWS accounts. This model empowers LOBs to tailor their generative AI solutions according to their specific requirements, while taking advantage of the power of Amazon Bedrock.

With this model, the LOBs configure the core components, such as LLMs and guardrails, and the Amazon Bedrock service account manages the hosting, execution, and provisioning of interface endpoints. These endpoints enable LOBs to access and interact with the Amazon Bedrock services they’ve configured.

Each LOB performs monitoring and auditing of their configured Amazon Bedrock services within their account, using Amazon CloudWatch Logs and AWS CloudTrail for log capture, analysis, and auditing tailored to their needs. Amazon Bedrock cost and usage will be recorded in each LOB’s AWS accounts. By adopting this decentralized model, LOBs retain control over their generative AI solutions through a decentralized configuration, while benefiting from the scalability, reliability, and security of Amazon Bedrock.

The following diagram shows the architecture of the decentralized operating model.

Centralized operating model

The centralized AWS account serves as the primary hub for configuring and managing the core generative AI functionalities, including reusable agents, prompt flows, and shared libraries. LOB teams contribute their business-specific requirements and use cases to the centralized team, which then integrates and orchestrates the appropriate generative AI components within the centralized account.

Although the orchestration and configuration of generative AI solutions reside in the centralized account, they often require interaction with LOB-specific resources and services. To facilitate this, the centralized account uses API gateways or other integration points provided by the LOBs’ AWS accounts. These integration points enable secure and controlled communication between the centralized generative AI orchestration and the LOBs’ business-specific applications, data sources, or services. This centralized operating model promotes consistency, governance, and scalability of generative AI solutions across the organization.

The centralized team maintains adherence to common standards, best practices, and organizational policies, while also enabling efficient sharing and reuse of generative AI components. Furthermore, the core components of Amazon Bedrock, such as LLMs and guardrails, continue to be hosted and executed by AWS in the Amazon Bedrock service account, promoting secure, scalable, and high-performance execution environments for these critical components. In this centralized model, monitoring and auditing of Amazon Bedrock can be achieved within the centralized account, allowing for comprehensive monitoring, auditing, and analysis of all generative AI activities and configurations. Amazon CloudWatch Logs provides a unified view of generative AI operations across the organization.

By consolidating the orchestration and configuration of generative AI solutions in a centralized account while enabling secure integration with LOB-specific resources, this operating model promotes standardization, governance, and centralized control over generative AI operations. It uses the scalability, reliability, security, and centralized monitoring capabilities of AWS managed infrastructure and services, while still allowing for integration with LOB-specific requirements and use cases.

The following is the architecture for a centralized operating model.

Federated operating model

In a federated model, Amazon Bedrock enables a collaborative approach where LOB teams can develop and contribute common generative AI functionalities within their respective AWS accounts. These common functionalities, such as reusable agents, prompt flows, or shared libraries, can then be migrated to a centralized AWS account managed by a dedicated team or CCoE.

The centralized AWS account acts as a hub for integrating and orchestrating these common generative AI components, providing a unified platform for action groups and prompt flows. Although the orchestration and configuration of generative AI solutions remain within the LOBs’ AWS accounts, they can use the centralized Amazon Bedrock agents, prompt flows, and other shared components defined in the centralized account.

This federated model allows LOBs to retain control over their generative AI solutions, tailoring them to specific business requirements while benefiting from the reusable and centrally managed components. The centralized account maintains consistency, governance, and scalability of these shared generative AI components, promoting collaboration and standardization across the organization.

Organizations frequently prefer storing sensitive data, including Payment Card Industry (PCI), PII, General Data Protection Regulation (GDPR), and Health Insurance Portability and Accountability Act (HIPAA) information, within their respective LOB AWS accounts. This approach makes sure that LOBs maintain control over their sensitive business data in the vector store while preventing centralized teams from accessing it without proper governance and security measures.

A federated model combines decentralized development, centralized integration, and centralized monitoring. This operating model fosters collaboration, reusability, and standardization while empowering LOBs to retain control over their generative AI solutions. It uses the scalability, reliability, security, and centralized monitoring capabilities of AWS managed infrastructure and services, promoting a harmonious balance between autonomy and governance.

The following is the architecture for a federated operating model.

Cost management

Organizations may want to analyze Amazon Bedrock usage and costs per LOB. To track the cost and usage of FMs across LOBs’ AWS accounts, solutions that record model invocations per LOB can be implemented.

Amazon Bedrock now supports model invocation resources that use inference profiles. Inference profiles can be defined to track Amazon Bedrock usage metrics, monitor model invocation requests, or route model invocation requests to multiple AWS Regions for increased throughput.

There are two types of inference profiles. Cross-Region inference profiles, which are predefined in Amazon Bedrock and include multiple AWS Regions to which requests for a model can be routed. The other is application inference profiles, which are user created to track cost and model usage when submitting on-demand model invocation requests. You can attach custom tags, such as cost allocation tags, to your application inference profiles. When submitting a prompt, you can include an inference profile ID or its Amazon Resource Name (ARN). This capability enables organizations to track and monitor costs for various LOBs, cost centers, or applications. For a detailed explanation of application inference profiles refer to this post: Track, allocate, and manage your generative AI cost and usage with Amazon Bedrock.

Conclusion

Although enterprises often begin with a centralized operating model, the rapid pace of development in generative AI technologies, the need for agility, and the desire to quickly capture value often lead organizations to converge on a federated operating model.

In a federated operating model, lines of business have the freedom to innovate and experiment with generative AI solutions, taking advantage of their domain expertise and proximity to business problems. Key aspects of the AI workflow, such as data access policies, model risk management, and compliance monitoring, are managed by a central cloud governance team. Successful generative AI solutions developed by a line of business can be promoted and productionized by the central team for enterprise-wide re-use.

This federated model fosters innovation from the lines of business closest to domain problems. Simultaneously, it allows the central team to curate, harden, and scale those solutions adherent to organizational policies, then redeploy them efficiently to other relevant areas of the business.

To sustain this operating model, enterprises often establish a dedicated product team with a business owner that works in partnership with lines of business. This team is responsible for continually evolving the operating model, refactoring and enhancing the generative AI services to help meet the changing needs of the lines of business and keep up with the rapid advancements in LLMs and other generative AI technologies.

Federated operating models strike a balance, mitigating the risks of fully decentralized initiatives while minimizing bottlenecks from overly centralized approaches. By empowering business agility with curation by a central team, enterprises can accelerate compliant, high-quality generative AI capabilities aligned with their innovation goals, risk tolerances, and need for rapid value delivery in the evolving AI landscape.

As enterprises look to capitalize on the generative AI revolution, Amazon Bedrock provides the ideal foundation to establish a flexible operating model tailored to their organization’s needs. Whether you’re starting with a centralized, decentralized, or federated approach, AWS offers a comprehensive suite of services to support the full generative AI lifecycle.

Try Amazon Bedrock and let us know your feedback on how you’re planning to implement the operating model that suits your organization.

About the Authors

Martin Tunstall is a Principal Solutions Architect at AWS. With over three decades of experience in the finance sector, he helps global finance and insurance customers unlock the full potential of Amazon Web Services (AWS).

Yashar Araghi is a Senior Solutions Architect at AWS. He has over 20 years of experience designing and building infrastructure and application security solutions. He has worked with customers across various industries such as government, education, finance, energy, and utilities. In the last 6 years at AWS, Yashar has helped customers design, build, and operate their cloud solutions that are secure, reliable, performant and cost optimized in the AWS Cloud.

Develop a RAG-based application using Amazon Aurora with Amazon Kendra

January 28, 2025

by Aravind Hariharaputran Amazon AWS

Generative AI and large language models (LLMs) are revolutionizing organizations across diverse sectors to enhance customer experience, which traditionally would take years to make progress. Every organization has data stored in data stores, either on premises or in cloud providers.

You can embrace generative AI and enhance customer experience by converting your existing data into an index on which generative AI can search. When you ask a question to an open source LLM, you get publicly available information as a response. Although this is helpful, generative AI can help you understand your data along with additional context from LLMs. This is achieved through Retrieval Augmented Generation (RAG).

RAG retrieves data from a preexisting knowledge base (your data), combines it with the LLM’s knowledge, and generates responses with more human-like language. However, in order for generative AI to understand your data, some amount of data preparation is required, which involves a big learning curve.

Amazon Aurora is a MySQL and PostgreSQL-compatible relational database built for the cloud. Aurora combines the performance and availability of traditional enterprise databases with the simplicity and cost-effectiveness of open source databases.

In this post, we walk you through how to convert your existing Aurora data into an index without needing data preparation for Amazon Kendra to perform data search and implement RAG that combines your data along with LLM knowledge to produce accurate responses.

Solution overview

In this solution, use your existing data as a data source (Aurora), create an intelligent search service by connecting and syncing your data source to Amazon Kendra search, and perform generative AI data search, which uses RAG to produce accurate responses by combining your data along with the LLM’s knowledge. For this post, we use Anthropic’s Claude on Amazon Bedrock as our LLM.

The following are the high-level steps for the solution:

Create an Amazon Aurora PostgreSQL-Compatible Edition
Ingest data to Aurora PostgreSQL-Compatible.
Create an Amazon Kendra index.
Set up the Amazon Kendra Aurora PostgreSQL connector.
Invoke the RAG application.

The following diagram illustrates the solution architecture.

Prerequisites

To follow this post, the following prerequisites are required:

The AWS Command Line Interface (AWS CLI) installed and configured
An AWS account and appropriate permissions to interact with resources in your AWS account
The AWS managed AWS Identity and Access Management (IAM) policy AmazonKendraReadOnlyAccess should be part of an Amazon SageMaker IAM role
An Aurora DB cluster where the current data is present
Your preferred interactive development environment (IDE) to run the Python script (such as SageMaker, or VS Code)
The pgAdmin tool for data loading and validation

Create an Aurora PostgreSQL cluster

Run the following AWS CLI commands to create an Aurora PostgreSQL Serverless v2 cluster:

aws rds create-db-cluster 
--engine aurora-postgresql 
--engine-version 15.4 
--db-cluster-identifier genai-kendra-ragdb 
--master-username postgres 
--master-user-password XXXXX 
--db-subnet-group-name dbsubnet 
--vpc-security-group-ids "sg-XXXXX" 
--serverless-v2-scaling-configuration "MinCapacity=2,MaxCapacity=64" 
--enable-http-endpoint 
--region us-east-2

aws rds create-db-instance 
--db-cluster-identifier genai-kendra-ragdb 
--db-instance-identifier genai-kendra-ragdb-instance 
--db-instance-class db.serverless 
--engine aurora-postgresql

The following screenshot shows the created instance.

Ingest data to Aurora PostgreSQL-Compatible

Connect to the Aurora instance using the pgAdmin tool. Refer to Connecting to a DB instance running the PostgreSQL database engine for more information. To ingest your data, complete the following steps:

Run the following PostgreSQL statements in pgAdmin to create the database, schema, and table:

CREATE DATABASE genai;
CREATE SCHEMA 'employees';

CREATE DATABASE genai;
SET SCHEMA 'employees';

CREATE TABLE employees.amazon_review(
pk int GENERATED ALWAYS AS IDENTITY NOT NULL,
id varchar(50) NOT NULL,
name varchar(300) NULL,
asins Text NULL,
brand Text NULL,
categories Text NULL,
keys Text NULL,
manufacturer Text NULL,
reviews_date Text NULL,
reviews_dateAdded Text NULL,
reviews_dateSeen Text NULL,
reviews_didPurchase Text NULL,
reviews_doRecommend varchar(100) NULL,
reviews_id varchar(150) NULL,
reviews_numHelpful varchar(150) NULL,
reviews_rating varchar(150) NULL,
reviews_sourceURLs Text NULL,
reviews_text Text NULL,
reviews_title Text NULL,
reviews_userCity varchar(100) NULL,
reviews_userProvince varchar(100) NULL,
reviews_username Text NULL,
PRIMARY KEY
(
pk
)
) ;

In your pgAdmin Aurora PostgreSQL connection, navigate to Databases, genai, Schemas, employees, Tables.
Choose (right-click) Tables and choose PSQL Tool to open a PSQL client connection.

Place the csv file under your pgAdmin location and run the following command:

copy employees.amazon_review (id, name, asins, brand, categories, keys, manufacturer, reviews_date, reviews_dateadded, reviews_dateseen, reviews_didpurchase, reviews_dorecommend, reviews_id, reviews_numhelpful, reviews_rating, reviews_sour
ceurls, reviews_text, reviews_title, reviews_usercity, reviews_userprovince, reviews_username) FROM 'C:Program FilespgAdmin 4runtimeamazon_review.csv' DELIMITER ',' CSV HEADER ENCODING 'utf8';

Run the following PSQL query to verify the number of records copied:
```
Select count (*) from employees.amazon_review;
```

Create an Amazon Kendra index

The Amazon Kendra index holds the contents of your documents and is structured in a way to make the documents searchable. It has three index types:

Generative AI Enterprise Edition index – Offers the highest accuracy for the Retrieve API operation and for RAG use cases (recommended)
Enterprise Edition index – Provides semantic search capabilities and offers a high-availability service that is suitable for production workloads
Developer Edition index – Provides semantic search capabilities for you to test your use cases

To create an Amazon Kendra index, complete the following steps:

On the Amazon Kendra console, choose Indexes in the navigation pane.
Choose Create an index.
On the Specify index details page, provide the following information:
- For Index name, enter a name (for example, genai-kendra-index).
- For IAM role, choose Create a new role (Recommended).
- For Role name, enter an IAM role name (for example, genai-kendra). Your role name will be prefixed with AmazonKendra-<region>- (for example, AmazonKendra-us-east-2-genai-kendra).
Choose Next.
On the Add additional capacity page, select Developer edition (for this demo) and choose Next.
On the Configure user access control page, provide the following information:
- Under Access control settings¸ select No.
- Under User-group expansion, select None.
Choose Next.
On the Review and create page, verify the details and choose Create.

It might take some time for the index to create. Check the list of indexes to watch the progress of creating your index. When the status of the index is ACTIVE, your index is ready to use.

Set up the Amazon Kendra Aurora PostgreSQL connector

Complete the following steps to set up your data source connector:

On the Amazon Kendra console, choose Data sources in the navigation pane.
Choose Add data source.
Choose Aurora PostgreSQL connector as the data source type.
On the Specify data source details page, provide the following information:
- For Data source name, enter a name (for example, data_source_genai_kendra_postgresql).
- For Default language¸ choose English (en).
- Choose Next.
On the Define access and security page, under Source, provide the following information:
- For Host, enter the host name of the PostgreSQL instance (cvgupdj47zsh.us-east-2.rds.amazonaws.com).
- For Port, enter the port number of the PostgreSQL instance (5432).
- For Instance, enter the database name of the PostgreSQL instance (genai).
Under Authentication, if you already have credentials stored in AWS Secrets Manager, choose it on the dropdown Otherwise, choose Create and add new secret.
In the Create an AWS Secrets Manager secret pop-up window, provide the following information:
- For Secret name, enter a name (for example, AmazonKendra-Aurora-PostgreSQL-genai-kendra-secret).
- For Data base user name, enter the name of your database user.
- For Password¸ enter the user password.
Choose Add Secret.
Under Configure VPC and security group, provide the following information:
- For Virtual Private Cloud, choose your virtual private cloud (VPC).
- For Subnet, choose your subnet.
- For VPC security groups, choose the VPC security group to allow access to your data source.
Under IAM role¸ if you have an existing role, choose it on the dropdown menu. Otherwise, choose Create a new role.
On the Configure sync settings page, under Sync scope, provide the following information:
- For SQL query, enter the SQL query and column values as follows: select * from employees.amazon_review.
- For Primary key, enter the primary key column (pk).
- For Title, enter the title column that provides the name of the document title within your database table (reviews_title).
- For Body, enter the body column on which your Amazon Kendra search will happen (reviews_text).
Under Sync node, select Full sync to convert the entire table data into a searchable index.

After the sync completes successfully, your Amazon Kendra index will contain the data from the specified Aurora PostgreSQL table. You can then use this index for intelligent search and RAG applications.

Under Sync run schedule, choose Run on demand.
Choose Next.
On the Set field mappings page, leave the default settings and choose Next.
Review your settings and choose Add data source.

Your data source will appear on the Data sources page after the data source has been created successfully.

Invoke the RAG application

The Amazon Kendra index sync can take minutes to hours depending on the volume of your data. When the sync completes without error, you are ready to develop your RAG solution in your preferred IDE. Complete the following steps:

Configure your AWS credentials to allow Boto3 to interact with AWS services. You can do this by setting the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables or by using the ~/.aws/credentials file:

import boto3
  pip install langchain

# Create a Boto3 session

session = boto3.Session(
   aws_access_key_id='YOUR_AWS_ACCESS_KEY_ID',
   aws_secret_access_key='YOUR_AWS_SECRET_ACCESS_KEY',
   region_name='YOUR_AWS_REGION'
)

Import LangChain and the necessary components:

from langchain_community.llms import Bedrock
from langchain_community.retrievers import AmazonKendraRetriever
from langchain.chains import RetrievalQA

Create an instance of the LLM (Anthropic’s Claude):

llm = Bedrock(
region_name = "bedrock_region_name",
model_kwargs = {
"max_tokens_to_sample":300,
"temperature":1,
"top_k":250,
"top_p":0.999,
"anthropic_version":"bedrock-2023-05-31"
},
model_id = "anthropic.claude-v2"
)

Create your prompt template, which provides instructions for the LLM:

from langchain_core.prompts import PromptTemplate

prompt_template = """
You are a <persona>Product Review Specialist</persona>, and you provide detail product review insights.
You have access to the product reviews in the <context> XML tags below and nothing else.

<context>
{context}
</context>

<question>
{question}
</question>
"""

prompt = PromptTemplate(template=prompt_template, input_variables=["context", "question"])

Initialize the KendraRetriever with your Amazon Kendra index ID by replacing the Kendra_index_id that you created earlier and the Amazon Kendra client:

session = boto3.Session(region_name='Kendra_region_name')
kendra_client = session.client('kendra')
# Create an instance of AmazonKendraRetriever
kendra_retriever = AmazonKendraRetriever(
kendra_client=kendra_client,
index_id="Kendra_Index_ID"
)

Combine Anthropic’s Claude and the Amazon Kendra retriever into a RetrievalQA chain:

qa = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=kendra_retriever,
return_source_documents=True,
chain_type_kwargs={"prompt": prompt},
)

Invoke the chain with your own query:

query = "What are some products that has bad quality reviews, summarize the reviews"
result_ = qa.invoke(
query
)
result_

Clean up

To avoid incurring future charges, delete the resources you created as part of this post:

Conclusion

In this post, we discussed how to convert your existing Aurora data into an Amazon Kendra index and implement a RAG-based solution for the data search. This solution drastically reduces the data preparation need for Amazon Kendra search. It also increases the speed of generative AI application development by reducing the learning curve behind data preparation.

Try out the solution, and if you have any comments or questions, leave them in the comments section.

About the Authors

Aravind Hariharaputran is a Data Consultant with the Professional Services team at Amazon Web Services. He is passionate about Data and AIML in general with extensive experience managing Database technologies .He helps customers transform legacy database and applications to Modern data platforms and generative AI applications. He enjoys spending time with family and playing cricket.

Ivan Cui is a Data Science Lead with AWS Professional Services, where he helps customers build and deploy solutions using ML and generative AI on AWS. He has worked with customers across diverse industries, including software, finance, pharmaceutical, healthcare, IoT, and entertainment and media. In his free time, he enjoys reading, spending time with his family, and traveling.

Optimizing AI responsiveness: A practical guide to Amazon Bedrock latency-optimized inference

January 28, 2025

by Ishan Singh Amazon AWS

In production generative AI applications, responsiveness is just as important as the intelligence behind the model. Whether it’s customer service teams handling time-sensitive inquiries or developers needing instant code suggestions, every second of delay, known as latency, can have a significant impact. As businesses increasingly use large language models (LLMs) for these critical tasks and processes, they face a fundamental challenge: how to maintain the quick, responsive performance users expect while delivering the high-quality outputs these sophisticated models promise.

The impact of latency on user experience extends beyond mere inconvenience. In interactive AI applications, delayed responses can break the natural flow of conversation, diminish user engagement, and ultimately affect the adoption of AI-powered solutions. This challenge is compounded by the increasing complexity of modern LLM applications, where multiple LLM calls are often needed to solve a single problem, significantly increasing total processing times.

During re:Invent 2024, we launched latency-optimized inference for foundation models (FMs) in Amazon Bedrock. This new inference feature provides reduced latency for Anthropic’s Claude 3.5 Haiku model and Meta’s Llama 3.1 405B and 70B models compared to their standard versions. This feature is especially helpful for time-sensitive workloads where rapid response is business critical.

In this post, we explore how Amazon Bedrock latency-optimized inference can help address the challenges of maintaining responsiveness in LLM applications. We’ll dive deep into strategies for optimizing application performance and improving user experience. Whether you’re building a new AI application or optimizing an existing one, you’ll find practical guidance on both the technical aspects of latency optimization and real-world implementation approaches. We begin by explaining latency in LLM applications.

Understanding latency in LLM applications

Latency in LLM applications is a multifaceted concept that goes beyond simple response times. When you interact with an LLM, you can receive responses in one of two ways: streaming or nonstreaming mode. In nonstreaming mode, you wait for the complete response before receiving any output—like waiting for someone to finish writing a letter. In streaming mode, you receive the response as it’s being generated—like watching someone type in real time.

To effectively optimize AI applications for responsiveness, we need to understand the key metrics that define latency and how they impact user experience. These metrics differ between streaming and nonstreaming modes and understanding them is crucial for building responsive AI applications.

Time to first token (TTFT) represents how quickly your streaming application starts responding. It’s the amount of time from when a user submits a request until they receive the beginning of a response (the first word, token, or chunk). Think of it as the initial reaction time of your AI application.

TTFT is affected by several factors:

Length of your input prompt (longer prompts generally mean higher TTFT)
Network conditions and geographic location (if the prompt is getting processed in a different region, it will take longer)

Calculation: TTFT = Time to first chunk/token – Time from request submission
Interpretation: Lower is better

Output tokens per second (OTPS) indicates how quickly your model generates new tokens after it starts responding. This metric is crucial for understanding the actual throughput of your model and how it maintains its response speed throughout longer generations.

OTPS is influenced by:

Model size and complexity
Length of the generated response
Complexity of the task and prompt
System load and resource availability

Calculation: OTPS = Total number of output tokens / Total generation time
Interpretation: Higher is better

End-to-end latency (E2E) measures the total time from request to complete response. As illustrated in the figure above, this encompasses the entire interaction.

Key factors affecting this metric include:

Input prompt length
Requested output length
Model processing speed
Network conditions
Complexity of the task and prompt
Postprocessing requirements (for example, using Amazon Bedrock Guardrails or other quality checks)

Calculation: E2E latency = Time at completion of request – Time from request submission
Interpretation: Lower is better

Although these metrics provide a solid foundation for understanding latency, there are additional factors and considerations that can impact the perceived performance of LLM applications. These metrics are shown in the following diagram.

The role of tokenization

An often-overlooked aspect of latency is how different models tokenize text differently. Each model’s tokenization strategy is defined by its provider during training and can’t be modified. For example, a prompt that generates 100 tokens in one model might generate 150 tokens in another. When comparing model performance, remember that these inherent tokenization differences can affect perceived response times, even when the models are equally efficient. Awareness of this variation can help you better interpret latency differences between models and make more informed decisions when selecting models for your applications.

Understanding user experience

The psychology of waiting in AI applications reveals interesting patterns about user expectations and satisfaction. Users tend to perceive response times differently based on the context and complexity of their requests. A slight delay in generating a complex analysis might be acceptable, and even a small lag in a conversational exchange can feel disruptive. This understanding helps us set appropriate optimization priorities for different types of applications.

Consistency over speed

Consistent response times, even if slightly slower, often lead to better user satisfaction than highly variable response times with occasional quick replies. This is crucial for streaming responses and implementing optimization strategies.

Keeping users engaged

When processing times are longer, simple indicators such as “Processing your request” or “loading animations” messages help keep users engaged, especially during the initial response time. In such scenarios, you want to optimize for TTFT.

Balancing speed, quality, and cost

Output quality often matters more than speed. Users prefer accurate responses over quick but less reliable ones. Consider benchmarking your user experience to find the best latency for your use case, considering that most humans can’t read faster than 225 words per minute and therefore extremely fast response can hinder user experience.

By understanding these nuances, you can make more informed decisions to optimize your AI applications for better user experience.

Latency-optimized inference: A deep dive

Amazon Bedrock latency-optimized inference capabilities are designed to provide higher OTPS and quicker TTFT, enabling applications to handle workloads more reliably. This optimization is available in the US East (Ohio) AWS Region for select FMs, including Anthropic’s Claude 3.5 Haiku and Meta’s Llama 3.1 models (both 405B and 70B versions). The optimization supports the following models:

Higher OTPS – Faster token generation after the model starts responding
Quicker TTFT – Faster initial response time

Implementation

To enable latency optimization, you need to set the latency parameter to optimized in your API calls:

# Using converse api without streaming
response = bedrock_runtime.converse(
    modelId='us.anthropic.claude-3-5-haiku-20241022-v1:0',
    messages=[{
            'role': 'user',
            'content': [{
                'text':'Write a story about music generating AI models'
                }]
              }],
    performanceConfig={'latency': 'optimized'}
)

For streaming responses:

# using converse API with streaming
response = bedrock_runtime.converse_stream(
       modelId='us.anthropic.claude-3-5-haiku-20241022-v1:0',
       messages=[{
            'role': 'user',
            'content': [{
                'text':'Write a story about music generating AI models'
                }]
              }],
        performanceConfig={'latency': 'optimized'}
    )

Benchmarking methodology and results

To understand the performance gains both for TTFT and OTPS, we conducted an offline experiment with around 1,600 API calls spread across various hours of the day and across multiple days. We used a dummy dataset comprising different task types: sequence-counting, story-writing, summarization, and translation. The input prompt ranged from 100 tokens to 100,000 tokens, and the output tokens ranged from 100 to 1,000 output tokens. These tasks were chosen to represent varying complexity levels and various model output lengths. Our test setup was hosted in the US West (Oregon) us-west-2 Region, and both the optimized and standard models were hosted in US East (Ohio) us-east-2 Region. This cross-Region setup introduced realistic network variability, helping us measure performance under conditions similar to real-world applications.

When analyzing the results, we focused on the key latency metrics discussed earlier: TTFT and OTPS. As a quick recap, lower TTFT values indicate faster initial response times, and higher OTPS values represent faster token generation speeds. We also looked at the 50th percentile (P50) and 90th percentile (P90) values to understand both typical performance and performance boundaries under challenging or upper bound conditions. Following the central limit theorem, we observed that, with sufficient samples, our results converged toward consistent values, providing reliable performance indicators.

It’s important to note that these results are from our specific test environment and datasets. Your actual results may vary based on your specific use case, prompt length, expected model response length, network conditions, client location, and other implementation components. When conducting your own benchmarks, make sure your test dataset represents your actual production workload characteristics, including typical input lengths and expected output patterns.

Benchmark results

Our experiments with the latency-optimized models revealed substantial performance improvements across both TTFT and OTPS metrics. The results in the following table show the comparison between standard and optimized versions of Anthropic’s Claude 3.5 Haiku and Meta’s Llama 3.1 70B models. For each model, we ran multiple iterations of our test scenarios to promote reliable performance. The improvements were particularly notable in high-percentile measurements, suggesting more consistent performance even under challenging conditions.

Model	Inference profile	TTFT P50 (in seconds)	TTFT P90 (in seconds)	OTPS P50	OTPS P90
us.anthropic.claude-3-5-haiku-20241022-v1:0	Optimized	0.6	1.4	85.9	152.0
us.anthropic.claude-3-5-haiku-20241022-v1:0	Standard	1.1	2.9	48.4	67.4
	*Improvement*	*-42.20%*	*-51.70%*	*77.34%*	*125.50%*
us.meta.llama3-1-70b-instruct-v1:0	Optimized	0.4	1.2	137.0	203.7
us.meta.llama3-1-70b-instruct-v1:0	Standard	0.9	42.8	30.2	32.4
	*Improvement*	*-51.65%*	*-97.10%*	*353.84%*	*529.33%*

These results demonstrate significant improvements across all metrics for both models. For Anthropic’s Claude 3.5 Haiku model, the optimized version achieved up to 42.20% reduction in TTFT P50 and up to 51.70% reduction in TTFT P90, indicating more consistent initial response times. Additionally, the OTPS saw improvements of up to 77.34% at the P50 level and up to 125.50% at the P90 level, enabling faster token generation.

The gains for Meta’s Llama 3.1 70B model are even more impressive, with the optimized version achieving up to 51.65% reduction in TTFT P50 and up to 97.10% reduction in TTFT P90, providing consistently rapid initial responses. Furthermore, the OTPS saw a massive boost, with improvements of up to 353.84% at the P50 level and up to 529.33% at the P90 level, enabling up to 5x faster token generation in some scenarios.

Although these benchmark results show the powerful impact of latency-optimized inference, they represent just one piece of the optimization puzzle. To make best use of these performance improvements and achieve the best possible response times for your specific use case, you’ll need to consider additional optimization strategies beyond merely enabling the feature.

Comprehensive guide to LLM latency optimization

Even though Amazon Bedrock latency-optimized inference offers great improvements from the start, getting the best performance requires a well-rounded approach to designing and implementing your application. In the next section, we explore some other strategies and considerations to make your application as responsive as possible.

Prompt engineering for latency optimization

When optimizing LLM applications for latency, the way you craft your prompts affects both input processing and output generation.

To optimize your input prompts, follow these recommendations:

Keep prompts concise – Long input prompts take more time to process and increase TTFT. Create short, focused prompts that prioritize necessary context and information.
Break down complex tasks – Instead of handling large tasks in a single request, break them into smaller, manageable chunks. This approach helps maintain responsiveness regardless of task complexity.
Smart context management – For interactive applications such as chatbots, include only relevant context instead of entire conversation history.
Token management – Different models tokenize text differently, meaning the same input can result in different numbers of tokens. Monitor and optimize token usage to keep performance consistent. Use token budgeting to balance context preservation with performance needs.

To engineer for brief outputs, follow these recommendations:

Engineer for brevity – Include explicit length constraints in your prompts (for example, “respond in 50 words or less”)
Use system messages – Set response length constraints through system messages
Balance quality and length – Make sure response constraints don’t compromise output quality

One of the best ways to make your AI application feel faster is to use streaming. Instead of waiting for the complete response, streaming shows the response as it’s being generated—like watching someone type in real-time. Streaming the response is one of the most effective ways to improve perceived performance in LLM applications maintaining user engagement.

These techniques can significantly reduce token usage and generation time, improving both latency and cost-efficiency.

Building production-ready AI applications

Although individual optimizations are important, production applications require a holistic approach to latency management. In this section, we explore how different system components and architectural decisions impact overall application responsiveness.

System architecture and end-to-end latency considerations

In production environments, overall system latency extends far beyond model inference time. Each component in your AI application stack contributes to the total latency experienced by users. For instance, when implementing responsible AI practices through Amazon Bedrock Guardrails, you might notice a small additional latency overhead. Similar considerations apply when integrating content filtering, user authentication, or input validation layers. Although each component serves a crucial purpose, their cumulative impact on latency requires careful consideration during system design.

Geographic distribution plays a significant role in application performance. Model invocation latency can vary considerably depending on whether calls originate from different Regions, local machines, or different cloud providers. This variation stems from data travel time across networks and geographic distances. When designing your application architecture, consider factors such as the physical distance between your application and model endpoints, cross-Region data transfer times, and network reliability in different Regions. Data residency requirements might also influence these architectural choices, potentially necessitating specific Regional deployments.

Integration patterns significantly impact how users perceive application performance. Synchronous processing, although simpler to implement, might not always provide the best user experience. Consider implementing asynchronous patterns where appropriate, such as pre-fetching likely responses based on user behavior patterns or processing noncritical components in the background. Request batching for bulk operations can also help optimize overall system throughput, though it requires careful balance with response time requirements.

As applications scale, additional infrastructure components become necessary but can impact latency. Load balancers, queue systems, cache layers, and monitoring systems all contribute to the overall latency budget. Understanding these components’ impact helps in making informed decisions about infrastructure design and optimization strategies.

Complex tasks often require orchestrating multiple model calls or breaking down problems into subtasks. Consider a content generation system that first uses a fast model to generate an outline, then processes different sections in parallel, and finally uses another model for coherence checking and refinement. This orchestration approach requires careful attention to cumulative latency impact while maintaining output quality. Each step needs appropriate timeouts and fallback mechanisms to provide reliable performance under various conditions.

Prompt caching for enhanced performance

Although our focus is on latency-optimized inference, it’s worth noting that Amazon Bedrock also offers prompt caching (in preview) to optimize for both cost and latency. This feature is particularly valuable for applications that frequently reuse context, such as document-based chat assistants or applications with repetitive query patterns. When combined with latency-optimized inference, prompt caching can provide additional performance benefits by reducing the processing overhead for frequently used contexts.

Prompt routing for intelligent model selection

Similar to prompt caching, Amazon Bedrock Intelligent Prompt Routing (in preview) is another powerful optimization feature. This capability automatically directs requests to different models within the same model family based on the complexity of each prompt. For example, simple queries can be routed to faster, more cost-effective models, and complex requests that require deeper understanding are directed to more sophisticated models. This automatic routing helps optimize both performance and cost without requiring manual intervention.

Architectural considerations and caching

Application architecture plays a crucial role in overall latency optimization. Consider implementing a multitiered caching strategy that includes response caching for frequently requested information and smart context management for historical information. This isn’t only about storing exact matches—consider implementing semantic caching that can identify and serve responses to similar queries.

Balancing model sophistication, latency, and cost

In AI applications, there’s a constant balancing act between model sophistication, latency, and cost, as illustrated in the diagram. Although more advanced models often provide higher quality outputs, they might not always meet strict latency requirements. In such cases, using a less sophisticated but faster model might be the better choice. For instance, in applications requiring near-instantaneous responses, opting for a smaller, more efficient model could be necessary to meet latency goals, even if it means a slight trade-off in output quality. This approach aligns with the broader need to optimize the interplay between cost, speed, and quality in AI systems.

Features such as Amazon Bedrock Intelligent Prompt Routing help manage this balance effectively. By automatically handling model selection based on request complexity, you can optimize for all three factors—quality, speed, and cost—without requiring developers to commit to a single model for all requests.

As we’ve explored throughout this post, optimizing LLM application latency involves multiple strategies, from using latency-optimized inference and prompt caching to implementing intelligent routing and careful prompt engineering. The key is to combine these approaches in a way that best suits your specific use case and requirements.

Conclusion

Making your AI application fast and responsive isn’t a one-time task, it’s an ongoing process of testing and improvement. Amazon Bedrock latency-optimized inference gives you a great starting point, and you’ll notice significant improvements when you combine it with the strategies we’ve discussed.

Ready to get started? Here’s what to do next:

Try our sample notebook to benchmark latency for your specific use case
Enable latency-optimized inference in your application code
Set up Amazon CloudWatch metrics to monitor your application’s performance

Remember, in today’s AI applications, being smart isn’t enough, being responsive is just as important. Start implementing these optimization strategies today and watch your application’s performance improve.

About the Authors

Rupinder Grewal is a Senior AI/ML Specialist Solutions Architect with AWS. He currently focuses on serving of models and MLOps on Amazon SageMaker. Prior to this role, he worked as a Machine Learning Engineer building and hosting models. Outside of work, he enjoys playing tennis and biking on mountain trails.

Vivek Singh is a Senior Manager, Product Management at AWS AI Language Services team. He leads the Amazon Transcribe product team. Prior to joining AWS, he held product management roles across various other Amazon organizations such as consumer payments and retail. Vivek lives in Seattle, WA and enjoys running, and hiking.

Ankur Desai is a Principal Product Manager within the AWS AI Services team.

Track LLM model evaluation using Amazon SageMaker managed MLflow and FMEval

January 28, 2025

by Paolo Di Francesco Amazon AWS

Evaluating large language models (LLMs) is crucial as LLM-based systems become increasingly powerful and relevant in our society. Rigorous testing allows us to understand an LLM’s capabilities, limitations, and potential biases, and provide actionable feedback to identify and mitigate risk. Furthermore, evaluation processes are important not only for LLMs, but are becoming essential for assessing prompt template quality, input data quality, and ultimately, the entire application stack. As LLMs take on more significant roles in areas like healthcare, education, and decision support, robust evaluation frameworks are vital for building trust and realizing the technology’s potential while mitigating risks.

Developers interested in using LLMs should prioritize a comprehensive evaluation process for several reasons. First, it assesses the model’s suitability for specific use cases, because performance can vary significantly across different tasks and domains. Evaluations are also a fundamental tool during application development to validate the quality of prompt templates. This process makes sure that solutions align with the company’s quality standards and policy guidelines before deploying them to production. Regular interval evaluation also allows organizations to stay informed about the latest advancements, making informed decisions about upgrading or switching models. Moreover, a thorough evaluation framework helps companies address potential risks when using LLMs, such as data privacy concerns, regulatory compliance issues, and reputational risk from inappropriate outputs. By investing in robust evaluation practices, companies can maximize the benefits of LLMs while maintaining responsible AI implementation and minimizing potential drawbacks.

To support robust generative AI application development, it’s essential to keep track of models, prompt templates, and datasets used throughout the process. This record-keeping allows developers and researchers to maintain consistency, reproduce results, and iterate on their work effectively. By documenting the specific model versions, fine-tuning parameters, and prompt engineering techniques employed, teams can better understand the factors contributing to their AI system’s performance. Similarly, maintaining detailed information about the datasets used for training and evaluation helps identify potential biases and limitations in the model’s knowledge base. This comprehensive approach to tracking key components not only facilitates collaboration among team members but also enables more accurate comparisons between different iterations of the AI application. Ultimately, this systematic approach to managing models, prompts, and datasets contributes to the development of more reliable and transparent generative AI applications.

In this post, we show how to use FMEval and Amazon SageMaker to programmatically evaluate LLMs. FMEval is an open source LLM evaluation library, designed to provide data scientists and machine learning (ML) engineers with a code-first experience to evaluate LLMs for various aspects, including accuracy, toxicity, fairness, robustness, and efficiency. In this post, we only focus on the quality and responsible aspects of model evaluation, but the same approach can be extended by using other libraries for evaluating performance and cost, such as LLMeter and FMBench, or richer quality evaluation capabilities like those provided by Amazon Bedrock Evaluations.

SageMaker is a data, analytics, and AI/ML platform, which we will use in conjunction with FMEval to streamline the evaluation process. We specifically focus on SageMaker with MLflow. MLflow is an open source platform for managing the end-to-end ML lifecycle, including experimentation, reproducibility, and deployment. The managed MLflow in SageMaker simplifies the deployment and operation of tracking servers, and offers seamless integration with other AWS services, making it straightforward to track experiments, package code into reproducible runs, and share and deploy models.

By combining FMEval’s evaluation capabilities with SageMaker with MLflow, you can create a robust, scalable, and reproducible workflow for assessing LLM performance. This approach can enable you to systematically evaluate models, track results, and make data-driven decisions in your generative AI development process.

Using FMEval for model evaluation

FMEval is an open-source library for evaluating foundation models (FMs). It consists of three main components:

Data config – Specifies the dataset location and its structure.
Model runner – Composes input, and invokes and extracts output from your model. Thanks to this construct, you can evaluate any LLM by configuring the model runner according to your model.
Evaluation algorithm – Computes evaluation metrics to model outputs. Different algorithms have different metrics to be specified.

You can use pre-built components because it provides native components for both Amazon Bedrock and Amazon SageMaker JumpStart, or create custom ones by inheriting from the base core component. The library supports various evaluation scenarios, including pre-computed model outputs and on-the-fly inference. FMEval offers flexibility in dataset handling, model integration, and algorithm implementation. Refer to Evaluate large language models for quality and responsibility or the Evaluating Large Language Models with fmeval paper to dive deeper into FMEval, or see the official GitHub repository.

Using SageMaker with MLflow to track experiments

The fully managed MLflow capability on SageMaker is built around three core components:

MLflow tracking server – This component can be quickly set up through the Amazon SageMaker Studio interface or using the API for more granular configurations. It functions as a standalone HTTP server that provides various REST API endpoints for monitoring, recording, and visualizing experiment runs. This allows you to keep track of your ML experiments.
MLflow metadata backend – This crucial part of the tracking server is responsible for storing all the essential information about your experiments. It keeps records of experiment names, run identifiers, parameter settings, performance metrics, tags, and locations of artifacts. This comprehensive data storage makes sure that you can effectively manage and analyze your ML projects.
MLflow artifact repository – This component serves as a storage space for all the files and objects generated during your ML experiments. These can include trained models, datasets, log files, and visualizations. The repository uses an Amazon Simple Storage Service (Amazon S3) bucket within your AWS account, making sure that your artifacts are stored securely and remain under your control.

The following diagram depicts the different components and where they run within AWS.

Code walkthrough

You can follow the full sample code from the GitHub repository.

Prerequisites

You must have the following prerequisites:

A running MLflow tracking server within an Amazon SageMaker Studio domain
A JupyterLab application within the same SageMaker Studio domain
Active subscriptions to the Amazon Bedrock models you want to evaluate and permissions to invoke these models
Permissions to deploy foundation models via Amazon SageMaker JumpStart

Refer to the documentation best practices regarding AWS Identity and Access Management (IAM) policies for SageMaker, MLflow, and Amazon Bedrock on how to set up permissions for the SageMaker execution role. Remember to always following the least privilege access principle.

Evaluate a model and log to MLflow

We provide two sample notebooks to evaluate models hosted in Amazon Bedrock (Bedrock.ipynb) and models deployed to SageMaker Hosting using SageMaker JumpStart (JumpStart.ipynb). The workflow implemented in these two notebooks is essentially the same, although a few differences are noteworthy:

Models hosted in Amazon Bedrock can be consumed directly using an API without any setup, providing a “serverless” experience, whereas models in SageMaker JumpStart require the user first to deploy the models. Although deploying models through SageMaker JumpStart is a straightforward operation, the user is responsible for managing the lifecycle of the endpoint.
ModelRunners implementations differ. FMEval provides native implementations for both Amazon Bedrock, using the BedrockModelRunner class, and SageMaker JumpStart, using the JumpStartModelRunner class. We discuss the main differences in the following section.

ModelRunner definition

For BedrockModelRunner, we need to find the model content_template. We can find this information conveniently on the Amazon Bedrock console in the API request sample section, and look at value of the body. The following example is the content template for Anthropic’s Claude 3 Haiku:

output_jmespath = "content[0].text"
content_template = """{
  "anthropic_version": "bedrock-2023-05-31",
  "max_tokens": 512,
  "temperature": 0.5,
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": $prompt
        }
      ]
    }
  ]
}"""

model_runner = BedrockModelRunner(
    model_id=model_id,
    output=output_jmespath,
    content_template=content_template,
)

For JumpStartModelRunner, we need to find the model and model_version. This information can be retrieved directly using the get_model_info_from_endpoint(endpoint_name=endpoint_name) utility provided by the SageMaker Python SDK, where endpoint_name is the name of the SageMaker endpoint where the SageMaker JumpStart model is hosted. See the following code example:

from sagemaker.jumpstart.session_utils import get_model_info_from_endpoint

model_id, model_version, , , _ = get_model_info_from_endpoint(endpoint_name=endpoint_name)

model_runner = JumpStartModelRunner(
    endpoint_name=endpoint_name,
    model_id=model_id,
    model_version=model_version,
)

DataConfig definition

For each model runner, we want to evaluate three categories: Summarization, Factual Knowledge, and Toxicity. For each of this category, we prepare a DataConfig object for the appropriate dataset. The following example shows only the data for the Summarization category:

dataset_path = Path("datasets")

dataset_uri_summarization = dataset_path / "gigaword_sample.jsonl"
if not dataset_uri_summarization.is_file():
    print("ERROR - please make sure the file, gigaword_sample.jsonl, exists.")

data_config_summarization = DataConfig(
    dataset_name="gigaword_sample",
    dataset_uri=dataset_uri_summarization.as_posix(),
    dataset_mime_type=MIME_TYPE_JSONLINES,
    model_input_location="document",
    target_output_location="summary",
)

Evaluation sets definition

We can now create an evaluation set for each algorithm we want to use in our test. For the Summarization evaluation set, replace with your own prompt according to the input signature identified earlier. fmeval uses $model_input as placeholder to get the input from your evaluation dataset. See the following code:

summarization_prompt = "Summarize the following text in one sentence: $model_input"

summarization_accuracy = SummarizationAccuracy()

evaluation_set_summarization = EvaluationSet(
  data_config_summarization,
  summarization_accuracy,
  summarization_prompt,
)

We are ready now to group the evaluation sets:

evaluation_list = [
    evaluation_set_summarization,
    evaluation_set_factual,
    evaluation_set_toxicity,
]

Evaluate and log to MLflow

We set up the MLflow experiment used to track the evaluations. We then create a new run for each model, and run all the evaluations for that model within that run, so that the metrics will all appear together. We use the model_id as the run name to make it straightforward to identify this run as part of a larger experiment, and run the evaluation using the run_evaluation_sets() defined in utils.py. See the following code:

run_name = f"{model_id}"

experiment_name = "fmeval-mlflow-simple-runs"
experiment = mlflow.set_experiment(experiment_name)

with mlflow.start_run(run_name=run_name) as run:
    run_evaluation_sets(model_runner, evaluation_list)

It is up to the user to decide how to best organize the results in MLflow. In fact, a second possible approach is to use nested runs. The sample notebooks implement both approaches to help you decide which one fits best your needs.

experiment_name = "fmeval-mlflow-nested-runs"
experiment = mlflow.set_experiment(experiment_name)

with mlflow.start_run(run_name=run_name, nested=True) as run:
    run_evaluation_sets_nested(model_runner, evaluation_list)

Run evaluations

Tracking the evaluation process involves storing information about three aspects:

The input dataset
The parameters of the model being evaluated
The scores for each evaluation

We provide a helper library (fmeval_mlflow) to abstract the logging of these aspects to MLflow, streamlining the interaction with the tracking server. For the information we want to store, we can refer to the following three functions:

log_input_dataset(data_config: DataConfig | list[DataConfig]) – Log one or more input datasets to MLflow for evaluation purposes
log_runner_parameters(model_runner: ModelRunner, custom_parameters_map: dict | None = None, model_id: str | None = None,) – Log the parameters associated with a given ModelRunner instance to MLflow
log_metrics(eval_output: list[EvalOutput], log_eval_output_artifact: bool = False) – Log metrics and artifacts for a list of SingleEvalOutput instances to MLflow.

When the evaluations are complete, we can analyze the results directly in the MLflow UI for a first visual assessment.

In the following screenshots, we show the visualization differences between logging using simple runs or nested runs.

You might want to create your own custom visualizations. For example, spider plots are often used to make visual comparison across multiple metrics. In the notebook compare_models.ipynb, we provide an example on how to use metrics stored in MLflow to generate such plots, which ultimately can also be stored in MLflow as part of your experiments. The following screenshots show some example visualizations.

Clean up

Once created, an MLflow tracking server will incur costs until you delete or stop it. Billing for tracking servers is based on the duration the servers have been running, the size selected, and the amount of data logged to the tracking servers. You can stop the tracking servers when they are not in use to save costs or delete them using the API or SageMaker Studio UI. For more details on pricing, see Amazon SageMaker pricing.

Similarly, if you deployed a model using SageMaker, endpoints are priced by deployed infrastructure time rather than by requests. You can avoid unnecessary charges by deleting your endpoints when you’re done with the evaluation.

Conclusion

In this post, we demonstrated how to create an evaluation framework for LLMs by combining SageMaker managed MLflow with FMEval. This integration provides a comprehensive solution for tracking and evaluating LLM performance across different aspects including accuracy, toxicity, and factual knowledge.

To enhance your evaluation journey, you can explore the following:

Get started with FMeval and SageMaker managed MLflow by following our code examples in the provided GitHub repository
Implement systematic evaluation practices in your LLM development workflow using the demonstrated approach
Use MLflow’s tracking capabilities to maintain detailed records of your evaluations, making your LLM development process more transparent and reproducible
Explore different evaluation metrics and datasets available in FMEval to comprehensively assess your LLM applications

By adopting these practices, you can build more reliable and trustworthy LLM applications while maintaining a clear record of your evaluation process and results.

About the authors

Paolo Di Francesco is a Senior Solutions Architect at Amazon Web Services (AWS). He holds a PhD in Telecommunications Engineering and has experience in software engineering. He is passionate about machine learning and is currently focusing on using his experience to help customers reach their goals on AWS, in particular in discussions around MLOps. Outside of work, he enjoys playing football and reading.

Dr. Alessandro Cerè is a GenAI Evaluation Specialist and Solutions Architect at AWS. He assists customers across industries and regions in operationalizing and governing their generative AI systems at scale, ensuring they meet the highest standards of performance, safety, and ethical considerations. Bringing a unique perspective to the field of AI, Alessandro has a background in quantum physics and research experience in quantum communications and quantum memories. In his spare time, he pursues his passion for landscape and underwater photography.

Create a SageMaker inference endpoint with custom model & extended container

January 27, 2025

by Aidan Ricci Amazon AWS

Amazon SageMaker provides a seamless experience for building, training, and deploying machine learning (ML) models at scale. Although SageMaker offers a wide range of built-in algorithms and pre-trained models through Amazon SageMaker JumpStart, there are scenarios where you might need to bring your own custom model or use specific software dependencies not available in SageMaker managed container images. Examples for this could include use cases like geospatial analysis, bioinformatics research, or quantum machine learning. In such cases, SageMaker allows you to extend its functionality by creating custom container images and defining custom model definitions. This approach enables you to package your model artifacts, dependencies, and inference code into a container image, which you can deploy as a SageMaker endpoint for real-time inference. This post walks you through the end-to-end process of deploying a single custom model on SageMaker using NASA’s Prithvi model. The Prithvi model is a first-of-its-kind temporal Vision transformer pre-trained by the IBM and NASA team on contiguous US Harmonised Landsat Sentinel 2 (HLS) data. It can be finetuned for image segmentation using the mmsegmentation library for use cases like burn scars detection, flood mapping, and multi-temporal crop classification. Due to its unique architecture and fine-tuning dependency on the MMCV library, it is an effective example of how to deploy complex custom models to SageMaker. We demonstrate how to use the flexibility of SageMaker to deploy your own custom model, tailored to your specific use case and requirements. Whether you’re working with unique model architectures, specialized libraries, or specific software versions, this approach empowers you to harness the scalability and management capabilities of SageMaker while maintaining control over your model’s environment and dependencies.

Solution overview

To run a custom model that needs unique packages as a SageMaker endpoint, you need to follow these steps:

If your model requires additional packages or package versions unavailable from the SageMaker managed container images, you will need to extend one of the container images. By extending a SageMaker managed container vs. creating one from scratch, you can focus on your specific use case and model development instead of the container infrastructure.
Write a Python model definition using the SageMaker inference.py file format.
Define your model artifacts and inference file within a specific file structure, archive your model files as a tar.gz file, and upload your files to Amazon Simple Storage Service (Amazon S3).
With your model code and an extended SageMaker container, use Amazon SageMaker Studio to create a model, endpoint configuration, and endpoint.
Query the inference endpoint to confirm your model is running correctly.

The following diagram illustrates the solution architecture and workflow:

Prerequisites

You need the following prerequisites before you can proceed. For this post, we use the us-east-1 AWS Region:

Have access to a POSIX based (Mac/Linux) system or SageMaker notebooks. This post doesn’t cover setting up SageMaker access and assumes a notebook accessible to the internet. However, this is not a security best practice and should not be done in production. To learn how to create a SageMaker notebook within a virtual private cloud (VPC), see Connect to SageMaker AI Within your VPC.
Make sure you have AWS Identity and Access Management (IAM) permissions for SageMaker access; S3 bucket create, read, and PutObject access; AWS CodeBuild access; Amazon Elastic Container Registry (Amazon ECR) repository access; and the ability to create IAM roles.

Download the Prithvi model artifacts and flood data fine-tuning:

curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash 
sudo apt-get install git-lfs 
git lfs install 
git clone https://huggingface.co/ibm-nasa-geospatial/Prithvi-100M 
git clone https://huggingface.co/ibm-nasa-geospatial/Prithvi-100M-sen1floods11

Extend a SageMaker container image for your model

Although AWS provides pre-built container images optimized for deep learning on the AWS Deep Learning Containers (DLCs) GitHub for PyTorch and TensorFlow use cases, there are scenarios where models require additional libraries not included in these containers. The installation of these dependencies can take minutes or hours, so it’s more efficient to pre-build these dependencies into a custom container image. For this example, we deploy the Prithvi model, which is dependent on the MMCV library for advanced computer vision techniques. This library is not available within any of the SageMaker DLCs, so you will have to create an extended container to add it. Both MMCV and Prithvi are third-party models which have not undergone AWS security reviews, so please review these models yourself or use at your own risk. This post uses CodeBuild and a Docker Dockerfile to build the extended container.

Complete the following steps:

CodeBuild requires a source location containing the source code. Create an S3 bucket to serve as this source location using the following commands:

# generate a unique postfix 
BUCKET_POSTFIX=$(python3 -S -c "import uuid; print(str(uuid.uuid4().hex)[:10])") 
echo "export BUCKET_POSTFIX=${BUCKET_POSTFIX}" &gt;&gt; ~/.bashrc 
echo "Your bucket name will be customsagemakercontainer-codebuildsource-${BUCKET_POSTFIX}" 
# make your bucket 
aws s3 mb s3://customsagemakercontainer-codebuildsource-${BUCKET_POSTFIX}

Create an ECR repository to store the custom container image produced by the CodeBuild project. Record the repository URI as an environment variable.

CONTAINER_REPOSITORY_NAME="prithvi" 
aws ecr create-repository --repository-name ${CONTAINER_REPOSITORY_NAME} 
export REPOSITORY_URI=$(aws ecr describe-repositories --repository-names ${CONTAINER_REPOSITORY_NAME} --query 'repositories[0].repositoryUri' --output text)

Create a Dockerfile for the custom container. You use an AWS Deep Learning SageMaker framework container as the base image because it includes required dependencies such as SageMaker libraries, PyTorch, and CUDA.

This Docker container installs the Prithvi model and MMCV v1.6.2. These models are third-party models not produced by AWS and therefore may have security vulnerabilities. Use at your own risk.

cat > Dockerfile << EOF 
FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:1.13.1-gpu-py39-cu117-ubuntu20.04-sagemaker
WORKDIR /root
RUN DEBIAN_FRONTEND=noninteractive apt-get update -y
RUN DEBIAN_FRONTEND=noninteractive apt-get upgrade -y
RUN git clone https://github.com/NASA-IMPACT/hls-foundation-os.git
RUN wget https://github.com/open-mmlab/mmcv/archive/refs/tags/v1.6.2.tar.gz
RUN tar xvzf v1.6.2.tar.gz
WORKDIR /root/hls-foundation-os
RUN pip install -e . 
RUN pip install -U openmim 
WORKDIR /root/mmcv-1.6.2
RUN MMCV_WITH_OPS=1 pip install -e . -v

EOF

Create a buildspec file to define the build process for the CodeBuild project. This buildspec file will instruct CodeBuild to install the nvidia-container-toolkit to make sure the Docker container has GPU access, run the Dockerfile build, and push the built container image to your ECR repository.

cat > buildspec.yml << EOF
version: 0.2

phases:
    pre_build:
        commands:
        - echo Logging in to Amazon ECR...
        - IMAGE_TAG=sagemaker-gpu
        - curl -s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo | sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo
        - sudo yum install -y nvidia-container-toolkit
        - sudo nvidia-ctk runtime configure --runtime=docker
        - | 
          cat > /etc/docker/daemon.json << EOF
          {
              "runtimes": {
                  "nvidia": {
                      "args": [],
                      "path": "nvidia-container-runtime"
                  }
              },
              "default-runtime": "nvidia"
          }
          EOF
        - kill $(ps aux | grep -v grep | grep "/usr/local/bin/dockerd --host" | awk '{print $2}')
        - sleep 10
        - nohup /usr/local/bin/dockerd --host=unix:///var/run/docker.sock --host=tcp://127.0.0.1:2375 --storage-driver=overlay2 &
        - sleep 10
        - aws ecr get-login-password --region $AWS_REGION | docker login --username AWS --password-stdin $REPOSITORY_URI_AWSDL
        - docker pull $REPOSITORY_URI_AWSDL/pytorch-inference:1.13.1-gpu-py39-cu117-ubuntu20.04-sagemaker

    build:
        commands:
        - echo Build started at $(date)
        - echo Building the Docker image...
        - aws ecr get-login-password --region $AWS_REGION | docker login --username AWS --password-stdin $REPOSITORY_URI
        - DOCKER_BUILDKIT=0 docker build -t $REPOSITORY_URI:latest .
        - docker tag $REPOSITORY_URI:latest $REPOSITORY_URI:$IMAGE_TAG

    post_build:
        commands:
        - echo Build completed at $(date)
        - echo Pushing the Docker images...
        - docker push $REPOSITORY_URI:latest
        - docker push $REPOSITORY_URI:$IMAGE_TAG

EOF

Zip and upload the Dockerfile and buildspec.yml files to the S3 bucket. This zip file will serve as the source code for the CodeBuild project.
- To install zip on a SageMaker notebook, run the following command:
```
sudo apt install zip
```
- With zip installed, run the following command:
```
zip prithvi_container_source.zip Dockerfile buildspec.yml
aws s3 cp prithvi_container_source.zip s3://customsagemakercontainer-codebuildsource-${BUCKET_POSTFIX}/
```

Create a CodeBuild service role so CodeBuild can access the required AWS services for the build.

First, create a file defining the role’s trust policy:

cat > create-role.json << EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "codebuild.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}
EOF

Create a file defining the service role’s permissions. This role has a few wildcard permissions (/* or *). These can give more permissions than needed and break the rule of least privilege. For more information about defining least privilege for production use cases, see Grant least privilege.

cat > put-role-policy.json << EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "CloudWatchLogsPolicy",
      "Effect": "Allow",
      "Action": [
        "logs:CreateLogGroup",
        "logs:CreateLogStream",
        "logs:PutLogEvents"
      ],
      "Resource": "arn:aws:logs:us-east-1:*:log-group:/aws/codebuild/*:*"
    },
    {
      "Sid": "S3GetObjectPolicy",
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:GetObjectVersion"
      ],
      "Resource": [
        "arn:aws:s3:::customsagemakercontainer-codebuildsource-${BUCKET_POSTFIX}",
        "arn:aws:s3:::customsagemakercontainer-codebuildsource-${BUCKET_POSTFIX}/*"
        ]
    },
    {
      "Sid": "S3BucketIdentity",
      "Effect": "Allow",
      "Action": [
        "s3:GetBucketAcl",
        "s3:GetBucketLocation"
      ],
      "Resource": "arn:aws:s3:::customsagemakercontainer-codebuildsource-${BUCKET_POSTFIX}"
    },
    {
      "Sid": "ECRAccess",
      "Effect": "Allow",
      "Action": [
        "ecr:GetImage",
        "ecr:BatchGetImage",
        "ecr:BatchCheckLayerAvailability",
        "ecr:CompleteLayerUpload",
        "ecr:UploadLayerPart",
        "ecr:GetDownloadUrlForLayer",
        "ecr:InitiateLayerUpload",
        "ecr:PutImage",
        "ecr:ListImages",
        "ecr:DescribeRepositories",
        "ecr:DescribeImages",
        "ecr:DescribeRegistry",
        "ecr:TagResource"
      ],
      "Resource": ["arn:aws:ecr:*:*:repository/prithvi",
                    "arn:aws:ecr:*:763104351884:repository/*"]
    },
    {
      "Sid": "ECRAuthToken",
      "Effect": "Allow",
      "Action": [
        "ecr:GetAuthorizationToken"
      ],
      "Resource": "*"
    }
  ]
}
EOF

Create the CodeBuild service role:

aws iam create-role --role-name CodeBuildServiceRole --assume-role-policy-document file://create-role.json

Capture the name of the role Amazon Resource Name (ARN) from the CLI command response and record as an environment variable:
```
export CODEBUILD_SERVICE_ROLE_ARN=$(aws iam get-role --role-name CodeBuildServiceRole --query 'Role.Arn' --output text)
```

Attach the permission policies to the service role:

aws iam put-role-policy --role-name CodeBuildServiceRole --policy-name CodeBuildServiceRolePolicy --policy-document file://put-role-policy.json

Define the configurations for the CodeBuild build project using the build project JSON specification:

cat > codebuild-project.json << EOF
{
    "name": "prithvi-container-build",
    "description": "build process to modify the AWS SageMaker Deep Learning Container to include HLS/Prithvi",
    "source": {
      "type": "S3",
      "location": "customsagemakercontainer-codebuildsource-${BUCKET_POSTFIX}/prithvi_container_source.zip",
      "buildspec": "buildspec.yml",
      "insecureSsl": false
    },
    "artifacts": {
      "type": "NO_ARTIFACTS"
    },
    "cache": {
      "type": "NO_CACHE"
    },
    "environment": {
      "type": "LINUX_GPU_CONTAINER",
      "image": "aws/codebuild/amazonlinux2-x86_64-standard:5.0",
      "computeType": "BUILD_GENERAL1_SMALL",
      "environmentVariables": [
        {
          "name": "REPOSITORY_URI_AWSDL",
          "value": "763104351884.dkr.ecr.us-east-1.amazonaws.com",
          "type": "PLAINTEXT"
        },
        {
          "name": "AWS_REGION",
          "value": "us-east-1",
          "type": "PLAINTEXT"
        },
        {
            "name": "REPOSITORY_URI",
            "value": "$REPOSITORY_URI",
            "type": "PLAINTEXT"
          }
      ],
      "imagePullCredentialsType": "CODEBUILD",
      "privilegedMode": true
    },
    "serviceRole": "$CODEBUILD_SERVICE_ROLE_ARN",
    "timeoutInMinutes": 60,
    "queuedTimeoutInMinutes": 480,
    "logsConfig": {
      "cloudWatchLogs": {
        "status": "ENABLED"
      }
  }
}
EOF

Create the CodeBuild project using the codebuild-project.json specification defined in the previous step:
```
aws codebuild create-project --cli-input-json file://codebuild-project.json
```

Run a build for the CodeBuild project:

aws codebuild start-build --project-name prithvi-container-build

The build will take approximately 30 minutes to complete and cost approximately $1.50 to run. The CodeBuild compute instance type gpu1.small costs $0.05 per minute.

After you run the preceding command, you can press Ctrl+C to exit and run future commands. The build will already be running on AWS and will not be canceled by closing the command.

Monitor the status of the build using the following command and wait until you observe buildStatus=SUCCEEDED before proceeding to the next step:

export BUILD_ID=$(aws codebuild list-builds-for-project --project-name prithvi-container-build --query 'ids[0]' --output text) aws codebuild batch-get-builds --ids ${BUILD_ID} | grep buildStatus

After your CodeBuild project has completed, make sure that you do not close your terminal. The environment variables here will be used again.

Build your `inference.py` file

To run a custom model for inference on AWS, you need to build out an inference.py file that initializes your model, defines the input and output structure, and produces your inference results. In this file, you must define four functions:

model_fn – Initializes your model
input_fn – Defines how your data should be input and how to convert to a usable format
predict_fn – Takes the input data and receives the prediction
output_fn – Converts the prediction into an API call format

We use the following completed inference.py file for the SageMaker endpoint in this post. Download this inference.py to continue because it includes the helper functions to process the TIFF files needed for this model’s input. The following code is contained within the inference.py and is only shown to provide an explanation of what is being done in the file.

`model_fn`

The model_fn function builds your model, which is called and used within the predict_fn function. This function loads the model weights into a torch model checkpoint, opens the model config, defines global variables, instantiates the model, loads the model checkpoint into the model, and returns the model.

def model_fn(model_dir):
    # implement custom code to load the model
    # load weights
    weights_path = "./code/prithvi/Prithvi_100M.pt"
    checkpoint = torch.load(weights_path, map_location="cpu")

    # read model config
    model_cfg_path = "./code/prithvi/Prithvi_100M_config.yaml"
    with open(model_cfg_path) as f:
        model_config = yaml.safe_load(f)

    model_args, train_args = model_config["model_args"], model_config["train_params"]

    global means
    global stds
    means = np.array(train_args["data_mean"]).reshape(-1, 1, 1)
    stds = np.array(train_args["data_std"]).reshape(-1, 1, 1)
    # let us use only 1 frame for now (the model was trained on 3 frames)
    model_args["num_frames"] = 1

    # instantiate model
    model = MaskedAutoencoderViT(**model_args)
    model.eval()

    # load weights into model
    # strict=false because we are loading with only 1 frame, but the warning is expected
    del checkpoint['pos_embed']
    del checkpoint['decoder_pos_embed']
    _ = model.load_state_dict(checkpoint, strict=False)
    
    return model

`input_fn`

This function defines the expected input for the model and how to load the input for use in predict_fn. The endpoint expects a string URL path linked to a TIFF file you can find online from the Prithvi demo on Hugging Face. This function also defines the content type of the request sent in the body (such as application/json, image/tiff).

def input_fn(input_data, content_type):
    # decode the input data  (e.g. JSON string -> dict)
    # statistics used to normalize images before passing to the model
    raster_data = load_raster(input_data, crop=(224, 224))
    return raster_data

`predict_fn`:

In predict_fn, you create the prediction from the given input. In this case, creating the prediction image uses two helper functions specific to this endpoint (preprocess_image and enhance_raster_for_visualization). You can find both functions here. The preprocess_image function normalizes the image, then the function uses torch.no_grad to disable gradient calculations for the model. This is useful during inference to decrease inference time and reduce memory usage. Next, the function collects the prediction from the instantiated model. The mask ratio determines the number of pixels on the image zeroed out during inference. The two unpatchify functions convert the smaller patchified results produced by the model back to the original image space. The function normalized.clone() clones the normalized images and replaces the masked Regions from rec_img with the Regions from the pred_img. Finally, the function reshapes the image back into TIFF format, removes the normalization, and returns the image in raster format, which is valuable for visualization. The result of this is an image that can be converted to bytes for the user and then visualized on the user’s screen.

def predict_fn(data, model):
    normalized = preprocess_image(data)
    with torch.no_grad():
        mask_ratio = 0.5
        _, pred, mask = model(normalized, mask_ratio=mask_ratio)
        mask_img = model.unpatchify(mask.unsqueeze(-1).repeat(1, 1, pred.shape[-1])).detach().cpu()
        pred_img = model.unpatchify(pred).detach().cpu()#.numpy()
        rec_img = normalized.clone()
        rec_img[mask_img == 1] = pred_img[mask_img == 1]
        rec_img_np = (rec_img.numpy().reshape(6, 224, 224) * stds) + means
        print(rec_img_np.shape)
        return enhance_raster_for_visualization(rec_img_np, ref_img=data)

`output_fn`

output_fn returns the TIFF image received from predict_fn as an array of bytes.

def output_fn(prediction, accept):
    print(prediction.shape)
    return prediction.tobytes()

Test your `inference.py` file

Now that you have downloaded the complete inference.py file, there are two options to test your model before compressing the files and uploading them to Amazon S3:

Test the inference.py functions on your custom container within an Amazon Elastic Compute Cloud (Amazon EC2) instance
Test your endpoint on a local mode SageMaker endpoint (requires a GPU or GPU-based workspace for this model)

Model file structure, tar.gz compressing, and S3 upload

Before you start this step, download the Prithvi model artifacts and the Prithvi flood fine-tuning of the model. The first link will provide all of the model data from the base Prithvi model, and the flood fine-tuning of the model builds upon the model to perform flood plain detection on satellite images. Install git-lfs using brew on Mac or using https://git-lfs.com/ on Windows to install the GitHub repo’s large files.

curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash
sudo apt-get install git-lfs
git lfs install

git clone https://huggingface.co/ibm-nasa-geospatial/Prithvi-100M
cd Prithvi-100M
git lfs pull
git checkout c2435efd8f92329d3ca79fc8a7a70e21e675a650
git clone https://huggingface.co/ibm-nasa-geospatial/Prithvi-100M-sen1floods11
cd Prithvi-100M-sen1floods11
git lfs pull
git checkout d48937bf588cd506dd73bc4deca543446ca5530d

To create a SageMaker model on the SageMaker console, you must store your model data within Amazon S3 because your SageMaker endpoint will pull your model artifacts directly from Amazon S3 using a tar.gz format. Within your tar.gz file, the data must have a specific file format defined by SageMaker. The following is the file structure for the Prithvi foundation model (our requirements are installed on the container, so requirements.txt has been left intentionally blank):

./model
./model/code/inference.py
./model/code/sen1floods11_Prithvi_100M.py (extended model config)
./model/code/sen1floods11_Prithvi_100M.pth (extended model weights)
./model/code/requirements.txt
./model/code/prithvi/Prithvi_100M.pt (extended model weights)
./model/code/prithvi/Prithvi_100M_config.yaml (model config)
./model/code/prithvi/Prithvi.py (model)

This folder structure remains true for other models as well. The /code folder must hold the inference.py file and any files used within inference.py. These additional files are generally model artifacts (configs, weights, and so on). In our case, this will be the whole Prithvi base model folder as well as the weights and configs for the fine-tuned version we will use. Because we have already installed these packages within our container, this is not used; however, there still must be a requirements.txt file, otherwise your endpoint will fail to build. All other files belong in the root folder.

With the preceding file structure in place, open your terminal and route into the model folder.

Run the following command in your terminal:
```
tar -czvf model.tar.gz ./
```
The command will create a compressed version of your model files called model.tar.gz from the files in your current directory. You can now upload this file into an S3 bucket.
If using SageMaker, run the following command:
```
sudo apt-get install uuid-runtime
```

Now create a new S3 bucket. The following CLI commands create an S3 bucket and upload your model.tar.gz file:

# generate a unique postfix

BUCKET_POSTFIX=$(uuidgen --random | cut -d'-' -f1)
echo "export BUCKET_POSTFIX=${BUCKET_POSTFIX}" >> ~/.bashrc
echo "Your bucket name will be customsagemakercontainer-model-${BUCKET_POSTFIX}"

# make your bucket
aws s3 mb s3://customsagemakercontainer-model-${BUCKET_POSTFIX}

# upload to your bucket
aws s3 cp model.tar.gz s3://customsagemakercontainer-model-${BUCKET_POSTFIX}/model.tar.gz

The file you uploaded will be used in the next step to define the model to be created in the endpoint.

Create SageMaker model, SageMaker endpoint configuration, and SageMaker endpoint

You now create a SageMaker inference endpoint using the CLI. There are three steps to creating a SageMaker endpoint: create a model, create an endpoint configuration, and create an endpoint.

In this post, you will create a public SageMaker endpoint because this will simplify running and testing the endpoint. For details about how to limit access to SageMaker endpoints, refer to Deploy models with SageMaker Studio.

Complete the following steps:

Get the ECR repository’s ARN:

export REPOSITORY_ARN=$(aws ecr describe-repositories --repository-names ${CONTAINER_REPOSITORY_NAME} --query 'repositories[0].repositoryArn' --output text)

Create a role for the SageMaker service to assume. Create a file defining the role’s trust policy.

cat > create-sagemaker-role.json << EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "sagemaker.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}
EOF

Create a file defining the service role’s permissions:

cat > put-sagemaker-role-policy.json << EOF
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "s3:ListBucket"
            ],
            "Effect": "Allow",
            "Resource": [
                "arn:aws:s3:::customsagemakercontainer-model-${BUCKET_POSTFIX}"
            ]
        },
        {
            "Action": [
                "s3:GetObject",
                "s3:PutObject",
                "s3:DeleteObject"
            ],
            "Effect": "Allow",
            "Resource": [
                "arn:aws:s3:::customsagemakercontainer-model-${BUCKET_POSTFIX}/*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "sagemaker:BatchPutMetrics",
                "ecr:GetAuthorizationToken",
                "ecr:ListImages"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "ecr:BatchCheckLayerAvailability",
                "ecr:GetDownloadUrlForLayer",
                "ecr:BatchGetImage"
            ],
            "Resource": [
                "${REPOSITORY_ARN}"
            ]
        },
        {
            "Effect": "Allow",
            "Action": "cloudwatch:PutMetricData",
            "Resource": "*",
            "Condition": {
                "StringLike": {
                    "cloudwatch:namespace": [
                        "*SageMaker*",
                        "*Sagemaker*",
                        "*sagemaker*"
                    ]
                }
            }
        },
        {
            "Effect": "Allow",
            "Action": [
                "logs:CreateLogStream",
                "logs:PutLogEvents",
                "logs:CreateLogGroup",
                "logs:DescribeLogStreams"
            ],
            "Resource": "arn:aws:logs:*:*:log-group:/aws/sagemaker/*"
        }
    ]
}
EOF

Create the SageMaker service role:

aws iam create-role --role-name SageMakerInferenceRole --assume-role-policy-document file://create-sagemaker-role.json

export SAGEMAKER_INFERENCE_ROLE_ARN=$(aws iam get-role --role-name SageMakerInferenceRole --query 'Role.Arn' --output text)

Attach the permission policies to the service role:
```
aws iam put-role-policy --role-name SageMakerInferenceRole --policy-name SageMakerInferenceServiceRolePolicy --policy-document file://put-sagemaker-role-policy.json
```
The model definition will include the role you created, the ECR container image, and the Amazon S3 location of the model.tar.gz file that you created previously.

Create a JSON file that defines the model and run the create-model command:

cat > create_model.json << EOF
{
    "ModelName": "prithvi",
    "PrimaryContainer": {
        "Image": "$REPOSITORY_URI:latest",
        "ModelDataUrl": "s3://customsagemakercontainer-model-${BUCKET_POSTFIX}/model.tar.gz"
    },
    "ExecutionRoleArn": "$SAGEMAKER_INFERENCE_ROLE_ARN"
}
EOF

aws sagemaker create-model --cli-input-json file://create_model.json

A SageMaker endpoint configuration specifies the infrastructure that the model will be hosted on. The model will be hosted on a ml.g4dn.xlarge instance for GPU-based acceleration.

Create the endpoint configuration JSON file and create the SageMaker endpoint configuration:

cat > create_endpoint_config.json << EOF
{
    "EndpointConfigName": "prithvi-endpoint-config",
    "ProductionVariants": [
        {
            "VariantName": "default-variant",
            "ModelName": "prithvi",
            "InitialInstanceCount": 1,
            "InstanceType": "ml.g4dn.xlarge",
            "InitialVariantWeight": 1.0
        }
    ]
}
EOF

aws sagemaker create-endpoint-config --cli-input-json file://create_endpoint_config.json

Create the SageMaker endpoint by referencing the endpoint configuration created in the previous step:
```
aws sagemaker create-endpoint --endpoint-name prithvi-endpoint --endpoint-config-name prithvi-endpoint-config
```
The ml.g4dn.xlarge inference endpoint will cost $0.736 per hour while running. It will take several minutes for the endpoint to finish deploying.

Check the status using the following command, waiting for it to return InService:

aws sagemaker describe-endpoint --endpoint-name prithvi-endpoint --query "EndpointStatus" --output text

When the endpoint’s status is InService, proceed to the next section.

Test your custom SageMaker inference endpoint

To test your SageMaker endpoint, you will query your endpoint with an image and display it. The following command sends a URL that references a TIFF image to the SageMaker endpoint, the model sends back a byte array, and the command reforms the byte array into an image. Open up a notebook locally or on Sagemaker Studio JupyterLab. The below code will need to be run outside of the command line to view the image

from sagemaker.predictor import Predictor
from sagemaker.serializers import JSONSerializer

payload = "https://huggingface.co/ibm-nasa-geospatial/Prithvi-EO-1.0-100M/resolve/main/examples/HLS.L30.T13REN.2018013T172747.v2.0.B02.B03.B04.B05.B06.B07_cropped.tif"

predictor = Predictor(endpoint_name="prithvi-endpoint")
predictor.serializer = JSONSerializer()

predictions = predictor.predict(payload)

This Python code creates a predictor object for your endpoint and sets the predictor’s serializer to NumPy for the conversion on the endpoint. It queries the predictor object using a payload of a URL pointing to a TIFF image. You use a helper function to display the image and enhance the raster. You will be able to find that helper function here. After you add the helper function, display the image:

import numpy as np
import matplotlib.pyplot as plt

NO_DATA_FLOAT = 0.0001
PERCENTILES = (0.1, 99.9)
NO_DATA = -9999

nppred = np.frombuffer(predictions).reshape((224, 224, 3))

def enhance_raster_for_visualization(raster, ref_img=None):
    if ref_img is None:
        ref_img = raster
    channels = []
    for channel in range(raster.shape[0]):
        valid_mask = np.ones_like(ref_img[channel], dtype=bool)
        valid_mask[ref_img[channel] == NO_DATA_FLOAT] = False
        mins, maxs = np.percentile(ref_img[channel][valid_mask], PERCENTILES)
        normalized_raster = (raster[channel] - mins) / (maxs - mins)
        normalized_raster[~valid_mask] = 0
        clipped = np.clip(normalized_raster, 0, 1)
        channels.append(clipped)
    clipped = np.stack(channels)
    channels_last = np.moveaxis(clipped, 0, -1)[..., :3]
    rgb = channels_last[..., ::-1]
    return rgb
    
raster_for_visualization = enhance_raster_for_visualization(nppred)
plt.imshow(nppred)
plt.show()

You should observe an image that has been taken from a satellite.

Clean up

To clean up the resources from this post and avoid incurring costs, follow these steps:

Delete the SageMaker endpoint, endpoint configuration, and model.
Delete the ECR image and repository.
Delete the model.tar.gz file in the S3 bucket that was created.
Delete the customsagemakercontainer-model and customsagemakercontainer-codebuildsource S3 buckets.

Conclusion

In this post, we extended a SageMaker container to include custom dependencies, wrote a Python script to run a custom ML model, and deployed that model on the SageMaker container within a SageMaker endpoint for real-time inference. This solution produces a running GPU-enabled endpoint for inference queries. You can use this same process to create custom model SageMaker endpoints by extending other SageMaker containers and writing an inference.py file for new custom models. Furthermore, with adjustments, you could create a multi-model SageMaker endpoint for custom models or run a batch processing endpoint for scenarios where you run large batches of queries at once. These solutions enable you to go beyond the most popular models used today and customize models to fit your own unique use case.

About the Authors

Aidan is a solutions architect supporting US federal government health customers. He assists customers by developing technical architectures and providing best practices on Amazon Web Services (AWS) cloud with a focus on AI/ML services. In his free time, Aidan enjoys traveling, lifting, and cooking

Nate is a solutions architect supporting US federal government sciences customers. He assists customers in developing technical architectures on Amazon Web Services (AWS), with a focus on data analytics and high performance computing. In his free time, he enjoys skiing and golfing.

Charlotte is a solutions architect on the aerospace & satellite team at Amazon Web Services (AWS), where she helps customers achieve their mission objectives through innovative cloud solutions. Charlotte specializes in machine learning with a focus on Generative AI. In her free time, she enjoys traveling, painting, and running.

Image and video prompt engineering for Amazon Nova Canvas and Amazon Nova Reel

January 27, 2025

by Yanyan Zhang Amazon AWS

Amazon has introduced two new creative content generation models on Amazon Bedrock: Amazon Nova Canvas for image generation and Amazon Nova Reel for video creation. These models transform text and image inputs into custom visuals, opening up creative opportunities for both professional and personal projects. Nova Canvas, a state-of-the-art image generation model, creates professional-grade images from text and image inputs, ideal for applications in advertising, marketing, and entertainment. Similarly, Nova Reel, a cutting-edge video generation model, supports the creation of short videos from text and image inputs, complete with camera motion controls using natural language. Although these models are powerful tools for creative expression, their effectiveness relies heavily on how well users can communicate their vision through prompts.

This post dives deep into prompt engineering for both Nova Canvas and Nova Reel. We share proven techniques for describing subjects, environments, lighting, and artistic style in ways the models can understand. For Nova Reel, we explore how to effectively convey camera movements and transitions through natural language. Whether you’re creating marketing content, prototyping designs, or exploring creative ideas, these prompting strategies can help you unlock the full potential of the visual generation capabilities of Amazon Nova.

Solution overview

To get started with Nova Canvas and Nova Reel, you can either use the Image/Video Playground on the Amazon Bedrock console or access the models through APIs. For detailed setup instructions, including account requirements, model access, and necessary permissions, refer to Creative content generation with Amazon Nova.

Generate images with Nova Canvas

Prompting for image generation models like Amazon Nova Canvas is both an art and a science. Unlike large language models (LLMs), Nova Canvas doesn’t reason or interpret command-based instructions explicitly. Instead, it transforms prompts into images based on how well the prompt captures the essence of the desired image. Writing effective prompts is key to unlocking the full potential of the model for various use cases. In this post, we explore how to craft effective prompts, examine essential elements of good prompts, and dive into typical use cases, each with compelling example prompts.

Essential elements of good prompts

A good prompt serves as a descriptive image caption rather than command-based instructions. It should provide enough detail to clearly describe the desired outcome while maintaining brevity (limited to 1,024 characters). Instead of giving command-based instructions like “Make a beautiful sunset,” you can achieve better results by describing the scene as if you’re looking at it: “A vibrant sunset over mountains, with golden rays streaming through pink clouds, captured from a low angle.” Think of it as painting a vivid picture with words to guide the model effectively.

Effective prompts start by clearly defining the main subject and action:

Subject – Clearly define the main subject of the image. Example: “A blue sports car parked in front of a grand villa.”
Action or pose – Specify what the subject is doing or how it is positioned. Example: “The car is angled slightly towards the camera, its doors open, showcasing its sleek interior.”

Add further context:

Environment – Describe the setting or background. Example: “A grand villa overlooking Lake Como, surrounded by manicured gardens and sparkling lake waters.”

After you define the main focus of the image, you can refine the prompt further by specifying additional attributes such as visual style, framing, lighting, and technical parameters. For instance:

Lighting – Include lighting details to set the mood. Example: “Soft, diffused lighting from a cloudy sky highlights the car’s glossy surface and the villa’s stone facade.”
Camera position and framing – Provide information about perspective and composition. Example: “A wide-angle shot capturing the car in the foreground and the villa’s grandeur in the background, with Lake Como visible beyond.”
Style – Mention the visual style or medium. Example: “Rendered in a cinematic style with vivid, high-contrast details.”

Finally, you can use a negative prompt to exclude specific elements or artifacts from your composition, aligning it more closely to your vision:

Negative prompt – Use the negativeText parameter to exclude unwanted elements in your main prompt. For instance, to remove any birds or people from the image, rather than adding “no birds or people” to the prompt, add “birds, people” to the negative prompt instead. Avoid using negation words like “no,” “not,” or “without” in both the prompt and negative prompt because they might lead to unintended consequences.

Let’s bring all these elements together with an example prompt with a basic and enhanced version that shows these techniques in action.

Basic: A car in front of a house

Enhanced: A blue luxury sports car parked in front of a grand villa overlooking Lake Como. The setting features meticulously manicured gardens, with the lake’s sparkling waters and distant mountains in the background. The car’s sleek, polished surface reflects the surrounding elegance, enhanced by soft, diffused lighting from a cloudy sky. A wide-angle shot capturing the car, villa, and lake in harmony, rendered in a cinematic style with vivid, high-contrast details.

Example image generation prompts

By mastering these techniques, you can create stunning visuals for a wide range of applications using Amazon Nova Canvas. The following images have been generated using Nova Canvas at a 1280×720 pixel resolution with a CFG scale of 6.5 and seed of 0 for reproducibility (see Request and response structure for image generation for detailed explanations of these parameters). This resolution also matches the image dimensions expected by Nova Reel, allowing seamless integration for video experimentation. Let’s examine some example prompts and their resulting images to see these techniques in action.


Aerial view of sparse arctic tundra landscape, expansive white terrain with meandering frozen rivers and scattered rock formations. High-contrast black and white composition showcasing intricate patterns of ice and snow, emphasizing texture and geological diversity. Bird’s-eye perspective capturing the abstract beauty of the arctic wilderness.	An overhead shot of premium over-ear headphones resting on a reflective surface, showcasing the symmetry of the design. Dramatic side lighting accentuates the curves and edges, casting subtle shadows that highlight the product’s premium build quality

An angled view of a premium matte metal water bottle with bamboo accents, showcasing its sleek profile. The background features a soft blur of a serene mountain lake. Golden hour sunlight casts a warm glow on the bottle’s surface, highlighting its texture. Captured with a shallow depth of field for product emphasis.	Watercolor scene of a cute baby dragon with pearlescent mint-green scales crouched at the edge of a garden puddle, tiny wings raised. Soft pastel flowers and foliage frame the composition. Loose, wet-on-wet technique for a dreamy atmosphere, with sunlight glinting off ripples in the puddle.

Abstract figures emerging from digital screens, glitch art aesthetic with RGB color shifts, fragmented pixel clusters, high contrast scanlines, deep shadows cast by volumetric lighting	An intimate portrait of a seasoned fisherman, his face filling the frame. His thick gray beard is flecked with sea spray, and his knit cap is pulled low over his brow. The warm glow of sunset bathes his weathered features in golden light, softening the lines of his face while still preserving the character earned through years at sea. His eyes reflect the calm waters of the harbor behind him.

Generate video with Nova Reel

Video generation models work best with descriptive prompts rather than command-based instructions. When crafting prompts, focus on what you want to see rather than telling the model what to do. Think of writing a detailed caption or scene description like you’re explaining a video that already exists. For example, describing elements like the main subjects, what they’re doing, the setting around them, how the scene is lit, the overall artistic style, and any camera movements can help create more accurate results. The key is to paint a complete picture through description rather than giving step-by-step directions. This means instead of saying “create a dramatic scene,” you could describe “a stormy beach at sunset with crashing waves and dark clouds, filmed with a slow aerial shot moving over the coastline.” The more specific and descriptive you can be about the visual elements you want, the better the output will be. You might want to include details about the subject, action, environment, lighting, style, and camera motion.

When writing a video generation prompt for Nova Reel, be mindful of the following requirements and best practices:

Prompts must be no longer than 512 characters.
For best results, place camera movement descriptions at the start or end of your prompt.
Specify what you want to include rather than what to exclude. For example, instead of “fruit basket with no bananas,” say “fruit basket with apples, oranges, and pears.”

When describing camera movements in your video prompts, be specific about the type of motion you want—whether it’s a smooth dolly shot (moving forward/backward), a pan (sweeping left/right), or a tilt (moving up/down). For more dynamic effects, you can request aerial shots, orbit movements, or specialized techniques like dolly zooms. You can also specify the speed of the movement; refer to camera controls for more ideas.

Example video generation prompts

Let’s look at some of the outputs from Amazon Nova Reel.


Track right shot of single red balloon floating through empty subway tunnel. Balloon glows from within, casting soft red light on concrete walls. Cinematic 4k, moody lighting	Dolly in shot of peaceful deer drinking from forest stream. Sunlight filtering through and bokeh of other deer and forest plants. 4k cinematic.

Pedestal down in a pan in modern kitchen penne pasta with heavy cream white sauce on top, mushrooms and garlic, steam coming out.	Orbit shot of crystal light bulb centered on polished marble surface, gentle floating gears spinning inside with golden glow. Premium lighting. 4k cinematic.

Generate image-based video using Nova Reel

In addition to the fundamental text-to-video, Amazon Nova Reel also supports an image-to-video feature that allows you to use an input reference image to guide video generation. Using image-based prompts for video generation offers significant advantages: it speeds up your creative iteration process and provides precise control over the final output. Instead of relying solely on text descriptions, you can visually define exactly how you want your video to start. These images can come from Nova Canvas or from other sources where you have appropriate rights to use them. This approach has two main strategies:

Simple camera motion – Start with your reference image to establish the visual elements and style, then add minimal prompts focused purely on camera movement, such as “dolly forward.” This approach keeps the scene largely static while creating dynamic motion through camera direction.
Dynamic transformation – This strategy involves describing specific actions and temporal changes in your scene. Detail how elements should evolve over time, framing your descriptions as summaries of the desired transformation rather than step-by-step commands. This allows for more complex scene evolution while maintaining the visual foundation established by your reference image.

This method streamlines your creative workflow by using your image as the visual foundation. Rather than spending time refining text prompts to achieve the right look, you can quickly create an image in Nova Canvas (or source it elsewhere) and use it as your starting point in Nova Reel. This approach leads to faster iterations and more predictable results compared to pure text-to-video generation.

Let’s take some images we created earlier with Amazon Nova Canvas and use them as reference frames to generate videos with Amazon Nova Reel.


Slow dolly forward	A 3d animated film: A mint-green baby dragon is talking dramatically. Emotive, high quality animation.

Track right of a premium matte metal water bottle with bamboo accents with background serene mountain lake ripples moving	Orbit shot of premium over-ear headphones on a reflective surface. Dramatic side lighting accentuates the curves and edges, casting subtle shadows that highlight the product’s premium build quality.

Tying it all together

You can transform your storytelling by combining multiple generated video clips into a captivating narrative. Although Nova Reel handles the video generation, you can further enhance the content using preferred video editing software to add thoughtful transitions, background music, and narration. This combination creates an immersive journey that transports viewers into compelling storytelling. The following example showcases Nova Reel’s capabilities in creating engaging visual narratives. Each clip is crafted with professional lighting and cinematic quality.

Best practices

Consider the following best practices:

Refine iteratively – Start simple and refine your prompt based on the output.
Be specific – Provide detailed descriptions for better results.
Diversify adjectives – Avoid overusing generic descriptors like “beautiful” or “amazing;” opt for specific terms like “serene” or “ornate.”
Refine prompts with AI – Use multimodal understanding models like Amazon Nova (Pro, Lite, or Micro) to help you convert your high-level idea into a refined prompt based on best practices. Generative AI can also help you quickly generate different variations of your prompt to help you experiment with different combinations of style, framing, and more, as well as other creative explorations. Experiment with community-built tools like the Amazon Nova Canvas Prompt Refiner in PartyRock, the Amazon Nova Canvas Prompting Assistant, and the Nova Reel Prompt Optimizer.
Maintain a templatized prompt library – Keep a catalog of effective prompts and their results to refine and adapt over time. Create reusable templates for common scenarios to save time and maintain consistency.
Learn from others – Explore community resources and tools to see effective prompts and adapt them.
Follow trends – Monitor updates or new model features, because prompt behavior might change with new capabilities.

Fundamentals of prompt engineering for Amazon Nova

Effective prompt engineering for Amazon Nova Canvas and Nova Reel follows an iterative process, designed to refine your input and achieve the desired output. This process can be visualized as a logical flow chart, as illustrated in the following diagram.

The process consists of the following steps:

Begin by crafting your initial prompt, focusing on descriptive elements such as subject, action, environment, lighting, style, and camera position or motion.
After generating the first output, assess whether it aligns with your vision. If the result is close to what you want, proceed to the next step. If not, return to Step 1 and refine your prompt.
When you’ve found a promising direction, maintain the same seed value. This maintains consistency in subsequent generations while you make minor adjustments.
Make small, targeted changes to your prompt. These could involve adjusting descriptors, adding or removing elements, and more.
Generate a new output with your refined prompt, keeping the seed value constant.
Evaluate the new output. If it’s satisfactory, move to the next step. If not, return to Step 4 and continue refining.
When you’ve achieved a satisfactory result, experiment with different seed values to create variations of your successful prompt. This allows you to explore subtle differences while maintaining the core elements of your desired output.
Select the best variations from your generated options as your final output.

This iterative approach allows for systematic improvement of your prompts, leading to more accurate and visually appealing results from Nova Canvas and Nova Reel models. Remember, the key to success lies in making incremental changes, maintaining consistency where needed, and being open to exploring variations after you’ve found a winning formula.

Conclusion

Understanding effective prompt engineering for Nova Canvas and Nova Reel unlocks exciting possibilities for creating stunning images and compelling videos. By following the best practices outlined in this guide—from crafting descriptive prompts to iterative refinement—you can transform your ideas into production-ready visual assets.

Ready to start creating? Visit the Amazon Bedrock console today to experiment with Nova Canvas and Nova Reel in the Amazon Bedrock Playground or using the APIs. For detailed specifications, supported features, and additional examples, refer to the following resources:

About the authors

Kris Schultz has spent over 25 years bringing engaging user experiences to life by combining emerging technologies with world class design. In his role as Senior Product Manager, Kris helps design and build AWS services to power Media & Entertainment, Gaming, and Spatial Computing.

Sanju Sunny is a Digital Innovation Specialist with Amazon ProServe. He engages with customers in a variety of industries around Amazon’s distinctive customer-obsessed innovation mechanisms in order to rapidly conceive, validate and prototype new products, services and experiences.

Nitin Eusebius is a Sr. Enterprise Solutions Architect at AWS, experienced in Software Engineering, Enterprise Architecture, and AI/ML. He is deeply passionate about exploring the possibilities of generative AI. He collaborates with customers to help them build well-architected applications on the AWS platform, and is dedicated to solving technology challenges and assisting with their cloud journey.

Security best practices to consider while fine-tuning models in Amazon Bedrock

January 24, 2025

by Vishal Naik Amazon AWS

Amazon Bedrock has emerged as the preferred choice for tens of thousands of customers seeking to build their generative AI strategy. It offers a straightforward, fast, and secure way to develop advanced generative AI applications and experiences to drive innovation.

With the comprehensive capabilities of Amazon Bedrock, you have access to a diverse range of high-performing foundation models (FMs), empowering you to select the most suitable option for your specific needs, customize the model privately with your own data using techniques such as fine-tuning and Retrieval Augmented Generation (RAG), and create managed agents that run complex business tasks.

Fine-tuning pre-trained language models allows organizations to customize and optimize the models for their specific use cases, providing better performance and more accurate outputs tailored to their unique data and requirements. By using fine-tuning capabilities, businesses can unlock the full potential of generative AI while maintaining control over the model’s behavior and aligning it with their goals and values.

In this post, we delve into the essential security best practices that organizations should consider when fine-tuning generative AI models.

Security in Amazon Bedrock

Cloud security at AWS is the highest priority. Amazon Bedrock prioritizes security through a comprehensive approach to protect customer data and AI workloads.

Amazon Bedrock is built with security at its core, offering several features to protect your data and models. The main aspects of its security framework include:

Access control – This includes features such as:
- Fine-grained access control using AWS Identity and Access Management (IAM)
- Resource-based policies to control access to specific Amazon Bedrock resources
Data encryption – Amazon Bedrock offers the following encryption:
- Data at rest is encrypted using AWS Key Management Service (AWS KMS)
- Data in transit is encrypted using TLS 1.2 or higher
Network security – Amazon Bedrock offers several security options, including:
- Support for AWS PrivateLink to establish private connectivity between your virtual private cloud (VPC) and Amazon Bedrock
- VPC endpoints for secure communication within your AWS environment
Compliance – Amazon Bedrock is in alignment with various industry standards and regulations, including HIPAA, SOC, and PCI DSS

Solution overview

Model customization is the process of providing training data to a model to improve its performance for specific use cases. Amazon Bedrock currently offers the following customization methods:

Continued pre-training – Enables tailoring an FM’s capabilities to specific domains by fine-tuning its parameters with unlabeled, proprietary data, allowing continuous improvement as more relevant data becomes available.
Fine-tuning – Involves providing labeled data to train a model on specific tasks, enabling it to learn the appropriate outputs for given inputs. This process adjusts the model’s parameters, enhancing its performance on the tasks represented by the labeled training dataset.
Distillation – Process of transferring knowledge from a larger more intelligent model (known as teacher) to a smaller, faster, cost-efficient model (known as student).

Model customization in Amazon Bedrock involves the following actions:

Create training and validation datasets.
Set up IAM permissions for data access.
Configure a KMS key and VPC.
Create a fine-tuning or pre-training job with hyperparameter tuning.
Analyze results through metrics and evaluation.
Purchase provisioned throughput for the custom model.
Use the custom model for tasks like inference.

In this post, we explain these steps in relation to fine-tuning. However, you can apply the same concepts for continued pre-training as well.

The following architecture diagram explains the workflow of Amazon Bedrock model fine-tuning.

The workflow steps are as follows:

The user submits an Amazon Bedrock fine-tuning job within their AWS account, using IAM for resource access.
The fine-tuning job initiates a training job in the model deployment accounts.
To access training data in your Amazon Simple Storage Service (Amazon S3) bucket, the job employs Amazon Security Token Service (AWS STS) to assume role permissions for authentication and authorization.
Network access to S3 data is facilitated through a VPC network interface, using the VPC and subnet details provided during job submission.
The VPC is equipped with private endpoints for Amazon S3 and AWS KMS access, enhancing overall security.
The fine-tuning process generates model artifacts, which are stored in the model provider AWS account and encrypted using the customer-provided KMS key.

This workflow provides secure data handling across multiple AWS accounts while maintaining customer control over sensitive information using customer managed encryption keys.

The customer is in control of the data; model providers don’t have access to the data, and they don’t have access to a customer’s inference data or their customization training datasets. Therefore, data will not be available to model providers for them to improve their base models. Your data is also unavailable to the Amazon Bedrock service team.

In the following sections, we go through the steps of fine-tuning and deploying the Meta Llama 3.1 8B Instruct model in Amazon Bedrock using the Amazon Bedrock console.

Prerequisites

Before you get started, make sure you have the following prerequisites:

An AWS account
An IAM federation role with access to do the following:
- Create, edit, view, and delete VPC network and security resources
- Create, edit, view, and delete KMS keys
- Create, edit, view, and delete IAM roles and policies for model customization
- Create, upload, view, and delete S3 buckets to access training and validation data and permission to write output data to Amazon S3
- List FMs on the base model that will be used for fine-tuning
- Create a custom training job for the Amazon Bedrock FM
- Provisioned model throughputs
- List custom models and invoke model permissions on the fine-tuned model
Model access, which you can request through the Amazon Bedrock console

For this post, we use the us-west-2 AWS Region. For instructions on assigning permissions to the IAM role, refer to Identity-based policy examples for Amazon Bedrock and How Amazon Bedrock works with IAM.

Prepare your data

To fine-tune a text-to-text model like Meta Llama 3.1 8B Instruct, prepare a training and optional validation dataset by creating a JSONL file with multiple JSON lines.

Each JSON line is a sample containing a prompt and completion field. The format is as follows:

{"prompt": "<prompt1>", "completion": "<expected generated text>"}
{"prompt": "<prompt2>", "completion": "<expected generated text>"}

The following is an example from a sample dataset used as one-line input for fine-tuning Meta Llama 3.1 8B Instruct in Amazon Bedrock. In JSONL format, each record is one text line.

{"prompt": "consumer complaints and resolutions for financial products", "completion": "{'Date received': '01/01/24', 'Product': 'Credit card', 'Sub-product': 'Store credit card', 'Issue': 'Other features, terms, or problems', 'Sub-issue': 'Other problem', 'Consumer complaint narrative': None, 'Company public response': None, 'Company': 'Bread Financial Holdings, Inc.', 'State': 'MD', 'ZIP code': '21060', 'Tags': 'Servicemember', 'Consumer consent provided?': 'Consent not provided', 'Submitted via': 'Web', 'Date sent to company': '01/01/24', 'Company response to consumer': 'Closed with non-monetary relief', 'Timely response?': 'Yes', 'Consumer disputed?': None, 'Complaint ID': 8087806}"}

Create a KMS symmetric key

When uploading your training data to Amazon S3, you can use server-side encryption with AWS KMS. You can create KMS keys on the AWS Management Console, the AWS Command Line Interface (AWS CLI) and SDKs, or an AWS CloudFormation template. Complete the following steps to create a KMS key in the console:

On the AWS KMS console, choose Customer managed keys in the navigation pane.
Choose Create key.
Create a symmetric key. For instructions, see Create a KMS key.

Create an S3 bucket and configure encryption

Complete the following steps to create an S3 bucket and configure encryption:

On the Amazon S3 console, choose Buckets in the navigation pane.
Choose Create bucket.
For Bucket name, enter a unique name for your bucket.

For Encryption type¸ select Server-side encryption with AWS Key Management Service keys.
For AWS KMS key, select Choose from your AWS KMS keys and choose the key you created.

Complete the bucket creation with default settings or customize as needed.

Upload the training data

Complete the following steps to upload the training data:

On the Amazon S3 console, navigate to your bucket.
Create the folders fine-tuning-datasets and outputs and keep the bucket encryption settings as server-side encryption.
Choose Upload and upload your training data file.

Create a VPC

To create a VPC using Amazon Virtual Private Cloud (Amazon VPC), complete the following steps:

On the Amazon VPC console, choose Create VPC.
Create a VPC with private subnets in all Availability Zones.

Create an Amazon S3 VPC gateway endpoint

You can further secure your VPC by setting up an Amazon S3 VPC endpoint and using resource-based IAM policies to restrict access to the S3 bucket containing the model customization data.

Let’s create an Amazon S3 gateway endpoint and attach it to VPC with custom IAM resource-based policies to more tightly control access to your Amazon S3 files.

The following code is a sample resource policy. Use the name of the bucket you created earlier.

{
	"Version": "2012-10-17",
	"Statement": [
		{
			"Sid": "RestrictAccessToTrainingBucket",
			"Effect": "Allow",
			"Principal": "*",
			"Action": [
				"s3:GetObject",
				"s3:PutObject",
				"s3:ListBucket"
			],
			"Resource": [
				"arn:aws:s3:::$your-bucket",
				"arn:aws:s3:::$your-bucket/*"
			]
		}
	]
}

Create a security group for the AWS KMS VPC interface endpoint

A security group acts as a virtual firewall for your instance to control inbound and outbound traffic. This VPC endpoint security group only allows traffic originating from the security group attached to your VPC private subnets, adding a layer of protection. Complete the following steps to create the security group:

On the Amazon VPC console, choose Security groups in the navigation pane.
Choose Create security group.
For Security group name, enter a name (for example, bedrock-kms-interface-sg).
For Description, enter a description.
For VPC, choose your VPC.

Add an inbound rule to HTTPS traffic from the VPC CIDR block.

Create a security group for the Amazon Bedrock custom fine-tuning job

Now you can create a security group to establish rules for controlling Amazon Bedrock custom fine-tuning job access to the VPC resources. You use this security group later during model customization job creation. Complete the following steps:

On the Amazon VPC console, choose Security groups in the navigation pane.
Choose Create security group.
For Security group name, enter a name (for example, bedrock-fine-tuning-custom-job-sg).
For Description, enter a description.
For VPC, choose your VPC.

Add an inbound rule to allow traffic from the security group.

Create an AWS KMS VPC interface endpoint

Now you can create an interface VPC endpoint (PrivateLink) to establish a private connection between the VPC and AWS KMS.

For the security group, use the one you created in the previous step.

Attach a VPC endpoint policy that controls the access to resources through the VPC endpoint. The following code is a sample resource policy. Use the Amazon Resource Name (ARN) of the KMS key you created earlier.

{
	"Statement": [
		{
			"Sid": "AllowDecryptAndView",
			"Principal": {
				"AWS": "*"
			},
			"Effect": "Allow",
			"Action": [
				"kms:Decrypt",
				"kms:DescribeKey",
				"kms:ListAliases",
				"kms:ListKeys"
			],
			"Resource": "$Your-KMS-KEY-ARN"
		}
	]
}

Now you have successfully created the endpoints needed for private communication.

Create a service role for model customization

Let’s create a service role for model customization with the following permissions:

A trust relationship for Amazon Bedrock to assume and carry out the model customization job
Permissions to access your training and validation data in Amazon S3 and to write your output data to Amazon S3
If you encrypt any of the following resources with a KMS key, permissions to decrypt the key (see Encryption of model customization jobs and artifacts)
A model customization job or the resulting custom model
The training, validation, or output data for the model customization job
Permission to access the VPC

Let’s first create the required IAM policies:

On the IAM console, choose Policies in the navigation pane.
Choose Create policy.
Under Specify permissions¸ use the following JSON to provide access on S3 buckets, VPC, and KMS keys. Provide your account, bucket name, and VPC settings.

You can use the following IAM permissions policy as a template for VPC permissions:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "ec2:DescribeNetworkInterfaces",
                "ec2:DescribeVpcs",
                "ec2:DescribeDhcpOptions",
                "ec2:DescribeSubnets",
                "ec2:DescribeSecurityGroups"
            ],
            "Resource": "*"
        }, 
        {
            "Effect": "Allow",
            "Action": [
                "ec2:CreateNetworkInterface",
            ],
            "Resource":[
               "arn:aws:ec2:${{region}}:${{account-id}}:network-interface/*"
            ],
            "Condition": {
               "StringEquals": { 
                   "aws:RequestTag/BedrockManaged": ["true"]
                },
                "ArnEquals": {
                   "aws:RequestTag/BedrockModelCustomizationJobArn": ["arn:aws:bedrock:${{region}}:${{account-id}}:model-customization-job/*"]
               }
            }
        }, 
        {
            "Effect": "Allow",
            "Action": [
                "ec2:CreateNetworkInterface",
            ],
            "Resource":[
               "arn:aws:ec2:${{region}}:${{account-id}}:subnet/${{subnet-id}}",
               "arn:aws:ec2:${{region}}:${{account-id}}:subnet/${{subnet-id2}}",
               "arn:aws:ec2:${{region}}:${{account-id}}:security-group/security-group-id"
            ]
        }, 
        {
            "Effect": "Allow",
            "Action": [
                "ec2:CreateNetworkInterfacePermission",
                "ec2:DeleteNetworkInterface",
                "ec2:DeleteNetworkInterfacePermission",
            ],
            "Resource": "*",
            "Condition": {
               "ArnEquals": {
                   "ec2:Subnet": [
                       "arn:aws:ec2:${{region}}:${{account-id}}:subnet/${{subnet-id}}",
                       "arn:aws:ec2:${{region}}:${{account-id}}:subnet/${{subnet-id2}}"
                   ],
                   "ec2:ResourceTag/BedrockModelCustomizationJobArn": ["arn:aws:bedrock:${{region}}:${{account-id}}:model-customization-job/*"]
               },
               "StringEquals": { 
                   "ec2:ResourceTag/BedrockManaged": "true"
               }
            }
        }, 
        {
            "Effect": "Allow",
            "Action": [
                "ec2:CreateTags"
            ],
            "Resource": "arn:aws:ec2:${{region}}:${{account-id}}:network-interface/*",
            "Condition": {
                "StringEquals": {
                    "ec2:CreateAction": [
                        "CreateNetworkInterface"
                    ]    
                },
                "ForAllValues:StringEquals": {
                    "aws:TagKeys": [
                        "BedrockManaged",
                        "BedrockModelCustomizationJobArn"
                    ]
                }
            }
        }
    ]
}

You can use the following IAM permissions policy as a template for Amazon S3 permissions:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::training-bucket",
                "arn:aws:s3:::training-bucket/*",
                "arn:aws:s3:::validation-bucket",
                "arn:aws:s3:::validation-bucket/*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:PutObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::output-bucket",
                "arn:aws:s3:::output-bucket/*"
            ]
        }
    ]
}

Now let’s create the IAM role.

On the IAM console, choose Roles in the navigation pane.
Choose Create roles.
Create a role with the following trust policy (provide your AWS account ID):

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": "bedrock.amazonaws.com"
            },
            "Action": "sts:AssumeRole",
            "Condition": {
                "StringEquals": {
                    "aws:SourceAccount": "account-id"
                },
                "ArnEquals": {
                    "aws:SourceArn": "arn:aws:bedrock:us-west-2:account-id:model-customization-job/*"
                }
            }
        }
    ] 
}

Assign your custom VPC and S3 bucket access policies.

Give a name to your role and choose Create role.

Update the KMS key policy with the IAM role

In the KMS key you created in the previous steps, you need to update the key policy to include the ARN of the IAM role. The following code is a sample key policy:

{
    "Version": "2012-10-17",
    "Id": "key-consolepolicy-3",
    "Statement": [
        {
            "Sid": "BedrockFineTuneJobPermissions",
            "Effect": "Allow",
            "Principal": {
                "AWS": "$IAM Role ARN"
            },
            "Action": [
                "kms:Decrypt",
                "kms:GenerateDataKey",
                "kms:Encrypt",
                "kms:DescribeKey",
                "kms:CreateGrant",
                "kms:RevokeGrant"
            ],
            "Resource": "$ARN of the KMS key"
        }
     ]
}

For more details, refer to Encryption of model customization jobs and artifacts.

Initiate the fine-tuning job

Complete the following steps to set up your fine-tuning job:

On the Amazon Bedrock console, choose Custom models in the navigation pane.
In the Models section, choose Customize model and Create fine-tuning job.

Under Model details, choose Select model.
Choose Llama 3.1 8B Instruct as the base model and choose Apply.

For Fine-tuned model name, enter a name for your custom model.
Select Model encryption to add a KMS key and choose the KMS key you created earlier.
For Job name, enter a name for the training job.
Optionally, expand the Tags section to add tags for tracking.

Under VPC Settings, choose the VPC, subnets, and security group you created as part of previous steps.

When you specify the VPC subnets and security groups for a job, Amazon Bedrock creates elastic network interfaces (ENIs) that are associated with your security groups in one of the subnets. ENIs allow the Amazon Bedrock job to connect to resources in your VPC.

We recommend that you provide at least one subnet in each Availability Zone.

Under Input data, specify the S3 locations for your training and validation datasets.

Under Hyperparameters, set the values for Epochs, Batch size, Learning rate, and Learning rate warm up steps for your fine-tuning job.

Refer to Custom model hyperparameters for additional details.

Under Output data, for S3 location, enter the S3 path for the bucket storing fine-tuning metrics.
Under Service access, select a method to authorize Amazon Bedrock. You can select Use an existing service role and use the role you created earlier.
Choose Create Fine-tuning job.

Monitor the job

On the Amazon Bedrock console, choose Custom models in the navigation pane and locate your job.

You can monitor the job on the job details page.

Purchase provisioned throughput

After fine-tuning is complete (as shown in the following screenshot), you can use the custom model for inference. However, before you can use a customized model, you need to purchase provisioned throughput for it.

Complete the following steps:

On the Amazon Bedrock console, under Foundation models in the navigation pane, choose Custom models.
On the Models tab, select your model and choose Purchase provisioned throughput.

For Provisioned throughput name, enter a name.
Under Select model, make sure the model is the same as the custom model you selected earlier.
Under Commitment term & model units, configure your commitment term and model units. Refer to Increase model invocation capacity with Provisioned Throughput in Amazon Bedrock for additional insights. For this post, we choose No commitment and use 1 model unit.

Under Estimated purchase summary, review the estimated cost and choose Purchase provisioned throughput.

After the provisioned throughput is in service, you can use the model for inference.

Use the model

Now you’re ready to use your model for inference.

On the Amazon Bedrock console, under Playgrounds in the navigation pane, choose Chat/text.
Choose Select model.
For Category, choose Custom models under Custom & self-hosted models.
For Model, choose the model you just trained.
For Throughput, choose the provisioned throughput you just purchased.
Choose Apply.

Now you can ask sample questions, as shown in the following screenshot.

Implementing these procedures allows you to follow security best practices when you deploy and use your fine-tuned model within Amazon Bedrock for inference tasks.

When developing a generative AI application that requires access to this fine-tuned model, you have the option to configure it within a VPC. By employing a VPC interface endpoint, you can make sure communication between your VPC and the Amazon Bedrock API endpoint occurs through a PrivateLink connection, rather than through the public internet.

This approach further enhances security and privacy. For more information on this setup, refer to Use interface VPC endpoints (AWS PrivateLink) to create a private connection between your VPC and Amazon Bedrock.

Clean up

Delete the following AWS resources created for this demonstration to avoid incurring future charges:

Amazon Bedrock model provisioned throughput
VPC endpoints
VPC and associated security groups
KMS key
IAM roles and policies
S3 bucket and objects

Conclusion

In this post, we implemented secure fine-tuning jobs in Amazon Bedrock, which is crucial for protecting sensitive data and maintaining the integrity of your AI models.

By following the best practices outlined in this post, including proper IAM role configuration, encryption at rest and in transit, and network isolation, you can significantly enhance the security posture of your fine-tuning processes.

By prioritizing security in your Amazon Bedrock workflows, you not only safeguard your data and models, but also build trust with your stakeholders and end-users, enabling responsible and secure AI development.

As a next step, try the solution out in your account and share your feedback.

About the Authors

Vishal Naik is a Sr. Solutions Architect at Amazon Web Services (AWS). He is a builder who enjoys helping customers accomplish their business needs and solve complex challenges with AWS solutions and best practices. His core area of focus includes Generative AI and Machine Learning. In his spare time, Vishal loves making short films on time travel and alternate universe themes.

Sumeet Tripathi is an Enterprise Support Lead (TAM) at AWS in North Carolina. He has over 17 years of experience in technology across various roles. He is passionate about helping customers to reduce operational challenges and friction. His focus area is AI/ML and Energy & Utilities Segment. Outside work, He enjoys traveling with family, watching cricket and movies.

Secure a generative AI assistant with OWASP Top 10 mitigation

January 24, 2025

by Syed Jaffry Amazon AWS

A common use case with generative AI that we usually see customers evaluate for a production use case is a generative AI-powered assistant. However, before it can be deployed, there is the typical production readiness assessment that includes concerns such as understanding the security posture, monitoring and logging, cost tracking, resilience, and more. The highest priority of these production readiness assessments is usually security. If there are security risks that can’t be clearly identified, then they can’t be addressed, and that can halt the production deployment of the generative AI application.

In this post, we show you an example of a generative AI assistant application and demonstrate how to assess its security posture using the OWASP Top 10 for Large Language Model Applications, as well as how to apply mitigations for common threats.

Generative AI scoping framework

Start by understanding where your generative AI application fits within the spectrum of managed vs. custom. Use the AWS generative AI scoping framework to understand the specific mix of the shared responsibility for the security controls applicable to your application. For example, Scope 1 “Consumer Apps” like PartyRock or ChatGPT are usually publicly facing applications, where most of the application internal security is owned and controlled by the provider, and your responsibility for security is on the consumption side. Contrast that with Scope 4/5 applications, where not only do you build and secure the generative AI application yourself, but you are also responsible for fine-tuning and training the underlying large language model (LLM). The security controls in scope for Scope 4/5 applications will range more broadly from the frontend to LLM model security. This post will focus on the Scope 3 generative AI assistant application, which is one of the more frequent use cases seen in the field.

The following figure of the AWS Generative AI Security Scoping Matrix summarizes the types of models for each scope.

OWASP Top 10 for LLMs

Using the OWASP Top 10 for understanding threats and mitigations to an application is one of the most common ways application security is assessed. The OWASP Top 10 for LLMs takes a tried and tested framework and applies it to generative AI applications to help us discover, understand, and mitigate the novel threats for generative AI.

Solution overview

Let’s start with a logical architecture of a typical generative AI assistant application overlying the OWASP Top 10 for LLM threats, as illustrated in the following diagram.

In this architecture, the end-user request usually goes through the following components:

Authentication layer – This layer validates that the user connecting to the application is who they say they are. This is typically done through some sort of an identity provider (IdP) capability like Okta, AWS IAM Identity Center, or Amazon Cognito.
Application controller – This layer contains most of the application business logic and determines how to process the incoming user request by generating the LLM prompts and processing LLM responses before they are sent back to the user.
LLM and LLM agent – The LLM provides the core generative AI capability to the assistant. The LLM agent is an orchestrator of a set of steps that might be necessary to complete the desired request. These steps might involve both the use of an LLM and external data sources and APIs.
Agent plugin controller – This component is responsible for the API integration to external data sources and APIs. This component also holds the mapping between the logical name of an external component, which the LLM agent might refer to, and the physical name.
RAG data store – The Retrieval Augmented Generation (RAG) data store delivers up-to-date, precise, and access-controlled knowledge from various data sources such as data warehouses, databases, and other software as a service (SaaS) applications through data connectors.

The OWASP Top 10 for LLM risks map to various layers of the application stack, highlighting vulnerabilities from UIs to backend systems. In the following sections, we discuss risks at each layer and provide an application design pattern for a generative AI assistant application in AWS that mitigates these risks.

The following diagram illustrates the assistant architecture on AWS.

Authentication layer (Amazon Cognito)

Common security threats such as brute force attacks, session hijacking, and denial of service (DoS) attacks can occur. To mitigate these risks, implement best practices like multi-factor authentication (MFA), rate limiting, secure session management, automatic session timeouts, and regular token rotation. Additionally, deploying edge security measures such as AWS WAF and distributed denial of service (DDoS) mitigation helps block common web exploits and maintain service availability during attacks.

In the preceding architecture diagram, AWS WAF is integrated with Amazon API Gateway to filter incoming traffic, blocking unintended requests and protecting applications from threats like SQL injection, cross-site scripting (XSS), and DoS attacks. AWS WAF Bot Control further enhances security by providing visibility and control over bot traffic, allowing administrators to block or rate-limit unwanted bots. This feature can be centrally managed across multiple accounts using AWS Firewall Manager, providing a consistent and robust approach to application protection.

Amazon Cognito complements these defenses by enabling user authentication and data synchronization. It supports both user pools and identity pools, enabling seamless management of user identities across devices and integration with third-party identity providers. Amazon Cognito offers security features, including MFA, OAuth 2.0, OpenID Connect, secure session management, and risk-based adaptive authentication, to help protect against unauthorized access by evaluating sign-in requests for suspicious activity and responding with additional security measures like MFA or blocking sign-ins. Amazon Cognito also enforces password reuse prevention, further protecting against compromised credentials.

AWS Shield Advanced adds an extra layer of defense by providing enhanced protection against sophisticated DDoS attacks. Integrated with AWS WAF, Shield Advanced delivers comprehensive perimeter protection, using tailored detection and health-based assessments to enhance response to attacks. It also offers round-the-clock support from the AWS Shield Response Team and includes DDoS cost protection, making applications remain secure and cost-effective. Together, Shield Advanced and AWS WAF create a security framework that protects applications against a wide range of threats while maintaining availability.

This comprehensive security setup addresses LLM10:2025 Unbound Consumption and LLM02:2025 Sensitive Information Disclosure, making sure that applications remain both resilient and secure.

Application controller layer (LLM orchestrator Lambda function)

The application controller layer is usually vulnerable to risks such as LLM01:2025 Prompt Injection, LLM05:2025 Improper Output Handling, and LLM 02:2025 Sensitive Information Disclosure. Outside parties might frequently attempt to exploit this layer by crafting unintended inputs to manipulate the LLM, potentially causing it to reveal sensitive information or compromise downstream systems.

In the physical architecture diagram, the application controller is the LLM orchestrator AWS Lambda function. It performs strict input validation by extracting the event payload from API Gateway and conducting both syntactic and semantic validation. By sanitizing inputs, applying allowlisting and deny listing of keywords, and validating inputs against predefined formats or patterns, the Lambda function helps prevent LLM01:2025 Prompt Injection attacks. Additionally, by passing the user_id downstream, it enables the downstream application components to mitigate the risk of sensitive information disclosure, addressing concerns related to LLM02:2025 Sensitive Information Disclosure.

Amazon Bedrock Guardrails provides an additional layer of protection by filtering and blocking sensitive content, such as personally identifiable information (PII) and other custom sensitive data defined by regex patterns. Guardrails can also be configured to detect and block offensive language, competitor names, or other undesirable terms, making sure that both inputs and outputs are safe. You can also use guardrails to prevent LLM01:2025 Prompt Injection attacks by detecting and filtering out harmful or manipulative prompts before they reach the LLM, thereby maintaining the integrity of the prompt.

Another critical aspect of security is managing LLM outputs. Because the LLM might generate content that includes executable code, such as JavaScript or Markdown, there is a risk of XSS attacks if this content is not properly handled. To mitigate this risk, apply output encoding techniques, such as HTML entity encoding or JavaScript escaping, to neutralize any potentially harmful content before it is presented to users. This approach addresses the risk of LLM05:2025 Improper Output Handling.

Implementing Amazon Bedrock prompt management and versioning allows for continuous improvement of the user experience while maintaining the overall security of the application. By carefully managing changes to prompts and their handling, you can enhance functionality without introducing new vulnerabilities and mitigating LLM01:2025 Prompt Injection attacks.

Treating the LLM as an untrusted user and applying human-in-the-loop processes over certain actions are strategies to lower the likelihood of unauthorized or unintended operations.

LLM and LLM agent layer (Amazon Bedrock LLMs)

The LLM and LLM agent layer frequently handles interactions with the LLM and faces risks such as LLM10: Unbounded Consumption, LLM05:2025 Improper Output Handling, and LLM02:2025 Sensitive Information Disclosure.

DoS attacks can overwhelm the LLM with multiple resource-intensive requests, degrading overall service quality while increasing costs. When interacting with Amazon Bedrock hosted LLMs, setting request parameters such as the maximum length of the input request will minimize the risk of LLM resource exhaustion. Additionally, there is a hard limit on the maximum number of queued actions and total actions an Amazon Bedrock agent can take to fulfill a customer’s intent, which limits the number of actions in a system reacting to LLM responses, avoiding unnecessary loops or intensive tasks that could exhaust the LLM’s resources.

Improper output handling leads to vulnerabilities such as remote code execution, cross-site scripting, server-side request forgery (SSRF), and privilege escalation. The inadequate validation and management of the LLM-generated outputs before they are sent downstream can grant indirect access to additional functionality, effectively enabling these vulnerabilities. To mitigate this risk, treat the model as any other user and apply validation of the LLM-generated responses. The process is facilitated with Amazon Bedrock Guardrails using filters such as content filters with configurable thresholds to filter harmful content and safeguard against prompt attacks before they are processed further downstream by other backend systems. Guardrails automatically evaluate both user input and model responses to detect and help prevent content that falls into restricted categories.

Amazon Bedrock Agents execute multi-step tasks and securely integrate with AWS native and third-party services to reduce the risk of insecure output handling, excessive agency, and sensitive information disclosure. In the architecture diagram, the action group Lambda function under the agents is used to encode all the output text, making it automatically non-executable by JavaScript or Markdown. Additionally, the action group Lambda function parses each output from the LLM at every step executed by the agents and controls the processing of the outputs accordingly, making sure they are safe before further processing.

Sensitive information disclosure is a risk with LLMs because malicious prompt engineering can cause LLMs to accidentally reveal unintended details in their responses. This can lead to privacy and confidentiality violations. To mitigate the issue, implement data sanitization practices through content filters in Amazon Bedrock Guardrails.

Additionally, implement custom data filtering policies based on user_id and strict user access policies. Amazon Bedrock Guardrails helps filter content deemed sensitive, and Amazon Bedrock Agents further reduces the risk of sensitive information disclosure by allowing you to implement custom logic in the preprocessing and postprocessing templates to strip any unexpected information. If you have enabled model invocation logging for the LLM or implemented custom logging logic in your application to record the input and output of the LLM in Amazon CloudWatch, measures such as CloudWatch Log data protection are important in masking sensitive information identified in the CloudWatch logs, further mitigating the risk of sensitive information disclosure.

Agent plugin controller layer (action group Lambda function)

The agent plugin controller frequently integrates with internal and external services and applies custom authorization to internal and external data sources and third-party APIs. At this layer, the risk of LLM08:2025 Vector & Embedding Weaknesses and LLM06:2025 Excessive Agency are in effect. Untrusted or unverified third-party plugins could introduce backdoors or vulnerabilities in the form of unexpected code.

Apply least privilege access to the AWS Identity and Access Management (IAM) roles of the action group Lambda function, which interacts with plugin integrations to external systems to help mitigate the risk of LLM06:2025 Excessive Agency and LLM08:2025 Vector & Embedding Weaknesses. This is demonstrated in the physical architecture diagram; the agent plugin layer Lambda function is associated with a least privilege IAM role for secure access and interface with other internal AWS services.

Additionally, after the user identity is determined, restrict the data plane by applying user-level access control by passing the user_id to downstream layers like the agent plugin layer. Although this user_id parameter can be used in the agent plugin controller Lambda function for custom authorization logic, its primary purpose is to enable fine-grained access control for third-party plugins. The responsibility lies with the application owner to implement custom authorization logic within the action group Lambda function, where the user_id parameter can be used in combination with predefined rules to apply the appropriate level of access to third-party APIs and plugins. This approach wraps deterministic access controls around a non-deterministic LLM and enables granular access control over which users can access and execute specific third-party plugins.

Combining user_id-based authorization on data and IAM roles with least privilege on the action group Lambda function will generally minimize the risk of LLM08:2025 Vector & Embedding Weaknesses and LLM06:2025 Excessive Agency.

RAG data store layer

The RAG data store is responsible for securely retrieving up-to-date, precise, and user access-controlled knowledge from various first-party and third-party data sources. By default, Amazon Bedrock encrypts all knowledge base-related data using an AWS managed key. Alternatively, you can choose to use a customer managed key. When setting up a data ingestion job for your knowledge base, you can also encrypt the job using a custom AWS Key Management Service (AWS KMS) key.

If you decide to use the vector store in Amazon OpenSearch Service for your knowledge base, Amazon Bedrock can pass a KMS key of your choice to it for encryption. Additionally, you can encrypt the sessions in which you generate responses from querying a knowledge base with a KMS key. To facilitate secure communication, Amazon Bedrock Knowledge Bases uses TLS encryption when interacting with third-party vector stores, provided that the service supports and permits TLS encryption in transit.

Regarding user access control, Amazon Bedrock Knowledge Bases uses filters to manage permissions. You can build a segmented access solution on top of a knowledge base using metadata and filtering feature. During runtime, your application must authenticate and authorize the user, and include this user information in the query to maintain accurate access controls. To keep the access controls updated, you should periodically resync the data to reflect any changes in permissions. Additionally, groups can be stored as a filterable attribute, further refining access control.

This approach helps mitigate the risk of LLM02:2025 Sensitive Information Disclosure and LLM08:2025 Vector & Embedding Weaknesses, to assist in that only authorized users can access the relevant data.

Summary

In this post, we discussed how to classify your generative AI application from a security shared responsibility perspective using the AWS Generative AI Security Scoping Matrix. We reviewed a common generative AI assistant application architecture and assessed its security posture using the OWASP Top 10 for LLMs framework, and showed how to apply the OWASP Top 10 for LLMs threat mitigations using AWS service controls and services to strengthen the architecture of your generative AI assistant application. Learn more about building generative AI applications with AWS Workshops for Bedrock.

About the Authors

Syed Jaffry is a Principal Solutions Architect with AWS. He advises software companies on AI and helps them build modern, robust and secure application architectures on AWS.

Amit Kumar Agrawal is a Senior Solutions Architect at AWS where he has spent over 5 years working with large ISV customers. He helps organizations build and operate cost-efficient and scalable solutions in the cloud, driving their business and technical outcomes.

Tej Nagabhatla is a Senior Solutions Architect at AWS, where he works with a diverse portfolio of clients ranging from ISVs to large enterprises. He specializes in providing architectural guidance across a wide range of topics around AI/ML, security, storage, containers, and serverless technologies. He helps organizations build and operate cost-efficient, scalable cloud applications. In his free time, Tej enjoys music, playing basketball, and traveling.

Aetion’s technology

Smart Subgroups

Solution overview

Outcomes

Conclusion

About the Authors

DeepSeek-R1 distilled variations

Solution overview

Prerequisites

Prepare the model package

Import the model

Test the imported model

Pricing

Benchmarks

Other considerations

Conclusion

About the Authors

Operating model patterns

Decentralized model

Centralized model

Federated model

Generative AI architecture components

Large language models

Data sources, embeddings, and vector store

Guardrails

Operating model architectures

Decentralized operating model

Centralized operating model

Federated operating model

Cost management

Conclusion

About the Authors

Solution overview

Prerequisites

Create an Aurora PostgreSQL cluster

Ingest data to Aurora PostgreSQL-Compatible

Create an Amazon Kendra index

Set up the Amazon Kendra Aurora PostgreSQL connector

Invoke the RAG application

Clean up

Conclusion

About the Authors

Understanding latency in LLM applications

The role of tokenization

Understanding user experience

Consistency over speed

Keeping users engaged

Balancing speed, quality, and cost

Latency-optimized inference: A deep dive

Implementation

Benchmarking methodology and results

Benchmark results

Comprehensive guide to LLM latency optimization

Prompt engineering for latency optimization

Building production-ready AI applications

System architecture and end-to-end latency considerations

Prompt caching for enhanced performance

Prompt routing for intelligent model selection

Architectural considerations and caching

Balancing model sophistication, latency, and cost

Conclusion

About the Authors

Using FMEval for model evaluation

Using SageMaker with MLflow to track experiments

Code walkthrough

Prerequisites

Evaluate a model and log to MLflow

ModelRunner definition

DataConfig definition

Evaluation sets definition

Evaluate and log to MLflow

Run evaluations

Clean up

Conclusion

About the authors

Solution overview

Prerequisites

Extend a SageMaker container image for your model

Build your inference.py file

model_fn

Build your `inference.py` file

`model_fn`

`input_fn`

`predict_fn`:

`output_fn`

Test your `inference.py` file