Asure’s approach to enhancing their call center experience using generative AI and Amazon Q in Quicksight

Asure’s approach to enhancing their call center experience using generative AI and Amazon Q in Quicksight

Asure, a company of over 600 employees, is a leading provider of cloud-based workforce management solutions designed to help small and midsized businesses streamline payroll and human resources (HR) operations and ensure compliance. Their offerings include a comprehensive suite of human capital management (HCM) solutions for payroll and tax, HR compliance services, time tracking, 401(k) plans, and more.

Asure anticipated that generative AI could aid contact center leaders to understand their team’s support performance, identify gaps and pain points in their products, and recognize the most effective strategies for training customer support representatives using call transcripts. The Asure team was manually analyzing thousands of call transcripts to uncover themes and trends, a process that lacked scalability. The overarching goal of this engagement was to improve upon this manual approach. Failing to adopt a more automated approach could have potentially led to decreased customer satisfaction scores and, consequently, a loss in future revenue. Therefore, it was valuable to provide Asure a post-call analytics pipeline capable of providing beneficial insights, thereby enhancing the overall customer support experience and driving business growth.

Asure recognized the potential of generative AI to further enhance the user experience and better understand the needs of the customer and wanted to find a partner to help realize it.

Pat Goepel, chairman and CEO of Asure, shares,

“In collaboration with the AWS Generative AI Innovation Center, we are utilizing Amazon Bedrock, Amazon Comprehend, and Amazon Q in QuickSight to understand trends in our own customer interactions, prioritize items for product development, and detect issues sooner so that we can be even more proactive in our support for our customers. Our partnership with AWS and our commitment to be early adopters of innovative technologies like Amazon Bedrock underscore our dedication to making advanced HCM technology accessible for businesses of any size.”

“We are thrilled to partner with AWS on this groundbreaking generative AI project. The robust AWS infrastructure and advanced AI capabilities provide the perfect foundation for us to innovate and push the boundaries of what’s possible. This collaboration will enable us to deliver cutting-edge solutions that not only meet but exceed our customers’ expectations. Together, we are poised to transform the landscape of AI-driven technology and create unprecedented value for our clients.”

—Yasmine Rodriguez, CTO of Asure.

“As we embarked on our journey at Asure to integrate generative AI into our solutions, finding the right partner was crucial. Being able to partner with the Gen AI Innovation Center at AWS brings not only technical expertise with AI but the experience of developing solutions at scale. This collaboration confirms that our AI solutions are not just innovative but also resilient. Together, we believe that we can harness the power of AI to drive efficiency, enhance customer experiences, and stay ahead in a rapidly evolving market.”

—John Canada, VP of Engineering at Asure.

In this post, we explore why Asure used the Amazon Web Services (AWS) post-call analytics (PCA) pipeline that generated insights across call centers at scale with the advanced capabilities of generative AI-powered services such as Amazon Bedrock and Amazon Q in QuickSight. Asure chose this approach because it provided in-depth consumer analytics, categorized call transcripts around common themes, and empowered contact center leaders to use natural language to answer queries. This ultimately allowed Asure to provide its customers with improvements in product and customer experiences.

Solution Overview

At a high level, the solution consists of first converting audio into transcripts using Amazon Transcribe and generating and evaluating summary fields for each transcript using Amazon Bedrock. In addition, Q&A can be done at a single call level using Amazon Bedrock or for many calls using Amazon Q in QuickSight. In the rest of this section, we describe these components and the services used in greater detail.

We added upon the existing PCA solution with the following services:

Customer service and call center operations are highly dynamic, with evolving customer expectations, market trends, and technological advancements reshaping the industry at a rapid pace. Staying ahead in this competitive landscape demands agile, scalable, and intelligent solutions that can adapt to changing demands.

In this context, Amazon Bedrock emerges as an exceptional choice for developing a generative AI-powered solution to analyze customer service call transcripts. This fully managed service provides access to cutting-edge foundation models (FMs) from leading AI providers, enabling the seamless integration of state-of-the-art language models tailored for text analysis tasks. Amazon Bedrock offers fine-tuning capabilities that allow you to customize these pre-trained models using proprietary call transcript data, facilitating high accuracy and relevance without the need for extensive machine learning (ML) expertise. Moreover, Amazon Bedrock offers integration with other AWS services like Amazon SageMaker, which streamlines the deployment process, and its scalable architecture makes sure the solution can adapt to increasing call volumes effortlessly.

With robust security measures, data privacy safeguards, and a cost-effective pay-as-you-go model, Amazon Bedrock offers a secure, flexible, and cost-efficient service to harness generative AI’s potential in enhancing customer service analytics, ultimately leading to improved customer experiences and operational efficiencies.

Furthermore, by integrating a knowledge base containing organizational data, policies, and domain-specific information, the generative AI models can deliver more contextual, accurate, and relevant insights from the call transcripts. This knowledge base allows the models to understand and respond based on the company’s unique terminology, products, and processes, enabling deeper analysis and more actionable intelligence from customer interactions.

In this use case, Amazon Bedrock is used for both generation of summary fields for sample call transcripts and evaluation of these summary fields against a ground truth dataset. Its value comes from its simple integration into existing pipelines and various evaluation frameworks. Amazon Bedrock also allows you to choose various models for different use cases, making it an obvious choice for the solution due to its flexibility. Using Amazon Bedrock allows for iteration of the solution using knowledge bases for simple storage and access of call transcripts as well as guardrails for building responsible AI applications.

Amazon Bedrock

Amazon Bedrock is a fully managed service that makes FMs available through an API, so you can choose from a wide range of FMs to find the model that is best suited for your use case. With the Amazon Bedrock serverless experience, you can get started quickly, privately customize FMs with your own data, and quickly integrate and deploy them into your applications using AWS tools without having to manage the infrastructure.

Amazon Q in Quicksight

Amazon Q in QuickSight is a generative AI assistant that accelerates decision-making and enhances business productivity with generative business intelligence (BI) capabilities.

The original PCA solution includes the following services:

The solution consisted of the following components:

  • Call metadata generation – After the file ingestion step when transcripts are generated for each call transcript using Amazon Transcribe, Anthropic’s Claude Haiku FM in Amazon Bedrock is used to generate call-related metadata. This includes a summary, the category, the root cause, and other high-level fields generated from a call transcript. This is orchestrated using AWS Step Functions.
  • Individual call Q&A – For questions requiring a specific call, such as, “How did the customer react in call ID X,” Anthropic’s Claude Haiku is used to power a Q&A assistant located in a CloudFront application. This is powered by the web app portion of the architecture diagram (provided in the next section).
  • Aggregate call Q&A – To answer questions requiring multiple calls, such as “What are the most common issues detected,” Amazon Q on QuickSight is used to enhance the Agent Assist interface. This step is shown by business analysts interacting with QuickSight in the storage and visualization step through natural language.

To learn more about the architectural components of the PCA solution, including file ingestion, insight extraction, storage and visualization, and web application components, refer to Post call analytics for your contact center with Amazon language AI services.

Architecture

The following diagram illustrates the solution architecture. The evaluation framework, call metadata generation, and Amazon Q in QuickSight were new components introduced from the original PCA solution.

Architecture Diagram for Asure

Ragas and a human-in-the-loop UI (as described in the customer blogpost with Tealium) were used to evaluate the metadata generation and individual call Q&A portions. Ragas is an open source evaluation framework that helps evaluate FM-generated text.

The high-level takeaways from this work are the following:

  • Anthropic’s Claude 3 Haiku successfully took in a call transcript and determined its summary, root cause, if the issue was resolved, and, if it was a callback, next steps by the customer and agent (generative AI-powered fields). When using Anthropic’s Claude 3 Haiku as opposed to Anthropic’s Claude Instant, there was a reduction in latency. With chain-of-thought reasoning, there was an increase in overall quality (includes how factual, understandable, and relevant responses are on a 1–5 scale, described in more detail later in this post) as measured by subject matter experts (SMEs). With the use of Amazon Bedrock, various models can be chosen based on different use cases, illustrating its flexibility in this application.
  • Amazon Q in QuickSight proved to be a powerful analytical tool in understanding and generating relevant insights from data through intuitive chart and table visualizations. It can perform simple calculations whenever necessary while also facilitating deep dives into issues and exploring data from multiple perspectives, demonstrating great value in insight generation.
  • The human-in-the-loop UI plus Ragas metrics proved effective to evaluate outputs of FMs used throughout the pipeline. Particularly, answer correctness, answer relevance, faithfulness, and summarization metrics (alignment and coverage score) were used to evaluate the call metadata generation and individual call Q&A components using Amazon Bedrock. Its flexibility in various FMs allowed the testing of many types of models to generate evaluation metrics, including Anthropic’s Claude Sonnet 3.5 and Anthropic’s Claude Haiku 3.

Call metadata generation

The call metadata generation pipeline consisted of converting an audio file to a call transcript in a JSON format using Amazon Transcribe and then generating key information for each transcript using Amazon Bedrock and Amazon Comprehend. The following diagram shows a subset of the preceding architecture diagram that demonstrates this.

Mini Arch Diagram

The original PCA post linked previously shows how Amazon Transcribe and Amazon Comprehend are used in the metadata generation pipeline.

The call transcript input that was outputted from the Amazon Transcribe step of the Step Functions workflow followed the format in the following code example:

{
call_id: <call id>,
agent_id: <agent_id>
customer_id: <customer_id>
transcript: """
   Agent: <Agent message>.
   Customer: <Customer message>
   Agent: <Agent message>.
   Customer: <Customer message>
   Agent: <Agent message>.
   Customer: <Customer message>
   ...........
    """
}

Metadata was generated using Amazon Bedrock. Specifically, it extracted the summary, root cause, topic, and next steps, and answered key questions such as if the call was a callback and if the issue was ultimately resolved.

Prompts were stored in Amazon DynamoDB, allowing Asure to quickly modify prompts or add new generative AI-powered fields based on future enhancements. The following screenshot shows how prompts can be modified through DynamoDB.

Full DynamoDB Prompts

Individual call Q&A

The chat assistant powered by Anthropic’s Claude Haiku was used to answer natural language queries on a single transcript. This assistant, the call metadata values generated from the previous section, and sentiments generated from Amazon Comprehend were displayed in an application hosted by CloudFront.

The user of the final chat assistant can modify the prompt in DynamoDB. The following screenshot shows the general prompt for an individual call Q&A.

DynamoDB Prompt for Chat

The UI hosted by CloudFront allows an agent or supervisor to analyze a specific call to extract additional details. The following screenshot shows the insights Asure gleaned for a sample customer service call.

Img of UI with Call Stats

The following screenshot shows the chat assistant, which exists in the same webpage.

Evaluation Framework

This section outlines components of the evaluation framework used. It ultimately allows Asure to highlight important metrics for their use case and provides visibility into the generative AI application’s strengths and weaknesses. This was done using automated quantitative metrics provided by Ragas, DeepEval, and traditional ML metrics as well as human-in-the-loop evaluation done by SMEs.

Quantitative Metrics

The results of the generated call metadata and individual call Q&A were evaluated using quantitative metrics provided by Ragas: answer correctness, answer relevance, and faithfulness; and DeepEval: alignment and coverage, both powered by FMs from Amazon Bedrock. Its simple integration with external libraries allowed Amazon Bedrock to be configured with existing libraries. In addition, traditional ML metrics were used for “Yes/No” answers. The following are the metrics used for different components of the solution:

  • Call metadata generation – This included the following:
    • Summary – Alignment and coverage (find a description of these metrics in the DeepEval repository) and answer correctness
    • Issue resolved, callback – F1-score and accuracy
    • Topic, next steps, root cause – Answer correctness, answer relevance, and faithfulness
  • Individual call Q&A – Answer correctness, answer relevance, and faithfulness
  • Human in the loop – Both individual call Q&A and call metadata generation used human-in-the-loop metrics

For a description of answer correctness, answer relevance, and faithfulness, refer to the customer blogpost with Tealium.

The use of Amazon Bedrock in the evaluation framework allowed for a flexibility of different models based on different use cases. For example, Anthropic’s Claude Sonnet 3.5 was used to generate DeepEval metrics, whereas Anthropic’s Claude 3 Haiku (with its low latency) was ideal for Ragas.

Human in the Loop

The human-in-the-loop UI is described in the Human-in-the-Loop section of the customer blogpost with Tealium. To use it to evaluate this solution, some changes had to be made:

  • There is a choice for the user to analyze one of the generated metadata fields for a call (such as a summary) or a specific Q&A pair.
  • The user can bring in two model outputs for comparison. This can include outputs from the same FMs but using different prompts, outputs from different FMs but using the same prompt, and outputs from different FMs and using different prompts.
  • Additional checks for fluency, coherence, creativity, toxicity, relevance, completeness, and overall quality were added, where the user adds in a measure of this metric based on the model output from a range of 0–4.

The following screenshots show the UI.

Human in the Loop UI Home Screen

Human in the Loop UI Metrics

The human-in-the-loop system establishes a mechanism between domain expertise and Amazon Bedrock outputs. This in turn will lead to improved generative AI applications and ultimately to high customer trust of such systems.

To demo the human-in-the-loop UI, follow the instructions in the GitHub repo.

Natural Language Q&A using Amazon Q in Quicksight

QuickSight, integrated with Amazon Q, enabled Asure to use natural language queries for comprehensive customer analytics. By interpreting queries on sentiments, call volumes, issue resolutions, and agent performance, the service delivered data-driven visualizations. This empowered Asure to quickly identify pain points, optimize operations, and deliver exceptional customer experiences through a streamlined, scalable analytics solution tailored for call center operations.

Integrate Amazon Q in QuickSight with the PCA solution

The Amazon Q in QuickSight integration was done by following three high-level steps:

  1. Create a dataset on QuickSight.
  2. Create a topic on QuickSight from the dataset.
  3. Query using natural language.

Create a dataset on QuickSight

We used Athena as the data source, which queries data from Amazon S3. QuickSight can be configured through multiple data sources (for more information, refer to Supported data sources). For this use case, we used the data generated from the PCA pipeline as the data source for further analytics and natural language queries in Amazon Q in QuickSight. The PCA pipeline stores data in Amazon S3, which can be queried in Athena, an interactive query service that allows you to analyze data directly in Amazon S3 using standard SQL.

  1. On the QuickSight console, choose Datasets in the navigation pane.
  2. Choose Create new.
  3. Choose Athena as the data source and input the particular catalog, database, and table that Amazon Q in QuickSight will reference.

Confirm the dataset was created successfully and proceed to the next step.

Quicksight Add Dataset

Create a topic on Amazon Quicksight from the dataset created

Users can use topics in QuickSight, powered by Amazon Q integration, to perform natural language queries on their data. This feature allows for intuitive data exploration and analysis by posing questions in plain language, alleviating the need for complex SQL queries or specialized technical skills. Before setting up a topic, make sure that the users have Pro level access. To set up a topic, follow these steps:

  1. On the QuickSight console, choose Topics in the navigation pane.
  2. Choose New topic.
  3. Enter a name for the topic and choose the data source created.
  4. Choose the created topic and then choose Open Q&A to start querying in natural language

Query using natural language

We performed intuitive natural language queries to gain actionable insights into customer analytics. This capability allows users to analyze sentiments, call volumes, issue resolutions, and agent performance through conversational queries, enabling data-driven decision-making, operational optimization, and enhanced customer experiences within a scalable, call center-tailored analytics solution. Examples of the simple natural language queries “Which customer had positive sentiments and a complex query?” and “What are the most common issues and which agents dealt with them?” are shown in the following screenshots.

Quicksight Dashboard

Quicksight Dashboard and Statistics

These capabilities are helpful when business leaders want to dive deep on a particular issue, empowering them to make informed decisions on various issues.

Success metrics

The primary success metric gained from this solution is boosting employee productivity, primarily by quickly understanding customer interactions from calls to uncover themes and trends while also identifying gaps and pain points in their products. Before the engagement, analysts were taking 14 days to manually go through each call transcript to retrieve insights. After engagement, Asure observed how Amazon Bedrock and Amazon Q in QuickSight could reduce this time to minutes, even seconds, to obtain both insights queried directly from all stored call transcripts and visualizations that can be used for report generation.

In the pipeline, Anthropic’s Claude 3 Haiku was used to obtain initial call metadata fields (such as summary, root cause, next steps, and sentiments) that was stored in Athena. This allowed each call transcript to be queried using natural language from Amazon Q in QuickSight, letting business analysts answer high-level questions about issues, themes, and customer and agent insights in seconds.

Pat Goepel, chairman and CEO of Asure, shares,

“In collaboration with the AWS Generative AI Innovation Center, we have improved upon a post-call analytics solution to help us identify and prioritize features that will be the most impactful for our customers. We are utilizing Amazon Bedrock, Amazon Comprehend, and Amazon Q in QuickSight to understand trends in our own customer interactions, prioritize items for product development, and detect issues sooner so that we can be even more proactive in our support for our customers. Our partnership with AWS and our commitment to be early adopters of innovative technologies like Amazon Bedrock underscore our dedication to making advanced HCM technology accessible for businesses of any size.”

Takeaways

We had the following takeaways:

  • Enabling chain-of-thought reasoning and specific assistant prompts for each prompt in the call metadata generation component and calling it using Anthropic’s Claude 3 Haiku improved metadata generation for each transcript. Primarily, the flexibility of Amazon Bedrock in the use of various FMs allowed full experimentation of many types of models with minimal changes. Using Amazon Bedrock can allow for the use of various models depending on the use case, making it the obvious choice for this application due to its flexibility.
  • Ragas metrics, particularly faithfulness, answer correctness, and answer relevance, were used to evaluate call metadata generation and individual Q&A. However, summarization required different metrics, alignment, and coverage, which didn’t require ground truth summaries. Therefore, DeepEval was used to calculate summarization metrics. Overall, the ease of integrating Amazon Bedrock allowed it to power the calculation of quantitative metrics with minimal changes to the evaluation libraries. This also allowed the use of different types of models for different evaluation libraries.
  • The human-in-the-loop approach can be used by SMEs to further evaluate Amazon Bedrock outputs. There is an opportunity to improve upon an Amazon Bedrock FM based on this feedback, but this was not worked on in this engagement.
  • The post-call analytics workflow, with the use of Amazon Bedrock, can be iterated upon in the future using features such as Amazon Bedrock Knowledge Bases to perform Q&A over a specific number of call transcripts as well as Amazon Bedrock Guardrails to detect harmful and hallucinated responses while also creating more responsible AI applications.
  • Amazon Q in QuickSight was able to answer natural language questions on customer analytics, root cause, and agent analytics, but some questions required reframing to get meaningful responses.
  • Data fields within Amazon Q in QuickSight needed to be defined properly and synonyms needed to be added to make Amazon Q more robust with natural language queries.

Security best practices

We recommend the following security guidelines for building secure applications on AWS:

Conclusion

In this post, we showcased how Asure used the PCA solution powered by Amazon Bedrock and Amazon Q in QuickSight to generate consumer and agent insights both at individual and aggregate levels. Specific insights included those centered around a common theme or issue. With these services, Asure was able to improve employee productivity to generate these insights in minutes instead of weeks.

This is one of the many ways builders can deliver great solutions using Amazon Bedrock and Amazon Q in QuickSight. To learn more, refer to Amazon Bedrock and Amazon Q in QuickSight.


About the Authors

Suren Gunturu is a Data Scientist working in the Generative AI Innovation Center, where he works with various AWS customers to solve high-value business problems. He specializes in building ML pipelines using large language models, primarily through Amazon Bedrock and other AWS Cloud services.

Avinash Yadav is a Deep Learning Architect at the Generative AI Innovation Center, where he designs and implements cutting-edge GenAI solutions for diverse enterprise needs. He specializes in building ML pipelines using large language models, with expertise in cloud architecture, Infrastructure as Code (IaC), and automation. His focus lies in creating scalable, end-to-end applications that leverage the power of deep learning and cloud technologies.

John Canada is the VP of Engineering at Asure Software, where he leverages his experience in building innovative, reliable, and performant solutions and his passion for AI/ML to lead a talented team dedicated to using Machine Learning to enhance the capabilities of Asure’s software and meet the evolving needs of businesses.

Yasmine Rodriguez Wakim is the Chief Technology Officer at Asure Software. She is an innovative Software Architect & Product Leader with deep expertise in creating payroll, tax, and workforce software development. As a results-driven tech strategist, she builds and leads technology vision to deliver efficient, reliable, and customer-centric software that optimizes business operations through automation.

Vidya Sagar Ravipati is a Science Manager at the Generative AI Innovation Center, where he leverages his vast experience in large-scale distributed systems and his passion for machine learning to help AWS customers across different industry verticals accelerate their AI and cloud adoption.

Read More

Unleashing the multimodal power of Amazon Bedrock Data Automation to transform unstructured data into actionable insights

Unleashing the multimodal power of Amazon Bedrock Data Automation to transform unstructured data into actionable insights

Gartner predicts that “by 2027, 40% of generative AI solutions will be multimodal (text, image, audio and video) by 2027, up from 1% in 2023.”

The McKinsey 2023 State of AI Report identifies data management as a major obstacle to AI adoption and scaling. Enterprises generate massive volumes of unstructured data, from legal contracts to customer interactions, yet extracting meaningful insights remains a challenge. Traditionally, transforming raw data into actionable intelligence has demanded significant engineering effort. It often requires managing multiple machine learning (ML) models, designing complex workflows, and integrating diverse data sources into production-ready formats.

The result is expensive, brittle workflows that demand constant maintenance and engineering resources. In a world where—according to Gartner—over 80% of enterprise data is unstructured, enterprises need a better way to extract meaningful information to fuel innovation.

Today, we’re excited to announce the general availability of Amazon Bedrock Data Automation, a powerful, fully managed feature within Amazon Bedrock that automate the generation of useful insights from unstructured multimodal content such as documents, images, audio, and video for your AI-powered applications. It enables organizations to extract valuable information from multimodal content unlocking the full potential of their data without requiring deep AI expertise or managing complex multimodal ML pipelines. With Amazon Bedrock Data Automation, enterprises can accelerate AI adoption and develop solutions that are secure, scalable, and responsible.

The benefits of using Amazon Bedrock Data Automation

Amazon Bedrock Data Automation provides a single, unified API that automates the processing of unstructured multi-modal content, minimizing the complexity of orchestrating multiple models, fine-tuning prompts, and stitching outputs together. It helps ensure high accuracy and cost efficiency while significantly lowering processing costs.

Built with responsible AI, Amazon Bedrock Data Automation enhances transparency with visual grounding and confidence scores, allowing outputs to be validated before integration into mission-critical workflows. It adheres to enterprise-grade security and compliance standards, enabling you to deploy AI solutions with confidence. It also enables you to define when data should be extracted as-is and when it should be inferred, giving complete control over the process.

Cross-Region inference enables seamless management of unplanned traffic bursts by using compute across different AWS Regions. Amazon Bedrock Data Automation optimizes for available AWS Regional capacity by automatically routing across regions within the same geographic area to maximize throughput at no additional cost. For example, a request made in the US stays within Regions in the US. Amazon Bedrock Data Automation is currently available in US West (Oregon) and US East (N. Virginia) AWS Regions helping to ensure seamless request routing and enhanced reliability.  Amazon Bedrock Data Automation is expanding to additional Regions, so be sure to check the documentation for the latest updates.

Amazon Bedrock Data Automation offers transparent and predictable pricing based on the modality of processed content and the type of output used (standard vs custom output). Pay according to the number of pages, quantity of images, and duration of audio and video files. This straightforward pricing model provides easier cost calculation compared to token-based pricing model.

Use cases for Amazon Bedrock Data Automation

Key use cases such as intelligent document processingmedia asset analysis and monetization, speech analytics, search and discovery, and agent-driven operations highlight how Amazon Bedrock Data Automation enhances innovation, efficiency, and data-driven decision-making across industries.

Intelligent document processing

According to Fortune Business Insights, the intelligent document processing industry is projected to grow from USD 10.57 billion in 2025 to USD 66.68 billion by 2032 with a CAGR of 30.1 %. IDP is powering critical workflows across industries and enabling businesses to scale with speed and accuracy. Financial institutions use IDP to automate tax forms and fraud detection, while healthcare providers streamline claims processing and medical record digitization. Legal teams accelerate contract analysis and compliance reviews, and in oil and gas, IDP enhances safety reporting. Manufacturers and retailers optimize supply chain and invoice processing, helping to ensure seamless operations. In the public sector, IDP improves citizen services, legislative document management, and compliance tracking. As businesses strive for greater automation, IDP is no longer an option, it’s a necessity for cost reduction, operational efficiency, and data-driven decision-making.

Let’s explore a real-world use case showcasing how Amazon Bedrock Data Automation enhances efficiency in loan processing.

Loan processing is a complex, multi-step process that involves document verification, credit assessments, policy compliance checks, and approval workflows, requiring precision and efficiency at every stage. Loan processing with traditional AWS AI services is shown in the following figure.

As shown in the preceding figure, loan processing is a multi-step workflow that involves handling diverse document types, managing model outputs, and stitching results across multiple services. Traditionally, documents from portals, email, or scans are stored in Amazon Simple Storage Service (Amazon S3), requiring custom logic to split multi-document packages. Next, Amazon Comprehend or custom classifiers categorize them into types such as W2s, bank statements, and closing disclosures, while Amazon Textract extracts key details. Additional processing is needed to standardize formats, manage JSON outputs, and align data fields, often requiring manual integration and multiple API calls. In some cases, foundation models (FMs) generate document summaries, adding further complexity. Additionally, human-in-the-loop verification may be required for low-threshold outputs.

With Amazon Bedrock Data Automation, this entire process is now simplified into a single unified API call. It automates document classification, data extraction, validation, and structuring, removing the need for manual stitching, API orchestration, and custom integration efforts, significantly reducing complexity and accelerating loan processing workflows as shown in the following figure.

As shown in the preceding figure, when using Amazon Bedrock Data Automation, loan packages from third-party systems, portals, email, or scanned documents are stored in Amazon S3, where Amazon Bedrock Data Automation automates document splitting and processing, removing the need for custom logic. After the loan packages are ingested, Amazon Bedrock Data Automation classifies documents such W2s, bank statements, and closing disclosures in a single step, alleviating the need for separate classifier model calls. Amazon Bedrock Data Automation then extracts key information based on the customer requirement, capturing critical details such as employer information from W2s, transaction history from bank statements, and loan terms from closing disclosures.

Unlike traditional workflows that require manual data normalization, Amazon Bedrock Data Automation automatically standardizes extracted data, helping to ensure consistent date formats, currency values, and field names without additional processing based on the customer provided output schema. Moreover, Amazon Bedrock Data Automation enhances compliance and accuracy by providing summarized outputs, bounding boxes for extracted fields, and confidence scores, delivering structured, validated, and ready-to-use data for downstream applications with minimal effort.

In summary, Amazon Bedrock Data Automation enables financial institutions to seamlessly process loan documents from ingestion to final output through a single unified API call, eliminating the need for multiple independent steps.

While this example highlights financial services, the same principles apply across industries to streamline complex document processing workflows. Built for scale, security, and transparency,  Amazon Bedrock Data Automation adheres to enterprise-grade compliance standards, providing robust data protection. With visual grounding, confidence scores, and seamless integration into knowledge bases, it powers Retrieval Augmented Generation (RAG)-driven document retrieval and completes the deployment of production-ready AI workflows in days, not months.

It also offers flexibility in data extraction by supporting both explicit and implicit extractions. Explicit extraction is used for clearly stated information, such as names, dates, or specific values, while implicit extraction infers insights that aren’t directly stated but can be derived through context and reasoning. This ability to toggle between extraction types enables more comprehensive and nuanced data processing across various document types.

This is achieved through responsible AI, with Amazon Bedrock Data Automation passing every process through a responsible AI model to help ensure fairness, accuracy, and compliance in document automation.

By automating document classification, extraction, and normalization, it not only accelerates document processing, it also enhances downstream applications, such as knowledge management and intelligent search. With structured, validated data readily available, organizations can unlock deeper insights and improve decision-making.

This seamless integration extends to efficient document search and retrieval, transforming business operations by enabling quick access to critical information across vast repositories. By converting unstructured document collections into searchable knowledge bases, organizations can seamlessly find, analyze, and use their data. This is particularly valuable for industries handling large document volumes, where rapid access to specific information is crucial. Legal teams can efficiently search through case files, healthcare providers can retrieve patient histories and research papers, and government agencies can manage legislative records and policy documents. Powered by Amazon Bedrock Data Automation and Amazon Bedrock Knowledge Bases, this integration streamlines investment research, regulatory filings, clinical protocols, and public sector record management, significantly improving efficiency across industries.

The following figure shows how Amazon Bedrock Data Automation seamlessly integrates with Amazon Bedrock Knowledge Bases to extract insights from unstructured datasets and ingest them into a vector database for efficient retrieval. This integration enables organizations to unlock valuable knowledge from their data, making it accessible for downstream applications. By using these structured insights, businesses can build generative AI applications, such as assistants that dynamically answer questions and provide context-aware responses based on the extracted information. This approach enhances knowledge retrieval, accelerates decision-making, and enables more intelligent, AI-driven interactions.

The preceding architecture diagram showcases a pipeline for processing and retrieving insights from multimodal content using Amazon Bedrock Data Automation and Amazon Bedrock Knowledge Bases. Unstructured data, such as documents, images, videos, and audio, is first ingested into an Amazon S3 bucket. Amazon Bedrock Data Automation then processes this content, extracting key insights and transforming it for further use. The processed data is stored in Amazon Bedrock Knowledge Bases, where an embedding model converts it into vector representations, which are then stored in a vector database for efficient semantic search. Amazon API Gateway (WebSocket API) facilitates real-time interactions, enabling users to query the knowledge base dynamically via a chatbot or other interfaces. This architecture enhances automated data processing, efficient retrieval, and seamless real-time access to insights.

Beyond intelligent search and retrieval, Amazon Bedrock Data Automation enables organizations to automate complex decision-making processes, providing greater accuracy and compliance in document-driven workflows. By using structured data, businesses can move beyond simple document processing to intelligent, policy-aware automation.

Amazon Bedrock Data Automation can also be used with Amazon Bedrock Agents to take the next step in automation. Going beyond traditional IDP, this approach enables autonomous workflows that assist knowledge workers and streamline decision-making. For example, in insurance claims processing, agents validate claims against policy documents; while in loan processing, they assess mortgage applications against underwriting policies. With multi-agent workflows, policy validation, automated decision support, and document generation, this approach enhances efficiency, accuracy, and compliance across industries.

Similarly, Amazon Bedrock Data Automation is simplifying media and entertainment use cases, seamlessly integrating workflows through its unified API. Let’s take a closer look at how it’s driving this transformation

Media asset analysis and monetization

Companies in media and entertainment (M&E), advertising, gaming, and education own vast digital assets, such as videos, images, and audio files, and require efficient ways to analyze them. Gaining insights from these assets enables better indexing, deeper analysis, and supports monetization and compliance efforts.

The image and video modalities of Amazon Bedrock Data Automation provide advanced features for efficient extraction and analysis.

  • Image modality: Supports image summarization, IAB taxonomy, and content moderation. It also includes text detection and logo detection with bounding boxes and confidence scores. Additionally, it enables customizable analysis via blueprints for use cases like scene classification.
  • Video modality: Automates video analysis workflows, chapter segmentation, and both visual and audio processing. It generates full video summaries, chapter summaries, IAB taxonomy, text detection, visual and audio moderation, logo detection, and audio transcripts.

The customized approach to extracting and analyzing video content involves a sophisticated process that gathers information from both the visual and audio components of the video, making it complex to build and manage.

As shown in the preceding figure, a customized video analysis pipeline involves sampling image frames from the visual portion of the video and applying both specialized and FMs to extract information, which is then aggregated at the shot level. It also transcribes the audio into text and combines both visual and audio data for chapter level analysis. Additionally, large language model (LLM)-based analysis is applied to derive further insights, such as video summaries and classifications. Finally, the data is stored in a database for downstream applications to consume.

Media video analysis with Amazon Bedrock Data Automation now simplifies this workflow into a single unified API call, minimizing complexity and reducing integration effort, as shown in the following figure.

Customers can use Amazon Bedrock Data Automation to support popular media analysis use cases such as:

  • Digital asset management: in the M&E industry, digital asset management (DAM) refers to the organized storage, retrieval, and management of digital content such as videos, images, audio files, and metadata. With growing content libraries, media companies need efficient ways to categorize, search, and repurpose assets for production, distribution, and monetization.

Amazon Bedrock Data Automation automates video, image, and audio analysis, making DAM more scalable, efficient and intelligent.

  • Contextual ad placement: Contextual advertising enhances digital marketing by aligning ads with content, but implementing it for video on demand (VOD) is challenging. Traditional methods rely on manual tagging, making the process slow and unscalable.

Amazon Bedrock Data Automation automates content analysis across video, audio, and images, eliminating complex workflows. It extracts scene summaries, audio segments, and IAB taxonomies to power video ads solution, improving contextual ad placement and improve ad campaign performance.

  • Compliance and moderation: Media compliance and moderation make sure that digital content adheres to legal, ethical, and environment-specific guidelines to protect users and maintain brand integrity. This is especially important in industries such as M&E, gaming, advertising, and social media, where large volumes of content need to be reviewed for harmful content, copyright violations, brand safety and regulatory compliance.

Amazon Bedrock Data Automation streamlines compliance by using AI-driven content moderation to analyze both the visual and audio components of media. This enables users to define and apply customized policies to evaluate content against their specific compliance requirements.

Intelligent speech analytics

Amazon Bedrock Data Automation is used in intelligent speech analytics to derive insights from audio data across multiple industries with speed and accuracy. Financial institutions rely on intelligent speech analytics to monitor call centers for compliance and detect potential fraud, while healthcare providers use it to capture patient interactions and optimize telehealth communications. In retail and hospitality, speech analytics drives customer engagement by uncovering insights from live feedback and recorded interactions. With the exponential growth of voice data, intelligent speech analytics is no longer a luxury—it’s a vital tool for reducing costs, improving efficiency, and driving smarter decision-making.

Customer service – AI-driven call analytics for better customer experience

Businesses can analyze call recordings at scale to gain actionable insights into customer sentiment, compliance, and service quality. Contact centers can use Amazon Bedrock Data Automation to:

  • Transcribe and summarize thousands of calls daily with speaker separation and key moment detection.
  • Extract sentiment insights and categorize customer complaints for proactive issue resolution.
  • Improve agent coaching by detecting compliance gaps and training needs.

A traditional call analytics approach is shown in the following figure.

Processing customer service call recordings involves multiple steps, from audio capture to advanced AI-driven analysis as highlighted below:

  • Audio capture and storage Call recordings from customer service interactions are collected and stored across disparate systems (for example, multiple S3 buckets and call center service provider output). Each file might require custom handling because of varying formats and qualities.
  • Multi-step processing: Multiple, separate AI and machine learning (AI/ML) services and models are needed for each processing stage:
    1. Transcription: Audio files are sent to a speech-to-text ML model, such as Amazon Transcribe, to generate different audio segments.
    2. Call summary: Summary of the call with main issue description, action items, and outcomes using either Amazon Transcribe Call Analytics or other generative AI FMs.
    3. Speaker diarization and identification: Determining who spoke when involves Amazon Transcribe or similar third-party tools.
    4. Compliance analysis: Separate ML models must be orchestrated to detect compliance issues (such as identifying profanity or escalated emotions), implement personally identifiable information (PII) redaction, and flag critical moments. These analytics are implemented with either Amazon Comprehend, or separate prompt engineering with FMs.
    5. Discovers entities referenced in the call using Amazon Comprehend or custom entity detection models, or configurable string matching.
    6. Audio metadata extraction: Extraction of file properties such as format, duration, and bit rate is handled by either Amazon Transcribe Analytics or another call center solution.
  • Fragmented workflows: The disparate nature of these processes leads to increased latency, higher integration complexity, and a greater risk of errors. Stitching of outputs is required to form a comprehensive view, complicating dashboard integration and decision-making.

Unified, API-drove speech analytics with Amazon Bedrock Data Automation

The following figure shows customer service call analytics using Amazon Bedrock Data Automation-power intelligent speech analytics.

Optimizing customer service call analysis requires a seamless, automated pipeline that efficiently ingests, processes, and extracts insights from audio recordings as mentioned below:

  • Streamlined data capture and processing: A single, unified API call ingests call recordings directly from storage—regardless of the file format or source—automatically handling any necessary file splitting or pre-processing.
  • End-to-end automation: Intelligent speech analytics with Amazon Bedrock Data Automation now encapsulates the entire call analysis workflow:
    1. Comprehensive transcription: Generates turn-by-turn transcripts with speaker identification, providing a clear record of every interaction.
    2. Detailed call summary: Created using the generative AI capability of Amazon Bedrock Data Automation, the detailed call summary enables an operator to quickly gain insights from the files.
    3. Automated speaker diarization and identification: Seamlessly distinguishes between multiple speakers, accurately mapping out who spoke when.
    4. Compliance scoring: In one step, the system flags key compliance indicators (such as profanity, violence, or other content moderation metrics) to help ensure regulatory adherence.
    5. Rich audio metadata: Amazon Bedrock Data Automation automatically extracts detailed metadata—including format, duration, sample rate, channels, and bit rate—supporting further analytics and quality assurance.

By consolidating multiple steps into a single API call, customer service centers benefit from faster processing, reduced error rates, and significantly lower integration complexity. This streamlined approach enables real-time monitoring and proactive agent coaching, ultimately driving improved customer experience and operational agility.

Before the availability of Amazon Bedrock Data Automation for intelligent speech analytics, customer service call analysis was a fragmented, multi-step process that required juggling various tools and models. Now, with the unified API of Amazon Bedrock Data Automation, organizations can quickly transform raw voice data into actionable insights—cutting through complexity, reducing costs, and empowering teams to enhance service quality and compliance.

When to choose Amazon Bedrock Data Automation instead of traditional AI/ML services

You should choose Amazon Bedrock Data Automation when you need a simple, API-driven solution for multi-modal content processing without the complexity of managing and orchestrating across multiple models or prompt engineering. With a single API call, Amazon Bedrock Data Automation seamlessly handles asset splitting, classification, information extraction, visual grounding, and confidence scoring, eliminating the need for manual orchestration.

On the other hand, the core capabilities of Amazon Bedrock are ideal if you require full control over models and workflows to tailor solutions to your organization’s specific business needs. Developers can use Amazon Bedrock to select FMs based on price-performance, fine-tune prompt engineering for data extraction, train custom classification models, implement responsible AI guardrails, and build an orchestration pipeline to provide consistent output.

Amazon Bedrock Data Automation streamlines multi-modal processing, while Amazon Bedrock offers building blocks for deeper customization and control.

Conclusion

Amazon Bedrock Data Automation provides enterprises with scalability, security, and transparency; enabling seamless processing of unstructured data with confidence. Designed for rapid deployment, it helps developers transition from prototype to production in days, accelerating time-to-value while maintaining cost efficiency. Start using Amazon Bedrock Data Automation today and unlock the full potential of your unstructured data. For solution guidance, see Guidance for Multimodal Data Processing with Bedrock Data Automation.


About the Authors

Wrick Talukdar is a Tech Lead – Generative AI Specialist focused on Intelligent Document Processing. He leads machine learning initiatives and projects across business domains, leveraging multimodal AI, generative models, computer vision, and natural language processing. He speaks at conferences such as AWS re:Invent, IEEE, Consumer Technology Society(CTSoc), YouTube webinars, and other industry conferences like CERAWEEK and ADIPEC. In his free time, he enjoys writing and birding photography.

Author Lana ZhangLana Zhang is a Senior Solutions Architect at AWS World Wide Specialist Organization AI Services team, specializing in AI and generative AI with a focus on use cases including content moderation and media analysis. With her expertise, she is dedicated to promoting AWS AI and generative AI solutions, demonstrating how generative AI can transform classic use cases with advanced business value. She assists customers in transforming their business solutions across diverse industries, including social media, gaming, e-commerce, media, advertising, and marketing.

Julia Hu is a Specialist Solutions Architect who helps AWS customers and partners build generative AI solutions using Amazon Q Business on AWS. Julia has over 4 years of experience developing solutions for customers adopting AWS services on the forefront of cloud technology.

Keith Mascarenhas leads worldwide GTM strategy for Generative AI at AWS, developing enterprise use cases and adoption frameworks for Amazon Bedrock. Prior to this, he drove AI/ML solutions and product growth at AWS, and held key roles in Business Development, Solution Consulting and Architecture across Analytics, CX and Information Security.

Read More

Tool choice with Amazon Nova models

Tool choice with Amazon Nova models

In many generative AI applications, a large language model (LLM) like Amazon Nova is used to respond to a user query based on the model’s own knowledge or context that it is provided. However, as use cases have matured, the ability for a model to have access to tools or structures that would be inherently outside of the model’s frame of reference has become paramount. This could be APIs, code functions, or schemas and structures required by your end application. This capability has developed into what is referred to as tool use or function calling.

To add fine-grained control to how tools are used, we have released a feature for tool choice for Amazon Nova models. Instead of relying on prompt engineering, tool choice forces the model to adhere to the settings in place.

In this post, we discuss tool use and the new tool choice feature, with example use cases.

Tool use with Amazon Nova

To illustrate the concept of tool use, we can imagine a situation where we provide Amazon Nova access to a few different tools, such as a calculator or a weather API. Based on the user’s query, Amazon Nova will select the appropriate tool and tell you how to use it. For example, if a user asks “What is the weather in Seattle?” Amazon Nova will use the weather tool.

The following diagram illustrates an example workflow between an Amazon Nova model, its available tools, and related external resources.

Tool use at the core is the selection of the tool and its parameters. The responsibility to execute the external functionality is left to application or developer. After the tool is executed by the application, you can return the results to the model for the generation of the final response.

Let’s explore some examples in more detail. The following diagram illustrates the workflow of an Amazon Nova model using a function call to access a weather API, and returning the response to the user.

The following diagram illustrates the workflow of an Amazon Nova model using a function call to access a calculator tool.

Tool choice with Amazon Nova

The toolChoice API parameter allows you to control when a tool is called. There are three supported options for this parameter:

  • Any – With tool choice Any, the model will select at least one of the available tools each time:
    {
       "toolChoice": {
            "any": {}
        }
    }

  • Tool – With tool choice Tool, the model will always use the requested tool:
    {
       "toolChoice": {
            "tool": {
                "name": "name_of_tool"
            }
        }
    }

  • Auto – Tool choice Auto is the default behavior and will leave the tool selection completely up to the model:
    {
       "toolChoice": {
            "auto": {}
        }
    }

A popular tactic to improve the reasoning capabilities of a model is to use chain of thought. When using the tool choice of auto, Amazon Nova will use chain of thought and the response of the model will include both the reasoning and the tool that was selected.

This behavior will differ depending on the use case. When tool or any are selected as the tool choice, Amazon Nova will output only the tools and not output chain of thought.

Use cases

In this section, we explore different use cases for tool choice.

Structured output/JSON mode

In certain scenarios, you might want Amazon Nova to use a specific tool to answer the user’s question, even if Amazon Nova believes it can provide a response without the use of a tool. A common use case for this approach is enforcing structured output/JSON mode. It’s often critical to have LLMs return structured output, because this enables downstream use cases to more effectively consume and process the generated outputs. In these instances, the tools employed don’t necessarily need to be client-side functions—they can be used whenever the model is required to return JSON output adhering to a predefined schema, thereby compelling Amazon Nova to use the specified tool.

When using tools for enforcing structured output, you provide a single tool with a descriptive JSON inputSchema. You specify the tool with {"tool" : {"name" : "Your tool name"}}. The model will pass the input to the tool, so the name of the tool and its description should be from the model’s perspective.

For example, consider a food website. When provided with a dish description, the website can extract the recipe details, such as cooking time, ingredients, dish name, and difficulty level, in order to facilitate user search and filtering capabilities. See the following example code:

import boto3
import json

tool_config = {
    "toolChoice": {
        "name": { "tool" : "extract_recipe"}
    },
    "tools": [
        {
            "toolSpec": {
                "name": "extract_recipe",
                "description": "Extract recipe for cooking instructions",
                "inputSchema": {
                    "json": {
                        "type": "object",
                        "properties": {
                            "recipe": {
                                "type": "object",
                                "properties": {
                                    "name": {
                                        "type": "string",
                                        "description": "Name of the recipe"
                                    },
                                    "description": {
                                        "type": "string",
                                        "description": "Brief description of the dish"
                                    },
                                    "prep_time": {
                                        "type": "integer",
                                        "description": "Preparation time in minutes"
                                    },
                                    "cook_time": {
                                        "type": "integer",
                                        "description": "Cooking time in minutes"
                                    },
                                    "servings": {
                                        "type": "integer",
                                        "description": "Number of servings"
                                    },
                                    "difficulty": {
                                        "type": "string",
                                        "enum": ["easy", "medium", "hard"],
                                        "description": "Difficulty level of the recipe"
                                    },
                                    "ingredients": {
                                        "type": "array",
                                        "items": {
                                            "type": "object",
                                            "properties": {
                                                "item": {
                                                    "type": "string",
                                                    "description": "Name of ingredient"
                                                },
                                                "amount": {
                                                    "type": "number",
                                                    "description": "Quantity of ingredient"
                                                },
                                                "unit": {
                                                    "type": "string",
                                                    "description": "Unit of measurement"
                                                }
                                            },
                                            "required": ["item", "amount", "unit"]
                                        }
                                    },
                                    "instructions": {
                                        "type": "array",
                                        "items": {
                                            "type": "string",
                                            "description": "Step-by-step cooking instructions"
                                        }
                                    },
                                    "tags": {
                                        "type": "array",
                                        "items": {
                                            "type": "string",
                                            "description": "Categories or labels for the recipe"
                                        }
                                    }
                                },
                                "required": ["name", "ingredients", "instructions"]
                            }
                        },
                        "required": ["recipe"]
                    }
                }
            }
        }
    ]
}

messages = [{
    "role": "user",
    "content": [
        {"text": input_text},
    ]
}]

inf_params = {"topP": 1, "temperature": 1}

client = boto3.client("bedrock-runtime", region_name="us-east-1")

response = client.converse(
    modelId="us.amazon.nova-micro-v1:0",
    messages=messages,
    toolConfig=tool_config,
    inferenceConfig=inf_params,
    additionalModelRequestFields= {"inferenceConfig": { "topK": 1 } }
)
print(json.dumps(response['output']['message']['content'][0][], indent=2))

We can provide a detailed description of a dish as text input:

Legend has it that this decadent chocolate lava cake was born out of a baking mistake in New York's Any Kitchen back in 1987, when chef John Doe pulled a chocolate sponge cake out of the oven too early, only to discover that the dessert world would never be the same. Today I'm sharing my foolproof version, refined over countless dinner parties. Picture a delicate chocolate cake that, when pierced with a fork, releases a stream of warm, velvety chocolate sauce – it's pure theater at the table. While it looks like a restaurant-worthy masterpiece, the beauty lies in its simplicity: just six ingredients (good quality dark chocolate, unsalted butter, eggs, sugar, flour, and a pinch of salt) transform into individual cakes in under 15 minutes. The secret? Precise timing is everything. Pull them from the oven a minute too late, and you'll miss that magical molten center; too early, and they'll be raw. But hit that sweet spot at exactly 12 minutes, when the edges are set but the center still wobbles slightly, and you've achieved dessert perfection. I love serving these straight from the oven, dusted with powdered sugar and topped with a small scoop of vanilla bean ice cream that slowly melts into the warm cake. The contrast of temperatures and textures – warm and cold, crisp and gooey – makes this simple dessert absolutely unforgettable.

We can force Amazon Nova to use the tool extract_recipe, which will generate a structured JSON output that adheres to the predefined schema provided as the tool input schema:

 {
  "toolUseId": "tooluse_4YT_DYwGQlicsNYMbWFGPA",
  "name": "extract_recipe",
  "input": {
    "recipe": {
      "name": "Decadent Chocolate Lava Cake",
      "description": "A delicate chocolate cake that releases a stream of warm, velvety chocolate sauce when pierced with a fork. It's pure theater at the table.",
      "difficulty": "medium",
      "ingredients": [
        {
          "item": "good quality dark chocolate",
          "amount": 125,
          "unit": "g"
        },
        {
          "item": "unsalted butter",
          "amount": 125,
          "unit": "g"
        },
        {
          "item": "eggs",
          "amount": 4,
          "unit": ""
        },
        {
          "item": "sugar",
          "amount": 100,
          "unit": "g"
        },
        {
          "item": "flour",
          "amount": 50,
          "unit": "g"
        },
        {
          "item": "salt",
          "amount": 0.5,
          "unit": "pinch"
        }
      ],
      "instructions": [
        "Preheat the oven to 200u00b0C (400u00b0F).",
        "Melt the chocolate and butter together in a heatproof bowl over a saucepan of simmering water.",
        "In a separate bowl, whisk the eggs and sugar until pale and creamy.",
        "Fold the melted chocolate mixture into the egg and sugar mixture.",
        "Sift the flour and salt into the mixture and gently fold until just combined.",
        "Divide the mixture among six ramekins and bake for 12 minutes.",
        "Serve straight from the oven, dusted with powdered sugar and topped with a small scoop of vanilla bean ice cream."
      ],
      "prep_time": 10,
      "cook_time": 12,
      "servings": 6,
      "tags": [
        "dessert",
        "chocolate",
        "cake"
      ]
    }
  }
}

API generation

Another common scenario is to require Amazon Nova to select a tool from the available options no matter the context of the user query. One example of this is with API endpoint selection. In this situation, we don’t know the specific tool to use, and we allow the model to choose between the ones available.

With the tool choice of any, you can make sure that the model will always use at least one of the available tools. Because of this, we provide a tool that can be used for when an API is not relevant. Another example would be to provide a tool that allows clarifying questions.

In this example, we provide the model with two different APIs, and an unsupported API tool that it will select based on the user query:

import boto3
import json

tool_config = {
    "toolChoice": {
        "any": {}
    },
    "tools": [
         {
            "toolSpec": {
                "name": "get_all_products",
                "description": "API to retrieve multiple products with filtering and pagination options",
                "inputSchema": {
                    "json": {
                        "type": "object",
                        "properties": {
                            "sort_by": {
                                "type": "string",
                                "description": "Field to sort results by. One of: price, name, created_date, popularity",
                                "default": "created_date"
                            },
                            "sort_order": {
                                "type": "string",
                                "description": "Order of sorting (ascending or descending). One of: asc, desc",
                                "default": "desc"
                            },
                        },
                        "required": []
                    }
                }
            }
        },
        {
            "toolSpec": {
                "name": "get_products_by_id",
                "description": "API to retrieve retail products based on search criteria",
                "inputSchema": {
                    "json": {
                        "type": "object",
                        "properties": {
                            "product_id": {
                                "type": "string",
                                "description": "Unique identifier of the product"
                            },
                        },
                        "required": ["product_id"]
                    }
                }
            }
        },
        {
            "toolSpec": {
                "name": "unsupported_api",
                "description": "API to use when the user query does not relate to the other available APIs",
                "inputSchema": {
                    "json": {
                        "type": "object",
                        "properties": {
                            "reasoning": {
                                "type": "string",
                                "description": "The reasoning for why the user query did not have a valid API available"
                            },
                        },
                        "required": ["reasoning"]
                    }
                }
            }
        }
    ]
}


messages = [{
    "role": "user",
    "content": [
        {"text": input_text},
    ]
}]

inf_params = {"topP": 1, "temperature": 1}

client = boto3.client("bedrock-runtime", region_name="us-east-1")

response = client.converse(
    modelId="us.amazon.nova-micro-v1:0",
    messages=messages,
    toolConfig=tool_config,
    inferenceConfig=inf_params,
    additionalModelRequestFields= {"inferenceConfig": { "topK": 1 } }
)

print(json.dumps(response['output']['message']['content'][0], indent=2))

A user input of “Can you get all of the available products?” would output the following:

{
  "toolUse": {
    "toolUseId": "tooluse_YCNbT0GwSAyjIYOuWnDhkw",
    "name": "get_all_products",
    "input": {}
  }
}

Whereas “Can you get my most recent orders?” would output the following:

{
  "toolUse": {
    "toolUseId": "tooluse_jpiZnrVcQDS1sAa-qPwIQw",
    "name": "unsupported_api",
    "input": {
      "reasoning": "The available tools do not support retrieving user orders. The user's request is for personal order information, which is not covered by the provided APIs."
    }
  }
}

Chat with search

The final option for tool choice is auto. This is the default behavior, so it is consistent with providing no tool choice at all.

Using this tool choice will allow the option of tool use or just text output. If the model selects a tool, there will be a tool block and text block. If the model responds with no tool, only a text block is returned. In the following example, we want to allow the model to respond to the user or call a tool if necessary:

import boto3
import json

tool_config = {
    "toolChoice": {
        "auto": {}
    },
    "tools": [
         {
            "toolSpec": {
                "name": "search",
                "description": "API that provides access to the internet",
                "inputSchema": {
                    "json": {
                        "type": "object",
                        "properties": {
                            "query": {
                                "type": "string",
                                "description": "Query to search by",
                            },
                        },
                        "required": ["query"]
                    }
                }
            }
        }
    ]
}

messages = [{
    "role": "user",
    "content": [
        {"text": input_text},
    ]
}]

system = [{
    "text": "ou are a helpful chatbot. You can use a tool if necessary or respond to the user query"
}]

inf_params = {"topP": 1, "temperature": 1}

client = boto3.client("bedrock-runtime", region_name="us-east-1")

response = client.converse(
    modelId="us.amazon.nova-micro-v1:0",
    messages=messages,
    toolConfig=tool_config,
    inferenceConfig=inf_params,
    additionalModelRequestFields= {"inferenceConfig": { "topK": 1 } }
)


if (response["stopReason"] == "tool_use"):
    tool_use = next(
        block["toolUse"]
        for block in response["output"]["message"]["content"]
            if "toolUse" in block
    )
   print(json.dumps(tool_use, indent=2))
 else:
    pattern = r'<thinking>.*?</thinking>\n\n|<thinking>.*?</thinking>'
    text_response = response["output"]["message"]["content"][0]["text"]
    stripped_text = re.sub(pattern, '', text_response, flags=re.DOTALL)
    
    print(stripped_text)

A user input of “What is the weather in San Francisco?” would result in a tool call:

{
  "toolUseId": "tooluse_IwtBnbuuSoynn1qFiGtmHA",
  "name": "search",
  "input": {
    "query": "what is the weather in san francisco"
  }
}

Whereas asking the model a direct question like “How many months are in a year?” would respond with a text response to the user:

There are 12 months in a year.

Considerations

There are a few best practices that are required for tool calling with Nova models. The first is to use greedy decoding parameters. With Amazon Nova models, that requires setting a temperature, top p, and top k of 1. You can refer to the previous code examples for how to set these. Using greedy decoding parameters forces the models to produce deterministic responses and improves the success rate of tool calling.

The second consideration is the JSON schema you are using for the tool consideration. At the time of writing, Amazon Nova models support a limited subset of JSON schemas, so they might not be picked up as expected by the model. Common fields would be $def and $ref fields. Make sure that your schema has the following top-level fields set: type (must be object), properties, and required.

Lastly, for the most impact on the success of tool calling, you should optimize your tool configurations. Descriptions and names should be very clear. If there are nuances to when one tool should be called over the other, make sure to have that concisely included in the tool descriptions.

Conclusion

Using tool choice in tool calling workflows is a scalable way to control how a model invokes tools. Instead of relying on prompt engineering, tool choice forces the model to adhere to the settings in place. However, there are complexities to tool calling; for more information, refer to Tool use (function calling) with Amazon Nova, Tool calling systems, and Troubleshooting tool calls.

Explore how Amazon Nova models can enhance your generative AI use cases today.


About the Authors

Jean Farmer is a Generative AI Solutions Architect on the Amazon Artificial General Intelligence (AGI) team, specializing in agentic applications. Based in Seattle, Washington, she works at the intersection of autonomous AI systems and practical business solutions, helping to shape the future of AGI at Amazon.

Sharon Li is an AI/ML Specialist Solutions Architect at Amazon Web Services (AWS) based in Boston, Massachusetts. With a passion for leveraging cutting-edge technology, Sharon is at the forefront of developing and deploying innovative generative AI solutions on the AWS cloud platform.

Lulu Wong is an AI UX designer on the Amazon Artificial General Intelligence (AGI) team. With a background in computer science, learning design, and user experience, she bridges the technical and user experience domains by shaping how AI systems interact with humans, refining model input-output behaviors, and creating resources to make AI products more accessible to users.

Read More

Integrate generative AI capabilities into Microsoft Office using Amazon Bedrock

Integrate generative AI capabilities into Microsoft Office using Amazon Bedrock

Generative AI is rapidly transforming the modern workplace, offering unprecedented capabilities that augment how we interact with text and data. At Amazon Web Services (AWS), we recognize that many of our customers rely on the familiar Microsoft Office suite of applications, including Word, Excel, and Outlook, as the backbone of their daily workflows. In this blog post, we showcase a powerful solution that seamlessly integrates AWS generative AI capabilities in the form of large language models (LLMs) based on Amazon Bedrock into the Office experience. By harnessing the latest advancements in generative AI, we empower employees to unlock new levels of efficiency and creativity within the tools they already use every day. Whether it’s drafting compelling text, analyzing complex datasets, or gaining more in-depth insights from information, integrating generative AI with Office suite transforms the way teams approach their essential work. Join us as we explore how your organization can leverage this transformative technology to drive innovation and boost employee productivity.

Solution overview


Figure 1: Solution architecture overview

The solution architecture in Figure 1 shows how Office applications interact with a serverless backend hosted on the AWS Cloud through an Add-In. This architecture allows users to leverage Amazon Bedrock’s generative AI capabilities directly from within the Office suite, enabling enhanced productivity and insights within their existing workflows.

Components deep-dive

Office Add-ins

Office Add-ins allow extending Office products with custom extensions built on standard web technologies. Using AWS, organizations can host and serve Office Add-ins for users worldwide with minimal infrastructure overhead.

An Office Add-in is composed of two elements:

The code snippet below demonstrates part of a function that could run whenever a user invokes the plugin, performing the following actions:

  1. Initiate a request to the generative AI backend, providing the user prompt and available context in the request body
  2. Integrate the results from the backend response into the Word document using Microsoft’s Office JavaScript APIs. Note that these APIs use objects as namespaces, alleviating the need for explicit imports. Instead, we use the globally available namespaces, such as Word, to directly access relevant APIs, as shown in following example snippet.
// Initiate backend request (optional context)
const response = await sendPrompt({ user_message: prompt, context: selectedContext });

// Modify Word content with responses from the Backend
await Word.run(async (context) => {
  let documentBody;

  // Target for the document modifications
  if (response.location === 'Replace') {
    documentBody = context.document.getSelection(); // active text selection
  } else {
    documentBody = context.document.body; // entire document body
  }

  // Markdown support for preserving original content layout
  // Dependencies used: React markdown
  const content = renderToString(<Markdown>{ response.content } < /Markdown>);
  const operation = documentBody.insertHtml(content, response.location);

  // set properties for the output content (font, size, color, etc.)
  operation.font.set({ name: 'Arial' });

  // flush changes to the Word document
  await context.sync();
});

Generative AI backend infrastructure

The AWS Cloud backend consists of three components:

  1. Amazon API Gateway acts as an entry point, receiving requests from the Office applications’ Add-in. API Gateway supports multiple mechanisms for controlling and managing access to an API.
  2. AWS Lambda handles the REST API integration, processing the requests and invoking the appropriate AWS services.
  3. Amazon Bedrock is a fully managed service that makes foundation models (FMs) from leading AI startups and Amazon available via an API, so you can choose from a wide range of FMs to find the model that is best suited for your use case. With Bedrock’s serverless experience, you can get started quickly, privately customize FMs with your own data, and quickly integrate and deploy them into your applications using the AWS tools without having to manage infrastructure.

LLM prompting

Amazon Bedrock allows you to choose from a wide selection of foundation models for prompting. Here, we use Anthropic’s Claude 3.5 Sonnet on Amazon Bedrock for completions. The system prompt we used in this example is as follows:

You are an office assistant helping humans to write text for their documents.

[When preparing the answer, take into account the following text: <text>{context}</text>]
Before answering the question, think through it step-by-step within the <thinking></thinking> tags.
Then, detect the user's language from their question and store it in the form of an ISO 639-1 code within the <user_language></user_language> tags.
Then, develop your answer in the user’s language within the <response></response> tags.

In the prompt, we first give the LLM a persona, indicating that it is an office assistant helping humans. The second, optional line contains text that has been selected by the user in the document and is provided as context to the LLM. We specifically instruct the LLM to first mimic a step-by-step thought process for arriving at the answer (chain-of-thought reasoning), an effective measure of prompt-engineering to improve the output quality. Next, we instruct it to detect the user’s language from their question so we can later refer to it. Finally, we instruct the LLM to develop its answer using the previously detected user language within response tags, which are used as the final response. While here, we use the default configuration for inference parameters such as temperature, that can quickly be configured with every LLM prompt. The user input is then added as a user message to the prompt and sent via the Amazon Bedrock Messages API to the LLM.

Implementation details and demo setup in an AWS account

As a prerequisite, we need to make sure that we are working in an AWS Region with Amazon Bedrock support for the foundation model (here, we use Anthropic’s Claude 3.5 Sonnet). Also, access to the required relevant Amazon Bedrock foundation models needs to be added. For this demo setup, we describe the manual steps taken in the AWS console. If required, this setup can also be defined in Infrastructure as Code.

To set up the integration, follow these steps:

  1. Create an AWS Lambda function with Python runtime and below code to be the backend for the API. Make sure that we have Powertools for AWS Lambda (Python) available in our runtime, for example, by attaching aLambda layer to our function. Make sure that the Lambda function’s IAM role provides access to the required FM, for example:
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Action": "bedrock:InvokeModel",
                "Resource": [
                    "arn:aws:bedrock:*::foundation-model/anthropic.claude-3-5-sonnet-20240620-v1:0"
                ]
            }
        ]
    }
    

    The following code block shows a sample implementation for the REST API Lambda integration based on a Powertools for AWS Lambda (Python) REST API event handler:

    import json
    import re
    from typing import Optional
    
    import boto3
    from aws_lambda_powertools import Logger
    from aws_lambda_powertools.event_handler import APIGatewayRestResolver, CORSConfig
    from aws_lambda_powertools.logging import correlation_paths
    from aws_lambda_powertools.utilities.typing import LambdaContext
    from pydantic import BaseModel
    
    logger = Logger()
    app = APIGatewayRestResolver(
        enable_validation=True,
        cors=CORSConfig(allow_origin="http://localhost:3000"),  # for testing purposes
    )
    
    bedrock_runtime_client = boto3.client("bedrock-runtime")
    
    
    SYSTEM_PROMPT = """
    You are an office assistant helping humans to write text for their documents.
    
    {context}
    Before answering the question, think through it step-by-step within the <thinking></thinking> tags.
    Then, detect the user's language from their question and store it in the form of an ISO 639-1 code within the <user_language></user_language> tags.
    Then, develop your answer in the user's language in markdown format within the <response></response> tags.
    """
    
    class Query(BaseModel):
        user_message: str  # required
        context: Optional[str] = None  # optional
        max_tokens: int = 1000  # default value
        model_id: str = "anthropic.claude-3-5-sonnet-20240620-v1:0"  # default value
    
    def wrap_context(context: Optional[str]) -> str:
        if context is None:
            return ""
        else:
            return f"When preparing the answer take into account the following text: <text>{context}</text>"
    
    def parse_completion(completion: str) -> dict:
        response = {"completion": completion}
        try:
            tags = ["thinking", "user_language", "response"]
            tag_matches = re.finditer(
                f"<(?P<tag>{'|'.join(tags)})>(?P<content>.*?)</(?P=tag)>",
                completion,
                re.MULTILINE | re.DOTALL,
            )
            for match in tag_matches:
                response[match.group("tag")] = match.group("content").strip()
        except Exception:
            logger.exception("Unable to parse LLM response")
            response["response"] = completion
    
        return response
    
    
    @app.post("/query")
    def query(query: Query):
        bedrock_response = bedrock_runtime_client.invoke_model(
            modelId=query.model_id,
            body=json.dumps(
                {
                    "anthropic_version": "bedrock-2023-05-31",
                    "max_tokens": query.max_tokens,
                    "system": SYSTEM_PROMPT.format(context=wrap_context(query.context)),
                    "messages": [{"role": "user", "content": query.user_message}],
                }
            ),
        )
        response_body = json.loads(bedrock_response.get("body").read())
        logger.info("Received LLM response", response_body=response_body)
        response_text = response_body.get("content", [{}])[0].get(
            "text", "LLM did not respond with text"
        )
        return parse_completion(response_text)
    
    @logger.inject_lambda_context(correlation_id_path=correlation_paths.API_GATEWAY_REST)
    def lambda_handler(event: dict, context: LambdaContext) -> dict:
        return app.resolve(event, context)
    

  2. Create an API Gateway REST API with a Lambda proxy integration to expose the Lambda function via a REST API. You can follow this tutorial for creating a REST API for the Lambda function by using the API Gateway console. By creating a Lambda proxy integration with a proxy resource, we can route requests to the resources to the Lambda function. Follow the tutorial to deploy the API and take note of the API’s invoke URL. Make sure to configure adequate access control for the REST API.

We can now invoke and test our function via the API’s invoke URL. The following example uses curl to send a request (make sure to replace all placeholders in curly braces as required), and the response generated by the LLM:

$ curl --header "Authorization: {token}" 
     --header "Content-Type: application/json" 
     --request POST 
     --data '{"user_message": "Write a 2 sentence summary about AWS."}' 
     https://{restapi_id}.execute-api.{region}.amazonaws.com/{stage_name}/query | jq .
{
 "completion": "<thinking>nTo summarize AWS in 2 sentences:n1. AWS (Amazon Web Services) is a comprehensive cloud computing platform offering a wide range of services like computing power, database storage, content delivery, and more.n2. It allows organizations and individuals to access these services over the internet on a pay-as-you-go basis without needing to invest in on-premises infrastructure.n</thinking>nn<user_language>en</user_language>nn<response>nnAWS (Amazon Web Services) is a cloud computing platform that offers a broad set of global services including computing, storage, databases, analytics, machine learning, and more. It enables companies of all sizes to access these services over the internet on a pay-as-you-go pricing model, eliminating the need for upfront capital expenditure or on-premises infrastructure management.nn</response>",
 "thinking": "To summarize AWS in 2 sentences:n1. AWS (Amazon Web Services) is a comprehensive cloud computing platform offering a wide range of services like computing power, database storage, content delivery, and more.n2. It allows organizations and individuals to access these services over the internet on a pay-as-you-go basis without needing to invest in on-premises infrastructure.",
 "user_language": "en",
 "response": "AWS (Amazon Web Services) is a cloud computing platform that offers a broad set of global services including computing, storage, databases, analytics, machine learning, and more. It enables companies of all sizes to access these services over the internet on a pay-as-you-go pricing model, eliminating the need for upfront capital expenditure or on-premises infrastructure management."
} 

If required, the created resources can be cleaned up by 1) deleting the API Gateway REST API, and 2) deleting the REST API Lambda function and associated IAM role.

Example use cases

To create an interactive experience, the Office Add-in integrates with the cloud back-end that implements conversational capabilities with support for additional context retrieved from the Office JavaScript API.

Next, we demonstrate two different use cases supported by the proposed solution, text generation and text refinement.

Text generation


Figure 2: Text generation use-case demo

In the demo in Figure 2, we show how the plug-in is prompting the LLM to produce a text from scratch. The user enters their query with some context into the Add-In text input area. Upon sending, the backend will prompt the LLM to generate respective text, and return it back to the frontend. From the Add-in, it is inserted into the Word document at the cursor position using the Office JavaScript API.

Text refinement


Figure 3: Text refinement use-case demo

In Figure 3, the user highlighted a text segment in the work area and entered a prompt into the Add-In text input area to rephrase the text segment. Again, the user input and highlighted text are processed by the backend and returned to the Add-In, thereby replacing the previously highlighted text.

Conclusion

This blog post showcases how the transformative power of generative AI can be incorporated into Office processes. We described an end-to-end sample of integrating Office products with an Add-in for text generation and manipulation with the power of LLMs. In our example, we used managed LLMs on Amazon Bedrock for text generation. The backend is hosted as a fully serverless application on the AWS cloud.

Text generation with LLMs in Office supports employees by streamlining their writing process and boosting productivity. Employees can leverage the power of generative AI to generate and edit high-quality content quickly, freeing up time for other tasks. Additionally, the integration with a familiar tool like Word provides a seamless user experience, minimizing disruptions to existing workflows.

To learn more about boosting productivity, building differentiated experiences, and innovating faster with AWS visit the Generative AI on AWS page.


About the Authors

Martin Maritsch is a Generative AI Architect at AWS ProServe focusing on Generative AI and MLOps. He helps enterprise customers to achieve business outcomes by unlocking the full potential of AI/ML services on the AWS Cloud.

Miguel Pestana is a Cloud Application Architect in the AWS Professional Services team with over 4 years of experience in the automotive industry delivering cloud native solutions. Outside of work Miguel enjoys spending its days at the beach or with a padel racket in one hand and a glass of sangria on the other.

Carlos Antonio Perea Gomez is a Builder with AWS Professional Services. He enables customers to become AWSome during their journey to the cloud. When not up in the cloud he enjoys scuba diving deep in the waters.

Read More

From innovation to impact: How AWS and NVIDIA enable real-world generative AI success

From innovation to impact: How AWS and NVIDIA enable real-world generative AI success

As we gather for NVIDIA GTC, organizations of all sizes are at a pivotal moment in their AI journey. The question is no longer whether to adopt generative AI, but how to move from promising pilots to production-ready systems that deliver real business value. The organizations that figure this out first will have a significant competitive advantage—and we’re already seeing compelling examples of what’s possible.

Consider Hippocratic AI’s work to develop AI-powered clinical assistants to support healthcare teams as doctors, nurses, and other clinicians face unprecedented levels of burnout. During a recent hurricane in Florida, their system called 100,000 patients in a day to check on medications and provide preventative healthcare guidance–the kind of coordinated outreach that would be nearly impossible to achieve manually. They aren’t just building another chatbot; they are reimagining healthcare delivery at scale.

Production-ready AI like this requires more than just cutting-edge models or powerful GPUs. In my decade working with customers’ data journeys, I’ve seen that an organization’s most valuable asset is its domain-specific data and expertise. And now leading our data and AI go-to-market, I hear customers consistently emphasize what they need to transform their domain advantage into AI success: infrastructure and services they can trust—with performance, cost-efficiency, security, and flexibility—all delivered at scale. When the stakes are high, success requires not just cutting-edge technology, but the ability to operationalize it at scale—a challenge that AWS has consistently solved for customers. As the world’s most comprehensive and broadly adopted cloud, our partnership with NVIDIA’s pioneering accelerated computing platform for generative AI amplifies this capability. It’s inspiring to see how, together, we’re enabling customers across industries to confidently move AI into production.

In this post, I will share some of these customers’ remarkable journeys, offering practical insights for any organization looking to harness the power of generative AI.

Transforming content creation with generative AI

Content creation represents one of the most visible and immediate applications of generative AI today. Adobe, a pioneer that has shaped creative workflows for over four decades, has moved with remarkable speed to integrate generative AI across its flagship products, helping millions of creators work in entirely new ways.

Adobe’s approach to generative AI infrastructure exemplifies what their VP of Generative AI, Alexandru Costin, calls an “AI superhighway”—a sophisticated technical foundation that enables rapid iteration of AI models and seamless integration into their creative applications. The success of their Firefly family of generative AI models, integrated across flagship products like Photoshop, demonstrates the power of this approach. For their AI training and inference workloads, Adobe uses NVIDIA GPU-accelerated Amazon Elastic Compute Cloud (Amazon EC2) P5en (NVIDIA H200 GPUs), P5 (NVIDIA H100 GPUs), P4de (NVIDIA A100 GPUs), and G5 (NVIDIA A10G GPUs) instances. They also use NVIDIA software such as NVIDIA TensorRT and NVIDIA Triton Inference Server for faster, scalable inference. Adobe needed maximum flexibility to build their AI infrastructure, and AWS provided the complete stack of services needed—from Amazon FSx for Lustre for high-performance storage, to Amazon Elastic Kubernetes Service (Amazon EKS) for container orchestration, to Elastic Fabric Adapter (EFA) for high-throughput networking—to create a production environment that could reliably serve millions of creative professionals.

Key takeaway

If you’re building and managing your own AI pipelines, Adobe’s success highlights a critical insight: although GPU-accelerated compute often gets the spotlight in AI infrastructure discussions, what’s equally important is the NVIDIA software stack along with the foundation of orchestration, storage, and networking services that enable production-ready AI. Their results speak for themselves—Adobe achieved a 20-fold scale-up in model training while maintaining the enterprise-grade performance and reliability their customers expect.

Pioneering new AI applications from the ground up

Throughout my career, I’ve been particularly energized by startups that take on audacious challenges—those that aren’t just building incremental improvements but are fundamentally reimagining how things work. Perplexity exemplifies this spirit. They’ve taken on a technology most of us now take for granted: search. It’s the kind of ambitious mission that excites me, not just because of its bold vision, but because of the incredible technical challenges it presents. When you’re processing 340 million queries monthly and serving over 1,500 organizations, transforming search isn’t just about having great ideas—it’s about building robust, scalable systems that can deliver consistent performance in production.

Perplexity’s innovative approach earned them membership in both AWS Activate and NVIDIA Inception—flagship programs designed to accelerate startup innovation and success. These programs provided them with the resources, technical guidance, and support needed to build at scale. They were one of the early adopters of Amazon SageMaker HyperPod, and continue to use its distributed training capabilities to accelerate model training time by up to 40%. They use a highly optimized inference stack built with NVIDIA TensorRT-LLM and NVIDIA Triton Inference Server to serve both their search application and pplx-api, their public API service that gives developers access to their proprietary models. The results speak for themselves—their inference stack achieves up to 3.1 times lower latency compared to other platforms. Both their training and inference workloads run on NVIDIA GPU-accelerated EC2 P5 instances, delivering the performance and reliability needed to operate at scale. To give their users even more flexibility, Perplexity complements their own models with services such as Amazon Bedrock, and provides access to additional state-of-the-art models in their API. Amazon Bedrock offers ease of use and reliability, which are crucial for their team—as they note, it allows them to effectively maintain the reliability and latency their product demands.

What I find particularly compelling about Perplexity’s journey is their commitment to technical excellence, exemplified by their work optimizing GPU memory transfer with EFA networking. The team achieved 97.1% of the theoretical maximum bandwidth of 3200 Gbps and open sourced their innovations, enabling other organizations to benefit from their learnings.

For those interested in the technical details, I encourage you to read their fascinating post Journey to 3200 Gbps: High-Performance GPU Memory Transfer on AWS Sagemaker Hyperpod.

Key takeaway

For organizations with complex AI workloads and specific performance requirements, Perplexity’s approach offers a valuable lesson. Sometimes, the path to production-ready AI isn’t about choosing between self-hosted infrastructure and managed services—it’s about strategically combining both. This hybrid strategy can deliver both exceptional performance (evidenced by Perplexity’s 3.1 times lower latency) and the flexibility to evolve.

Transforming enterprise workflows with AI

Enterprise workflows represent the backbone of business operations—and they’re a crucial proving ground for AI’s ability to deliver immediate business value. ServiceNow, which terms itself the AI platform for business transformation, is rapidly integrating AI to reimagine core business processes at scale.

ServiceNow’s innovative AI solutions showcase their vision for enterprise-specific AI optimization. As Srinivas Sunkara, ServiceNow’s Vice President, explains, their approach focuses on deep AI integration with technology workflows, core business processes, and CRM systems—areas where traditional large language models (LLMs) often lack domain-specific knowledge. To train generative AI models at enterprise scale, ServiceNow uses NVIDIA DGX Cloud on AWS. Their architecture combines high-performance FSx for Lustre storage with NVIDIA GPU clusters for training, and NVIDIA Triton Inference Server handles production deployment. This robust technology platform allows ServiceNow to focus on domain-specific AI development and customer value rather than infrastructure management.

Key takeaway

ServiceNow offers an important lesson about enterprise AI adoption: while foundation models (FMs) provide powerful general capabilities, the greatest business value often comes from optimizing models for specific enterprise use cases and workflows. In many cases, it’s precisely this deliberate specialization that transforms AI from an interesting technology into a true business accelerator.

Scaling AI across enterprise applications

Cisco’s Webex team’s journey with generative AI exemplifies how large organizations can methodically transform their applications while maintaining enterprise standards for reliability and efficiency. With a comprehensive suite of telecommunications applications serving customers globally, they needed an approach that would allow them to incorporate LLMs across their portfolio—from AI assistants to speech recognition—without compromising performance or increasing operational complexity.

The Webex team’s key insight was to separate their models from their applications. Previously, they had embedded AI models into the container images for applications running on Amazon EKS, but as their models grew in sophistication and size, this approach became increasingly inefficient. By migrating their LLMs to Amazon SageMaker AI and using NVIDIA Triton Inference Server, they created a clean architectural break between their relatively lean applications and the underlying models, which require more substantial compute resources. This separation allows applications and models to scale independently, significantly reducing development cycle time and increasing resource utilization. The team deployed dozens of models on SageMaker AI endpoints, using Triton Inference Server’s model concurrency capabilities to scale globally across AWS data centers.

The results validate Cisco’s methodical approach to AI transformation. By separating applications from models, their development teams can now fix bugs, perform tests, and add features to applications much faster, without having to manage large models in their workstation memory. The architecture also enables significant cost optimization—applications remain available during off-peak hours for reliability, and model endpoints can scale down when not needed, all without impacting application performance. Looking ahead, the team is evaluating Amazon Bedrock to further improve their price-performance, demonstrating how thoughtful architecture decisions create a foundation for continuous optimization.

Key takeaway

For enterprises with large application portfolios looking to integrate AI at scale, Cisco’s methodical approach offers an important lesson: separating LLMs from applications creates a cleaner architectural boundary that improves both development velocity and cost optimization. By treating models and applications as independent components, Cisco significantly improved development cycle time while reducing costs through more efficient resource utilization.

Building mission-critical AI for healthcare

Earlier, we highlighted how Hippocratic AI reached 100,000 patients during a crisis. Behind this achievement lies a story of rigorous engineering for safety and reliability—essential in healthcare where stakes are extraordinarily high.

Hippocratic AI’s approach to this challenge is both innovative and rigorous. They’ve developed what they call a “constellation architecture”—a sophisticated system of over 20 specialized models working in concert, each focused on specific safety aspects like prescription adherence, lab analysis, and over-the-counter medication guidance. This distributed approach to safety means they have to train multiple models, requiring management of significant computational resources. That’s why they use SageMaker HyperPod for their training infrastructure, using Amazon FSx and Amazon Simple Storage Service (Amazon S3) for high-speed storage access to NVIDIA GPUs, while Grafana and Prometheus provide the comprehensive monitoring needed to provide optimal GPU utilization. They build upon NVIDIA’s low-latency inference stack, and are enhancing conversational AI capabilities using NVIDIA Riva models for speech recognition and text-to-speech translation, and are also using NVIDIA NIM microservices to deploy these models. Given the sensitive nature of healthcare data and HIPAA compliance requirements, they’ve implemented a sophisticated multi-account, multi-cluster strategy on AWS—running production inference workloads with patient data on completely separate accounts and clusters from their development and training environments. This careful attention to both security and performance allows them to handle thousands of patient interactions while maintaining precise control over clinical safety and accuracy.

The impact of Hippocratic AI’s work extends far beyond technical achievements. Their AI-powered clinical assistants address critical healthcare workforce burnout by handling burdensome administrative tasks—from pre-operative preparation to post-discharge follow-ups. For example, during weather emergencies, their system can rapidly assess heat risks and coordinate transport for vulnerable patients—the kind of comprehensive care that would be too burdensome and resource-intensive to coordinate manually at scale.

Key takeaway

For organizations building AI solutions for complex, regulated, and high-stakes environments, Hippocratic AI’s constellation architecture reinforces what we’ve consistently emphasized: there’s rarely a one-size-fits-all model for every use case. Just as Amazon Bedrock offers a choice of models to meet diverse needs, Hippocratic AI’s approach of combining over 20 specialized models—each focused on specific safety aspects—demonstrates how a thoughtfully designed ensemble can achieve both precision and scale.

Conclusion

As the technology partners enabling these and countless other customer innovations, AWS and NVIDIA’s long-standing collaboration continues to evolve to meet the demands of the generative AI era. Our partnership, which began over 14 years ago with the world’s first GPU cloud instance, has grown to offer the industry’s widest range of NVIDIA accelerated computing solutions and software services for optimizing AI deployments. Through initiatives like Project Ceiba—one of the world’s fastest AI supercomputers hosted exclusively on AWS using NVIDIA DGX Cloud for NVIDIA’s own research and development use—we continue to push the boundaries of what’s possible.

As all the examples we’ve covered illustrate, it isn’t just about the technology we build together—it’s how organizations of all sizes are using these capabilities to transform their industries and create new possibilities. These stories ultimately reveal something more fundamental: when we make powerful AI capabilities accessible and reliable, people find remarkable ways to use them to solve meaningful problems. That’s the true promise of our partnership with NVIDIA—enabling innovators to create positive change at scale. I’m excited to continue inventing and partnering with NVIDIA and can’t wait to see what our mutual customers are going to do next.

Resources

Check out the following resources to learn more about our partnership with NVIDIA and generative AI on AWS:


About the Author

Rahul Pathak is Vice President Data and AI GTM at AWS, where he leads the global go-to-market and specialist teams who are helping customers create differentiated value with AWS’s AI and capabilities such as Amazon Bedrock, Amazon Q, Amazon SageMaker, and Amazon EC2 and Data Services such as Amaqzon S3, AWS Glue and Amazon Redshift. Rahul believes that generative AI will transform virtually every single customer experience and that data is a key differentiator for customers as they build AI applications. Prior to his current role, he was Vice President, Relational Database Engines where he led Amazon Aurora, Redshift, and DSQL . During his 13+ years at AWS, Rahul has been focused on launching, building, and growing managed database and analytics services, all aimed at making it easy for customers to get value from their data. Rahul has over twenty years of experience in technology and has co-founded two companies, one focused on analytics and the other on IP-geolocation. He holds a degree in Computer Science from MIT and an Executive MBA from the University of Washington.

Read More

Amazon Q Business now available in Europe (Ireland) AWS Region

Amazon Q Business now available in Europe (Ireland) AWS Region

Today, we are excited to announce that Amazon Q Business—a fully managed generative-AI powered assistant that you can configure to answer questions, provide summaries and generate content based on your enterprise data—is now generally available in the Europe (Ireland) AWS Region.

Since its launch, Amazon Q Business has been helping customers find information, gain insight, and take action at work. The general availability of Amazon Q Business in the Europe (Ireland) Region will support customers across Ireland and the EU to transform how their employees work and access information, while maintaining data security and privacy requirements.

AWS customers and partners innovate using Amazon Q Business in Europe

Organizations across the EU are using Amazon Q Business for a wide variety of use cases, including answering questions about company data, summarizing documents, and providing business insights.

Katya Dunets, the AWS Lead Sales Engineer for Adastra noted,

Adastra stands at the forefront of technological innovation, specializing in artificial intelligence, data, cloud, digital, and governance services. Our team was facing the daunting challenge of sifting through hundreds of documents on SharePoint, searching for content and information critical for market research and RFP generation. This process was not only time-consuming but also impeded our agility and responsiveness. Recognizing the need for a transformative solution, we turned to Amazon Q Business for its prowess in answering queries, summarizing documents, generating content, and executing tasks, coupled with its direct SharePoint integration. Amazon Q Business became the catalyst for unprecedented efficiency within Adastra, dramatically streamlining document retrieval, enhancing cross-team collaboration through shared insights from past projects, and accelerating our RFP development process by 70%. Amazon Q Business has not only facilitated a smoother exchange of knowledge within our teams but has also empowered us to maintain our competitive edge by focusing on innovation rather than manual tasks. Adastra’s journey with Amazon Q exemplifies our commitment to harnessing cutting-edge technology to better serve both our clients and their customers.

AllCloud is a cloud solutions provider specializing in cloud stack, infrastructure, platform, and Software-as-a-Service. Their CTO, Peter Nebel stated,

“AllCloud faces the common challenge of information sprawl. Critical knowledge for sales and delivery teams is scattered across various tools—Salesforce for customer and marketing data, Google Drive for documents, Bamboo for HR and internal information, and Confluence for internal wikis. This fragmented approach wastes valuable time as employees hunt and peck for the information they need, hindering productivity and potentially impacting client satisfaction. Amazon Q Business provides AllCloud a solution to increase productivity by streamlining information access. By leveraging Amazon Q’s natural language search capabilities, AllCloud can empower its personnel with a central hub to find answers to their questions across all their existing information sources. This drives efficiency and accuracy by eliminating the need for time-consuming searches across multiple platforms and ensures all teams have access to the most up-to-date information. Amazon Q will significantly accelerate productivity, across all lines of business, allowing AllCloud’s teams to focus on delivering exceptional service to their clients.”

Lars Ritter, Senior Manager at Woodmark Consulting noted,

“Amazon Bedrock and Amazon Q Business have been game-changers for Woodmark. Employees struggled with time-consuming searches across various siloed systems, leading to reduced productivity and slower operations. To solve for the inefficient retrieval of corporate knowledge from unstructured data sources we turned to Amazon Bedrock and Amazon Q Business for help. With this innovative solution, Woodmark has been able to revolutionize data accessibility, empowering our teams to effortlessly retrieve insights using simple natural language queries, and to make informed decisions without relying on specialized data teams, which was not feasible before. These solutions have dramatically increased efficiency, fostered a data-driven culture, and positioned us for scalable growth, driving our organization toward unparalleled success.”

Scott Kumono, Product Manager for Kinectus at Siemens Healthineers adds,

“Amazon Q Business has enhanced the delivery of service and clinical support for our ultrasound customers. Previously, finding specific information meant sifting through a 1,000-page manual or waiting for customer support to respond. Now, customers have instant access to answers and specifications right at their fingertips, using Kinectus Remote Service. With Amazon Q Business we were able to significantly reduce manual work and wait times to find the right information, allowing our customers to focus on what really matters – patient care.”

Till Gloger, Head of Digital Production Platform Region Americas at Volkswagen Group of America states,

“Volkswagen innovates not only on its products, but also on how to boost employee productivity and increase production throughput. Volkswagen is testing the use of Amazon Q to streamline employee workflows by potentially integrating it with existing processes. This integration has the possibility to help employees save time during the assembly process, reducing some processes from minutes to seconds, ultimately leading to more throughput.”

Pricing

With Amazon Q Business, enterprise customers pay for user subscriptions and index capacity. For more details, see Amazon Q Business pricing.

Get started with Amazon Q Business today

To get started with Amazon Q Business, users first need to configure an application environment and create a knowledge base using over 40 data source connectors that index documents (e.g text, pdf, images, tables). Organizations then set up user authentication through AWS IAM Identity Center or other SAML-based identity providers like Okta, Ping Identity, and Microsoft Entra ID. After configuring access permissions, applications users can navigate to their organization’s Amazon Q Business web interface using their credentials to begin interacting with the Q Business and the data they have access to. Q Business enables natural language interactions where users can ask questions and receive answers based on their indexed documents, uploaded content, and world knowledge – this may include getting details, generating content or insights. Users can access Amazon Q Business through multiple channels including web applications, Slack, Microsoft Teams, Microsoft 365 for Word and Outlook, or through browser extensions for gen-AI assistance directly where they work. Additionally, customers can securely share their data with verified independent software vendors (ISVs) like Asana, Miro, PagerDuty, and Zoom using the data accessors feature, which maintains security and compliance while respecting user-level permissions.

Learn more about how to get started with Amazon Q Business here. Read about other Amazon Q Business customers’ success stories here. Certain Amazon Q Business features already available in US East (N. Virginia) and US West (Oregon) including Q Apps, Q Actions, and Audio/Video file support will become available in Europe (Ireland) soon.


About the Authors

Jose Navarro is an AI/ML Specialist Solutions Architect at AWS, based in Spain. Jose helps AWS customers—from small startups to large enterprises—architect and take their end-to-end machine learning use cases to production.

Morgan Dutton is a Senior Technical Program Manager at AWS, Amazon Q Business based in Seattle.

Eva Pagneux is a Principal Product Manager at AWS, Amazon Q Business, based in San Francisco.

Wesleigh Roeca is a Senior Worldwide Gen AI/ML Specialist at AWS, Amazon Q Business, based in Santa Monica.

Read More

Running NVIDIA NeMo 2.0 Framework on Amazon SageMaker HyperPod

Running NVIDIA NeMo 2.0 Framework on Amazon SageMaker HyperPod

This post is cowritten with Abdullahi Olaoye, Akshit Arora and Eliuth Triana Isaza at NVIDIA.

As enterprises continue to push the boundaries of generative AI, scalable and efficient model training frameworks are essential. The NVIDIA NeMo Framework provides a robust, end-to-end solution for developing, customizing, and deploying large-scale AI models, while Amazon SageMaker HyperPod delivers the distributed infrastructure needed to handle multi-GPU, multi-node workloads seamlessly.

In this blog post, we explore how to integrate NeMo 2.0 with SageMaker HyperPod to enable efficient training of large language models (LLMs). We cover the setup process and provide a step-by-step guide to running a NeMo job on a SageMaker HyperPod cluster.

NVIDIA NeMo Framework Overview

The NVIDIA NeMo Framework is an end-to-end solution for developing cutting edge generative AI models such as LLMs, vision language models (VLMs), video and speech models, and others.

At its core, NeMo Framework provides model builders with:

  • Comprehensive development tools: A complete ecosystem of tools, scripts, and proven recipes that guide users through every phase of the LLM lifecycle, from initial data preparation to final deployment.
  • Advanced customization: Flexible customization options that teams can use to tailor models to their specific use cases while maintaining peak performance.
  • Optimized infrastructure: Sophisticated multi-GPU and multi-node configurations that maximize computational efficiency for both language and image applications.
  • Enterprise-grade features with built-in capabilities including:
    • Advanced parallelism techniques
    • Memory optimization strategies
    • Distributed checkpointing
    • Streamlined deployment pipelines

By consolidating these powerful features into a unified framework, NeMo significantly reduces the complexity and cost associated with generative AI development. NeMo Framework 2.0 is a flexible, IDE-independent Python-based framework that enables flexible integration in each developer’s workflow. The framework provides capabilities such as code completion, type checking and programmatic extensions and configuration customization. The NeMo Framework includes NeMo-Run, a library designed to that streamline the configuration, execution, and management of machine learning experiments across various computing environments.

The end-to-end NeMo Framework includes the following key features that streamline and accelerate AI development:

  • Data curation: NeMo Curator is a Python library that includes a suite of modules for data-mining and synthetic data generation. They are scalable and optimized for GPUs, making them ideal for curating natural language data to train or fine-tune LLMs. With NeMo Curator, you can efficiently extract high-quality text from extensive raw web data sources.
  • Training and customization: NeMo Framework provides tools for efficient training and customization of LLMs and multimodal models. It includes default configurations for compute cluster setup, data downloading, and model hyperparameters autotuning, which can be adjusted to train on new datasets and models. In addition to pre-training, NeMo supports both supervised fine-tuning (SFT) and parameter-efficient fine-tuning (PEFT) techniques such as LoRA, Ptuning, and more.
  • Alignment: NeMo Aligner is a scalable toolkit for efficient model alignment. The toolkit supports state-of-the-art model alignment algorithms such as SteerLM, DPO, reinforcement learning from human feedback (RLHF), and much more. By using these algorithms, you can align language models to be safer, more harmless, and more helpful.

Solution overview

In this post, we show you how to efficiently train large-scale generative AI models with NVIDIA NeMo Framework 2.0 using SageMaker HyperPod, a managed distributed training service designed for high-performance workloads. This solution integrates NeMo Framework 2.0 with the scalable infrastructure of SageMaker HyperPod, creating seamless orchestration of multi-node, multi-GPU clusters.

The key steps to deploying this solution include:

  • Setting up SageMaker HyperPod prerequisites: Configuring networking, storage, and permissions management (AWS Identity and Access Management (IAM) roles).
  • Launching the SageMaker HyperPod cluster: Using lifecycle scripts and a predefined cluster configuration to deploy compute resources.
  • Configuring the environment: Setting up NeMo Framework and installing the required dependencies.
  • Building a custom container: Creating a Docker image that packages NeMo Framework and installs the required AWS networking dependencies.
  • Running NeMo model training: Using NeMo-Run with a Slurm-based execution setup to train an example LLaMA (180M) model efficiently.

Architecture diagram

The architecture, shown in the preceding diagram shows an Amazon SageMaker HyperPod Cluster.

Prerequisites

First, you deploy a SageMaker HyperPod cluster before running the job. But to deploy the cluster, you need to create some prerequisite resources.

Note that there is a cost associated with running a SageMaker HyperPod cluster, see the Amazon SageMaker AI Pricing (HyperPod pricing in On-demand pricing) for more information.

The following prerequisite steps are adapted from the Amazon SageMaker HyperPod workshop, which you can visit for additional information.

Use the following steps to deploy the prerequisite resources.

  1. Sign in to the AWS Management Console using the AWS account you want to deploy the SageMaker HyperPod cluster in. You will create a VPC, subnets, an FSx Lustre volume, an Amazon Simple Storage Service (Amazon S3) bucket, and IAM role as pre-requisites; so make sure that your IAM role or user for console access has permissions to create these resources.
  2. Use the CloudFormation template to go to your AWS CloudFormation console and launch the solution template.
  3. Template parameters:
    • Change the Availability Zone to match the AWS Region where you’re deploying the template. See Availability Zone IDs for the AZ ID for your Region.
    • All other parameters can be left as default or changed as needed for your use case.
  4. Select the acknowledgement box in the Capabilities section and create the stack.

It takes about 10 minutes for the CloudFormation stack creation to complete. The following figure shows the deployment timeline of the CloudFormation stack deployment for the prerequisite infrastructure components.

Launch the training job

With the prerequisite infrastructure deployed in your AWS account, you next deploy the SageMaker HyperPod cluster that you’ll use for the model training example. For the model training job, you will use the NeMo Framework to launch training jobs efficiently.

Step 1: Set up a SageMaker HyperPod cluster

After the prerequisite resources are successfully deployed, create a SageMaker HyperPod cluster.

The deployment steps are adapted from the SageMaker HyperPod workshop, which you can review for additional information.

  1. Install and configure the AWS Command Line Interface (AWS CLI). If you already have it installed, verify that the version is at least 2.17.1 by running the following command:
$ aws --version
  1. Configure the environment variables that using outputs from the CloudFormation stack deployed earlier.
$ curl -O https://raw.githubusercontent.com/aws-samples/awsome-distributed-training/main/1.architectures/5.sagemaker-hyperpod/create_config.sh
# Change the region below to the region you wish to use
$ AWS_REGION=us-east-1 bash create_config.sh
$ source env_vars
# Confirm environment variables
$ cat env_vars
  1. Download the lifecycle scripts and upload them to the S3 bucket created in the prerequisites. SageMaker HyperPod uses lifecycle scripts to bootstrap a cluster. Examples of actions the lifecycle script manages include setting up Slurm and mounting the FSx Lustre filesystem.
$ git clone --depth=1 https://github.com/aws-samples/awsome-distributed-training/
$ cd awsome-distributed-training && git checkout e67fc352de83e13ad6a34b4a33da11c8a71b4d9c
# upload script
$ aws s3 cp --recursive 1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/ s3://${BUCKET}/src
  1. Create a cluster config file for setting up the cluster. The following is an example of creating a cluster config from a template. The example cluster config is for g5.48xlarge compute nodes accelerated by 8 x NVIDIA A10G GPUs. See Create Cluster for cluster config examples of additional Amazon Elastic Compute Cloud (Amazon EC2) instance types. A cluster config file contains the following information:
    1. Cluster name
    2. It defines three instance groups
      1. Login-group: Acts as the entry point for users and administrators. Typically used for managing jobs, monitoring and debugging.
      2. Controller-machine: This is the head node for the Hyperpod Slurm cluster. It manages the overall orchestration of the distributed training process and handles job scheduling and communication within nodes.
      3. Worker-group: The group of nodes that executes the actual model training workload
    3. VPC configuration
$ cd 3.test_cases/22.nemo-run/slurm
$ curl -O https://awsome-distributed-training.s3.amazonaws.com/blog-assets/nemo2.0-hyperpod/cluster-config-template.json 
$ cp cluster-config-template.json cluster-config.json
# Replace the placeholders in the cluster config
$ source env_vars
$ sed -i "s/$BUCKET/${BUCKET}/g" cluster-config.json
$ sed -i "s/$ROLE/${ROLE}/g" cluster-config.json 
$ sed -i "s/$SECURITY_GROUP/${SECURITY_GROUP}/g" cluster-config.json
$ sed -i "s/$SUBNET_ID/${SUBNET_ID}/g" cluster-config.json
  1. Create a config file based on the following example with the cluster provisioning parameters and upload it to the S3 bucket.
$ instance_type=$(jq '.InstanceGroups[] | select(.InstanceGroupName == "worker-group-1").InstanceType' cluster-config.json)
$ cat > provisioning_parameters.json << EOL
{
"version": "1.0.0",
"workload_manager": "slurm",
"controller_group": "controller-machine",
"login_group": "login-group",
"worker_groups": [
{      
		"instance_group_name": "worker-group-1",      
		"partition_name": ${instance_type}
	}  
],  "fsx_dns_name": "${FSX_ID}.fsx.${AWS_REGION}.amazonaws.com",
"fsx_mountname": "${FSX_MOUNTNAME}"
}
EOL
# copy to the S3 Bucket
$ aws s3 cp provisioning_parameters.json s3://${BUCKET}/src/
  1. Create the SageMaker HyperPod cluster
$ aws sagemaker create-cluster 
    --cli-input-json file://cluster-config.json --region $AWS_REGION
  1. Use the following code or the console to check the status of the cluster. The status should be Creating. Wait for the cluster status to be InService proceeding
$ aws sagemaker list-clusters --output table

The following screenshot shows the results of the –output table command showing the cluster status as Creating.

The following screenshot shows the Cluster Management page and status of the cluster in the Amazon SageMaker AI console.

The following screenshot shows the results of the –output table command showing the cluster status as InService.

Step 2: SSH into the cluster

After the cluster is ready (that is, has a status of InService), you can connect to it using the AWS Systems Manager Session Manager and an SSH helper script. See SSH into Cluster for more information

  1. Install the AWS SSM Session Manager Plugin.
  2. Create a local key pair that can be added to the cluster by the helper script for easier SSH access and run the following SSH helper script.
$ ssh-keygen -t rsa -q -f "$HOME/.ssh/id_rsa" -N ""
$ curl -O https://raw.githubusercontent.com/aws-samples/awsome-distributed-training/main/1.architectures/5.sagemaker-hyperpod/easy-ssh.sh
$ chmod +x easy-ssh.sh
$ ./easy-ssh.sh -c controller-machine ml-cluster

Step 3: Interact with the cluster and clone the repository

After connecting to the cluster, you can validate that the command is properly configured by running several commands. See Get to know your Cluster for more information.

  1. View the existing partition and nodes per partition
$ sinfo
  1. List the jobs that are in the queue or running.
$ squeue
  1. SSH to the compute nodes.
# First ssh into the cluster head node as ubuntu user
$ ssh ml-cluster

#SSH into one of the compute nodes
$ salloc -N 1
$ ssh $(srun hostname)

#Exit to the head node
$ exit

#Exit again to cancel the srun job above
$ exit
  1. Clone the code sample GitHub repository onto the cluster controller node (head node).
$ cd /fsx/ubuntu
$ git clone https://github.com/aws-samples/awsome-distributed-training/
$ cd awsome-distributed-training && git checkout e67fc352de83e13ad6a34b4a33da11c8a71b4d9c$ cd 3.test_cases/22.nemo-run/slurm

Now, you’re ready to run your NeMo Framework Jobs on the SageMaker HyperPod cluster.

Step 4: Build the job container

The next step is to build the job container. By using a container, you can create a consistent, portable, and reproducible environment, helping to ensure that all dependencies, configurations, and optimizations remain intact. This is particularly important for high-performance computing (HPC) and AI workloads, where variations in the software stack can impact performance and compatibility.

To have a fully functioning and optimized environment, you need to add AWS-specific networking dependencies (EFA, OFI plugin, update NCCL, and NCCL tests) to the NeMo Framework container from NVIDIA GPU Cloud (NGC) Catalog. After building the Docker image, you will use Enroot to create a squash file from it. A squash file is a compressed, read-only file system that encapsulates the container image in a lightweight format. It helps reduce storage space, speeds up loading times, and improves efficiency when deploying the container across multiple nodes in a cluster. By converting the Docker image into a squash file, you can achieve a more optimized and performant execution environment, especially in distributed training scenarios.

Make sure that you have a registered account with NVIDIA and can access NGC. Retrieve the NGC API key following the instructions from NVIDIA. Use the following command to configure NGC. When prompted, use $oauthtoken for the login username and the API key from NGC for the password.

$ docker login nvcr.io

You can use the following command to build the Docker file and create a SquashFS image.

$ docker build --progress=plain -t nemo_hyperpod:24.12 -f Dockerfile .
$ sudo enroot import -o /fsx/ubuntu/nemo-hyperpod-24-12-02102025.sqsh dockerd://nemo_hyperpod:24.12

Step 5: Set up NeMo-Run and other dependencies on the head node

Before continuing:

  1. NeMo-Run requires python3.10, verify that this is installed on the head node before proceeding.
  2. You can use the following steps to set up Nemo-Run dependencies using a virtual environment. The steps create and activate a virtual environment then execute the venv.sh script to install the dependencies. Dependencies being installed include the NeMo toolkit, NeMo-Run, PyTorch, Megatron-LM, and others.
$ python3.10 -m venv temp-env
$ source temp-env/bin/activate
$ bash venv.sh
  1. To prepare for the pre-training of the LLaMA model in an offline mode and to help ensure consistent tokenization, use the widely adopted GPT-2 vocabulary and merges files. This approach helps avoid potential issues related to downloading tokenizer files during training:
$ mkdir -p /fsx/ubuntu/temp/megatron
$ wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json -O /fsx/ubuntu/temp/megatron/megatron-gpt-345m_vocab
$ wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt -O /fsx/ubuntu/temp/megatron/megatron-gpt-345m_merges

Step 6: Launch the pretraining job using NeMo-Run

Run the training script to start the LLaMA pretraining job. The training script run.py defines the configuration for a LLaMA 180M parameter model, defines a Slurm executor, defines the experiment, and launches the experiment.

The following function defines the model configuration.

def small_llama_cfg() -> llm.GPTConfig:
   return run.Config(
       llm.Llama3Config8B,       
	   rotary_base=500_000,       
	   seq_length=1024,       
	   num_layers=12,       
	   hidden_size=768,       
	   ffn_hidden_size=2688,       
	   num_attention_heads=16,       
	   init_method_std=0.023,
   )

The following function defines the Slurm executor.

def slurm_executor(
   account: str,   
   partition: str,   
   nodes: int,   
   user: str = "local",   
   host: str = "local",   
   remote_job_dir: str = "/fsx/ubuntu/nemo2-sm-hyperpod/tmp/",   
   time: str = "01:00:00",   
   custom_mounts: Optional[list[str]] = None,   
	custom_env_vars: Optional[dict[str, str]] = None,   
	container_image: str = "/fsx/ubuntu/nemo-hyperpod-24-12-02102025.sqsh",   
	retries: int = 0,) -> run.SlurmExecutor: 

The following function runs the experiment.

with run.Experiment(exp_name, log_level="INFO") as exp:
       exp.add(pretrain_recipe, executor=executor, tail_logs=True, name="training")
       # Run the experiment
       exp.run(detach=True)

Use the following command to run the training job.

$ python run.py --nodes 2 --max_steps 1000

The –nodes argument specifies the number of nodes to use during the pretraining job, while the –max_steps argument specifies the maximum number of training iterations. This is useful for controlling the duration of training.

The following figure shows the logs of a running training job.

You can download the training logs from the cluster to your local machine and use machine learning visualization tools like TensorBoard to visualize your experimentation. See Install TensorFlow 2 for information about installing TensorBoard. The following is an example of downloading logs from the cluster and visualizing the logs on TensorBoard.

  1. After installing TensorBoard, download the log files from the cluster to your workstation where TensorBoardis installed
$ rsync -aP ml-cluster:/path/to/logs/checkpoints/tb_logs/events.out.tfevents.1741213162.ip-10-1-7-21.55692.0 .

  1. After the logs are downloaded, you can launch TensorBoard with the log files in the current directory.
$ tensorboard --logdir .

Below is a tensorboard screenshot for a training job. There we can see the reduced_train_loss which shows a decreasing loss curve over the training steps.

Troubleshooting

  • If some of the nodes appear “down” or “down*” as shown below, we can see that both the two nodes are shown in down* status

Solution: login to them and run sudo systemctl restart slurmd. As shown below, the two nodes went to an idle state.

Clean up

Use the following steps to clean up the infrastructure created for this post and avoid incurring ongoing costs. You can also find cleanup instructions in Cleanup.

  1. Delete the SageMaker HyperPod cluster.
    $ aws sagemaker delete-cluster --cluster-name ml-cluster

  2. Delete the CloudFormation stack created in the prerequisites.
    $ aws cloudformation wait stack-delete-complete --stack-name sagemaker-hyperpod

Conclusion

Using the NVIDIA NeMo 2.0 framework on SageMaker HyperPod offers a scalable, cost-efficient, and streamlined approach to training large-scale generative AI models. By following the step-by-step deployment process, you can use the power of distributed computing with minimal setup complexity.

References


About the authors

Abdullahi Olaoye is a Senior AI Solutions Architect at NVIDIA, specializing in integrating NVIDIA AI libraries, frameworks, and products with cloud AI services and open-source tools to optimize AI model deployment, inference, and generative AI workflows. He collaborates with AWS to enhance AI workload performance and drive adoption of NVIDIA-powered AI and generative AI solutions.

Greeshma Nallapareddy is a Sr. Business Development Manager at AWS working with NVIDIA on go-to-market strategy to accelerate AI solutions for customers at scale. Her experience includes leading solutions architecture teams focused on working with startups.

Akshit Arora is a senior data scientist at NVIDIA, where he works on deploying conversational AI models on GPUs at scale. He’s a graduate of University of Colorado at Boulder, where he applied deep learning to improve knowledge tracking on a K-12 online tutoring service. His work spans multilingual text-to-speech, time series classification, ed-tech, and practical applications of deep learning.

Ankur Srivastava is a Sr. Solutions Architect in the ML Frameworks Team. He focuses on helping customers with self-managed distributed training and inference at scale on AWS. His experience includes industrial predictive maintenance, digital twins, probabilistic design optimization and has completed his doctoral studies from Mechanical Engineering at Rice University and post-doctoral research from Massachusetts Institute of Technology.

Eliuth Triana Isaza is a Developer Relations Manager at NVIDIA empowering Amazon AI MLOps, DevOps, Scientists, and AWS technical experts to master the NVIDIA computing stack for accelerating and optimizing generative AI foundation models spanning from data curation, GPU training, model inference, and production deployment on AWS GPU instances. In addition, Eliuth is a passionate mountain biker, skier, tennis and poker player.

Read More

NeMo Retriever Llama 3.2 text embedding and reranking NVIDIA NIM microservices now available in Amazon SageMaker JumpStart

NeMo Retriever Llama 3.2 text embedding and reranking NVIDIA NIM microservices now available in Amazon SageMaker JumpStart

Today, we are excited to announce that the NeMo Retriever Llama3.2 Text Embedding and Reranking NVIDIA NIM microservices are available in Amazon SageMaker JumpStart. With this launch, you can now deploy NVIDIA’s optimized reranking and embedding models to build, experiment, and responsibly scale your generative AI ideas on AWS.

In this post, we demonstrate how to get started with these models on SageMaker JumpStart.

About NVIDIA NIM on AWS

NVIDIA NIM microservices integrate closely with AWS managed services such as Amazon Elastic Compute Cloud (Amazon EC2), Amazon Elastic Kubernetes Service (Amazon EKS), and Amazon SageMaker to enable the deployment of generative AI models at scale. As part of NVIDIA AI Enterprise available in AWS Marketplace, NIM is a set of user-friendly microservices designed to streamline and accelerate the deployment of generative AI. These prebuilt containers support a broad spectrum of generative AI models, from open source community models to NVIDIA AI foundation models (FMs) and custom models. NIM microservices provide straightforward integration into generative AI applications using industry-standard APIs and can be deployed with just a few lines of code, or with a few clicks on the SageMaker JumpStart console. Engineered to facilitate seamless generative AI inferencing at scale, NIM helps you deploy your generative AI applications.

Overview of NVIDIA NeMo Retriever NIM microservices

In this section, we provide an overview of the NVIDIA NeMo Retriever NIM microservices discussed in this post.

NeMo Retriever text embedding NIM

The NVIDIA NeMo Retriever Llama3.2 embedding NIM is optimized for multilingual and cross-lingual text question-answering retrieval with support for long documents (up to 8,192 tokens) and dynamic embedding size (Matryoshka Embeddings). This model was evaluated on 26 languages: English, Arabic, Bengali, Chinese, Czech, Danish, Dutch, Finnish, French, German, Hebrew, Hindi, Hungarian, Indonesian, Italian, Japanese, Korean, Norwegian, Persian, Polish, Portuguese, Russian, Spanish, Swedish, Thai, and Turkish. In addition to enabling multilingual and cross-lingual question-answering retrieval, this model reduces the data storage footprint by 35-fold through dynamic embedding sizing and support for longer token length, making it feasible to handle large-scale datasets efficiently.

NeMo Retriever text reranking NIM

The NVIDIA NeMo Retriever Llama3.2 reranking NIM is optimized for providing a logit score that represents how relevant a document is to a given query. The model was fine-tuned for multilingual, cross-lingual text question-answering retrieval, with support for long documents (up to 8,192 tokens). This model was evaluated on the same 26 languages mentioned earlier.

SageMaker JumpStart overview

SageMaker JumpStart is a fully managed service that offers state-of-the-art FMs for various use cases such as content writing, code generation, question answering, copywriting, summarization, classification, and information retrieval. It provides a collection of pre-trained models that you can deploy quickly, accelerating the development and deployment of ML applications. One of the key components of SageMaker JumpStart is model hubs, which offer a vast catalog of pre-trained models, such as Mistral, for a variety of tasks.

Solution overview

You can now discover and deploy the NeMo Retriever text embedding and reranking NIM microservices in Amazon SageMaker Studio or programmatically through the Amazon SageMaker Python SDK, enabling you to derive model performance and MLOps controls with SageMaker features such as Amazon SageMaker Pipelines, Amazon SageMaker Debugger, or container logs. The model is deployed in a secure AWS environment and in your virtual private cloud (VPC), helping to support data security for enterprise security needs.

In the following sections, we demonstrate how to deploy these microservices and run real-time and batch inference.

Make sure your SageMaker AWS Identity and Access Management (IAM) service role has the AmazonSageMakerFullAccess permission policy attached.

To deploy NeMo Retriever Llama3.2 embedding and reranking microservices successfully, confirm one of the following:

  • Make sure your IAM role has the following permissions and you have the authority to make AWS Marketplace subscriptions in the AWS account used:
    • aws-marketplace:ViewSubscriptions
    • aws-marketplace:Unsubscribe
    • aws-marketplace:Subscribe
  • Alternatively, confirm your AWS account has a subscription to the model. If so, you can skip the following deployment instructions and start at the Subscribe to the model package section.

Deploy NeMo Retriever microservices on SageMaker JumpStart

For those new to SageMaker JumpStart, we demonstrate using SageMaker Studio to access models on SageMaker JumpStart. The following screenshot shows the NeMo Retriever text embedding and reranking microservices available on SageMaker JumpStart.

NeMo Retriever text embedding and reranking microservices available on SageMaker JumpStart.

Deployment starts when you choose the Deploy option. You might be prompted to subscribe to this model through AWS Marketplace. If you are already subscribed, then you can move forward with choosing the second Deploy button. After deployment finishes, you will see that an endpoint is created. You can test the endpoint by passing a sample inference request payload or by selecting the testing option using the SDK.

Deploy the NeMo Retriever microservice.

Subscribe to the model package

To subscribe to the model package, complete the following steps

  1. Depending on the model you want to deploy, open the model package listing page for Llama-3.2-NV-EmbedQA-1B-v2 or Llama-3.2-NV-RerankQA-1B-v2.
  2. On the AWS Marketplace listing, choose Continue to subscribe.
  3. On the Subscribe to this software page, review and choose Accept Offer if you and your organization agree with EULA, pricing, and support terms.
  4. Choose Continue to configuration and then choose an AWS Region.

A product Amazon Resource Name (ARN) will be displayed. This is the model package ARN that you need to specify while creating a deployable model using Boto3.

Deploy NeMo Retriever microservices using the SageMaker SDK

In this section, we walk through deploying the NeMo Retriever text embedding NIM through the SageMaker SDK. A similar process can be followed for deploying the NeMo Retriever text reranking NIM as well.

Define the SageMaker model using the model package ARN

To deploy the model using the SDK, copy the product ARN from the previous step and specify it in the model_package_arn in the following code:

# Define the model details
model_package_arn = "Specify the model package ARN here"
sm_model_name = "nim-llama-3-2-nv-embedqa-1b-v2"

# Create the SageMaker model
create_model_response = sm.create_model(
ModelName=sm_model_name,
PrimaryContainer={
'ModelPackageName': model_package_arn
},
ExecutionRoleArn=role,
EnableNetworkIsolation=True
)
print("Model Arn: " + create_model_response["ModelArn"])

Create the endpoint configuration

Next, we create an endpoint configuration specifying instance type; in this case, we are using an ml.g5.2xlarge instance type accelerated by NVIDIA A10G GPUs. Make sure you have the account-level service limit for using ml.g5.2xlarge for endpoint usage as one or more instances. To request a service quota increase, refer to AWS service quotas. For further performance improvements, you can use NVIDIA Hopper GPUs (P5 instances) on SageMaker.

# Create the endpoint configuration
endpoint_config_name = sm_model_name

create_endpoint_config_response = sm.create_endpoint_config(
EndpointConfigName=endpoint_config_name,
ProductionVariants=[
{
'VariantName': 'AllTraffic',
'ModelName': sm_model_name,
'InitialInstanceCount': 1,
'InstanceType': 'ml.g5.xlarge', 
'InferenceAmiVersion': 'al2-ami-sagemaker-inference-gpu-2',
'RoutingConfig': {'RoutingStrategy': 'LEAST_OUTSTANDING_REQUESTS'},
'ModelDataDownloadTimeoutInSeconds': 3600, # Specify the model download timeout in seconds.
'ContainerStartupHealthCheckTimeoutInSeconds': 3600, # Specify the health checkup timeout in seconds
}
]
)
print("Endpoint Config Arn: " + create_endpoint_config_response["EndpointConfigArn"])

Create the endpoint

Using the preceding endpoint configuration, we create a new SageMaker endpoint and wait for the deployment to finish. The status will change to InService after the deployment is successful.

# Create the endpoint
endpoint_name = endpoint_config_name
create_endpoint_response = sm.create_endpoint(
EndpointName=endpoint_name,
EndpointConfigName=endpoint_config_name
)

print("Endpoint Arn: " + create_endpoint_response["EndpointArn"])

Deploy the NIM microservice

Deploy the NIM microservice with the following code:

resp = sm.describe_endpoint(EndpointName=endpoint_name)
status = resp["EndpointStatus"]
print("Status: " + status)

while status == "Creating":
time.sleep(60)
resp = sm.describe_endpoint(EndpointName=endpoint_name)
status = resp["EndpointStatus"]
print("Status: " + status)

print("Arn: " + resp["EndpointArn"])
print("Status: " + status)

We get the following output:

Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: InService
Arn: arn:aws:sagemaker:us-west-2:611951037680:endpoint/nim-llama-3-2-nv-embedqa-1b-v2
Status: InService

After you deploy the model, your endpoint is ready for inference. In the following section, we use a sample text to do an inference request. For inference request format, NIM on SageMaker supports the OpenAI API inference protocol (at the time of writing). For an explanation of supported parameters, see Create an embedding vector from the input text.

Inference example with NeMo Retriever text embedding NIM microservice

The NVIDIA NeMo Retriever Llama3.2 embedding model is optimized for multilingual and cross-lingual text question-answering retrieval with support for long documents (up to 8,192 tokens) and dynamic embedding size (Matryoshka Embeddings). In this section, we provide examples of running real-time inference and batch inference.

Real-time inference example

The following code example illustrates how to perform real-time inference using the NeMo Retriever Llama3.2 embedding model:

import pprint
pp1 = pprint.PrettyPrinter(indent=2, width=80, compact=True, depth=3)

input_embedding = '''{
"model": "nvidia/llama-3.2-nv-embedqa-1b-v2",
"input": ["Sample text 1", "Sample text 2"],
"input_type": "query"
}'''

print("Example input data for embedding model endpoint:")
print(input_embedding)

response = client.invoke_endpoint(
EndpointName=endpoint_name,
ContentType="application/json",
Accept="application/json",
Body=input_embedding
)

print("nEmbedding endpoint response:")
response = json.load(response["Body"])
pp1.pprint(response)

We get the following output:

Example input data for embedding model endpoint:
{
"model": "nvidia/llama-3.2-nv-embedqa-1b-v2", 
"input": ["Sample text 1", "Sample text 2"],
"input_type": "query"
}

Embedding endpoint response:
{ 'data': [ {'embedding': [...], 'index': 0, 'object': 'embedding'},
            {'embedding': [...], 'index': 1, 'object': 'embedding'}],
  'model': 'nvidia/llama-3.2-nv-embedqa-1b-v2',
  'object': 'list',
  'usage': {'prompt_tokens': 14, 'total_tokens': 14}}

Batch inference example

When you have many documents, you can vectorize each of them with a for loop. This will often result in many requests. Alternatively, you can send requests consisting of batches of documents to reduce the number of requests to the API endpoint. We use the following example with a dataset of 10 documents. Let’s test the model with a number of documents in different languages:

documents = [
"El futuro de la computación cuántica en aplicaciones criptográficas.",
"L’application des réseaux neuronaux dans les systèmes de véhicules autonomes.",
"Analyse der Rolle von Big Data in personalisierten Gesundheitslösungen.",
"L’evoluzione del cloud computing nello sviluppo di software aziendale.",
"Avaliando o impacto da IoT na infraestrutura de cidades inteligentes.",
"Потенциал граничных вычислений для обработки данных в реальном времени.",
"评估人工智能在欺诈检测系统中的有效性。",
"倫理的なAIアルゴリズムの開発における課題と機会。",
"دمج تقنية الجيل الخامس (5G) في تعزيز الاتصال بالإنترنت للأشياء (IoT).",
"सुरक्षित लेनदेन के लिए बायोमेट्रिक प्रमाणीकरण विधियों में प्रगति।"
]

The following code demonstrates how to group the documents into batches and invoke the endpoint repeatedly to vectorize the whole dataset. Specifically, the example code loops over the 10 documents in batches of size 5 (batch_size=5).

pp2 = pprint.PrettyPrinter(indent=2, width=80, compact=True, depth=2)

encoded_data = []
batch_size = 5

# Loop over the documents in increments of the batch size
for i in range(0, len(documents), batch_size):
input = json.dumps({
"input": documents[i:i+batch_size],
"input_type": "passage",
"model": "nvidia/llama-3.2-nv-embedqa-1b-v2",
})

response = client.invoke_endpoint(
EndpointName=endpoint_name,
ContentType="application/json",
Accept="application/json",
Body=input,
)

response = json.load(response["Body"])

# Concatenating vectors into a single list; preserve original index
encoded_data.extend({"embedding": data[1]["embedding"], "index": data[0] } for
data in zip(range(i,i+batch_size), response["data"]))

# Print the response data
pp2.pprint(encoded_data)

We get the following output:

[ {'embedding': [...], 'index': 0}, {'embedding': [...], 'index': 1},
  {'embedding': [...], 'index': 2}, {'embedding': [...], 'index': 3},
  {'embedding': [...], 'index': 4}, {'embedding': [...], 'index': 5},
  {'embedding': [...], 'index': 6}, {'embedding': [...], 'index': 7},
  {'embedding': [...], 'index': 8}, {'embedding': [...], 'index': 9}]

Inference example with NeMo Retriever text reranking NIM microservice

The NVIDIA NeMo Retriever Llama3.2 reranking NIM microservice is optimized for providing a logit score that represents how relevant a documents is to a given query. The model was fine-tuned for multilingual, cross-lingual text question-answering retrieval, with support for long documents (up to 8,192 tokens).

In the following example, we create an input payload for a list of emails in multiple languages:

payload_model = "nvidia/llama-3.2-nv-rerankqa-1b-v2"
query = {"text": "What emails have been about returning items?"}
documents = [
    {"text":"Contraseña incorrecta. Hola, llevo una hora intentando acceder a mi cuenta y sigue diciendo que mi contraseña es incorrecta. ¿Puede ayudarme, por favor?"},
    {"text":"Confirmation Email Missed. Hi, I recently purchased a product from your website but I never received a confirmation email. Can you please look into this for me?"},
    {"text":"أسئلة حول سياسة الإرجاع. مرحبًا، لدي سؤال حول سياسة إرجاع هذا المنتج. لقد اشتريته قبل بضعة أسابيع وهو معيب"},
    {"text":"Customer Support is Busy. Good morning, I have been trying to reach your customer support team for the past week but I keep getting a busy signal. Can you please help me?"},
    {"text":"Falschen Artikel erhalten. Hallo, ich habe eine Frage zu meiner letzten Bestellung. Ich habe den falschen Artikel erhalten und muss ihn zurückschicken."},
    {"text":"Customer Service is Unavailable. Hello, I have been trying to reach your customer support team for the past hour but I keep getting a busy signal. Can you please help me?"},
    {"text":"Return Policy for Defective Product. Hi, I have a question about the return policy for this product. I purchased it a few weeks ago and it is defective."},
    {"text":"收到错误物品. 早上好,关于我最近的订单,我有一个问题。我收到了错误的商品,需要退货。"},
    {"text":"Return Defective Product. Hello, I have a question about the return policy for this product. I purchased it a few weeks ago and it is defective."}
]

payload = {
  "model": payload_model,
  "query": query,
  "passages": documents,
  "truncate": "END"
}

response = client.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType="application/json",
    Body=json.dumps(payload)
)

output = json.loads(response["Body"].read().decode("utf8"))
print(f'Documents: {response}')
print(json.dumps(output, indent=2))

In this example, the relevance (logit) scores are normalized to be in the range [0, 1]. Scores close to 1 indicate a high relevance to the query, and scores closer to 0 indicate low relevance.

Documents: {'ResponseMetadata': {'RequestId': 'a3f19e06-f468-4382-a927-3485137ffcf6', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': 'a3f19e06-f468-4382-a927-3485137ffcf6', 'x-amzn-invoked-production-variant': 'AllTraffic', 'date': 'Tue, 04 Mar 2025 21:46:39 GMT', 'content-type': 'application/json', 'content-length': '349', 'connection': 'keep-alive'}, 'RetryAttempts': 0}, 'ContentType': 'application/json', 'InvokedProductionVariant': 'AllTraffic', 'Body': <botocore.response.StreamingBody object at 0x7fbb00ff94b0>}
{
  "rankings": [
    {
      "index": 4,
      "logit": 0.0791015625
    },
    {
      "index": 8,
      "logit": -0.1904296875
    },
    {
      "index": 7,
      "logit": -2.583984375
    },
    {
      "index": 2,
      "logit": -4.71484375
    },
    {
      "index": 6,
      "logit": -5.34375
    },
    {
      "index": 1,
      "logit": -5.64453125
    },
    {
      "index": 5,
      "logit": -11.96875
    },
    {
      "index": 3,
      "logit": -12.2265625
    },
    {
      "index": 0,
      "logit": -16.421875
    }
  ],
  "usage": {
    "prompt_tokens": 513,
    "total_tokens": 513
  }
}

Let’s see the top-ranked document for our query:

# 1. Extract the array of rankings
rankings = output["rankings"]  # or output.get("rankings", [])

# 2. Get the top-ranked entry (highest logit)
top_ranked_entry = rankings[0]
top_index = top_ranked_entry["index"]  # e.g. 4 in your example

# 3. Retrieve the corresponding document
top_document = documents[top_index]

print("Top-ranked document:")
print(top_document)

The following is the top-ranked document based on the provided relevance scores:

Top-ranked document:
{'text': 'Falschen Artikel erhalten. Hallo, ich habe eine Frage zu meiner letzten Bestellung. Ich habe den falschen Artikel erhalten und muss ihn zurückschicken.'}

This translates to the following:

"Wrong item received. Hello, I have a question about my last order. I received the wrong item and need to return it."

Based on the preceding results from the model, we see that a higher logit indicates stronger alignment with the query, whereas lower (or more negative) values indicate lower relevance. In this case, the document discussing receiving the wrong item (in German) was ranked first with the highest logit, confirming that the model quickly and effectively identified it as the most relevant passage regarding product returns.

Clean up

To clean up your resources, use the following commands:

sm.delete_model(ModelName=sm_model_name)
sm.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
sm.delete_endpoint(EndpointName=endpoint_name)

Conclusion

The NVIDIA NeMo Retriever Llama 3.2 NIM microservices bring powerful multilingual capabilities to enterprise search and retrieval systems. These models excel in diverse use cases, including cross-lingual search applications, enterprise knowledge bases, customer support systems, and content recommendation engines. The text embedding NIM’s dynamic embedding size (Matryoshka Embeddings) reduces storage footprint by 35-fold while supporting 26 languages and documents up to 8,192 tokens. The reranking NIM accurately scores document relevance across languages, enabling precise information retrieval even for multilingual content. For organizations managing global knowledge bases or customer-facing search experiences, these NVIDIA-optimized microservices provide a significant advantage in latency, accuracy, and efficiency, allowing developers to quickly deploy sophisticated search capabilities without compromising on performance or linguistic diversity.

SageMaker JumpStart provides a straightforward way to use state-of-the-art large language FMs for text embedding and reranking. Through the UI or just a few lines of code, you can deploy a highly accurate text embedding model to generate dense vector representations that capture semantic meaning and a reranking model to find semantic matches and retrieve the most relevant information from various data stores at scale and cost-efficiently.


About the Authors

Niithiyn Vijeaswaran is a Generative AI Specialist Solutions Architect with the Third-Party Model Science team at AWS. His area of focus is AWS AI accelerators (AWS Neuron). He holds a Bachelor’s in Computer Science and Bioinformatics.

Greeshma NallapareddyGreeshma Nallapareddy is a Sr. Business Development Manager at AWS working with NVIDIA on go-to-market strategy to accelerate AI solutions for customers at scale. Her experience includes leading solutions architecture teams focused on working with startups.

Abhishek Sawarkar is a product manager in the NVIDIA AI Enterprise team working on integrating NVIDIA AI Software in Cloud MLOps platforms. He focuses on integrating the NVIDIA AI end-to-end stack within cloud platforms and enhancing user experience on accelerated computing.

Abdullahi OlaoyeAbdullahi Olaoye is a Senior AI Solutions Architect at NVIDIA, specializing in integrating NVIDIA AI libraries, frameworks, and products with cloud AI services and open source tools to optimize AI model deployment, inference, and generative AI workflows. He collaborates with AWS to enhance AI workload performance and drive adoption of NVIDIA-powered AI and generative AI solutions.

Banu NagasundaramBanu Nagasundaram leads product, engineering, and strategic partnerships for Amazon SageMaker JumpStart, the machine learning and generative AI hub provided by Amazon SageMaker. She is passionate about building solutions that help customers accelerate their AI journey and unlock business value.

Chase PinkertonChase Pinkerton is a Startups Solutions Architect at Amazon Web Services. He holds a Bachelor’s in Computer Science with a minor in Economics from Tufts University. He’s passionate about helping startups grow and scale their businesses. When not working, he enjoys road cycling, hiking, playing volleyball, and photography.

Eliuth Triana IsazaEliuth Triana Isaza is a Developer Relations Manager at NVIDIA, empowering Amazon’s AI MLOps, DevOps, scientists, and AWS technical experts to master the NVIDIA computing stack for accelerating and optimizing generative AI foundation models spanning from data curation, GPU training, model inference, and production deployment on AWS GPU instances. In addition, Eliuth is a passionate mountain biker, skier, and tennis and poker player.

Read More

Amazon Bedrock Guardrails announces IAM Policy-based enforcement to deliver safe AI interactions

Amazon Bedrock Guardrails announces IAM Policy-based enforcement to deliver safe AI interactions

As generative AI adoption accelerates across enterprises, maintaining safe, responsible, and compliant AI interactions has never been more critical. Amazon Bedrock Guardrails provides configurable safeguards that help organizations build generative AI applications with industry-leading safety protections. With Amazon Bedrock Guardrails, you can implement safeguards in your generative AI applications that are customized to your use cases and responsible AI policies. You can create multiple guardrails tailored to different use cases and apply them across multiple foundation models (FMs), improving user experiences and standardizing safety controls across generative AI applications. Beyond Amazon Bedrock models, the service offers the flexible ApplyGuardrails API that enables you to assess text using your pre-configured guardrails without invoking FMs, allowing you to implement safety controls across generative AI applications—whether running on Amazon Bedrock or on other systems—at both input and output levels.

Today, we’re announcing a significant enhancement to Amazon Bedrock Guardrails: AWS Identity and Access Management (IAM) policy-based enforcement. This powerful capability enables security and compliance teams to establish mandatory guardrails for every model inference call, making sure organizational safety policies are consistently enforced across AI interactions. This feature enhances AI governance by enabling centralized control over guardrail implementation.

Challenges with building generative AI applications

Organizations deploying generative AI face critical governance challenges: content appropriateness, where models might produce undesirable responses to problematic prompts; safety concerns, with potential generation of harmful content even from innocent prompts; privacy protection requirements for handling sensitive information; and consistent policy enforcement across AI deployments.

Perhaps most challenging is making sure that appropriate safeguards are applied consistently across AI interactions within an organization, regardless of which team or individual is developing or deploying applications.

Amazon Bedrock Guardrails capabilities

Amazon Bedrock Guardrails enables you to implement safeguards in generative AI applications customized to your specific use cases and responsible AI policies. Guardrails currently supports six types of policies:

  • Content filters – Configurable thresholds across six harmful categories: hate, insults, sexual, violence, misconduct, and prompt injections
  • Denied topics – Definition of specific topics to be avoided in the context of an application
  • Sensitive information filters – Detection and removal of personally identifiable information (PII) and custom regex entities to protect user privacy
  • Word filters – Blocking of specific words in generative AI applications, such as harmful words, profanity, or competitor names and products
  • Contextual grounding checks – Detection and filtering of hallucinations in model responses by verifying if the response is properly grounded in the provided reference source and relevant to the user query
  • Automated reasoning – Prevention of factual errors from hallucinations using sound mathematical, logic-based algorithmic verification and reasoning processes to verify the information generated by a model, so outputs align with known facts and aren’t based on fabricated or inconsistent data

Policy-based enforcement of guardrails

Security teams often have organizational requirements to enforce the use of Amazon Bedrock Guardrails for every inference call to Amazon Bedrock. To support this requirement, Amazon Bedrock Guardrails provides the new IAM condition key bedrock:GuardrailIdentifier, which can be used in IAM policies to enforce the use of a specific guardrail for model inference. The condition key in the IAM policy can be applied to the following APIs:

The following diagram illustrates the policy-based enforcement workflow.

If the guardrail configured in your IAM policy doesn’t match the guardrail specified in the request, the request will be rejected with an access denied exception, enforcing compliance with organizational policies.

Policy examples

In this section, we present several policy examples demonstrating how to enforce guardrails for model inference.

Example 1: Enforce the use of a specific guardrail and its numeric version

The following example illustrates the enforcement of exampleguardrail and its numeric version 1 during model inference:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "InvokeFoundationModelStatement1",
            "Effect": "Allow",
            "Action": [
                "bedrock:InvokeModel",
                "bedrock:InvokeModelWithResponseStream"
            ],
            "Resource": [
                "arn:aws:bedrock:region::foundation-model/*"
            ],
            "Condition": {
                "StringEquals": {
                    "bedrock:GuardrailIdentifier": "arn:aws:bedrock:<region>:<account-id>:guardrail/exampleguardrail:1"
                }
            }
        },
        {
            "Sid": "InvokeFoundationModelStatement2",
            "Effect": "Deny",
            "Action": [
                "bedrock:InvokeModel",
                "bedrock:InvokeModelWithResponseStream"
            ],
            "Resource": [
                "arn:aws:bedrock:region::foundation-model/*"
            ],
            "Condition": {
                "StringNotEquals": {
                    "bedrock:GuardrailIdentifier": "arn:aws:bedrock:<region>:<account-id>:guardrail/exampleguardrail:1"
                }
            }
        },
        {
            "Sid": "ApplyGuardrail",
            "Effect": "Allow",
            "Action": [
                "bedrock:ApplyGuardrail"
            ],
            "Resource": [
                "arn:aws:bedrock:<region>:<account-id>:guardrail/exampleguardrail"
            ]
        }
    ]
}

The added explicit deny denies the user request for calling the listed actions with other GuardrailIdentifier and GuardrailVersion values irrespective of other permissions the user might have.

Example 2: Enforce the use of a specific guardrail and its draft version

The following example illustrates the enforcement of exampleguardrail and its draft version during model inference:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "InvokeFoundationModelStatement1",
            "Effect": "Allow",
            "Action": [
                "bedrock:InvokeModel",
                "bedrock:InvokeModelWithResponseStream"
            ],
            "Resource": [
                "arn:aws:bedrock:region::foundation-model/*"
            ],
            "Condition": {
                "StringEquals": {
                    "bedrock:GuardrailIdentifier": "arn:aws:bedrock:<region>:<account-id>:guardrail/exampleguardrail"
                }
            }
        },
        {
            "Sid": "InvokeFoundationModelStatement2",
            "Effect": "Deny",
            "Action": [
                "bedrock:InvokeModel",
                "bedrock:InvokeModelWithResponseStream"
            ],
            "Resource": [
                "arn:aws:bedrock:region::foundation-model/*"
            ],
            "Condition": {
                "StringNotEquals": {
                    "bedrock:GuardrailIdentifier": "arn:aws:bedrock:<region>:<account-id>:guardrail/exampleguardrail"
                }
            }
        },
        {
            "Sid": "ApplyGuardrail",
            "Effect": "Allow",
            "Action": [
                "bedrock:ApplyGuardrail"
            ],
            "Resource": [
                "arn:aws:bedrock:<region>:<account-id>:guardrail/exampleguardrail"
            ]
        }
    ]
}

Example 3: Enforce the use of a specific guardrail and its numeric versions

The following example illustrates the enforcement of exampleguardrail and its numeric versions during model inference:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "InvokeFoundationModelStatement1",
            "Effect": "Allow",
            "Action": [
                "bedrock:InvokeModel",
                "bedrock:InvokeModelWithResponseStream"
            ],
            "Resource": [
                "arn:aws:bedrock:region::foundation-model/*"
            ],
            "Condition": {
                "StringLike": {
                    "bedrock:GuardrailIdentifier": "arn:aws:bedrock:<region>:<account-id>:guardrail/exampleguardrail:*"
                }
            }
        },
        {
            "Sid": "InvokeFoundationModelStatement2",
            "Effect": "Deny",
            "Action": [
                "bedrock:InvokeModel",
                "bedrock:InvokeModelWithResponseStream"
            ],
            "Resource": [
                "arn:aws:bedrock:region::foundation-model/*"
            ],
            "Condition": {
                "StringNotLike": {
                    "bedrock:GuardrailIdentifier": "arn:aws:bedrock:<region>:<account-id>:guardrail/exampleguardrail:*"
                }
            }
        },
        {
            "Sid": "ApplyGuardrail",
            "Effect": "Allow",
            "Action": [
                "bedrock:ApplyGuardrail"
            ],
            "Resource": [
                "arn:aws:bedrock:<region>:<account-id>:guardrail/exampleguardrail"
            ]
        }
    ]
}

Example 4: Enforce the use of a specific guardrail and its versions, including the draft

The following example illustrates the enforcement of exampleguardrail and its versions, including the draft, during model inference:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "InvokeFoundationModelStatement1",
            "Effect": "Allow",
            "Action": [
                "bedrock:InvokeModel",
                "bedrock:InvokeModelWithResponseStream"
            ],
            "Resource": [
                "arn:aws:bedrock:region::foundation-model/*"
            ],
            "Condition": {
                "StringLike": {
                    "bedrock:GuardrailIdentifier": "arn:aws:bedrock:<region>:<account-id>:guardrail/exampleguardrail*"
                }
            }
        },
        {
            "Sid": "InvokeFoundationModelStatement2",
            "Effect": "Deny",
            "Action": [
                "bedrock:InvokeModel",
                "bedrock:InvokeModelWithResponseStream"
            ],
            "Resource": [
                "arn:aws:bedrock:region::foundation-model/*"
            ],
            "Condition": {
                "StringNotLike": {
                    "bedrock:GuardrailIdentifier": "arn:aws:bedrock:<region>:<account-id>:guardrail/exampleguardrail*"
                }
            }
        },
        {
            "Sid": "ApplyGuardrail",
            "Effect": "Allow",
            "Action": [
                "bedrock:ApplyGuardrail"
            ],
            "Resource": [
                "arn:aws:bedrock:<region>:<account-id>:guardrail/exampleguardrail"
            ]
        }
    ]
}

Example 5: Enforce the use of a specific guardrail and version pair from a list of guardrail and version pairs

The following example illustrates the enforcement of exampleguardrail1 and its version 1, or exampleguardrail2 and its version 2, or exampleguardrail3 and its version 3 and its draft during model inference:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "InvokeFoundationModelStatement1",
            "Effect": "Allow",
            "Action": [
                "bedrock:InvokeModel",
                "bedrock:InvokeModelWithResponseStream"
            ],
            "Resource": [
                "arn:aws:bedrock:region::foundation-model/*"
            ],
            "Condition": {
                "StringEquals": {
                    "bedrock:GuardrailIdentifier": [
                        "arn:aws:bedrock:<region>:<account-id>:guardrail/exampleguardrail1:1",
                        "arn:aws:bedrock:<region>:<account-id>:guardrail/exampleguardrail2:2",
                        "arn:aws:bedrock:<region>:<account-id>:guardrail/exampleguardrail3"
                    ]
                }
            }
        },
        {
            "Sid": "InvokeFoundationModelStatement2",
            "Effect": "Deny",
            "Action": [
                "bedrock:InvokeModel",
                "bedrock:InvokeModelWithResponseStream"
            ],
            "Resource": [
                "arn:aws:bedrock:region::foundation-model/*"
            ],
            "Condition": {
                "StringNotEquals": {
                    "bedrock:GuardrailIdentifier": [
                        "arn:aws:bedrock:<region>:<account-id>:guardrail/exampleguardrail1:1",
                        "arn:aws:bedrock:<region>:<account-id>:guardrail/exampleguardrail2:2",
                        "arn:aws:bedrock:<region>:<account-id>:guardrail/exampleguardrail3"
                    ]
                }
            }
        },
        {
            "Sid": "ApplyGuardrail",
            "Effect": "Allow",
            "Action": [
                "bedrock:ApplyGuardrail"
            ],
            "Resource": [
                "arn:aws:bedrock:<region>:<account-id>:guardrail/exampleguardrail1",
                "arn:aws:bedrock:<region>:<account-id>:guardrail/exampleguardrail2",
                "arn:aws:bedrock:<region>:<account-id>:guardrail/exampleguardrail3"
            ]
        }
    ]
}

Known limitations

When implementing policy-based guardrail enforcement, be aware of these limitations:

  • At the time of this writing, Amazon Bedrock Guardrails doesn’t support resource-based policies for cross-account access.
  • If a user assumes a role that has a specific guardrail configured using the bedrock:GuardrailIdentifier condition key, the user can strategically use input tags to help avoid having guardrail checks applied to certain parts of their prompt. Input tags allow users to mark specific sections of text that should be processed by guardrails, leaving other sections unprocessed. For example, a user could intentionally leave sensitive or potentially harmful content outside of the tagged sections, preventing those portions from being evaluated against the guardrail policies. However, regardless of how the prompt is structured or tagged, the guardrail will still be fully applied to the model’s response.
  • If a user has a role configured with a specific guardrail requirement (using the bedrock:GuardrailIdentifier condition), they shouldn’t use that same role to access services like Amazon Bedrock Knowledge Bases RetrieveAndGenerate or Amazon Bedrock Agents InvokeAgent. These higher-level services work by making multiple InvokeModel calls behind the scenes on the user’s behalf. Although some of these calls might include the required guardrail, others don’t. When the system attempts to make these guardrail-free calls using a role that requires guardrails, it results in AccessDenied errors, breaking the functionality of these services. To help avoid this issue, organizations should separate permissions—using different roles for direct model access with guardrails versus access to these composite Amazon Bedrock services.

Conclusion

The new IAM policy-based guardrail enforcement in Amazon Bedrock represents a crucial advancement in AI governance as generative AI becomes integrated into business operations. By enabling centralized policy enforcement, security teams can maintain consistent safety controls across AI applications regardless of who develops or deploys them, effectively mitigating risks related to harmful content, privacy violations, and bias. This approach offers significant advantages: it scales efficiently as organizations expand their AI initiatives without creating administrative bottlenecks, helps prevent technical debt by standardizing safety implementations, and enhances the developer experience by allowing teams to focus on innovation rather than compliance mechanics.

This capability demonstrates organizational commitment to responsible AI practices through comprehensive monitoring and audit mechanisms. Organizations can use model invocation logging in Amazon Bedrock to capture complete request and response data in Amazon CloudWatch Logs or Amazon Simple Storage Service (Amazon S3) buckets, including specific guardrail trace documentation showing when and how content was filtered. Combined with AWS CloudTrail integration that records guardrail configurations and policy enforcement actions, businesses can confidently scale their generative AI initiatives with appropriate safety mechanisms protecting their brand, customers, and data—striking the essential balance between innovation and ethical responsibility needed to build trust in AI systems.

Get started today with Amazon Bedrock Guardrails and implement configurable safeguards that balance innovation with responsible AI governance across your organization.


About the Authors

Shyam Srinivasan is on the Amazon Bedrock Guardrails product team. He cares about making the world a better place through technology and loves being part of this journey. In his spare time, Shyam likes to run long distances, travel around the world, and experience new cultures with family and friends.

Antonio Rodriguez is a Principal Generative AI Specialist Solutions Architect at AWS. He helps companies of all sizes solve their challenges, embrace innovation, and create new business opportunities with Amazon Bedrock. Apart from work, he loves to spend time with his family and play sports with his friends.

Satveer Khurpa is a Sr. WW Specialist Solutions Architect, Amazon Bedrock at Amazon Web Services. In this role, he uses his expertise in cloud-based architectures to develop innovative generative AI solutions for clients across diverse industries. Satveer’s deep understanding of generative AI technologies allows him to design scalable, secure, and responsible applications that unlock new business opportunities and drive tangible value.

Read More

Build your gen AI–based text-to-SQL application using RAG, powered by Amazon Bedrock (Claude 3 Sonnet and Amazon Titan for embedding)

Build your gen AI–based text-to-SQL application using RAG, powered by Amazon Bedrock (Claude 3 Sonnet and Amazon Titan for embedding)

SQL is one of the key languages widely used across businesses, and it requires an understanding of databases and table metadata. This can be overwhelming for nontechnical users who lack proficiency in SQL. Today, generative AI can help bridge this knowledge gap for nontechnical users to generate SQL queries by using a text-to-SQL application. This application allows users to ask questions in natural language and then generates a SQL query for the user’s request.

Large language models (LLMs) are trained to generate accurate SQL queries for natural language instructions. However, off-the-shelf LLMs can’t be used without some modification. Firstly, LLMs don’t have access to enterprise databases, and the models need to be customized to understand the specific database of an enterprise. Additionally, the complexity increases due to the presence of synonyms for columns and internal metrics available.

The limitation of LLMs in understanding enterprise datasets and human context can be addressed using Retrieval Augmented Generation (RAG). In this post, we explore using Amazon Bedrock to create a text-to-SQL application using RAG. We use Anthropic’s Claude 3.5 Sonnet model to generate SQL queries, Amazon Titan in Amazon Bedrock for text embedding and Amazon Bedrock to access these models.

Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities you need to build generative AI applications with security, privacy, and responsible AI.

Solution overview

This solution is primarily based on the following services:

  1. Foundational model – We use Anthropic’s Claude 3.5 Sonnet on Amazon Bedrock as our LLM to generate SQL queries for user inputs.
  2. Vector embeddings – We use Amazon Titan Text Embeddings v2 on Amazon Bedrock for embeddings. Embedding is the process by which text, images, and audio are given numerical representation in a vector space. Embedding is usually performed by a machine learning (ML) model. The following diagram provides more details about embeddings.vector embeddings
  3. RAG – We use RAG for providing more context about table schema, column synonyms, and sample queries to the FM. RAG is a framework for building generative AI applications that can make use of enterprise data sources and vector databases to overcome knowledge limitations. RAG works by using a retriever module to find relevant information from an external data store in response to a user’s prompt. This retrieved data is used as context, combined with the original prompt, to create an expanded prompt that is passed to the LLM. The language model then generates a SQL query that incorporates the enterprise knowledge. The following diagram illustrates the RAG framework.RAG Framework
  4. Streamlit This open source Python library makes it straightforward to create and share beautiful, custom web apps for ML and data science. In just a few minutes you can build powerful data apps using only Python.

The following diagram shows the solution architecture.

solution architecture

We need to update the LLMs with an enterprise-specific database. This make sure that the model can correctly understand the database and generate a response tailored to enterprise-based data schema and tables. There are multiple file formats available for storing this information, such as JSON, PDF, TXT, and YAML. In our case, we created JSON files to store table schema, table descriptions, columns with synonyms, and sample queries. JSON’s inherently structured format allows for clear and organized representation of complex data such as table schemas, column definitions, synonyms, and sample queries. This structure facilitates quick parsing and manipulation of data in most programming languages, reducing the need for custom parsing logic.

There can be multiple tables with similar information, which can lower the model’s accuracy. To increase the accuracy, we categorized the tables in four different types based on the schema and created four JSON files to store different tables. We’ve added one dropdown menu with four choices. Each choice represents one of these four categories and is lined to individual JSON files. After the user selects the value from the dropdown menu, the relevant JSON file is passed to Amazon Titan Text Embeddings v2, which can convert text into embeddings. These embeddings are stored in a vector database for faster retrieval.

We added the prompt template to the FM to define the roles and responsibilities of the model. You can add additional information such as which SQL engine should be used to generate the SQL queries.

When the user provides the input through the chat prompt, we use similarity search to find the relevant table metadata from the vector database for the user’s query. The user input is combined with relevant table metadata and the prompt template, which is passed to the FM as a single input all together. The FM generates the SQL query based on the final input.

To evaluate the model’s accuracy and track the mechanism, we store every user input and output in Amazon Simple Storage Service (Amazon S3).

Prerequisites

To create this solution, complete the following prerequisites:

  1. Sign up for an AWS account if you don’t already have one.
  2. Enable model access for Amazon Titan Text Embeddings v2 and Anthropic’s Claude 3.5 Sonnet on Amazon Bedrock.
  3. Create an S3 bucket as ‘simplesql-logs-****‘, replace ‘****’ with your unique identifier. Bucket names are unique globally across the entire Amazon S3 service.
  4. Choose your testing environment. We recommend that you test in Amazon SageMaker Studio, although you can use other available local environments.
  5. Install the following libraries to execute the code:
    pip install streamlit
    pip install jq
    pip install openpyxl
    pip install "faiss-cpu"
    pip install langchain

Procedure

There are three main components in this solution:

  1. JSON files store the table schema and configure the LLM
  2. Vector indexing using Amazon Bedrock
  3. Streamlit for the front-end UI

You can download all three components and code snippets provided in the following section.

Generate the table schema

We use the JSON format to store the table schema. To provide more inputs to the model, we added a table name and its description, columns and their synonyms, and sample queries in our JSON files. Create a JSON file as Table_Schema_A.json by copying the following code into it:

{
  "tables": [
    {
      "separator": "table_1",
      "name": "schema_a.orders",
      "schema": "CREATE TABLE schema_a.orders (order_id character varying(200), order_date timestamp without time zone, customer_id numeric(38,0), order_status character varying(200), item_id character varying(200) );",
      "description": "This table stores information about orders placed by customers.",
      "columns": [
        {
          "name": "order_id",
          "description": "unique identifier for orders.",
          "synonyms": ["order id"]
        },
        {
          "name": "order_date",
          "description": "timestamp when the order was placed",
          "synonyms": ["order time", "order day"]
        },
        {
          "name": "customer_id",
          "description": "Id of the customer associated with the order",
          "synonyms": ["customer id", "userid"]
        },
        {
          "name": "order_status",
          "description": "current status of the order, sample values are: shipped, delivered, cancelled",
          "synonyms": ["order status"]
        },
        {
          "name": "item_id",
          "description": "item associated with the order",
          "synonyms": ["item id"]
        }
      ],
      "sample_queries": [
        {
          "query": "select count(order_id) as total_orders from schema_a.orders where customer_id = '9782226' and order_status = 'cancelled'",
          "user_input": "Count of orders cancelled by customer id: 978226"
        }
      ]
    },
    {
      "separator": "table_2",
      "name": "schema_a.customers",
      "schema": "CREATE TABLE schema_a.customers (customer_id numeric(38,0), customer_name character varying(200), registration_date timestamp without time zone, country character varying(200) );",
      "description": "This table stores the details of customers.",
      "columns": [
        {
          "name": "customer_id",
          "description": "Id of the customer, unique identifier for customers",
          "synonyms": ["customer id"]
        },
        {
          "name": "customer_name",
          "description": "name of the customer",
          "synonyms": ["name"]
        },
        {
          "name": "registration_date",
          "description": "registration timestamp when customer registered",
          "synonyms": ["sign up time", "registration time"]
        },
        {
          "name": "country",
          "description": "customer's original country",
          "synonyms": ["location", "customer's region"]
        }
      ],
      "sample_queries": [
        {
          "query": "select count(customer_id) as total_customers from schema_a.customers where country = 'India' and to_char(registration_date, 'YYYY') = '2024'",
          "user_input": "The number of customers registered from India in 2024"
        },
        {
          "query": "select count(o.order_id) as order_count from schema_a.orders o join schema_a.customers c on o.customer_id = c.customer_id where c.customer_name = 'john' and to_char(o.order_date, 'YYYY-MM') = '2024-01'",
          "user_input": "Total orders placed in January 2024 by customer name john"
        }
      ]
    },
    {
      "separator": "table_3",
      "name": "schema_a.items",
      "schema": "CREATE TABLE schema_a.items (item_id character varying(200), item_name character varying(200), listing_date timestamp without time zone );",
      "description": "This table stores the complete details of items listed in the catalog.",
      "columns": [
        {
          "name": "item_id",
          "description": "Id of the item, unique identifier for items",
          "synonyms": ["item id"]
        },
        {
          "name": "item_name",
          "description": "name of the item",
          "synonyms": ["name"]
        },
        {
          "name": "listing_date",
          "description": "listing timestamp when the item was registered",
          "synonyms": ["listing time", "registration time"]
        }
      ],
      "sample_queries": [
        {
          "query": "select count(item_id) as total_items from schema_a.items where to_char(listing_date, 'YYYY') = '2024'",
          "user_input": "how many items are listed in 2024"
        },
        {
          "query": "select count(o.order_id) as order_count from schema_a.orders o join schema_a.customers c on o.customer_id = c.customer_id join schema_a.items i on o.item_id = i.item_id where c.customer_name = 'john' and i.item_name = 'iphone'",
          "user_input": "how many orders are placed for item 'iphone' by customer name john"
        }
      ]
    }
  ]
}

Configure the LLM and initialize vector indexing using Amazon Bedrock

Create a Python file as library.py by following these steps:

  1. Add the following import statements to add the necessary libraries:
    import boto3  # AWS SDK for Python
    from langchain_community.document_loaders import JSONLoader  # Utility to load JSON files
    from langchain.llms import Bedrock  # Large Language Model (LLM) from Anthropic
    from langchain_community.chat_models import BedrockChat  # Chat interface for Bedrock LLM
    from langchain.embeddings import BedrockEmbeddings  # Embeddings for Titan model
    from langchain.memory import ConversationBufferWindowMemory  # Memory to store chat conversations
    from langchain.indexes import VectorstoreIndexCreator  # Create vector indexes
    from langchain.vectorstores import FAISS  # Vector store using FAISS library
    from langchain.text_splitter import RecursiveCharacterTextSplitter  # Split text into chunks
    from langchain.chains import ConversationalRetrievalChain  # Conversational retrieval chain
    from langchain.callbacks.manager import CallbackManager

  2. Initialize the Amazon Bedrock client and configure Anthropic’s Claude 3.5 You can limit the number of output tokens to optimize the cost:
    # Create a Boto3 client for Bedrock Runtime
    bedrock_runtime = boto3.client(
        service_name="bedrock-runtime",
        region_name="us-east-1"
    )
    
    # Function to get the LLM (Large Language Model)
    def get_llm():
        model_kwargs = {  # Configuration for Anthropic model
            "max_tokens": 512,  # Maximum number of tokens to generate
            "temperature": 0.2,  # Sampling temperature for controlling randomness
            "top_k": 250,  # Consider the top k tokens for sampling
            "top_p": 1,  # Consider the top p probability tokens for sampling
            "stop_sequences": ["nnHuman:"]  # Stop sequence for generation
        }
        # Create a callback manager with a default callback handler
        callback_manager = CallbackManager([])
        
        llm = BedrockChat(
            model_id="anthropic.claude-3-5-sonnet-20240620-v1:0",  # Set the foundation model
            model_kwargs=model_kwargs,  # Pass the configuration to the model
            callback_manager=callback_manager
            
        )
    
        return llm

  3. Create and return an index for the given schema type. This approach is an efficient way to filter tables and provide relevant input to the model:
    # Function to load the schema file based on the schema type
    def load_schema_file(schema_type):
        if schema_type == 'Schema_Type_A':
            schema_file = "Table_Schema_A.json"  # Path to Schema Type A
        elif schema_type == 'Schema_Type_B':
            schema_file = "Table_Schema_B.json"  # Path to Schema Type B
        elif schema_type == 'Schema_Type_C':
            schema_file = "Table_Schema_C.json"  # Path to Schema Type C
        return schema_file
    
    # Function to get the vector index for the given schema type
    def get_index(schema_type):
        embeddings = BedrockEmbeddings(model_id="amazon.titan-embed-text-v2:0",
                                       client=bedrock_runtime)  # Initialize embeddings
    
        db_schema_loader = JSONLoader(
            file_path=load_schema_file(schema_type),  # Load the schema file
            # file_path="Table_Schema_RP.json",  # Uncomment to use a different file
            jq_schema='.',  # Select the entire JSON content
            text_content=False)  # Treat the content as text
    
        db_schema_text_splitter = RecursiveCharacterTextSplitter(  # Create a text splitter
            separators=["separator"],  # Split chunks at the "separator" string
            chunk_size=10000,  # Divide into 10,000-character chunks
            chunk_overlap=100  # Allow 100 characters to overlap with previous chunk
        )
    
        db_schema_index_creator = VectorstoreIndexCreator(
            vectorstore_cls=FAISS,  # Use FAISS vector store
            embedding=embeddings,  # Use the initialized embeddings
            text_splitter=db_schema_text_splitter  # Use the text splitter
        )
    
        db_index_from_loader = db_schema_index_creator.from_loaders([db_schema_loader])  # Create index from loader
    
        return db_index_from_loader

  4. Use the following function to create and return memory for the chat session:
    # Function to get the memory for storing chat conversations
    def get_memory():
        memory = ConversationBufferWindowMemory(memory_key="chat_history", return_messages=True)  # Create memory
    
        return memory

  5. Use the following prompt template to generate SQL queries based on user input:
    # Template for the question prompt
    template = """ Read table information from the context. Each table contains the following information:
    - Name: The name of the table
    - Description: A brief description of the table
    - Columns: The columns of the table, listed under the 'columns' key. Each column contains:
      - Name: The name of the column
      - Description: A brief description of the column
      - Type: The data type of the column
      - Synonyms: Optional synonyms for the column name
    - Sample Queries: Optional sample queries for the table, listed under the 'sample_data' key
    
    Given this structure, Your task is to provide the SQL query using Amazon Redshift syntax that would retrieve the data for following question. The produced query should be functional, efficient, and adhere to best practices in SQL query optimization.
    
    Question: {}
    """

  6. Use the following function to get a response from the RAG chat model:
    # Function to get the response from the conversational retrieval chain
    def get_rag_chat_response(input_text, memory, index):
        llm = get_llm()  # Get the LLM
    
        conversation_with_retrieval = ConversationalRetrievalChain.from_llm(
            llm, index.vectorstore.as_retriever(), memory=memory, verbose=True)  # Create conversational retrieval chain
    
        chat_response = conversation_with_retrieval.invoke({"question": template.format(input_text)})  # Invoke the chain
    
        return chat_response['answer']  # Return the answer

Configure Streamlit for the front-end UI

Create the file app.py by following these steps:

  1. Import the necessary libraries:
    import streamlit as st
    import library as lib
    from io import StringIO
    import boto3
    from datetime import datetime
    import csv
    import pandas as pd
    from io import BytesIO

  2. Initialize the S3 client:
    s3_client = boto3.client('s3')
    bucket_name = 'simplesql-logs-****'
    #replace the 'simplesql-logs-****’ with your S3 bucket name
    log_file_key = 'logs.xlsx'

  3. Configure Streamlit for UI:
    st.set_page_config(page_title="Your App Name")
    st.title("Your App Name")
    
    # Define the available menu items for the sidebar
    menu_items = ["Home", "How To", "Generate SQL Query"]
    
    # Create a sidebar menu using radio buttons
    selected_menu_item = st.sidebar.radio("Menu", menu_items)
    
    # Home page content
    if selected_menu_item == "Home":
        # Display introductory information about the application
        st.write("This application allows you to generate SQL queries from natural language input.")
        st.write("")
        st.write("**Get Started** by selecting the button Generate SQL Query !")
        st.write("")
        st.write("")
        st.write("**Disclaimer :**")
        st.write("- Model's response depends on user's input (prompt). Please visit How-to section for writing efficient prompts.")
               
    # How-to page content
    elif selected_menu_item == "How To":
        # Provide guidance on how to use the application effectively
        st.write("The model's output completely depends on the natural language input. Below are some examples which you can keep in mind while asking the questions.")
        st.write("")
        st.write("")
        st.write("")
        st.write("")
        st.write("**Case 1 :**")
        st.write("- **Bad Input :** Cancelled orders")
        st.write("- **Good Input :** Write a query to extract the cancelled order count for the items which were listed this year")
        st.write("- It is always recommended to add required attributes, filters in your prompt.")
        st.write("**Case 2 :**")
        st.write("- **Bad Input :** I am working on XYZ project. I am creating a new metric and need the sales data. Can you provide me the sales at country level for 2023 ?")
        st.write("- **Good Input :** Write an query to extract sales at country level for orders placed in 2023 ")
        st.write("- Every input is processed as tokens. Do not provide un-necessary details as there is a cost associated with every token processed. Provide inputs only relevant to your query requirement.") 

  4. Generate the query:
    # SQL-AI page content
    elif selected_menu_item == "Generate SQL Query":
        # Define the available schema types for selection
        schema_types = ["Schema_Type_A", "Schema_Type_B", "Schema_Type_C"]
        schema_type = st.sidebar.selectbox("Select Schema Type", schema_types)

  5. Use the following for SQL generation:
    if schema_type:
            # Initialize or retrieve conversation memory from session state
            if 'memory' not in st.session_state:
                st.session_state.memory = lib.get_memory()
    
            # Initialize or retrieve chat history from session state
            if 'chat_history' not in st.session_state:
                st.session_state.chat_history = []
    
            # Initialize or update vector index based on selected schema type
            if 'vector_index' not in st.session_state or 'current_schema' not in st.session_state or st.session_state.current_schema != schema_type:
                with st.spinner("Indexing document..."):
                    # Create a new index for the selected schema type
                    st.session_state.vector_index = lib.get_index(schema_type)
                    # Update the current schema in session state
                    st.session_state.current_schema = schema_type
    
            # Display the chat history
            for message in st.session_state.chat_history:
                with st.chat_message(message["role"]):
                    st.markdown(message["text"])
    
            # Get user input through the chat interface, set the max limit to control the input tokens.
            input_text = st.chat_input("Chat with your bot here", max_chars=100)
            
            if input_text:
                # Display user input in the chat interface
                with st.chat_message("user"):
                    st.markdown(input_text)
    
                # Add user input to the chat history
                st.session_state.chat_history.append({"role": "user", "text": input_text})
    
                # Generate chatbot response using the RAG model
                chat_response = lib.get_rag_chat_response(
                    input_text=input_text, 
                    memory=st.session_state.memory,
                    index=st.session_state.vector_index
                )
                
                # Display chatbot response in the chat interface
                with st.chat_message("assistant"):
                    st.markdown(chat_response)
    
                # Add chatbot response to the chat history
                st.session_state.chat_history.append({"role": "assistant", "text": chat_response})

  6. Log the conversations to the S3 bucket:
    timestamp = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
    
                try:
                    # Attempt to download the existing log file from S3
                    log_file_obj = s3_client.get_object(Bucket=bucket_name, Key=log_file_key)
                    log_file_content = log_file_obj['Body'].read()
                    df = pd.read_excel(BytesIO(log_file_content))
    
                except s3_client.exceptions.NoSuchKey:
                    # If the log file doesn't exist, create a new DataFrame
                    df = pd.DataFrame(columns=["User Input", "Model Output", "Timestamp", "Schema Type"])
    
                # Create a new row with the current conversation data
                new_row = pd.DataFrame({
                    "User Input": [input_text], 
                    "Model Output": [chat_response], 
                    "Timestamp": [timestamp],
                    "Schema Type": [schema_type]
                })
                # Append the new row to the existing DataFrame
                df = pd.concat([df, new_row], ignore_index=True)
                
                # Prepare the updated DataFrame for S3 upload
                output = BytesIO()
                df.to_excel(output, index=False)
                output.seek(0)
                
                # Upload the updated log file to S3
                s3_client.put_object(Body=output.getvalue(), Bucket=bucket_name, Key=log_file_key)
    

Test the solution

Open your terminal and invoke the following command to run the Streamlit application.

streamlit run app.py

To visit the application using your browser, navigate to the localhost.

To visit the application using SageMaker, copy your notebook URL and replace ‘default/lab’ in the URL with ‘default/proxy/8501/ ‘ . It should look something like the following:

https://your_sagemaker_lab_url.studio.us-east-1.sagemaker.aws/jupyterlab/default/proxy/8501/

Choose Generate SQL query to open the chat window. Test your application by asking questions in natural language. We tested the application with the following questions and it generated accurate SQL queries.

Count of orders placed from India last month?
Write a query to extract the canceled order count for the items that were listed this year.
Write a query to extract the top 10 item names having highest order for each country.

Troubleshooting tips

Use the following solutions to address errors:

Error – An error raised by inference endpoint means that an error occurred (AccessDeniedException) when calling the InvokeModel operation. You don’t have access to the model with the specified model ID.
Solution – Make sure you have access to the FMs in Amazon Bedrock, Amazon Titan Text Embeddings v2, and Anthropic’s Claude 3.5 Sonnet.

Error – app.py does not exist
Solution – Make sure your JSON file and Python files are in the same folder and you’re invoking the command in the same folder.

Error – No module named streamlit
Solution – Open the terminal and install the streamlit module by running the command pip install streamlit

Error – An error occurred (NoSuchBucket) when calling the GetObject operation. The specified bucket doesn’t exist.
Solution – Verify your bucket name in the app.py file and update the name based on your S3 bucket name.

Clean up

Clean up the resources you created to avoid incurring charges. To clean up your S3 bucket, refer to Emptying a bucket.

Conclusion

In this post, we showed how Amazon Bedrock can be used to create a text-to-SQL application based on enterprise-specific datasets. We used Amazon S3 to store the outputs generated by the model for corresponding inputs. These logs can be used to test the accuracy and enhance the context by providing more details in the knowledge base. With the aid of a tool like this, you can create automated solutions that are accessible to nontechnical users, empowering them to interact with data more efficiently.

Ready to get started with Amazon Bedrock? Start learning with these interactive workshops.

For more information on SQL generation, refer to these posts:

We recently launched a managed NL2SQL module to retrieve structured data in Amazon Bedrock Knowledge  . To learn more, visit Amazon Bedrock Knowledge Bases now supports structured data retrieval.


About the Author

rajendra choudharyRajendra Choudhary is a Sr. Business Analyst at Amazon. With 7 years of experience in developing data solutions, he possesses profound expertise in data visualization, data modeling, and data engineering. He is passionate about supporting customers by leveraging generative AI–based solutions. Outside of work, Rajendra is an avid foodie and music enthusiast, and he enjoys swimming and hiking.

Read More