December 2024 – Vedere AI

Dancing on Their Own: NVIDIA-Powered Robots to Adore From 2024

Artificial intelligence made big moves this year — as did the robots the technology is working behind.

From Silicon Valley to India, Boston to Japan, here are some of the autonomous machines and robotics technologies, powered by NVIDIA AI, that offered helping hands in 2024.

Clearbot Tidies Up Ocean Litter

Hong Kong- and India-based Clearbot, a member of the NVIDIA Inception program for cutting-edge startups, is making a splash with its autonomous trash-collection boats enabled by the NVIDIA Jetson platform for edge AI and robotics.

With plastic making up at least 85% of marine waste across the globe, Clearbot aims to remove waste from waterways before it gets into the seas.

The startup’s autonomous ocean vessels are equipped with two cameras — one for navigation and another for waste identification of what boats have scooped up.

Clearbot trained its garbage- and obstacle-identifying AI models using NVIDIA accelerated computing. The energy-efficient NVIDIA Jetson Xavier NX system-on-module allows the battery-powered water-cleaning boats to collect for eight hours at a time.

Figure Unveils Latest Humanoid Robot

Silicon Valley-based Figure introduced its Figure 02 conversational humanoid robot that taps into the NVIDIA Omniverse platform and accelerated computing for fully autonomous tasks.

New human-scale hands, six RGB cameras and perception AI models trained with synthetic data generated in the NVIDIA Isaac Sim robotics developer simulation platform enable Figure 02 to perform high-precision pick-and-place tasks required for smart manufacturing applications.

The company recently tested Figure 02 for data collection and use-case training at BMW Group’s Spartanburg, South Carolina, production line.

Figure is a part of NVIDIA Inception and among the initial members to join the NVIDIA Humanoid Robot Developer Program, which provides early access to advanced tools and computing technologies for humanoid robot development. This includes the latest releases of NVIDIA Isaac Sim, Isaac Lab, NIM microservices, OSMO, Jetson Thor and Project GR00T general-purpose humanoid foundation models.

ORBIT-Surgical Researchers Prep Robots for Surgery

The ORBIT-Surgical simulation framework — developed by researchers from the University of Toronto, UC Berkeley, ETH Zürich, Georgia Tech and NVIDIA — helps train robots that could augment the skills of surgical teams while reducing surgeons’ cognitive loads.

It supports more than a dozen maneuvers inspired by the training curriculum for minimally invasive surgeries.

Presented at this year’s ICRA robotics conference in Japan, the needle-moving robotics research — based on Isaac Sim and Omniverse — also introduced benchmarks for one-handed tasks such as picking up a piece of gauze, inserting a shunt into a blood vessel and lifting a suture needle to a specific position.

The benchmarks also include two-handed tasks, like passing a threaded needle through a ring pole.

Zordi Robots Grow Strawberries Indoors

Boston-based autonomous agriculture startup Zordi — with farms in southern New Jersey and western New York — is tapping into robotics to grow strawberries indoors.

Autonomous greenhouse systems can support regional markets across the globe, cutting down on the carbon footprint for transportation and providing fresher, better-tasting fruits grown more sustainably.

Zordi uses two types of robots within its operations. One is a scouting robot for gathering information on plant health using foundational models, the other a harvesting robot for delicately picking and placing fruits and handling other tasks.

The startup is exploring NVIDIA Jetson AGX Orin modules for gathering sensor data and running its AI models.

Cutting-edge companies weren’t the only ones to drive major robotics advancements this year. For example, high school and university students are developing robot guide dogs for the visually impaired and training robots to perform 1,000 household chores.

Learn more about the latest in generative AI and autonomous machines — including NVIDIA’s three-computer approach to robotics — by joining NVIDIA at CES, running Jan. 6-10 in Las Vegas.

Research Galore From 2024: Recapping AI Advancements in 3D Simulation, Climate Science and Audio Engineering

The pace of technology innovation has accelerated in the past year, most dramatically in AI. And in 2024, there was no better place to be a part of creating those breakthroughs than NVIDIA Research.

NVIDIA Research is comprised of hundreds of extremely bright people pushing the frontiers of knowledge, not just in AI, but across many areas of technology.

In the past year, NVIDIA Research laid the groundwork for future improvements in GPU performance with major research discoveries in circuits, memory architecture and sparse arithmetic. The team’s invention of novel graphics techniques continues to raise the bar for real-time rendering. And we developed new methods for improving the efficiency of AI — requiring less energy, taking fewer GPU cycles and delivering even better results.

But the most exciting developments of the year have been in generative AI.

We’re now able to generate, not just images and text, but 3D models, music and sounds. We’re also developing better control over what is generated: to generate realistic humanoid motion and to generate sequences of images with consistent subjects.

The application of generative AI to science has resulted in high-resolution weather forecasts that are more accurate than conventional numerical weather models. AI models have given us the ability to accurately predict how blood glucose levels respond to different foods. Embodied generative AI is being used to develop autonomous vehicles and robots.

And that was just this year. What follows is a deeper dive into some of NVIDIA Research’s greatest generative AI work in 2024. Of course, we continue to develop new models and methods for AI, and expect even more exciting results next year.

ConsiStory: AI-Generated Images With Main Character Energy

ConsiStory, a collaboration between researchers at NVIDIA and Tel Aviv University, makes it easier to generate multiple images with a consistent main character — an essential capability for storytelling use cases such as illustrating a comic strip or developing a storyboard.

The researchers’ approach introduced a technique called subject-driven shared attention, which reduces the time it takes to generate consistent imagery from 13 minutes to around 30 seconds.

Read the ConsiStory paper.

Panels of multiple AI-generated images featuring the same character — ConsiStory is capable of generating a series of images featuring the same character.

Edify 3D: Generative AI Enters a New Dimension

NVIDIA Edify 3D is a foundation model that enables developers and content creators to quickly generate 3D objects that can be used to prototype ideas and populate virtual worlds.

Edify 3D helps creators quickly ideate, lay out and conceptualize immersive environments with AI-generated assets. Novice and experienced content creators can use text and image prompts to harness the model, which is now part of the NVIDIA Edify multimodal architecture for developing visual generative AI.

Read the Edify 3D paper and watch the video on YouTube.

Fugatto: Flexible AI Sound Machine for Music, Voices and More

A team of NVIDIA researchers recently unveiled Fugatto, a foundational generative AI model that can create or transform any mix of music, voices and sounds based on text or audio prompts.

The model can, for example, create music snippets based on text prompts, add or remove instruments from existing songs, modify the accent or emotion in a voice recording, or generate completely novel sounds. It could be used by music producers, ad agencies, video game developers or creators of language learning tools.

Read the Fugatto paper.

GluFormer: AI Predicts Blood Sugar Levels Four Years Out

Researchers from the Weizmann Institute of Science, Tel Aviv-based startup Pheno.AI and NVIDIA led the development of GluFormer, an AI model that can predict an individual’s future glucose levels and other health metrics based on past glucose monitoring data.

The researchers showed that, after adding dietary intake data into the model, GluFormer can also predict how a person’s glucose levels will respond to specific foods and dietary changes, enabling precision nutrition. The research team validated GluFormer across 15 other datasets and found it generalizes well to predict health outcomes for other groups, including those with prediabetes, type 1 and type 2 diabetes, gestational diabetes and obesity.

Read the GluFormer paper.

LATTE3D: Enabling Near-Instant Generation, From Text to 3D Shape

Another 3D generator released by NVIDIA Research this year is LATTE3D, which converts text prompts into 3D representations within a second — like a speedy, virtual 3D printer. Crafted in a popular format used for standard rendering applications, the generated shapes can be easily served up in virtual environments for developing video games, ad campaigns, design projects or virtual training grounds for robotics.

Read the LATTE3D paper.

MaskedMimic: Reconstructing Realistic Movement for Humanoid Robots

To advance the development of humanoid robots, NVIDIA researchers introduced MaskedMimic, an AI framework that applies inpainting — the process of reconstructing complete data from an incomplete, or masked, view — to descriptions of motion.

Given partial information, such as a text description of movement, or head and hand position data from a virtual reality headset, MaskedMimic can fill in the blanks to infer full-body motion. It’s become part of NVIDIA Project GR00T, a research initiative to accelerate humanoid robot development.

Read the MaskedMimic paper.

StormCast: Boosting Weather Prediction, Climate Simulation

In the field of climate science, NVIDIA Research announced StormCast, a generative AI model for emulating atmospheric dynamics. While other machine learning models trained on global data have a spatial resolution of about 30 kilometers and a temporal resolution of six hours, StormCast achieves a 3-kilometer, hourly scale.

The researchers trained StormCast on approximately three-and-a-half years of NOAA climate data from the central U.S. When applied with precipitation radars, StormCast offers forecasts with lead times of up to six hours that are up to 10% more accurate than the U.S. National Oceanic and Atmospheric Administration’s state-of-the-art 3-kilometer regional weather prediction model.

Read the StormCast paper, written in collaboration with researchers from Lawrence Berkeley National Laboratory and the University of Washington.

NVIDIA Research Sets Records in AI, Autonomous Vehicles, Robotics

Through 2024, models that originated in NVIDIA Research set records across benchmarks for AI training and inference, route optimization, autonomous driving and more.

NVIDIA cuOpt, an optimization AI microservice used for logistics improvements, has 23 world-record benchmarks. The NVIDIA Blackwell platform demonstrated world-class performance on MLPerf industry benchmarks for AI training and inference.

In the field of autonomous vehicles, Hydra-MDP, an end-to-end autonomous driving framework by NVIDIA Research, achieved first place on the End-To-End Driving at Scale track of the Autonomous Grand Challenge at CVPR 2024.

In robotics, FoundationPose, a unified foundation model for 6D object pose estimation and tracking, obtained first place on the BOP leaderboard for model-based pose estimation of unseen objects.

Learn more about NVIDIA Research, which has hundreds of scientists and engineers worldwide. NVIDIA Research teams are focused on topics including AI, computer graphics, computer vision, self-driving cars and robotics.

Have You Heard? 5 AI Podcast Episodes Listeners Loved in 2024

NVIDIA’s AI Podcast gives listeners the inside scoop on the ways AI is transforming nearly every industry. Since the show’s debut in 2016, it’s garnered more than 6 million listens across 200-plus episodes, covering how generative AI is used to power applications including assistive technology for the visually impaired, wildfire alert systems and the Roblox online game platform. Here are the top five episodes of 2024:

Driving Energy Efficiency, Sustainability

The AI Podcast · NVIDIA’s Josh Parker on How AI and Accelerated Computing Drive Sustainability – Ep. 234

AI and accelerated computing are key tools in the push for sustainability. Joshua Parker, senior director of corporate sustainability at NVIDIA, discusses how these technologies are contributing to a more sustainable future by improving energy efficiency and helping address climate challenges.

Zooming In on AI for a Productivity Boost

The AI Podcast · Zoom CTO Xuedong “XD” Huang on How AI Revolutionizes Productivity – Ep. 235

Zoom helped change the way people work, playing a pivotal role for many during the COVID-19 pandemic. The company’s CTO, Xuedong Huang, shares how the company is reshaping productivity with AI.

Driving the Future of Computing

The AI Podcast · How the Ohio Supercomputer Center Drives the Future of Computing – Ep. 213

Alan Chalker, the director of strategic programs at the Ohio Supercomputer Center, shares how the center empowers Ohio higher education institutions and industries with accessible, reliable and secure computational services — and works with client companies like NASCAR, which is simulating race car designs virtually.

Supercharging Cinematic Content Creation

The AI Podcast · Exploring Filmmaking with Cuebric’s AI: Insights from Pinar Seyhan Demirdag – Ep. 214

Generative AI can help anyone become a content creator by rapidly bringing ideas to life. Pinar Seyhan Demirdag, cofounder and CEO of Cuebric, discusses how the company’s AI-powered application makes high-quality production more accessible and affordable.

Bringing Clarity to Cardiology

The AI Podcast · Cardiac Clarity: Dr. Keith Channon Talks Revolutionizing Heart Health With AI – Ep. 212

Dr. Keith Channon, cofounder and chief medical officer at health tech startup Caristo Diagnostics, discusses an AI-powered solution for detecting coronary inflammation — a key indicator of heart disease — in cardiac CT scans. These insights could help physicians improve treatment plans and risk predictions.

Subscribe to the AI Podcast

Get the AI Podcast through Amazon Music, Apple Podcasts, Google Podcasts, Google Play, Castbox, DoggCatcher, Overcast, PlayerFM, Pocket Casts, Podbay, PodBean, PodCruncher, PodKicker, SoundCloud, Spotify, Stitcher and TuneIn.

Optimizing costs of generative AI applications on AWS

The report The economic potential of generative AI: The next productivity frontier, published by McKinsey & Company, estimates that generative AI could add an equivalent of $2.6 trillion to $4.4 trillion in value to the global economy. The largest value will be added across four areas: customer operations, marketing and sales, software engineering, and R&D.

The potential for such large business value is galvanizing tens of thousands of enterprises to build their generative AI applications in AWS. However, many product managers and enterprise architect leaders want a better understanding of the costs, cost-optimization levers, and sensitivity analysis.

This post addresses these cost considerations so you can optimize your generative AI costs in AWS.

The post assumes a basic familiarity of foundation model (FMs) and large language models (LLMs), tokens, vector embeddings, and vector databases in AWS. With Retrieval Augmented Generation (RAG) being one of the most common frameworks used in generative AI solutions, the post explains costs in the context of a RAG solution and respective optimization pillars on Amazon Bedrock.

In Part 2 of this series, we will cover how to estimate business value and the influencing factors.

Cost and performance optimization pillars

Designing performant and cost-effective generative AI applications is essential for realizing the full potential of this transformative technology and driving widespread adoption within your organization.

Forecasting and managing costs and performance in generative AI applications is driven by the following optimization pillars:

Model selection, choice, and customization – We define these as follows:
- Model selection – This process involves identifying the optimal model that meets a wide variety of use cases, followed by model validation, where you benchmark against high-quality datasets and prompts to identify successful model contenders.
- Model choice – This refers to the choice of an appropriate model because different models have varying pricing and performance attributes.
- Model customization – This refers to choosing the appropriate techniques to customize the FMs with training data to optimize the performance and cost-effectiveness according to business-specific use cases.
Token usage – Analyzing token usage consists of the following:
- Token count – The cost of using a generative AI model depends on the number of tokens processed. This can directly impact the cost of an operation.
- Token limits – Understanding token limits and what drives token count, and putting guardrails in place to limit token count can help you optimize token costs and performance.
- Token caching – Caching at the application layer or LLM layer for commonly asked user questions can help reduce the token count and improve performance.
Inference pricing plan and usage patterns – We consider two pricing options:
- On-Demand – Ideal for most models, with charges based on the number of input/output tokens, with no guaranteed token throughput.
- Provisioned Throughput – Ideal for workloads demanding guaranteed throughput, but with relatively higher costs.
Miscellaneous factors – Additional factors can include:
- Security guardrails – Applying content filters for personally identifiable information (PII), harmful content, undesirable topics, and detecting hallucinations improves the safety of your generative AI application. These filters can perform and scale independently of LLMs and have costs that are directly proportional to the number of filters and the tokens examined.
- Vector database – The vector database is a critical component of most generative AI applications. As the amount of data usage in your generative AI application grows, vector database costs can also grow.
- Chunking strategy – Chunking strategies such as fixed size chunking, hierarchical chunking, or semantic chunking can influence the accuracy and costs of your generative AI application.

Let’s dive deeper to examine these factors and associated cost-optimization tips.

Retrieval Augmented Generation

RAG helps an LLM answer questions specific to your corporate data, even though the LLM was never trained on your data.

As illustrated in the following diagram, the generative AI application reads your corporate trusted data sources, chunks it, generates vector embeddings, and stores the embeddings in a vector database. The vectors and data stored in a vector database are often called a knowledge base.

The generative AI application uses the vector embeddings to search and retrieve chunks of data that are most relevant to the user’s question and augment the question to generate the LLM response. The following diagram illustrates this workflow.

The workflow consists of the following steps:

A user asks a question using the generative AI application.
A request to generate embeddings is sent to the LLM.
The LLM returns embeddings to the application.
These embeddings are searched against vector embeddings stored in a vector database (knowledge base).
The application receives context relevant to the user question from the knowledge base.
The application sends the user question and the context to the LLM.
The LLM uses the context to generate an accurate and grounded response.
The application sends the final response back to the user.

Amazon Bedrock is a fully managed service providing access to high-performing FMs from leading AI providers through a unified API. It offers a wide range of LLMs to choose from.

In the preceding workflow, the generative AI application invokes Amazon Bedrock APIs to send text to an LLM like Amazon Titan Embeddings V2 to generate text embeddings, and to send prompts to an LLM like Anthropic’s Claude Haiku or Meta Llama to generate a response.

The generated text embeddings are stored in a vector database such as Amazon OpenSearch Service, Amazon Relational Database Service (Amazon RDS), Amazon Aurora, or Amazon MemoryDB.

A generative AI application such as a virtual assistant or support chatbot might need to carry a conversation with users. A multi-turn conversation requires the application to store a per-user question-answer history and send it to the LLM for additional context. This question-answer history can be stored in a database such as Amazon DynamoDB.

The generative AI application could also use Amazon Bedrock Guardrails to detect off-topic questions, ground responses to the knowledge base, detect and redact PII information, and detect and block hate or violence-related questions and answers.

Now that we have a good understanding of the various components in a RAG-based generative AI application, let’s explore how these factors influence costs while running your application in AWS using RAG.

Directional costs for small, medium, large, and extra large scenarios

Consider an organization that wants to help their customers with a virtual assistant that can answer their questions any time with a high degree of accuracy, performance, consistency, and safety. The performance and cost of the generative AI application depends directly on a few major factors in the environment, such as the velocity of questions per minute, the volume of questions per day (considering peak and off-peak), the amount of knowledge base data, and the LLM that is used.

Although this post explains the factors that influence costs, it can be useful to know the directional costs, based on some assumptions, to get a relative understanding of various cost components for a few scenarios such as small, medium, large, and extra large environments.

The following table is a snapshot of directional costs for four different scenarios with varying volume of user questions per month and knowledge base data.

.	SMALL	MEDIUM	LARGE	EXTRA LARGE
INPUTs	500,000	2,000,000	5,000,000	7,020,000
Total questions per month	5	25	50	100
Knowledge base data size in GB (actual text size on documents)	.	.	.	.
Annual costs (directional)*	.	.	.	.
Amazon Bedrock On-Demand costs using Anthropic’s Claude 3 Haiku	$5,785	$23,149	$57,725	$81,027
Amazon OpenSearch Service provisioned cluster costs	$6,396	$13,520	$20,701	$39,640
Amazon Bedrock Titan Text Embedding v2 costs	$396	$5,826	$7,320	$13,585
Total annual costs (directional)	$12,577	$42,495	$85,746	$134,252
Unit cost per 1,000 questions (directional)	$2.10	$1.80	$1.40	$1.60

These costs are based on assumptions. Costs will vary if assumptions change. Cost estimates will vary for each customer. The data in this post should not be used as a quote and does not guarantee the cost for actual use of AWS services. The costs, limits, and models can change over time.

For the sake of brevity, we use the following assumptions:

Amazon Bedrock On-Demand pricing model
Anthropic’s Claude 3 Haiku LLM
AWS Region us-east-1
Token assumptions for each user question:
- Total input tokens to LLM = 2,571
- Output tokens from LLM = 149
- Average of four characters per token
- Total tokens = 2,720
There are other cost components such as DynamoDB to store question-answer history, Amazon Simple Storage Service (Amazon S3) to store data, and AWS Lambda or Amazon Elastic Container Service (Amazon ECS) to invoke Amazon Bedrock APIs. However, these costs are not as significant as the cost components mentioned in the table.

We refer to this table in the remainder of this post. In the next few sections, we will cover Amazon Bedrock costs and the key factors influences its costs, vector embedding costs, vector database costs, and Amazon Bedrock Guardrails costs. In the final section, we will cover how chunking strategies will influence some of the above cost components.

Amazon Bedrock costs

Amazon Bedrock has two pricing models: On-Demand (used in the preceding example scenario) and Provisioned Throughput.

With the On-Demand model, an LLM has a maximum requests (questions) per minute (RPM) and tokens per minute (TPM) limit. The RPM and TPM are typically different for each LLM. For more information, see Quotas for Amazon Bedrock.

In the extra large use case, with 7 million questions per month, assuming 10 hours per day and 22 business days per month, it translates to 532 questions per minute (532 RPM). This is well below the maximum limit of 1,000 RPM for Anthropic’s Claude 3 Haiku.

With 2,720 average tokens per question and 532 requests per minute, the TPM is 2,720 x 532 = 1,447,040, which is well below the maximum limit of 2,000,000 TPM for Anthropic’s Claude 3 Haiku.

However, assume that the user questions grow by 50%. The RPM, TPM, or both might cross the thresholds. In such cases where the generative AI application needs cross the On-Demand RPM and TPM thresholds, you should consider the Amazon Bedrock Provisioned Throughput model.

With Amazon Bedrock Provisioned Throughput, cost is based on a per-model unit basis. Model units are dedicated for the duration you plan to use, such as an hourly, 1-month, 6-month commitment.

Each model unit offers a certain capacity of maximum tokens per minute. Therefore, the number of model units (and the costs) are determined by the input and output TPM.

With Amazon Bedrock Provisioned Throughput, you incur charges per model unit whether you use it or not. Therefore, the Provisioned Throughput model is relatively more expensive than the On-Demand model.

Consider the following cost-optimization tips:

Start with the On-Demand model and test for your performance and latency with your choice of LLM. This will deliver the lowest costs.
If On-Demand can’t satisfy the desired volume of RPM or TPM, start with Provisioned Throughput with a 1-month subscription during your generative AI application beta period. However, for steady state production, consider a 6-month subscription to lower the Provisioned Throughput costs.
If there are shorter peak hours and longer off-peak hours, consider using a Provisioned Throughput hourly model during the peak hours and On-Demand during the off-peak hours. This can minimize your Provisioned Throughput costs.

Factors influencing costs

In this section, we discuss various factors that can influence costs.

Number of questions

Cost grows as the number of questions grow with the On-Demand model, as can be seen in the following figure for annual costs (based on the table discussed earlier).

Input tokens

The main sources of input tokens to the LLM are the system prompt, user prompt, context from the vector database (knowledge base), and context from QnA history, as illustrated in the following figure.

As the size of each component grows, the number of input tokens to the LLM grows, and so does the costs.

Generally, user prompts are relatively small. For example, in the user prompt “What are the performance and cost optimization strategies for Amazon DynamoDB?”, assuming four characters per token, there are approximately 20 tokens.

System prompts can be large (and therefore the costs are higher), especially for multi-shot prompts where multiple examples are provided to get LLM responses with better tone and style. If each example in the system prompt uses 100 tokens and there are three examples, that’s 300 tokens, which is quite larger than the actual user prompt.

Context from the knowledge base tends to be the largest. For example, when the documents are chunked and text embeddings are generated for each chunk, assume that the chunk size is 2,000 characters. Assume that the generative AI application sends three chunks relevant to the user prompt to the LLM. This is 6,000 characters. Assuming four characters per token, this translates to 1,500 tokens. This is much higher compared to a typical user prompt or system prompt.

Context from QnA history can also be high. Assume an average of 20 tokens in the user prompt and 100 tokens in LLM response. Assume that the generative AI application sends a history of three question-answer pairs along with each question. This translates to (20 tokens per question + 100 tokens per response) x 3 question-answer pairs = 360 tokens.

Consider the following cost-optimization tips:

Limit the number of characters per user prompt
Test the accuracy of responses with various numbers of chunks and chunk sizes from the vector database before finalizing their values
For generative AI applications that need to carry a conversation with a user, test with two, three, four, or five pairs of QnA history and then pick the optimal value

Output tokens

The response from the LLM will depend on the user prompt. In general, the pricing for output tokens is three to five times higher than the pricing for input tokens.

Consider the following cost-optimization tips:

Because the output tokens are expensive, consider specifying the maximum response size in your system prompt
If some users belong to a group or department that requires higher token limits on the user prompt or LLM response, consider using multiple system prompts in such a way that the generative AI application picks the right system prompt depending on the user

Vector embedding costs

As explained previously, in a RAG application, the data is chunked, and text embeddings are generated and stored in a vector database (knowledge base). The text embeddings are generated by invoking the Amazon Bedrock API with an LLM, such as Amazon Titan Text Embeddings V2. This is independent of the Amazon Bedrock model you choose for inferencing, such as Anthropic’s Claude Haiku or other LLMs.

The pricing to generate text embeddings is based on the number of input tokens. The greater the data, the greater the input tokens, and therefore the higher the costs.

For example, with 25 GB of data, assuming four characters per token, input tokens total 6,711 million. With the Amazon Bedrock On-Demand costs for Amazon Titan Text Embeddings V2 as $0.02 per million tokens, the cost of generating embeddings is $134.22.

However, On-Demand has an RPM limit of 2,000 for Amazon Titan Text Embeddings V2. With 2,000 RPM, it will take 112 hours to embed 25 GB of data. Because this is a one-time job of embedding data, this might be acceptable in most scenarios.

For monthly change rate and new data of 5% (1.25 GB per month), the time required will be 6 hours.

In rare situations where the actual text data is very high in TBs, Provisioned Throughput will be needed to generate text embeddings. For example, to generate text embeddings for 500 GB in 3, 6, and 9 days, it will be approximately $60,000, $33,000, or $24,000 one-time costs using Provisioned Throughput.

Typically, the actual text inside a file is 5–10 times smaller than the file size reported by Amazon S3 or a file system. Therefore, when you see 100 GB size for all your files that need to be vectorized, there is a high probability that the actual text inside the files will be 2–20 GB.

One way to estimate the text size inside files is with the following steps:

Pick 5–10 sample representations of the files.
Open the files, copy the content, and enter it into a Word document.
Use the word count feature to identify the text size.
Calculate the ratio of this size with the file system reported size.
Apply this ratio to the total file system to get a directional estimate of actual text size inside all the files.

Vector database costs

AWS offers many vector databases, such as OpenSearch Service, Aurora, Amazon RDS, and MemoryDB. As explained earlier in this post, the vector database plays a critical role in grounding responses to your enterprise data whose vector embeddings are stored in a vector database.

The following are some of the factors that influence the costs of vector database. For the sake of brevity, we consider an OpenSearch Service provisioned cluster as the vector database.

Amount of data to be used as the knowledge base – Costs are directly proportional to data size. More data means more vectors. More vectors mean more indexes in a vector database, which in turn requires more memory and therefore higher costs. For best performance, it’s recommended to size the vector database so that all the vectors are stored in memory.
Index compression – Vector embeddings can be indexed by HNSW or IVF algorithms. The index can also be compressed. Although compressing the indexes can reduce the memory requirements and costs, it might lose accuracy. Therefore, consider doing extensive testing for accuracy before deciding to use compression variants of HNSW or IVF. For example, for a large text data size of 100 GB, assuming 2,000 bytes of chunk size, 15% overlap, vector dimension count of 512, no upfront Reserved Instance for 3 years, and HNSW algorithm, the approximate costs are $37,000 per year. The corresponding costs with compression using hnsw-fp16 and hnsw-pq are $21,000 and $10,000 per year, respectively.
Reserved Instances – Cost is inversely proportional to the number of years you reserve the cluster instance that stores the vector database. For example, in the preceding scenario, an On-Demand instance would cost approximately, $75,000 per year, a no upfront 1-year Reserved Instance would cost $52,000 per year, and a no upfront 3-year Reserved Instance would cost $37,000 per year.

Other factors, such as the number of retrievals from the vector database that you pass as context to the LLM, can influence input tokens and therefore costs. But in general, the preceding factors are the most important cost drivers.

Amazon Bedrock Guardrails

Let’s assume your generative AI virtual assistant is supposed to answer questions related to your products for your customers on your website. How will you avoid users asking off-topic questions such as science, religion, geography, politics, or puzzles? How do you avoid responding to user questions on hate, violence, or race? And how can you detect and redact PII in both questions and responses?

The Amazon Bedrock ApplyGuardrail API can help you solve these problems. Guardrails offer multiple policies such as content filters, denied topics, contextual grounding checks, and sensitive information filters (PII). You can selectively apply these filters to all or a specific portion of data such as user prompt, system prompt, knowledge base context, and LLM responses.

Applying all filters to all data will increase costs. Therefore, you should evaluate carefully which filter you want to apply on what portion of data. For example, if you want PII to be detected or redacted from the LLM response, for 2 million questions per month, approximate costs (based on output tokens mentioned earlier in this post) would be $200 per month. In addition, if your security team wants to detect or redact PII for user questions as well, the total Amazon Bedrock Guardrails costs will be $400 per month.

Chunking strategies

As explained earlier in how RAG works, your data is chunked, embeddings are generated for those chunks, and the chunks and embeddings are stored in a vector database. These chunks of data are retrieved later and passed as context along with user questions to the LLM to generate a grounded and relevant response.

The following are different chunking strategies, each of which can influence costs:

Standard chunking – In this case, you can specify default chunking, which is approximately 300 tokens, or fixed-size chunking, where you specify the token size (for example, 300 tokens) for each chunk. Larger chunks will increase input tokens and therefore costs.
Hierarchical chunking – This strategy is useful when you want to chunk data at smaller sizes (for example, 300 tokens) but send larger pieces of chunks (for example, 1,500 tokens) to the LLM so the LLM has a bigger context to work with while generating responses. Although this can improve accuracy in some cases, this can also increase the costs because of larger chunks of data being sent to the LLM.
Semantic chunking – This strategy is useful when you want chunking based on semantic meaning instead of just the token. In this case, a vector embedding is generated for one or three sentences. A sliding window is used to consider the next sentence and embeddings are calculated again to identify whether the next sentence is semantically similar or not. The process continues until you reach an upper limit of tokens (for example, 300 tokens) or you find a sentence that isn’t semantically similar. This boundary defines a chunk. The input token costs to the LLM will be similar to standard chunking (based on a maximum token size) but the accuracy might be better because of chunks having sentences that are semantically similar. However, this will increase the costs of generating vector embeddings because embeddings are generated for each sentence, and then for each chunk. But at the same time, these are one-time costs (and for new or changed data), which might be worth it if the accuracy is comparatively better for your data.
Advanced parsing – This is an optional pre-step to your chunking strategy. This is used to identify chunk boundaries, which is especially useful when you have documents with a lot of complex data such as tables, images, and text. Therefore, the costs will be the input and output token costs for the entire data that you want to use for vector embeddings. These costs will be high. Consider using advanced parsing only for those files that have a lot of tables and images.

The following table is a relative cost comparison for various chunking strategies.

Chunking Strategy	Standard	Semantic	Hierarchical
Relative Inference Costs	Low	Medium	High

Conclusion

In this post, we discussed various factors that could impact costs for your generative AI application. This a rapidly evolving space, and costs for the components we mentioned could change in the future. Consider the costs in this post as a snapshot in time that is based on assumptions and is directionally accurate. If you have any questions, reach out to your AWS account team.

In Part 2, we discuss how to calculate business value and the factors that impact business value.

About the Authors

Vinnie Saini is a Senior Generative AI Specialist Solution Architect at Amazon Web Services(AWS) based in Toronto, Canada. With a background in Machine Learning, she has over 15 years of experience designing & building transformational cloud based solutions for customers across industries. Her focus has been primarily scaling AI/ML based solutions for unparalleled business impacts, customized to business needs.

Chandra Reddy is a Senior Manager of Solution Architects team at Amazon Web Services(AWS) in Austin, Texas. He and his team help enterprise customers in North America on their AIML and Generative AI use cases in AWS. He has more than 20 years of experience in software engineering, product management, product marketing, business development, and solution architecture.

Cheers to 2024: GeForce NOW Recaps Year of Ultimate Cloud Gaming

This GFN Thursday wraps up another incredible year for cloud gaming. Take a look back at the top games and new features that made 2024 a standout for GeForce NOW members.

Enjoy it all with three new games to close the year.

Remember to mark the calendar for the CES opening keynote, to be delivered by NVIDIA founder and CEO Jensen Huang on Monday, Jan. 6.

That’s a Wrap

In another ultimate year of high-performance cloud gaming, GeForce NOW introduced new features for cloud gamers and reached a significant milestone by surpassing 2,000 games in its library, thanks to strong collaborations with celebrated publishers.

Day Pass on GeForce NOW — *Don’t pass up the chance to get all the premium benefits of the cloud for 24 hours.*

GeForce NOW also launched new data centers in Japan and Poland this year, bringing GeForce RTX 4080-powered servers to gamers in the regions. Day Passes were introduced to offer gamers more flexible ways to access the cloud, with the ability to enjoy premium benefits of Ultimate and Performance memberships for 24 hours at a time.

Members who wanted to stream their favorite PC games to Valve’s Steam Deck were provided a new beta installation method that let them automatically install Google Chrome to the device, along with all the settings needed to log in to GeForce NOW to stream their favorite games. And GeForce NOW brought an upgraded streaming experience for Performance members, providing up to 1440p resolution — an increase from the original 1080p limit.

Addons in the cloud for WoW — *Conquering in the cloud never looked so good.*

The rollout of Xbox automatic sign-in streamlined the gaming experience, enabling members to link their Xbox profile to their GeForce NOW account for seamless access to their game libraries. GeForce NOW also partnered with CurseForge to bring mod for World of Warcraft, enabling Ultimate and Priority members to easily enable and customize over 25 top WoW Addons in the cloud, enhancing their gameplay experience across various devices.

Indiana Jones on GeForce NOW — *Everything is great in the cloud.*

The highly anticipated Indiana Jones and the Great Circle made its debut in the cloud, offering players a thrilling globetrotting adventure with stunning ray-traced graphics and NVIDIA DLSS 3 support. Fans could uncover one of history’s greatest mysteries as they traveled from the pyramids of Egypt to the sunken temples of Sukhothai, all while enjoying the game’s immersive action and intriguing puzzles.

PoE2 on GeForce NOW — *The cloud is the path of least resistance.*

Early access for Path of Exile 2 arrived with deep customization options and improved visuals. Dragon Age: The Veilguard captivated players with BioWare’s rich fantasy world, while Black Myth: Wukong pushed cloud gaming graphics to new heights with its stunning take on Chinese mythology.

The long-awaited S.T.A.L.K.E.R. 2: Heart of Chornobyl brought its intense survival horror to the GeForce NOW cloud, and Call of Duty: Black Ops 6 delivered fast-paced multiplayer action to members on day one.

Diablo IV on GeForce NOW — *Raining terror on any device.*

GeForce NOW also welcomed Activision Blizzard games and Battle.net integration. Members gained access to blockbuster titles like Diablo IV, Overwatch 2, Call of Duty: Warzone, Hearthstone and more, adding to the cloud gaming library some of the most popular multiplayer titles.

And it doesn’t stop there — check back in each week to see what’s in store for GeForce NOW in the new year.

Toast to New Adventures

Look for the following games available to stream in the cloud this week:

Headquarters: World War II (Steam)
Supermarket Together (Steam)
Ys X: Nordics (Steam)

What are you planning to play this weekend? Let us know on X or in the comments below.

Mark your calendars, #CES2025 is fast approaching!

Be sure to tune-in on 1/6/25 https://t.co/yC5KAednSw pic.twitter.com/PlrIzPQv5g

— NVIDIA GeForce NOW (@NVIDIAGFN) December 24, 2024

The 10 most viewed publications of 2024

From cloud databases and anomaly detection on graphs to recession prediction and Amazon’s new Nova foundation models, these are the most viewed publications authored by Amazon scientists and collaborators in 2024.Read More

The 10 most viewed blog posts of 2024

Large language models remained a hot topic, but posts about cryptography and automated reasoning also drew readers.Read More

PEFT fine tuning of Llama 3 on SageMaker HyperPod with AWS Trainium

Training large language models (LLMs) models has become a significant expense for businesses. For many use cases, companies are looking to use LLM foundation models (FM) with their domain-specific data. However, companies are discovering that performing full fine tuning for these models with their data isn’t cost effective. To reduce costs while continuing to use the power of AI, many companies have shifted to fine tuning LLMs on their domain-specific data using Parameter-Efficient Fine Tuning (PEFT). PEFT is a set of techniques designed to adapt pre-trained LLMs to specific tasks while minimizing the number of parameters that need to be updated. Techniques such as Low-Rank Adaptation (LoRA) and Weighted-Decomposed Low Rank Adaptation (DoRA), significantly reducing the number of trainable parameters and resulting in lower costs for fine tuning.

In addition to cost, performing fine tuning for LLMs at scale presents significant technical challenges. The process of setting up and configuring a distributed training environment can be complex, requiring expertise in server management, cluster configuration, networking and distributed computing. Manually managing such complexity can often be counter-productive and take away valuable resources from your businesses AI development. To simplify infrastructure setup and accelerate distributed training, AWS introduced Amazon SageMaker HyperPod in late 2023.

In this blog post, we showcase how you can perform efficient supervised fine tuning for a Meta Llama 3 model using PEFT on AWS Trainium with SageMaker HyperPod. We use HuggingFace’s Optimum-Neuron software development kit (SDK) to apply LoRA to fine-tuning jobs, and use SageMaker HyperPod as the primary compute cluster to perform distributed training on Trainium. Using LoRA supervised fine-tuning for Meta Llama 3 models, you can further reduce your cost to fine tune models by up to 50% and reduce the training time by 70%.

Solution overview

SageMaker HyperPod is designed to help reduce the time required to train generative AI FMs by providing a purpose-built infrastructure for distributed training at scale. When using SageMaker HyperPod for training, SageMaker will actively monitor the cluster’s health, automatically replacing faulty nodes and resuming model training from checkpoints. The clusters come pre-configured with SageMaker distributed training libraries that enable you to split your training data and model across thousands of compute nodes, allowing data to be processed in parallel while fully utilizing the cluster’s compute and network infrastructure. You can also customize your distributed training. The architecture diagram that follows provides a high level overview of these various components:

Compute cluster: This contains a head node that orchestrates computation across a cluster of worker nodes. Because the head node is only facilitating the training, it’s typically a much smaller instance. In this post, we use Amazon Elastic Compute Cloud (Amazon EC2) Trn1 instances for the worker nodes and a single Amazon EC2 C5 instance for the head node.
Shared Volume: FSx for Lustre is used as the shared storage volume across nodes to maximize data throughput. It’s mounted at /fsx on the head and compute nodes.
External storage: Amazon Simple Storage Service (Amazon S3) is used to store the cluster’s lifecycle scripts, configuration files, datasets, and checkpoints.
Scheduler: SLURM is used as the job scheduler for the cluster.

Trainium chips are purpose-built for deep learning training of 100 billion and larger parameter models. Model training on Trainium is supported by the AWS Neuron SDK, which provides compiler, runtime, and profiling tools that unlock high-performance and cost-effective deep learning acceleration. To learn more about Trainium chips and the Neuron SDK, see Welcome to AWS Neuron.

To integrate Trainium chips with existing models and tools provided through the transformers package, Hugging Face’s Optimum-Neuron package functions as an interface with Neuron. With Optimum-Neuron, users can apply techniques such as LoRA to their fine-tuning jobs, streamlining the process of adapting LLMs for specific tasks while capitalizing on the performance gains provided by the AWS infrastructure.

Traditional fine tuning involves modifying all the parameters of a model, which can be computationally expensive and memory intensive. PEFT approaches such as LoRA focus on introducing a smaller set of trainable parameters, often in the form of low-rank matrices that adjust the model’s behavior while keeping most of its parameters frozen. The advantage of LoRA lies in its ability to maintain the performance of the base model while significantly lowering the computational burden and resource requirements. The Neuron 2.20 release supports model training with LoRA on Trainium.

In the next section, we’ll walk through the code in three steps for PEFT on Trainium with HyperPod:

Setting up and deploying a HyperPod cluster for distributed training.
Fine tuning a Meta Llama 3-8B model on Trainium instance with the dolly 15k dataset.
Model weights consolidation and inference.

Amazon SageMaker HyperPod cluster setup

In this first section, you will begin setting up your Amazon SageMaker HyperPod compute environment for fine tuning.

Prerequisites

The following are the prerequisites for configuring and deploying a SageMaker HyperPod cluster for fine tuning:

Submit a service quota increase request to get access to Trainium instances in the us-west-2 AWS Region. For the purposes of this post, you will request an increase for Amazon EC2 Trn1 instances, ml.trn1.32xlarge.
Install the AWS Command Line Interface (AWS CLI); the required minimum version needed is 2.14.3.
Install the AWS Systems Manager Session Manager Plugin in order to SSH into your cluster.

Step 1: Infrastructure setup

After completing the prerequisites, deploy an AWS CloudFormation stack that contains the necessary infrastructure components for distributed training through SageMaker HyperPod. The default Region specified in the template is us-west-2, but you can modify that. You will also need to specify the Availability Zone where your subnets will be deployed. The template configures your environment with an Amazon Virtual Private Cloud (Amazon VPC) and corresponding public and private subnets for network isolation. It establishes additional components inside your VPC including an S3 bucket for lifecycle scripts and FSx for Lustre, a file system shared across the head and compute nodes of the HyperPod cluster.

Step 2: Cluster configuration

Configure and deploy the HyperPod cluster. Begin by defining your infrastructure’s environment variables through the create_config script. This script uses the AWS CLI to extract infrastructure component variables from your CloudFormation stack including Region, resource IDs, and Amazon Resource Name (ARN).

# Set region
export AWS_REGION=us-west-2

# Fetch create_config script
curl 'https://static.us-east-1.prod.workshops.aws/public/05a78a77-24f9-4f29-867c-64c9687646e1/static/scripts/create_config.sh' --output create_config.sh

# Set environment variables
bash create_config.sh
source env_vars

After setting your environment variables, download the lifecycle scripts required for bootstrapping the compute nodes on your SageMaker HyperPod cluster and define its configuration settings before uploading the scripts to your S3 bucket.

# Download Lifecycle scripts
git clone --depth=1 https://github.com/aws-samples/awsome-distributed-training/

# upload scripts to s3
aws s3 cp --recursive awsome-distributed-training/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/ s3://${BUCKET}/src

After uploading the Lifecycle scripts to Amazon S3, create your cluster and file system configurations. See the Create Cluster section of the SageMaker HyperPod workshop to create these files. After generating the cluster-config.json and provisioning_parameters.json configuration files, validate them and upload the FSx for Lustre configuration file to Amazon S3.

# validate and check config for known issues
curl -O https://raw.githubusercontent.com/aws-samples/awsome-distributed-training/main/1.architectures/5.sagemaker-hyperpod/validate-config.py
python3 validate-config.py --cluster-config cluster-config.json --provisioning-parameters provisioning_parameters.json

# Upload FSx configuration to S3
aws s3 cp provisioning_parameters.json s3://${BUCKET}/src/

Step 3: Cluster deployment

Now that the cluster’s configuration is defined, you can create the cluster.

aws sagemaker create-cluster 
--cli-input-json file://cluster-config.json 
--region $AWS_REGION

You should be able to see your cluster by navigating to SageMaker Hyperpod in the AWS Management Console and see a cluster named ml-cluster listed. After a few minutes, its status should change from Creating to InService.

If you select your cluster, you will be able to see the details of your compute cluster including the head and worker nodes.

After installing the Systems Manager Session Manager plugin, you can ssh into your cluster’s head node using the easy-ssh script to begin training.

# Modify permissions and ssh
chmod +x easy-ssh.sh
./easy-ssh.sh -c controller-machine ml-cluster

# Switch to ubuntu user
sudo su - ubuntu

# Change directory
cd /fsx

Now that your cluster is running and accessible through ssh, you can begin uploading the model training scripts to the shared file system through either curl or the AWS CLI. For more instructions on setting up your cluster, see the SageMaker HyperPod workshop.

Fine tuning

Now that your SageMaker HyperPod cluster is deployed, you can start preparing to execute your fine tuning job.

Data preparation

The foundation of successful language model fine tuning lies in properly structured and prepared training data. This implementation focuses on instruction-tuned datasets, which form the backbone of modern language model adaptation. These datasets work together to create meaningful training examples through three essential components:

Instructions that guide the model’s task.
Optional context that provides background information.
Responses that represent the desired output.

Training begins by loading your dataset and formatting your dataset examples with this structure. Loading your dataset can be accomplished through the Hugging Face datasets library, which provides a straightforward interface for accessing and managing training data. Hugging Face also provides this format function for the databricks-dolly-15k dataset. Note that the format function needs to be embedded in your train.py file (as shown in the following sample). It’s referenced by the NeuronSFTTrainer to format your dataset during fine tuning.

# Load dataset
dataset = load_dataset(args.dataset, split="train")

def format_dolly(examples):
    output_text = []
    for i in range(len(examples["instruction"])):
        instruction = f"### Instructionn{examples['instruction'][i]}"
        context = f"### Contextn{examples['context'][i]}" if examples["context"][i] else None
        response = f"### Answern{examples['response'][i]}"
        prompt = "nn".join([i for i in [instruction, context, response] if i is not None])
        output_text.append(prompt)
    return output_text

The formatting function employs delimiter tokens ("###") to create clear boundaries between different components of each training example. This separation is important because it helps the model distinguish between different parts of the input during training. The function handles cases where context might be missing, making sure that the final format remains consistent regardless of whether all components are present. Double newlines between sections provide additional structural clarity that helps the model recognize the natural breaks in the input.

Tokenization

After formatting your dataset, the next step is tokenization—the process of converting your text data into a numerical format that your model can understand. Tokenization serves as the bridge between your human-readable text and the mathematical operations that drive your model’s understanding of language. To begin, you use Hugging Face’s AutoTokenizer to load your model’s tokenizer.

tokenizer = AutoTokenizer.from_pretrained(args.tokenizer_path)
tokenizer.pad_token = tokenizer.eos_token

The AutoTokenizer class automatically selects the appropriate tokenizer for your model, loading not just the vocabulary, but also the rules and special tokens that match your training configuration. The assignment of the padding token to match the end-of-sequence token is particularly important for causal language modeling, because it verifies the consistent handling of your variable-length sequences.

The tokenization process itself operates in several stages. First, it breaks down your input text into tokens based on its vocabulary. These tokens are then converted to numerical IDs that your model can process. During this process, your tokenizer also handles special tokens that mark the beginning and end of sequences, in addition to padding tokens that make sure that the sequences in your batch have the same length.

When working with tokenizers, your sequence length management becomes a critical consideration. Your maximum sequence length must balance between preserving enough information for your model to understand the context and staying within your model’s architectural limitations. Too short, and you risk losing important context; too long, and you might exceed memory constraints or introduce unnecessary computational overhead.

Model compilation and fine tuning

For this solution, you created a SageMaker HyperPod cluster with the controller node and one worker node. The worker node contains one ml.trn1.32xlarge instance which has 32 Neuron cores. You can conduct distributed fine tuning using all 32 Neuron cores within the worker node.

Step 1: Environment setup

You first need to install the required Python packages for fine tuning. The following is the bash script for the Python environment setup. Note that the solution uses the most recently released Neuron SDK. From the HOME directory, create a file touch environment.sh with the following code and run it with sbatch ./environment.sh. You might need to modify the permissions of the shell scripts throughout this post before running them with the command chmod +x environment.sh.

#!/usr/bin/env bash
#SBATCH --nodes=1
#SBATCH --exclusive
#SBATCH -o /fsx/ubuntu/peft_ft/logs/8b/environment.out

sudo apt install -y python3.8-venv git
python3.8 -m venv $HOME/peft_ft/env_llama3_8B_peft
source $HOME/peft_ft/env_llama3_8B_peft/bin/activate
pip install -U pip

python3 -m pip config set global.extra-index-url "https://pip.repos.neuron.amazonaws.com"
python3 -m pip install torch-neuronx==2.1.2.2.3.0 neuronx-cc==2.15.128.0 neuronx_distributed==0.9.0 torchvision
python3 -m pip install datasets transformers peft huggingface_hub trl PyYAML
python3 -m pip install git+https://github.com/huggingface/optimum-neuron.git

With your environment created, switch to your fine-tuning directory before proceeding to the next step: cd $HOME/peft_ft.

Step 1: Download the base Llama 3 8B model and tokenizer from Hugging Face

Download the base Meta Llama 3 8B model and the corresponding tokenizer from Hugging Face. You will need to first request access for the model from Meta on Hugging Face and then use your Hugging Face access token to download the model. The following is the Python code for the get_model.py script to download the model and tokenizer. Create this file with touch get_model.py and copy the following code to this file before moving on to the next step.

import os
import argparse
from transformers import AutoTokenizer, LlamaForCausalLM

def download_model_and_tokenizer(model_id: str, model_output_path: str, tokenizer_output_path: str, huggingface_token: str = None) -> None:
    huggingface_token = os.environ.get("HUGGINGFACE_TOKEN", None)
    model = LlamaForCausalLM.from_pretrained(model_id, token=huggingface_token)
    model.save_pretrained(model_output_path)
    tokenizer = AutoTokenizer.from_pretrained(model_id, token=huggingface_token)
    tokenizer.save_pretrained(tokenizer_output_path)
    
if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--model_id", type=str, required=True, help="Hugging Face Model id")
    parser.add_argument("--model_output_path", type=str, required=True, help="Path to save model/weights file")
    parser.add_argument("--tokenizer_output_path", type=str, required=True, help="Path to save tokenizer file")
    args, _ = parser.parse_known_args()
    download_model_and_tokenizer(model_id=args.model_id, model_output_path=args.model_output_path, tokenizer_output_path=args.tokenizer_output_path)

Next, create the bash script touch get_model.sh with the code that follows and run it with the command sbatch ./get_model.sh. This will trigger the get_model.py script to download the model and tokenizer using Slurm. Because you’re using the Llama 3 8B model, Hugging Face requires you to authenticate with an access token prior to download. Be sure to add your access token to get_model.sh before running the script.

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --exclusive
#SBATCH -o /fsx/ubuntu/peft_ft/logs/8b/get_model.out

export OMP_NUM_THREADS=1
export HUGGINGFACE_TOKEN="<YOUR TOKEN HERE>"
source $HOME/peft_ft/env_llama3_8B_peft/bin/activate

srun python3 $HOME/peft_ft/get_model.py 
--model_id meta-llama/Meta-Llama-3-8B-Instruct 
--model_output_path $HOME/peft_ft/model_artifacts/llama3-8B 
--tokenizer_output_path $HOME/peft_ft/tokenizer/llama3-8B

Step 2: Pre-compile model

Training deep learning models on Trainium requires model compilation. To do that, use the neuron_parallel_compile CLI utility, which will extract graphs from a trial run of your script, and perform parallel pre-compilation of the computation graphs. Note that the scripts for model pre-compilation are identical to those for the actual training, except for max_steps. This is because pre-compilation doesn’t require the completion of the entire training cycle; rather, it necessitates approximately 10 training steps to extract the graphs. Before compiling the model, you need to create the training script, touch train.py which is used for both pre-compilation and model fine tuning steps. Add the following code after creating the file, along with the format function previously mentioned.

import os
import torch
import argparse
from datasets import load_dataset
from peft import LoraConfig
from transformers import AutoModelForCausalLM, AutoTokenizer, set_seed
from optimum.neuron import NeuronSFTConfig, NeuronSFTTrainer
from optimum.neuron.distributed import lazy_load_for_parallelism
import torch_xla.core.xla_model as xm

# add format_dolly function here

def training_function(args):
    dataset = load_dataset(args.dataset, split="train")    
    tokenizer = AutoTokenizer.from_pretrained(args.tokenizer_path)
    tokenizer.pad_token = tokenizer.eos_token
    with lazy_load_for_parallelism(tensor_parallel_size=args.tp_size):
        model = AutoModelForCausalLM.from_pretrained(
            args.model_path, 
            low_cpu_mem_usage=True, 
            torch_dtype=torch.bfloat16 if args.bf16 else torch.float32
        )

    lora_config = LoraConfig(
        r=16,
        lora_alpha=16,
        lora_dropout=0.05,
        target_modules=["q_proj", "v_proj"],
        bias="none",
        task_type="CAUSAL_LM",
    )
        
    training_args = NeuronSFTConfig(
        output_dir=args.model_checkpoint_path,
        overwrite_output_dir=True,
        num_train_epochs=args.epochs,
        per_device_train_batch_size=args.train_batch_size,
        gradient_accumulation_steps=args.gradient_accumulation_steps,
        learning_rate=args.learning_rate,
        weight_decay=args.weight_decay,
        warmup_steps=args.warmup_steps,
        bf16=args.bf16,
        tensor_parallel_size=args.tp_size,
        pipeline_parallel_size=args.pp_size,
        save_steps=args.checkpoint_frequency,
        logging_steps=100,
        max_steps=args.max_steps,
        )

    trainer = NeuronSFTTrainer(
        args=training_args,
        model=model,
        peft_config=lora_config,
        tokenizer=tokenizer,
        train_dataset=dataset,
        formatting_func=format_dolly,
    )

    trainer.train()
    trainer.save_model(args.model_final_path)
    
if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--model_path", type=str)
    parser.add_argument("--tokenizer_path", type=str)
    parser.add_argument("--epochs", type=int)
    parser.add_argument("--train_batch_size", type=int)
    parser.add_argument("--learning_rate", type=float)
    parser.add_argument("--weight_decay", type=float)
    parser.add_argument("--bf16", type=bool)
    parser.add_argument("--tp_size", type=int)
    parser.add_argument("--pp_size", type=int)
    parser.add_argument("--gradient_accumulation_steps", type=int)
    parser.add_argument("--warmup_steps", type=int)
    parser.add_argument("--early_stopping_patience", type=int)
    parser.add_argument("--checkpoint_frequency", type=int)
    parser.add_argument("--dataset", type=str)
    parser.add_argument("--max_steps", type=int)
    parser.add_argument("--max_seq_length", type=int)
    parser.add_argument("--model_checkpoint_path", type=str)
    parser.add_argument("--model_final_path", type=str)
    args = parser.parse_args()
    training_function(args)

After creating the training file, use the following code to create the compile.sh script, which will trigger finetune-llama3-8B.sh to compile the Llama 3 8B model using the neuron_parallel_compile command. You can run this with the sbatch compile.sh command.

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --exclusive
#SBATCH -o /fsx/ubuntu/peft_ft/logs/8b/compile.out

source $HOME/peft_ft/env_llama3_8B_peft/bin/activate

export NEURON_EXTRACT_GRAPHS_ONLY=0
srun bash ${HOME}/peft_ft/finetune-llama3-8B.sh

The following is the finetune-llama3-8B.sh script, which lists the hyper-parameters for your model fine tuning. The script uses tensor parallelism for the training with degree of 8. With 32 NeuronCores in the ml.trn1.32xlarge instance, you get data parallel of degree 4. Note that the script also sets XLA_USE_BF16=1 to map both torch.float and torch.double tensors to bfloat16 tensors. This can both reduce memory footprint and improve performance. The script then sets gradient_accumulation_steps to be 3 to get a larger effective batch size for gradient update.

#!/bin/bash
GPUS_PER_NODE=32
if [ $NEURON_EXTRACT_GRAPHS_ONLY -gt 0 ]; then
    MAX_STEPS=10
    MAYBE_COMPILE="neuron_parallel_compile"
else
    MAX_STEPS=-1
fi

declare -a TORCHRUN_ARGS=(
    --nproc_per_node=$GPUS_PER_NODE
    --nnodes=$SLURM_JOB_NUM_NODES
)
export TRAIN_SCRIPT=${HOME}/peft_ft/train.py

declare -a TRAINING_ARGS=(
    --bf16 True 
    --checkpoint_frequency 400 
    --dataset "databricks/databricks-dolly-15k" 
    --max_steps $MAX_STEPS 
    --max_seq_length 1024 
    --epochs 1 
    --gradient_accumulation_steps 3 
    --learning_rate 2e-05 
    --model_path "/fsx/ubuntu/peft_ft/model_artifacts/llama3-8B" 
    --tokenizer_path "/fsx/ubuntu/peft_ft/tokenizer/llama3-8B" 
    --model_checkpoint_path "/fsx/ubuntu/peft_ft/model_checkpoints" 
    --model_final_path "/fsx/ubuntu/peft_ft/model_checkpoints/final" 
    --tp_size 8 
    --pp_size 1 
    --train_batch_size 1 
    --warmup_steps 100 
    --weight_decay 0.01 
)
$MAYBE_COMPILE torchrun "${TORCHRUN_ARGS[@]}" $TRAIN_SCRIPT "${TRAINING_ARGS[@]}"

Step 3: Model fine tuning

After the model compiling is complete, you can then start the model fine tuning by reusing the compile.sh script. To do this, prevent the neuron_parallel_compile utility from being used by setting export NEURON_EXTRACT_GRAPHS_ONLY=-1 in compile.sh, and then re-run the script to start fine tuning your model. You might need to delete the model_consolidation directory created during the previous model compilation step before you start your fine-tuning job.

Model consolidation

When working with distributed machine learning workflows, you’ll often need to manage and merge model weights efficiently. Let’s explore two essential processes that you’ll frequently encounter: checkpoint consolidation and weight merging when performing LoRA fine tuning.

Checkpoint consolidation

During distributed training, your model checkpoints are typically split across multiple devices according to the model parallelism configuration that you provide. To bring these pieces back together, you’ll use a consolidation process. Your consolidation function handles three primary tasks. First, it combines distributed checkpoints into a unified model. Then, it manages memory efficiently by processing tensors in chunks. Finally, it creates sharded outputs with an index file for quick access.

LoRA weight merging

When you’re working with LoRA, you need to merge these adapters with your base model. The merging process is straightforward but requires careful attention to detail. Start by loading your base model and LoRA configuration. Then transform the LoRA weight names to match your base model’s structure. The process concludes by merging the adapters and saving the final model in a sharded format.

To put these tools into practice, you can use the following scripts after your fine-tuning job has finished. First, create the Python file, touch consolidation.py and shell file, touch consolidation.sh using the following code.

import argparse
import json
from pathlib import Path
from huggingface_hub import split_torch_state_dict_into_shards
from safetensors.torch import save_file
from optimum.neuron.distributed.checkpointing import consolidate_model_parallel_checkpoints
import torch

def custom_consolidate_to_unified_checkpoint(checkpoint_dir: str, output_dir: str, save_format: str = "safetensors"):
    output_dir.mkdir(parents=True, exist_ok=True)
    state_dict = consolidate_model_parallel_checkpoints(checkpoint_dir)
    for key, value in state_dict.items():
        if isinstance(value, torch.Tensor):
            state_dict[key] = value.contiguous()

    split_result = split_torch_state_dict_into_shards(state_dict, max_shard_size="5GB")
    # Save shards
    for shard_file, shard_tensors in split_result.filename_to_tensors.items():
        shard_dict = {name: state_dict[name] for name in shard_tensors}
        shard_path = output_dir / shard_file
        if save_format == "safetensors":
            save_file(shard_dict, shard_path, metadata={"format": "pt"})
        else:
            torch.save(shard_dict, shard_path)

    index = {
        "metadata": split_result.metadata,
        "weight_map": split_result.tensor_to_filename
    }
    
    index_file = "model.safetensors.index.json" if save_format == "safetensors" else "pytorch_model.bin.index.json"
    with open(output_dir / index_file, "w") as f:
        json.dump(index, f, indent=2)

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--input_dir", type=str, required=True)
    parser.add_argument("--output_dir", type=str, required=True)
    parser.add_argument("--save_format", type=str, choices=["safetensors", "pytorch"])
    args = parser.parse_args()
    output_dir = Path(args.output_dir)
    checkpoint_dir = Path(args.input_dir) / "adapter_shards"
    custom_consolidate_to_unified_checkpoint(
        checkpoint_dir=checkpoint_dir,
        output_dir=output_dir,
        save_format=args.save_format
    )

This code will consolidate the sharded checkpoint files generated during training into a consolidated LoRA adaptersafetensor format. After saving the file, you can invoke this script to trigger the model checkpoint consolidation job. The input directory that you provide points to your fine-tuned model’s sharded checkpoints and the output directory for the consolidated LoRA adapter safetensor file. You trigger this with sbatch consolidation.sh.

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --exclusive

export OMP_NUM_THREADS=1
source $HOME/peft_ft/env_llama3_8B_peft/bin/activate

srun python3 "$HOME/peft_ft/consolidation.py" 
--input_dir "/fsx/ubuntu/peft_ft/model_checkpoints/checkpoint-1251" 
--output_dir "$HOME/peft_ft/model_checkpoints/adapter_shards_consolidation"
--save_format "safetensors"

After consolidation is complete, you need to merge the LoRA adapter weights from the consolidated files with the base model’s weights. Begin by creating a new Python file touch merge_lora.py and shell file merge_lora.sh using the following code.

import json
from peft import LoraConfig, PeftModel
from transformers import AutoModelForCausalLM
import torch
import argparse
from safetensors import safe_open


def merge_lora_weights(args):
    base_model = AutoModelForCausalLM.from_pretrained(args.base_model_path)
    with open(args.adapter_config_path, "r") as f:
        config_dict = json.load(f)
    peft_config = LoraConfig(**config_dict)
    model = PeftModel(base_model, peft_config)
    
    lora_weights_tensors = {}
    with safe_open(args.lora_safetensors_path, framework="pt", device='cpu') as f:
        for k in f.keys():
            lora_weights_tensors[k] = f.get_tensor(k)
            
    for layer_name in list(lora_weights_tensors):
        if 'layer' in layer_name and 'lora' in layer_name:
            new_layer_name = layer_name.replace('weight', 'default.weight')
            lora_weights_tensors[new_layer_name] = lora_weights_tensors[layer_name].clone()
            del lora_weights_tensors[layer_name]
        else:
            del lora_weights_tensors[layer_name]

    updated_state_dict = model.state_dict().copy()
    for layer, weights in lora_weights_tensors.items():
        updated_state_dict[layer] = weights
    model.load_state_dict(updated_state_dict)    
    merged_model = model.merge_and_unload()    
    merged_model.save_pretrained(args.final_model_path, safe_serialization=True, max_shard_size="5GB")

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--final_model_path", type=str)
    parser.add_argument("--adapter_config_path", type=str)
    parser.add_argument("--base_model_path", type=str)
    parser.add_argument("--lora_safetensors_path", type=str)
    args = parser.parse_args()
    merge_lora_weights(args)

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --exclusive
#SBATCH --output=/fsx/ubuntu/peft_ft/logs/8b/lora_weights.log

export OMP_NUM_THREADS=1
source $HOME/peft_ft/env_llama3_8B_peft/bin/activate

srun python3 "$HOME/peft_ft/merge_lora.py" 
    --final_model_path "/fsx/ubuntu/peft_ft/model_checkpoints/final_model_output" 
    --adapter_config_path "/fsx/ubuntu/peft_ft/model_checkpoints/checkpoint-1251/adapter_config.json"
    --base_model_path "/fsx/ubuntu/peft_ft/model_artifacts/llama3-8B" 
    --lora_safetensors_path "/fsx/ubuntu/peft_ft/model_checkpoints/adapter_shards_consolidation/model.safetensors"

Trigger the run with sbatch merge_lora.sh to merge the model weights. Here the base_model_path parameter is the local directory where you previously downloaded the model from Hugging Face in step 1 of “Model compilation and fine tuning.” Similarly, the adapter_config_path parameter will be the model’s configuration file previously downloaded and the lora_safetensors_path parameter will be the path to the model.safetensor file output by the LoRA consolidation in the previous step.

Inference

After consolidation and merging, the safetensors files will be saved to your final_model_path output directory containing the updated model weights after fine tuning. Using these updated weights, you can load and generate a prediction for your trained model in the context of the dolly dataset. To check that the fine-tuned model understands the databricks-dolly-15k dataset it was fine tuned on, select a question from the dataset for validation, as shown in the following figure.

Using Hugging Face’s LlamaForCausalLM class you can load your newly fine-tuned model, and generate a prediction for the question, “Who are the Smiths?” (shown in the following figure):

Comparing the generated answer to the ground truth context and response from the training dataset, it’s clear that the fine-tuned Meta Llama 3 model now understands this data and can give coherent responses to posed questions.

Results

Technique	Trainable parameters	Samples processed per second	Training time (minutes)
FPFT	7,570,591,744	2.083	90
PEFT	6,815,744	3.554	53

To benchmark the fine-tuned model’s performance with LoRA on a single ml.trn1.32xlarge, we compared it to full parameter fine tuning (FPFT) for the model over three training epochs. Measuring training samples processed per second showed a 70% increase in throughput and reduction in training time for the LoRA fine-tuned model. Subsequently, on-demand hours required to fine tune the model on the dolly 15k dataset for three epochs was halved compared to FPFT, resulting in a 50% reduction of training costs.

Clean up

To clean up the resources provisioned for this post, first delete the SageMaker HyperPod cluster. This can be done either through the AWS CLI or in the SageMaker console.

aws sagemaker delete-cluster --cluster-name ml-cluster

After the cluster is deleted, delete the CloudFormation template to delete the remaining provisioned resources.

aws cloudformation delete-stack --stack-name sagemaker-hyperpod

Conclusion

In this post, we showed you how to set up a SageMaker HyperPod compute cluster for training. Then we showed you how to perform multi-node distributed fine tuning with Trainium for a Meta Llama 3 model using LoRA. Finally, we showed you how to consolidate model weights across a distributed training environment to generate coherent predictions for the newly fine-tuned model.

About the Authors

Georgios Ioannides is a Deep Learning Architect with the AWS Generative AI Innovation Center. Before AWS, Georgios worked in startups, where he specialized in signal processing, deep learning, and multi-modal and cross-modal machine learning systems for speech, vision, and text applications. He holds Master’s degrees from Imperial College London and Carnegie Mellon University.

Bingchen Liu is a Machine Learning Engineer with the AWS Generative AI Innovation Center. Before AWS, he worked as a lead MLE in ADP focusing on RAG applications, vector database, model development, and serving. He holds a Master’s degree in Computer Science from Columbia University and a PhD in Statistics from Southern Methodist University.

Hannah Marlowe is a Senior Manager of Model Customization at the AWS Generative AI Innovation Center. Her team specializes in helping customers develop differentiating generative AI solutions using their unique and proprietary data to achieve key business outcomes. She holds a PhD in Physics from the University of Iowa, with a focus on astronomical X-ray analysis and instrumentation development. Outside of work, she can be found hiking, mountain biking, and skiing around the mountains in Colorado.

Jeremy Roghair is a Machine Learning Engineer with the AWS Generative AI Innovation Center, where he focuses on developing generative AI solutions for distributed training workloads and model hosting for customers. Prior to joining AWS, Jeremy worked as a Data Scientist in the finance/insurance industry and earned a Master’s degree in Computer Science with research in reinforcement learning from Iowa State University.

From Generative to Agentic AI, Wrapping the Year’s AI Advancements

Editor’s note: This post is part of the AI Decoded series, which demystifies AI by making the technology more accessible, and showcases new hardware, software, tools and accelerations for GeForce RTX PC and NVIDIA RTX workstation users.

The AI Decoded series over the past year has broken down all things AI — from simplifying the complexities of large language models (LLMs) to highlighting the power of RTX AI PCs and workstations.

Recapping the latest AI advancements, this roundup highlights how the technology has changed the way people write, game, learn and connect with each other online.

NVIDIA GeForce RTX GPUs offer the power to deliver these experiences on PC laptops, desktops and workstations. They feature specialized AI Tensor Cores that can deliver more than 1,300 trillion operations per second (TOPS) of processing power for cutting-edge performance in gaming, creating, everyday productivity and more. For workstations, NVIDIA RTX GPUs deliver over 1,400 TOPS, enabling next-level AI acceleration and efficiency.

Unlocking Productivity and Creativity With AI-Powered Chatbots

AI Decoded earlier this year explored what LLMs are, why they matter and how to use them.

For many, tools like ChatGPT were their first introduction to AI. LLM-powered chatbots have transformed computing from basic, rule-based interactions to dynamic conversations. They can suggest vacation ideas, write customer service emails, spin up original poetry and even write code for users.

Introduced in March, ChatRTX is a demo app that lets users personalize a GPT LLM with their own content, such as documents, notes and images.

With features like retrieval-augmented generation (RAG), NVIDIA TensorRT-LLM and RTX acceleration, ChatRTX enables users to quickly search and ask questions about their own data. And since the app runs locally on RTX PCs or workstations, results are both fast and private.

NVIDIA offers the broadest selection of foundation models for enthusiasts and developers, including Gemma 2, Mistral and Llama-3. These models can run locally on NVIDIA GeForce and RTX GPUs for fast, secure performance without needing to rely on cloud services.

Download ChatRTX today.

Introducing RTX-Accelerated Partner Applications

AI is being incorporated into more and more apps and use cases, including games, content creation apps, software development and productivity tools.

This expansion is fueled by the wide selection of RTX-accelerated developer and community tools, software development kits, models and frameworks have made it easier than ever to run models locally in popular applications.

AI Decoded in October spotlighted how Brave Browser’s Leo AI, powered by NVIDIA RTX GPUs and the open-source Ollama platform, enables users to run local LLMs like Llama 3 directly on their RTX PCs or workstations.

This local setup offers fast, responsive AI performance while keeping user data private — without relying on the cloud. NVIDIA’s optimizations for tools like Ollama offer accelerated performance for tasks like summarizing articles, answering questions and extracting insights, all directly within the Brave browser. Users can switch between local and cloud models, providing flexibility and control over their AI experience.

For simple instructions on how to add local LLM support via Ollama, read Brave’s blog. Once configured to point to Ollama, Leo AI will use the locally hosted LLM for prompts and queries.

Agentic AI — Enabling Complex Problem-Solving

Agentic AI is the next frontier of AI, capable of using sophisticated reasoning and iterative planning to autonomously solve complex, multi-step problems.

AI Decoded explored how the AI community is experimenting with the technology to create smarter, more capable AI systems.

Partner applications like AnythingLLM showcase how AI is going beyond simple question-answering to improving productivity and creativity. Users can harness the application to deploy built-in agents that can tackle tasks like searching the web or scheduling meetings.

Example of a user invoking an AI agent in AnythingLLM to complete a web search query.

AnythingLLM lets users interact with documents through intuitive interfaces, automate complex tasks with AI agents and run advanced LLMs locally. Harnessing the power of RTX GPUs, it delivers faster, smarter and more responsive AI workflows — all within a single local desktop application. The application also works offline and is fast and private, capable of using local data and tools typically inaccessible with cloud-based solutions.

AnythingLLM’s Community Hub lets anyone easily access system prompts that can help them steer LLM behavior, discover productivity-boosting slash commands and build specialized AI agent skills for unique workflows and custom tools.

By enabling users to run agentic AI workflows on their own systems with full privacy, AnythingLLM is fueling innovation and making it easier to experiment with the latest technologies.

AI Decoded Wrapped

Over 600 Windows apps and games today are already running AI locally on more than 100 million GeForce RTX AI PCs and workstations worldwide, delivering fast, reliable and low-latency performance. Learn more about NVIDIA GeForce RTX AI PCs and NVIDIA RTX AI workstations.

Tune into the CES keynote delivered by NVIDIA founder and CEO Jensen Huang on Jan. 6. to discover how the latest in AI is supercharging gaming, content creation and development.

Generative AI is transforming gaming, videoconferencing and interactive experiences of all kinds. Make sense of what’s new and what’s next by subscribing to the AI Decoded newsletter.

Using transcription confidence scores to improve slot filling in Amazon Lex

When building voice-enabled chatbots with Amazon Lex, one of the biggest challenges is accurately capturing user speech input for slot values. For example, when a user needs to provide their account number or confirmation code, speech recognition accuracy becomes crucial. This is where transcription confidence scores come in to help ensure reliable slot filling.

What Are Transcription Confidence Scores?

Transcription confidence scores indicate how confident Amazon Lex is in converting speech to text for slot values. These scores range from low to high and are separate from intent/entity recognition scores. For each spoken slot value, Lex provides a confidence score that you can use to:

Validate if a spoken slot value was correctly understood
Decide whether to ask for confirmation or re-prompt
Branch conversation flows based on recognition confidence

Here are some ways to leverage confidence scores for better slot handling:

Progressive Confirmation
- High confidence (>0.9): Accept the slot value and continue
- Medium confidence (0.6-0.9): Ask user to confirm (“Did you say 12345?”)
- Low confidence (<0.6): Re-prompt for the slot value
Adaptive re-prompting
- Customize re-prompt messages based on confidence level
- Provide more specific guidance for low confidence inputs
- Offer alternative input methods when needed
Branching Logic
- Route to human agent if multiple low confidence attempts
- Skip confirmation for consistently high confidence inputs
- Adjust validation rules based on confidence thresholds

The post includes an AWS CloudFormation template, to demonstrate these patterns, that you can deploy in your AWS account.

Using confidence scores strategically helps create more robust slot filling experiences that:

Reduce errors in capturing critical information
Improve containment rates for self-service
Provide better handling of challenging audio conditions
Enable smarter conversation flows

By implementing these patterns, you can significantly improve the accuracy and reliability of slot filling in your Lex voice bots while maintaining a natural conversation flow.

Solution overview

The following diagram illustrates the overall architecture for the solution.

We use an Amazon Lex bot integrated with an Amazon Connect contact flow to deliver the conversational experience.

Amazon Lex asks the caller “In a few words, how can I help you today?” Amazon Lex reviews the transcription confidence score of the slot values to influence the confirmation routine.

Interaction model

The following are different customer experiences a caller may have depending on what they say and the transcription confidence scores returned by Amazon Lex:

Conversation 1:

Virtual Agent:
Thank you for calling Acme Travel.
Virtual Agent:
…In a few words, what is the reason for your call today?
Customer:
I’d like to check my booking.
Virtual Agent:
That’s great, please say your 5 character booking reference, you will find it at the top of the information pack we sent.
Customer:
1 9 A K B (Recognized with low transcription confidence: 0.4)
Virtual Agent:
Ok, let’s try again. What is your booking reference?
Customer:
1 9 A A B (More clearly pronounced: recognized with 0.8)
Virtual Agent:
Your booking 1 9 A A B is currently being progressed.

Conversation 2:

Virtual Agent:
Thank you for calling Acme Travel.
Virtual Agent:
…In a few words, what is the reason for your call today?
Customer:
I’d like to check my booking 1 9 A Z Y.
Virtual Agent:
Your booking 1 9 A Z Y is currently being progressed.

Conversation 3:

Virtual Agent:
Thank you for calling Acme Travel.
Virtual Agent:
…In a few words, what is the reason for your call today?
Customer:
I’d like to check my booking.
Virtual Agent:
That’s great, please say your 5 character booking reference, you will find it at the top of the information pack we sent.
Customer:
1 9 A Z Y (recognized with a transcription confidence of 0.6)
Virtual Agent:
Did you say 1 9 A Z Y? Please say yes or no.
Customer:
Yes
Virtual Agent:
Your booking 1 9 A Z Y is currently being progressed.

In the example conversations, the IVR requests the booking reference from the customer. Once received, the transcription confidence score is evaluated by enabling conditional branching in Amazon Lex based on speech confidence scores. These conditions check the value against specific thresholds. If the transcription confidence score exceeds the high threshold (for example, greater than 0.7), the conversation progresses to the next state. If the score falls in the medium confidence range (for example, between 0.4–0.7), the user is asked to confirm the interpreted input. Finally, if the score falls below a minimum threshold (for example, lower than 0.4), the user is prompted to retry and provide the information again. This approach optimizes the conversation flow based on the quality of the input captured and prevents erroneous or redundant slot capturing, leading to an improved user experience while increasing the self-service containment rates.

Prerequisites

You need to have an AWS account and an AWS Identity and Access Management (IAM) role and user with permissions to create and manage the necessary resources and components for this application. If you don’t have an AWS account, see How do I create and activate a new Amazon Web Services account?

Additionally, you need an Amazon Connect instance—you use the instance Amazon Resource Name (ARN) in a later step.

Deploy the Amazon Lex bot and Amazon Connect flow

To create the sample bot and configure the runtime phrase hints, perform the following steps. For this example, we create an Amazon Lex bot called disambiguation-bot, one intent (CheckBooking), and one slot type (BookingRef).

For Stack Name, enter a name, for example contact-center-transcription-confidence-scores.
Choose Next.
Provide the following parameters:
1. For BotName, enter disambiguation-bot.
2. For ConnectInstanceARN, enter the ARN of your Amazon Connect instance.
3. For ContactFlowName, enter a name for your Amazon Connect contact flow (for example, lex-check-booking-sample-flow).
4. For LogGroupName, enter the name of the Amazon CloudWatch log group where the conversation logs are stored.
Choose Next.

Leave all remaining settings as default and choose Next.
Select I acknowledge that AWS CloudFormation might create IAM resources.
Choose Submit.

Wait for the CloudFormation stack to successfully deploy.
On the Amazon Connect console, assign the contact flow to an Amazon Connect claimed number.

Configure the transcript confidence score logic

After you create your intent (CheckBooking), use you can Visual conversation builder to configure your transcription confidence score logic.

The following figure is an example of how we add logic to the intent. Highlighted in red is the branch condition where we use the transcription confidence score to dynamically change the customer experience and improve accuracy.

If you choose the node, you’re presented with the following configuration options, which is where you can configure the branch condition.

Test the solution

To test the solution, we examine a conversation with words that might not be clearly understood.

Assign the Amazon Lex bot to an Amazon Connect workflow.
Make a call.

Amazon Connect will ask “Thank you for calling Acme travel, In a few words, what is the reason for your call today?”

Respond “I want to check my booking.”
When asked for the booking reference, speak any two numbers followed by three letters (for example, “1 9 A Z Y”).

This test checks the confidence score and will either say “your booking 1 9 A Z Y is currently being progressed” or it will ask you to confirm “1 9 A Z Y”.

Limitations

Audio transcription confidence scores are available only in the English (GB) (en_GB) and English (US) (en_US) languages. Confidence scores are supported only for 8 kHz audio input. Transcription confidence scores aren’t provided for audio input from the test window on the Amazon Lex V2 console because it uses 16 kHz audio input.

Clean up

To remove the infrastructure created by the CloudFormation template, open the AWS CloudFormation console and delete the stack. This will remove the services and configuration installed as part of this deployment process.

Conclusion

Optimizing the user experience is at the forefront of any Amazon Lex conversational designer’s priority list, and so is capturing information accurately. This new feature empowers designers to have choices around confirmation routines that drive a more natural dialog between the customer and the bot. Although confirming each input can slow down the user experience and cause frustration, failing to confirm when transcription confidence is low can risk accuracy. These improvements enable you to create a more natural and performant experience.

For more information about how to build effective conversations on Amazon Lex with intent confidence scores, see Build more effective conversations on Amazon Lex with confidence scores and increased accuracy.

About the Authors

Alex Buckhurst is a Senior Amazon Connect consultant at Amazon Web Services with a focus on innovation and building customer-centric designs. In his downtime, Alex enjoys playing squash, perfecting his BBQ skills, and cherishing moments with his family.

Kai Loreck is a Senior professional services Amazon Connect consultant. He works on designing and implementing scalable customer experience solutions. In his spare time, he can be found playing sports, snowboarding, or hiking in the mountains.

Neel Kapadia is a Senior Software Engineer at AWS where he works on designing and building scalable AI/ML services using Large Language Models and Natural Language Processing. He has been with Amazon for over 5 years and has worked on Amazon Lex and Amazon Bedrock. In his spare time, he enjoys cooking, reading, and traveling.

Anand Jumnani is a DevOps Consultant at Amazon Web Services based in United Kingdom. Outside of work, he is passionate about club cricket and enjoys spending quality time with family and friends.

Clearbot Tidies Up Ocean Litter

Figure Unveils Latest Humanoid Robot

ORBIT-Surgical Researchers Prep Robots for Surgery

Zordi Robots Grow Strawberries Indoors

ConsiStory: AI-Generated Images With Main Character Energy

Edify 3D: Generative AI Enters a New Dimension

Fugatto: Flexible AI Sound Machine for Music, Voices and More

GluFormer: AI Predicts Blood Sugar Levels Four Years Out

LATTE3D: Enabling Near-Instant Generation, From Text to 3D Shape

MaskedMimic: Reconstructing Realistic Movement for Humanoid Robots

StormCast: Boosting Weather Prediction, Climate Simulation

NVIDIA Research Sets Records in AI, Autonomous Vehicles, Robotics

Subscribe to the AI Podcast

Cost and performance optimization pillars

Retrieval Augmented Generation

Directional costs for small, medium, large, and extra large scenarios

Amazon Bedrock costs

Factors influencing costs

Number of questions

Input tokens

Output tokens

Vector embedding costs

Vector database costs

Amazon Bedrock Guardrails

Chunking strategies

Conclusion

About the Authors

That’s a Wrap

Toast to New Adventures

Solution overview

Amazon SageMaker HyperPod cluster setup

Prerequisites

Step 1: Infrastructure setup

Step 2: Cluster configuration

Step 3: Cluster deployment

Fine tuning

Data preparation

Tokenization

Model compilation and fine tuning

Step 1: Download the base Llama 3 8B model and tokenizer from Hugging Face

Step 2: Pre-compile model

Step 3: Model fine tuning

Model consolidation

Checkpoint consolidation

LoRA weight merging

Inference

Results

Clean up

Conclusion

About the Authors

Unlocking Productivity and Creativity With AI-Powered Chatbots

Introducing RTX-Accelerated Partner Applications

Agentic AI — Enabling Complex Problem-Solving

AI Decoded Wrapped

What Are Transcription Confidence Scores?

Solution overview

Interaction model

Prerequisites

Deploy the Amazon Lex bot and Amazon Connect flow

Configure the transcript confidence score logic

Test the solution

Limitations

Clean up

Conclusion

About the Authors

Navigation

GenAI Vision Endless Possibilities

"I'm interested in things that change the world or that affect the future and wondrous, new technology where you see it, and you're like, 'Wow, how did that even happen? How is that possible?'" -- Elon Musk

Copyright © 2019-2025 Vedere AI. All Rights Reserved.