Advances to low-bit quantization enable LLMs on edge devices

Advances to low-bit quantization enable LLMs on edge devices

Three white icons that represent artificial intelligence, systems, and networking. These icons sit on a purple to pink gradient background.

Large language models (LLMs) are increasingly being deployed on edge devices—hardware that processes data locally near the data source, such as smartphones, laptops, and robots. Running LLMs on these devices supports advanced AI and real-time services, but their massive size, with hundreds of millions of parameters, requires significant memory and computational power, limiting widespread adoption. Low-bit quantization, a technique that compresses models and reduces memory demands, offers a solution by enabling more efficient operation.

Recent advances in low-bit quantization have made mixed-precision matrix multiplication (mpGEMM) viable for LLMs. This deep learning technique allows data of the same or different formats to be multiplied, such as int8*int1, int8*int2, or FP16*int4. By combining a variety of precision levels, mpGEMM strikes a balance among speed, memory efficiency, and computational accuracy. 

However, most hardware supports only symmetric computations—operations on data of similar formats—creating challenges for mixed-precision calculations during General Matrix Multiplication (GEMM), a critical operation for LLMs. Overcoming these hardware limitations is essential to fully benefit from mpGEMM and support asymmetrical computations. 

To unlock the potential of low-bit quantization on resource-constrained edge devices, hardware must natively support mpGEMM. To address this, we developed the following three approaches for computing kernels and hardware architectures: 

  • Ladder data type compiler: Supports various low-precision data types by converting unsupported types into hardware-compatible ones without data loss, while also generating high-performance conversion code. 
  • T-MAC mpGEMM library: Implements GEMM using a lookup table (LUT) approach, eliminating multiplications to significantly reduce computational overhead. Optimized for diverse CPUs, T-MAC delivers several times the speed of other libraries. 
  • LUT Tensor Core hardware architecture: Introduces a cutting-edge design for next-generation AI hardware, tailored for low-bit quantization and mixed-precision computations.

The following sections describe these techniques in detail.

Ladder: Bridging the gap between custom data and hardware limits

Cutting-edge hardware accelerators, such as GPUs, TPUs, and specialized chips, are designed to speed up computationally intensive tasks like deep learning by efficiently handling large-scale operations. These accelerators now integrate lower-bit computing units, such as FP32, FP16, and even FP8, into their architectures.  

However, constraints in chip area and hardware costs limit the availability of these units for standard data types. For instance, the NVIDIA V100 Tensor Core GPU supports only FP16, while the A100 supports int2, int4, and int8 but not newer formats like FP8 or OCP-MXFP. Additionally, the rapid development of LLMs often outpaces hardware upgrades, leaving many new data formats unsupported and complicating deployment.

Additionally, while hardware accelerators may lack direct support for custom data types, their memory systems can convert these types into fixed-width data blocks that store any data format. For instance, NF4 tensors can be converted into FP16 or FP32 for floating-point operations.

Building on these insights, we developed the Ladder data type compiler, a method to separate data storage from computation, enabling broader support for custom data types. It bridges the gap between emerging custom data formats with the precision types supported by current hardware.

Ladder offers a flexible system for converting between algorithm-specific and hardware-supported data types without data loss. For low-bit applications, it optimizes performance by translating low-bit data into the most efficient formats for the hardware being used. As shown in Figure 1, this includes mapping low-bit computations to supported instructions and efficiently managing data storage across the memory hierarchy. 

Figure 1: A diagram illustrating the Ladder architecture. At the top, the tTile-Graph shows a computational flow where inputs in NF4 and FP16 formats feed into a matrix multiplication (MatMul) operation, which outputs in FP16. This output, along with another FP16 input, proceeds to an addition (Add) operation, also in FP16. Below, the tTile-Device schematic depicts a hierarchical memory structure with L2/Global Memory, L1/Shared Memory, and L0/Register, organized under 'Core.' Transformations occur in the loading and storing stages around computation, with arrows indicating data flow. The scheduling mechanism assigns operations to different layers of the memory hierarchy to optimize performance.
Figure 1: The Ladder architecture

Evaluating Ladder

Evaluations of Ladder on NVIDIA and AMD GPUs show that it outperforms existing deep neural network (DNN) compilers for natively supported data types. It also handles custom data types not supported by GPUs, achieving speedups of up to 14.6 times. 

As the first system to support custom low-precision data types for running DNNs on modern hardware accelerators, Ladder provides researchers with flexibility in optimizing data types. It also enables hardware developers to support a wider range of data types without requiring hardware modifications. 

T-MAC: Table-lookup for mpGEMM without multiplication

Deploying low-bit quantized LLMs on edge devices often requires dequantizing models to ensure hardware compatibility. However, this approach has two major drawbacks: 

  1. Performance: Dequantization overhead can result in poor performance, negating the benefits of low-bit quantization.
  2. Development: Developers must redesign data layouts and kernels for different mixed precisions.

To address these challenges, we introduce T-MAC, a novel LUT-based method that enables mpGEMM without dequantization or multiplication. 

T-MAC replaces traditional multiplication operations with bit-wise table lookups, offering a unified and scalable solution for mpGEMM. It incorporates techniques to reduce the size of tables and store them directly on the chip, minimizing the overhead of accessing data from memory. By eliminating dequantization and lowering computational costs, T-MAC enables efficient inference of low-bit LLMs on resource-constrained edge devices. Figure 2 illustrates T-MAC’s architecture. 

Figure 2: A diagram showing offline and online processes for bit-serial computation. Offline: integer weights are decomposed into 1-bit indices and permuted into tiles. Online: activations are precomputed with 1-bit patterns, processed via a lookup table (LUT), and aggregated using weighted summation in bit-serial aggregation.
Figure 2. Overview of the T-MAC system

Evaluating T-MAC

Performance evaluations of T-MAC on low-bit models demonstrated substantial benefits in efficiency and speed. On the Surface Laptop 7 with the Qualcomm Snapdragon X Elite chipset, T-MAC achieved: 

  • 48 tokens per second for the 3B BitNet-b1.58 model 
  • 30 tokens per second for the 2-bit 7B Llama model 
  • 20 tokens per second for the 4-bit 7B Llama model

These speeds far exceed average human reading rates, outperforming llama.cpp by 4–5 times and doubling the speed of a dedicated NPU accelerator. Even on lower-end devices like the Raspberry Pi 5, T-MAC made it possible for the 3B BitNet-b1.58 model to generate 11 tokens per second. It also proved highly power-efficient, matching llama.cpp’s generation rate while using only 1/4 to 1/6 of the CPU cores.

These results establish T-MAC as a practical solution for deploying LLMs on edge devices with standard CPUs, without relying on GPUs or NPUs. T-MAC allows LLMs to run efficiently on resource-constrained devices, expanding their applicability across a wider range of scenarios.

LUT Tensor Core: Driving hardware for mpGEMM

While T-MAC and Ladder optimize mpGEMM on existing CPU and GPU architectures, improving computational efficiency, they cannot match the performance of dedicated hardware accelerators with built-in LUT support. Achieving significant improvements in performance, power, and area (PPA) requires overcoming four key challenges:

  1. Table precompute and storage: Precomputing and storing LUTs add overhead, increasing area usage, latency, and storage requirements, which can reduce overall efficiency gains.
  2. Bit-width flexibility: Hardware must support various precision levels, such as int4/2/1 for weights and FP16/8 or int8 for activations, along with their combinations. This flexibility is crucial for accommodating diverse model architectures and use cases.
  3. LUT tiling shape: Inefficient tiling shapes can raise storage costs and limit reuse opportunities, adversely affecting performance and efficiency.
  4. Instruction and compilation: LUT-based mpGEMM requires a new instruction set. Existing compilation stacks, designed for standard GEMM hardware, may not optimally map and schedule these instructions, complicating integration with LLM inference software.

In response, we developed LUT Tensor Core, a software-hardware codesign for low-bit LLM inference. To address precomputation overhead in conventional LUT-based methods, we introduce techniques like software-based DFG transformation, operator fusion, and table symmetrization to optimize table precomputation and storage. Additionally, we propose a hardware design with an elongated tiling shape to support table reuse and a bit-serial design to handle various precision combinations in mpGEMM.

To integrate with existing GPU microarchitectures and software stacks, we extended the MMA instruction set, added new LMMA instructions, and developed a cuBLAS-like software stack for easy integration into existing DNN frameworks. We also created a compiler for end-to-end execution planning on GPUs with LUT Tensor Core. This design and workflow, illustrated in Figure 3, enabled the quick and seamless adoption of LUT Tensor Core.

Figure 3: Diagram of the LUT Tensor Core workflow. The left side shows operator fusion, where 'Norm' produces activations for pre-computation, and 'Weight Reinterpretation' processes low-bit weights. Both feed into LUT-mpGEMM, utilizing an activation LUT table and reinterpreted weights. The right side illustrates the LUT Tensor Core, comprising a LUT table for precomputed values, low-bit weights, and multiplexers (MUX) for computation.
Figure 3. The LUT Tensor Core workflow

Evaluating LUT Tensor Core

Testing LUT Tensor Core on low-bit LLMs, such as BitNet and Llama, showed significant performance gains, achieving 6.93 times the inference speed while using just 38.3% of the area of a traditional Tensor Core. With nearly identical model accuracy, this results in a 20.9-fold increase in computational density and an 11.2-fold boost in energy efficiency. As AI models grow in scale and complexity, LUT Tensor Core enables low-bit LLMs to be applied in new and diverse scenarios.

We believe the LUT technique could drive a paradigm shift in AI model inference. Traditional methods rely on multiplication and accumulation operations, whereas LUT implementations provide higher transistor density, greater throughput per chip area, lower energy costs, and better scalability. As large models adopt low-bit quantization, the LUT method could become the standard for system and hardware design, advancing the next generation of AI hardware innovation.

Unlocking new possibilities for embodied AI

Low-bit quantization improves the efficiency of running large models on edge devices while also enabling model scaling by reducing the bits used to represent each parameter. This scaling enhances model capabilities, generality, and expressiveness, as shown by the BitNet model, which starts with a low-bit configuration and expands.

Technologies like T-MAC, Ladder, and LUT Tensor Core provide solutions for running low-bit quantized LLMs, supporting efficient operation across edge devices and encouraging researchers to design and optimize LLMs using low-bit quantization. By reducing memory and computational demands, low-bit LLMs could power embodied AI systems, such as robots, enabling dynamic perception and real-time environmental interaction.

T-MAC (opens in new tab) and Ladder (opens in new tab) are open source and available on GitHub. We invite you to test and explore these innovations in AI technology with Microsoft Research.

Microsoft research podcast

NeurIPS 2024: The co-evolution of AI and systems with Lidong Zhou

Just after his NeurIPS 2024 keynote on the co-evolution of systems and AI, Microsoft CVP Lidong Zhou joins the podcast to discuss how rapidly advancing AI impacts the systems supporting it and the opportunities to use AI to enhance systems engineering itself.



The post Advances to low-bit quantization enable LLMs on edge devices appeared first on Microsoft Research.

Read More

Build a multi-interface AI assistant using Amazon Q and Slack with Amazon CloudFront clickable references from an Amazon S3 bucket

Build a multi-interface AI assistant using Amazon Q and Slack with Amazon CloudFront clickable references from an Amazon S3 bucket

There is consistent customer feedback that AI assistants are the most useful when users can interface with them within the productivity tools they already use on a daily basis, to avoid switching applications and context. Web applications like Amazon Q Business and Slack have become essential environments for modern AI assistant deployment. This post explores how diverse interfaces enhance user interaction, improve accessibility, and cater to varying preferences.

By offering seamless experiences across environments, organizations can increase user satisfaction and adoption rates. The assistant employs Retrieval Augmented Generation (RAG), a technique that integrates credible and authoritative sources within responses across these interfaces, bolstering trustworthiness and educational value. This multi-interface, RAG-powered approach not only strives to meet the flexibility demands of modern users, but also fosters a more informed and engaged user base, ultimately maximizing the assistant’s effectiveness and reach. By combining RAG with multiple interfaces, the assistant delivers consistent, accurate, and contextually relevant information regardless of the user’s preferred environment and productivity tools.

Solution overview

The following diagram illustrates the application’s architectural design.

Build a multi-interface AI assistant using Amazon Q and Slack with Amazon CloudFront clickable references from an Amazon S3 bucket architecture

You can find the complete code and the steps to deploy the solution in the GitHub repository.

Click here to open the AWS console and follow along. 

Prerequisites

You must have the following prerequisites:

Deploy the solution

For the set-up steps, refer to the README in the GitHub repo.

Solution components

In this section, we discuss two key components to the solution: the data sources and vector database.

Data sources

We use Spack documentation RST (ReStructured Text) files uploaded in an Amazon Simple Storage Service (Amazon S3) bucket. Whenever the assistant returns it as a source, it will be a link in the specific portion of the Spack documentation and not the top of a source page. For example, Spack images on Docker Hub.

Spack is a versatile package manager for supercomputers, Linux, and macOS that revolutionizes scientific software installation by allowing multiple versions, configurations, environments, and compilers to coexist on a single machine. Developed by Todd Gamblin at the Lawrence Livermore National Laboratory in 2013, Spack addresses the limitations of traditional package managers in high-performance computing (HPC) environments. Brian Weston, Cloud Transformation for Mission Science Program Lead at LLNL, advised in the development of this assistant.

Additionally, we use text files uploaded to an S3 bucket that is accessible through an Amazon CloudFront link. There is also an automated ingestion job from Slack conversation data to the S3 bucket powered by an AWS Lambda function. This enables the assistant to also use previous conversations from users to answer questions and cite its sources. We opted to use CloudFront links as opposed to using Slack links because when this source is cited in Amazon Q, the user might not have access to the Slack data. There is also an alternative to this methodology using the Slack connector for Amazon Kendra.

This solution could support other data types such as PDFs, Word documents, and more as long as their text can be extracted and fed into the vector database with some code changes. Their raw files can be served in a CloudFront distribution.

The following screenshot illustrates a sample CloudFront URL.

Build a multi-interface AI assistant using Amazon Q and Slack with Amazon CloudFront clickable references from an Amazon S3 bucket cloudfront

Upon deployment, existing data is automatically uploaded into an S3 bucket and processed to be used by the assistant. The solution also includes automatic daily ingestion of data from Slack into the application using Amazon EventBridge.

Vector database

This solution uses Amazon Kendra as its vector database, offering significant advantages in simplicity and cost-effectiveness. As a fully managed AWS service, Amazon Kendra reduces both development and maintenance costs. Amazon Q, which supports two types of retrievers (native retriever and Amazon Kendra), is seamlessly integrated into this setup. By using Amazon Kendra, the solution efficiently employs the same retriever for both the Amazon Q and Slack interfaces. This approach not only streamlines the overall architecture but also provides a more consistent user experience across both environments. The result is a cohesive, cost-efficient system that maintains uniformity in information retrieval and presentation, regardless of the user’s chosen interface.

Amazon Kendra also supports the use of metadata for each source file, which enables both UIs to provide a link to its sources, whether it is the Spack documentation website or a CloudFront link. Furthermore, Amazon Kendra supports relevance tuning, enabling boosting certain data sources. For this solution, we boosted the results for the Spack documentation.

User interfaces

In this section, we discuss the UIs used in this solution.

Amazon Q Business

Amazon Q Business uses RAG to offer a secure, knowledge-enhanced AI assistant tailored to your organization. As an AWS native solution, it seamlessly integrates with other AWS services and features its own user-friendly interface. This integration, combined with its straightforward setup and deployment process, provides a smooth implementation experience. By fusing generative AI capabilities with intelligent information retrieval from your enterprise systems, Amazon Q Business delivers precise, context-aware responses firmly rooted in your organization’s specific data and documents, enhancing its relevance and accuracy.

The following screenshot is an example of the Amazon Q Business UI.

Build a multi-interface AI assistant using Amazon Q and Slack with Amazon CloudFront clickable references from an Amazon S3 bucket amazon q

Slack

Slack is a popular collaboration service that has become an integral part of many organizations’ communication forums. Its versatility extends beyond team messaging to serve as an effective interface for assistants. By integrating AI-powered assistants into Slack, companies can use its familiar environment to provide users with instant access to information.

The following screenshot shows an example of the Slack UI with a message thread.

Build a multi-interface AI assistant using Amazon Q and Slack with Amazon CloudFront clickable references from an Amazon S3 bucket Slack

Monitoring

Amazon Q has a built-in feature for an analytics dashboard that provides insights into user engagement within a specific Amazon Q Business application environment. It offers valuable data on usage patterns, conversation dynamics, user feedback, and query trends, allowing you to analyze and optimize your AI assistant’s performance and user interaction.

For Slack, we are collecting user feedback, as shown in the preceding screenshot of the UI. Users can add a “thumbs up” or a “thumbs down” to the assistant response to keep track of its performance. Furthermore, we have built a custom solution that uses an Amazon CloudWatch dashboard to mimic the Amazon Q analytics dashboard to further align the experience between the two applications.

The following screenshot shows an example of the Slack CloudWatch dashboard.

Build a multi-interface AI assistant using Amazon Q and Slack with Amazon CloudFront clickable references from an Amazon S3 bucket cloudwatch

In addition, there is a daily scheduled Slack message that summarizes the Slackbot data for the past day, as shown in the following screenshot.

Build a multi-interface AI assistant using Amazon Q and Slack with Amazon CloudFront clickable references from an Amazon S3 bucket reports

Clean up

To avoid incurring ongoing charges, clean up the resources you created as part of this post with the command mentioned in the readme.

Conclusion

The implementation of a multi-interface AI assistant using RAG represents a leap in AI-driven organizational communication. By integrating Amazon Q Business and Slack interfaces with a robust backend powered by Amazon Kendra, this solution offers seamless, environment-agnostic access to accurate, context-aware information. The architecture’s strengths lie in its consistency across environments, automatic data ingestion processes, and comprehensive monitoring capabilities. This approach not only enhances user engagement and productivity, but also positions organizations to adapt swiftly to evolving communication needs in an increasingly AI-centric landscape, marking a pivotal step towards more efficient and intelligent information management systems.

To learn more about the AWS services used in this solution, refer to the Amazon Q User Guide, Deploy a Slack gateway for Amazon Bedrock, and the Amazon Kendra Developer Guide.


About the Authors

Nick Biso is a Machine Learning Engineer at AWS Professional Services. He solves complex organizational and technical challenges using data science and engineering. In addition, he builds and deploys AI/ML models on the AWS Cloud. His passion extends to his proclivity for travel and diverse cultural experiences.

Dr. Ian Lunsford is an Aerospace Cloud Consultant at AWS Professional Services. He integrates cloud services into aerospace applications. Additionally, Ian focuses on building AI/ML solutions using AWS services.

Read More

Building More Builders: Gooey.AI Makes AI More Accessible Across Communities

Building More Builders: Gooey.AI Makes AI More Accessible Across Communities

When non-technical users can create and deploy reliable AI workflows, organizations can do more to serve their clientele

Platforms for developing no- and low-code solutions are bridging the gap between powerful AI models and everyone who’d like to harness them.

Gooey.AI, a member of the NVIDIA Inception program for cutting-edge startups, offers one such platform, enabling teams to tap into multiple AI tools to improve productivity for frontline workers across the globe. Cofounders Sean Blagsvedt and Archana Prasad join the NVIDIA AI Podcast to discuss how the startup’s platform is making AI development accessible to developers and non-coders alike.

The founders detail Gooey.AI’s evolution from a British Council-funded arts project to a comprehensive, open-source, cloud-hosted platform serving over 1 million users in diverse industries like agriculture, healthcare and frontline services. The company’s vision centers on democratizing AI development through shareable AI recipes, as well as helping ensure responsible implementation and representation of historically underserved communities in AI model-building.

Prasad and Blagsvedt discuss unique applications, such as multilingual chatbots that support African farmers via messaging apps and AI assistants that help heating, ventilation, and air conditioning technicians access technical documentation.

Given the rapid adoption of low-code AI platforms is helping organizations of all sizes and charters overcome technical barriers while improving access to expertise, Blagsvedt noted, “You can’t [create] good technology that changes the world just by focusing on the technology — you have to find the problem worth solving.”

Learn more about the latest advancements in AI by registering for NVIDIA GTC, the conference for the era of AI, taking place March 17-21.

Time Stamps

00:31 – How a development platform began life as a British Council arts project called Dara.network.

17:53 – Working with the Gates Foundation, DigitalGreen and Opportunity International on agricultural chatbots.

33:21 – The influence of HTML standards and Kubernetes on Gooey.AI’s approach.

You Might Also Like… 

NVIDIA’s Louis Stewart on How AI Is Shaping Workforce Development

Louis Stewart, head of strategic initiatives for NVIDIA’s global developer ecosystem, discusses why workforce development is crucial for maximizing AI benefits. He emphasizes the importance of AI education, inclusivity and public-private partnerships in preparing the global workforce for the future. Engaging with AI tools and understanding their impact on the workforce landscape is vital for ensuring these changes benefit everyone.

Living Optics CEO Robin Wang on Democratizing Hyperspectral Imaging

Step into the realm of the unseen with Robin Wang, CEO of Living Optics. Living Optics’ hyperspectral imaging camera, which can capture visual data across 96 colors, reveals details invisible to the human eye. Potential applications are as diverse as monitoring plant health to detecting cracks in bridges. Living Optics aims to empower users across industries to gain new insights from richer, more informative datasets fueled by hyperspectral imaging technology.

Yotta CEO Sunil Gupta on Supercharging India’s Fast-Growing AI Market 

India’s AI market is expected to be massive. Yotta Data Services is setting its sights on supercharging it. Sunil Gupta, cofounder, managing director and CEO of Yotta Data Services, details the company’s Shakti Cloud offering, which provides scalable GPU services for enterprises of all sizes. Yotta is the first Indian cloud services provider in the NVIDIA Partner Network, and its Shakti Cloud is India’s fastest AI supercomputing infrastructure, with 16 exaflops of compute capacity supported by over 16,000 NVIDIA H100 GPUs.

Subscribe to the AI Podcast

Get the AI Podcast through Amazon MusicApple PodcastsGoogle PodcastsGoogle PlayCastbox, DoggCatcher, OvercastPlayerFM, Pocket Casts, PodbayPodBean, PodCruncher, PodKicker, SoundCloudSpotifyStitcher and TuneIn.

Read More

How GeForce RTX 50 Series GPUs Are Built to Supercharge Generative AI on PCs

How GeForce RTX 50 Series GPUs Are Built to Supercharge Generative AI on PCs

NVIDIA’s GeForce RTX 5090 and 5080 GPUs — which are based on the groundbreaking NVIDIA Blackwell architecture —offer up to 8x faster frame rates with NVIDIA DLSS 4 technology, lower latency with NVIDIA Reflex 2 and enhanced graphical fidelity with NVIDIA RTX neural shaders.

These GPUs were built to accelerate the latest generative AI workloads, delivering up to 3,352 AI trillion operations per second (TOPS), enabling incredible experiences for AI enthusiasts, gamers, creators and developers.

To help AI developers and enthusiasts harness these capabilities, NVIDIA at the CES trade show last month unveiled NVIDIA NIM and AI Blueprints for RTX. NVIDIA NIM microservices are prepackaged generative AI models that let developers and enthusiasts easily get started with generative AI, iterate quickly and harness the power of RTX for accelerating AI on Windows PCs. NVIDIA AI Blueprints are reference projects that show developers how to use NIM microservices to build the next generation of AI experiences.

NIM and AI Blueprints are optimized for GeForce RTX 50 Series GPUs. These technologies work together seamlessly to help developers and enthusiasts build, iterate and deliver cutting-edge AI experiences on AI PCs.

NVIDIA NIM Accelerates Generative AI on PCs

While AI model development is rapidly advancing, bringing these innovations to PCs remains a challenge for many people. Models posted on platforms like Hugging Face must be curated, adapted and quantized to run on PC. They also need to be integrated into new AI application programming interfaces (APIs) to ensure compatibility with existing tools, and converted to optimized inference backends for peak performance.

NVIDIA NIM microservices for RTX AI PCs and workstations can ease the complexity of this process by providing access to community-driven and NVIDIA-developed AI models. These microservices are easy to download and connect to via industry-standard APIs and span the key modalities essential for AI PCs. They are also compatible with a wide range of AI tools and offer flexible deployment options, whether on PCs, in data centers, or in the cloud.

NIM microservices include everything needed to run optimized models on PCs with RTX GPUs, including prebuilt engines for specific GPUs, the NVIDIA TensorRT software development kit (SDK), the open-source NVIDIA TensorRT-LLM library for accelerated inference using Tensor Cores, and more.

Microsoft and NVIDIA worked together to enable NIM microservices and AI Blueprints for RTX in Windows Subsystem for Linux (WSL2). With WSL2, the same AI containers that run on data center GPUs can now run efficiently on RTX PCs, making it easier for developers to build, test and deploy AI models across platforms.

In addition, NIM and AI Blueprints harness key innovations of the Blackwell architecture that the GeForce RTX 50 series is built on, including fifth-generation Tensor Cores and support for FP4 precision.

Tensor Cores Drive Next-Gen AI Performance

AI calculations are incredibly demanding and require vast amounts of processing power. Whether generating images and videos or understanding language and making real-time decisions, AI models rely on hundreds of trillions of mathematical operations to be completed every second. To keep up, computers need specialized hardware built specifically for AI.

NVIDIA GeForce RTX desktop GPUs deliver up to 3,352 AI TOPS for unmatched speed and efficiency in AI-powered workflows.

In 2018, NVIDIA GeForce RTX GPUs changed the game by introducing Tensor Cores — dedicated AI processors designed to handle these intensive workloads. Unlike traditional computing cores, Tensor Cores are built to accelerate AI by performing calculations faster and more efficiently. This breakthrough helped bring AI-powered gaming, creative tools and productivity applications into the mainstream.

Blackwell architecture takes AI acceleration to the next level. The fifth-generation Tensor Cores in Blackwell GPUs deliver up to 3,352 AI TOPS to handle even more demanding AI tasks and simultaneously run multiple AI models. This means faster AI-driven experiences, from real-time rendering to intelligent assistants, that pave the way for greater innovation in gaming, content creation and beyond.

FP4 — Smaller Models, Bigger Performance

Another way to optimize AI performance is through quantization, a technique that reduces model sizes, enabling the models to run faster while reducing the memory requirements.

Enter FP4 — an advanced quantization format that allows AI models to run faster and leaner without compromising output quality. Compared with FP16, it reduces model size by up to 60% and more than doubles performance, with minimal degradation.

For example, Black Forest Labs’ FLUX.1 [dev] model at FP16 requires over 23GB of VRAM, meaning it can only be supported by the GeForce RTX 4090 and professional GPUs. With FP4, FLUX.1 [dev] requires less than 10GB, so it can run locally on more GeForce RTX GPUs.

On a GeForce RTX 4090 with FP16, the FLUX.1 [dev] model can generate images in 15 seconds with just 30 steps. With a GeForce RTX 5090 with FP4, images can be generated in just over five seconds.

FP4 is natively supported by the Blackwell architecture, making it easier than ever to deploy high-performance AI on local PCs. It’s also integrated into NIM microservices, effectively optimizing models that were previously difficult to quantize. By enabling more efficient AI processing, FP4 helps to bring faster, smarter AI experiences for content creation.

AI Blueprints Power Advanced AI Workflows on RTX PCs

NVIDIA AI Blueprints, built on NIM microservices, provide prepackaged, optimized reference implementations that make it easier to develop advanced AI-powered projects — whether for digital humans, podcast generators or application assistants.

At CES, NVIDIA demonstrated PDF to Podcast, a blueprint that allows users to convert a PDF into a fun podcast, and even create a Q&A with the AI podcast host afterwards. This workflow integrates seven different AI models, all working in sync to deliver a dynamic, interactive experience.

The blueprint for PDF to podcast harnesses several AI models to seamlessly convert PDFs into engaging podcasts, complete with an interactive Q&A feature hosted by an AI-powered podcast host.

With AI Blueprints, users can quickly go from experimenting with to developing AI on RTX PCs and workstations.

NIM and AI Blueprints Coming Soon to RTX PCs and Workstations

Generative AI is pushing the boundaries of what’s possible across gaming, content creation and more. With NIM microservices and AI Blueprints, the latest AI advancements are no longer limited to the cloud — they’re now optimized for RTX PCs. With RTX GPUs, developers and enthusiasts can experiment, build and deploy AI locally, right from their PCs and workstations.

NIM microservices and AI Blueprints are coming soon, with initial hardware support for GeForce RTX 50 Series, GeForce RTX 4090 and 4080, and NVIDIA RTX 6000 and 5000 professional GPUs. Additional GPUs will be supported in the future.

Read More

AI Pays Off: Survey Reveals Financial Industry’s Latest Technological Trends

AI Pays Off: Survey Reveals Financial Industry’s Latest Technological Trends

The financial services industry is reaching an important milestone with AI, as organizations move beyond testing and experimentation to successful AI implementation, driving business results.

NVIDIA’s fifth annual State of AI in Financial Services report shows how financial institutions have consolidated their AI efforts to focus on core applications, signaling a significant increase in AI capability and proficiency.

AI Helps Drive Revenue and Save Costs 

Companies investing in AI are seeing tangible benefits, including increased revenue and cost savings.

Nearly 70% of respondents report that AI has driven a revenue increase of 5% or more, with a dramatic rise in those seeing a 10-20% revenue boost. In addition, more than 60% of respondents say AI has helped reduce annual costs by 5% or more. Nearly a quarter of respondents are planning to use AI to create new business opportunities and revenue streams.

The top generative AI use cases in terms of return on investment (ROI) are trading and portfolio optimization, which account for 25% of responses, followed by customer experience and engagement at 21%. These figures highlight the practical, measurable benefits of AI as it transforms key business areas and drives financial gains.

Overcoming Barriers to AI Success

Half of management respondents said they’ve deployed their first generative AI service or application, with an additional 28% planning to do so within the next six months. A 50% decline in the number of respondents reporting a lack of AI budget suggests increasing dedication to AI development and resource allocation.

The challenges associated with early AI exploration are also diminishing. The survey revealed fewer companies reporting data issues and privacy concerns, as well as reduced concern over insufficient data for model training. These improvements reflect growing expertise and better data management practices within the industry.

As financial services firms allocate budget and grow more savvy at data management, they can better position themselves to harness AI for enhanced operational efficiency, security and innovation across business functions.

Generative AI Powers More Use Cases  

After data analytics, generative AI has emerged as the second-most-used AI workload in the financial services industry. The applications of the technology have expanded significantly, from enhancing customer experience to optimizing trading and portfolio management.

Notably, the use of generative AI for customer experience, particularly via chatbots and virtual assistants, has more than doubled, rising from 25% to 60%. This surge is driven by the increasing availability, cost efficiency and scalability of generative AI technologies for powering more sophisticated and accurate digital assistants that can enhance customer interactions.

More than half of the financial professionals surveyed are now using generative AI to enhance the speed and accuracy of critical tasks like document processing and report generation.

Financial institutions are also poised to benefit from agentic AI — systems that harness vast amounts of data from various sources and use sophisticated reasoning to autonomously solve complex, multistep problems. Banks and asset managers can use agentic AI systems to enhance risk management, automate compliance processes, optimize investment strategies and personalize customer services.

Advanced AI Drives Innovation

Recognizing the transformative potential of AI, companies are taking proactive steps to build AI factories — specially built accelerated computing platforms equipped with full-stack AI software — through cloud providers or on premises. This strategic focus on implementing high-value AI use cases is crucial to enhancing customer service, boosting revenue and reducing costs.

By tapping into advanced infrastructure and software, companies can streamline the development and deployment of AI models and position themselves to harness the power of agentic AI.

With industry leaders predicting at least 2x ROI on AI investments, financial institutions remain highly motivated to implement their highest-value AI use cases to drive efficiency and innovation.

Download the full report to learn more about how financial services companies are using accelerated computing and AI to transform services and business operations.

Read More

Enabling advanced GPU features in PyTorch - Warp Specialization

Enabling advanced GPU features in PyTorch – Warp Specialization

Meta: Hongtao Yu, Manman Ren, Bert Maher, Shane Nay
NVIDIA: Gustav Zhu, Shuhao Jiang

Over the past few months, we have been working on enabling advanced GPU features for PyTorch and Triton users through the Triton compiler. One of our key goals has been to introduce warp specialization support on NVIDIA Hopper GPUs. Today, we are thrilled to announce that our efforts have resulted in the rollout of fully automated Triton warp specialization, now available to users in the upcoming release of Triton 3.2, which will ship with PyTorch 2.6. PyTorch users can leverage this feature by implementing user-defined Triton kernels. This work leveraged an initial implementation of warp specialization in Triton by NVIDIA and we look forward to further development with the community in the future.

Warp specialization (WS) is a GPU programming technique where warps (a group of 32 threads on NVIDIA GPUs) within a threadblock are assigned distinct roles or tasks. This approach optimizes performance by enabling efficient execution of workloads that require task differentiation or cooperative processing. It enhances kernel performance by leveraging an asynchronous execution model, where different parts of the kernel are managed by separate hardware units. Data communication between these units, facilitated via shared memory on the NVIDIA H100, is highly efficient. Compared to a uniform warp approach, warp specialization allows the hardware multitasking warp scheduler to operate more effectively, maximizing resource utilization and overall performance.

Using GEMM as an example, a typical uniform warp approach on the H100 GPU involves 8 warps per thread block collectively computing a tile of the output tensor. These 8 warps are divided into two warp groups (WG), with each group cooperatively computing half of the tile using efficient warp-group-level MMA (WGMMA) instructions, as illustrated in Figure 1.

Figure 1. GEMM K-loop Body with Uniform Warps

Figure 1. GEMM K-loop Body with Uniform Warps

The implementation is clean, easy to understand, and generally performs well, thanks to an elegant software pipeliner. The pipeliner’s purpose is to enhance instruction-level parallelism by executing non-dependent operations on different hardware units. For instance, load operations from the next loop iteration can be executed simultaneously with WGMMA operations in the current iteration. However, this approach relies heavily on the compiler to craft an instruction sequence that ensures load and WGMMA instructions are issued at precisely the right time. While this is relatively straightforward for GEMM, which involves a limited number of operations, it becomes significantly more challenging for more complex kernels, such as flash attention.

On the other hand, warp specialization addresses programming challenges by separating operations intended to run simultaneously on different hardware units into distinct warps, synchronizing them efficiently using low-cost barriers in shared memory. This allows each warp to have its own instruction sequence, enabling instructions to be issued and executed continuously without being interrupted by other operations, thanks to the multi-way warp scheduler. An illustration of a warp-specialized GEMM can be seen in Figure 2.

Figure 2. GEMM K-loop Body with Specialized Warps

Figure 2. GEMM K-loop Body with Specialized Warps

How to enable WS

To enable warp specialization, users simply need to specify two autotune flags: num_consumer_groups and num_buffers_warp_spec. For example, a warp-specialized GEMM implementation might look as shown below. Users can enable warp specialization by setting a non-zero value for num_consumer_groups, which defines the number of consumer warp groups. There is no corresponding flag to set the number of producer warp groups, as currently only one producer is supported. The num_buffers_warp_spec flag specifies the number of buffers the producer warp group will use to communicate with the consumer warp groups. A working example of a warp-specialized kernel is provided in the persistent GEMM tutorial.

@triton.autotune(
    configs=[
        triton.Config(
            {
                "BLOCK_SIZE_M": 128,
                "BLOCK_SIZE_N": 256,
                "BLOCK_SIZE_K": 64,
                "GROUP_SIZE_M": 8,
            },
            num_stages=2,
            num_warps=4,
            num_consumer_groups=2,
            num_buffers_warp_spec=3,
        ),
    ],
    key=["M", "N", "K"],
)
@triton.jit
def matmul_persistent_ws_kernel(
   a_ptr, b_ptr, c_ptr, M, N, K,
   stride_am, stride_ak, stride_bk, stride_bn, stride_cm, stride_cn,
   BLOCK_M: tl.constexpr, BLOCK_N: tl.constexpr, BLOCK_K: tl.constexpr,
):
   pid = tl.program_id(axis=0)
   num_pid_m = tl.cdiv(M, BLOCK_M)
   num_pid_n = tl.cdiv(N, BLOCK_N)
   pid_m = pid // num_pid_m
   pid_n = pid % num_pid_n
   offs_m = pid_m * BLOCK_M + tl.arange(0, BLOCK_M)
   offs_n = pid_n * BLOCK_N + tl.arange(0, BLOCK_N)
   offs_k = tl.arange(0, BLOCK_K)
   a_ptrs = a_ptr + (offs_m[:, None] * stride_am + offs_k[None, :] * stride_ak)
   b_ptrs = b_ptr + (offs_k[:, None] * stride_bk + offs_n[None, :] * stride_bn)
   acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=tl.float32)
   for k in range(0, tl.cdiv(K, BLOCK_K)):
       a = tl.load(a_ptrs)
       b = tl.load(b_ptrs)
       acc += tl.dot(a, b)
       a_ptrs += BLOCK_K * stride_ak
       b_ptrs += BLOCK_K * stride_bk
   c = acc.to(tl.float16)
   c_ptrs = c_ptr + stride_cm * offs_m[:, None] + stride_cn * offs_n[None, :]
   tl.store(c_ptrs, c)

Under the Hood

Warp specialization uses a set of hierarchical compiler transformations and IR changes to translate a user’s non-warp-specialized kernel into warp-specialized machine code. These include:

  • Task Partitioning: The entire kernel is automatically divided into asynchronous tasks based on predefined heuristics. The compiler determines how to utilize one producer warp group and a user-specified number of consumer warp groups to execute the kernel. It assigns task IDs to specific anchor operations, which then influence the task assignments for remaining operations through asynchronous task ID propagation and dependency analysis. Since shared memory is the most efficient method for data transfer between warp groups across all supported platforms, the compiler optimizes task partitions to minimize register spills to shared memory, ensuring efficient execution.
  • Data Partitioning for Multiple Consumer Groups: Efficiently partitioning data among multiple consumer groups is key to optimizing workload distribution. On the H100 GPU, the compiler, by default, attempts to partition the input tensor A along the M dimension, allowing each consumer group to compute half of the output tensor independently. This strategy, known as cooperative partitioning, maximizes efficiency under most conditions. However, if this split leads to inefficiencies—such as producing a workload smaller than the native WGMMA instruction size—the compiler dynamically adjusts and partitions along the N dimension instead.
  • Dataflow Pipelining: The compiler creates cyclic shared memory buffers to pipeline dataflows across multiple-dimensional loops. Warp-specialized pipelining supports complex control flow. For example, our warp-specialized persistent GEMM kernel uses a doubly-nested loop, allowing the producer to begin fetching data for the next output tile while the consumer is finishing the compute for the prior tile.
  • Communication Operations: We introduced four high-level Triton GPU IR (TTGIR) communication operations—ProducerAcquireOp, ProducerCommitOp, ConsumerWaitOp, and ConsumerReleaseOp—to manage pipelined dataflows. These support both TMA and non-TMA memory operations.
  • Code Partitioning: Each async task is outlined into its own standalone code region, guarded by warp group ID checks. Control dependencies are duplicated as needed.
  • TTGIR to LLVM/PTX Materialization: TTGIR communication operations are materialized into corresponding LLVM/PTX barrier operations.

Performance

The warp specialization release introduces a range of Triton compiler transformations that collectively convert user code into warp-specialized kernels. This feature has been applied to several key kernels, including Flash Attention and FP8 row-wise GEMM, resulting in significant performance gains of 10% to 15%. Below, we highlight the latest performance metrics for these high-impact kernels.

bar chart

bar chart

Future Work

Looking ahead, we plan to further enhance Triton’s warp specialization support by introducing new features such as Ping-Pong scheduling, expanded buffer sharing support, improved transparent handling for TMA, refined partitioning heuristics for upcoming NVIDIA hardware.

Read More

Reinforcement Learning for Long-Horizon Interactive LLM Agents

Interactive digital agents (IDAs) leverage APIs of stateful digital environments to perform tasks in response to user requests. While IDAs powered by instruction-tuned large language models (LLMs) can react to feedback from interface invocations in multi-step exchanges, they have not been trained in their respective digital environments. Prior methods accomplish less than half of tasks in sophisticated benchmarks such as AppWorld. We present a reinforcement learning (RL) approach that trains IDAs directly in their target environments. We formalize this training as a partially observable Markov…Apple Machine Learning Research

Orchestrate seamless business systems integrations using Amazon Bedrock Agents

Orchestrate seamless business systems integrations using Amazon Bedrock Agents

Generative AI has revolutionized technology through generating content and solving complex problems. To fully take advantage of this potential, seamless integration with existing business systems and efficient access to data are crucial. Amazon Bedrock Agents provides the integration capabilities to connect generative AI models with the wealth of information and workflows already in place within an organization, enabling the creation of efficient and impactful generative AI applications.

Amazon Bedrock is a fully managed service that enables the development and deployment of generative AI applications using high-performance foundation models (FMs) from leading AI companies through a single API. Amazon Bedrock Agents allows you to streamline workflows and automate repetitive tasks across your company systems and data sources, while maintaining security, privacy, and responsible AI practices. Using these agents, you can enable generative AI applications to execute multiple tasks across your company systems and data sources. Businesses can now unlock the power of generative AI to automate tasks, generate content, and solve complex problems—all while maintaining connectivity to critical enterprise systems and data sources.

The post showcases how generative AI can be used to logic, reason, and orchestrate integrations using a fictitious business process. It demonstrates strategies and techniques for orchestrating Amazon Bedrock agents and action groups to seamlessly integrate generative AI with existing business systems, enabling efficient data access and unlocking the full potential of generative AI.

This solution also integrates with Appian Case Management Studio. Cases are a vital part of case management applications and represent a series of tasks to complete or a multi-step problem to solve. Appian Case Management Studio is an out-of-the box suite of applications that facilitates rapid development of case management apps. The fictitious business process used in this post creates a case in Appian for further review.

Business workflow

The following workflow shows the fictitious business process.

The workflow consists of the following steps:

  1. The user asks the generative AI assistant to determine if a device needs review.
  2. If a device type is provided, the assistant checks if it’s a Type 3 device.
  3. If it’s a Type 3 device, the assistant asks the user for the device name.
  4. The assistant checks if a document exists with the provided name.
  5. If the document exists, the assistant creates a case in Appian to start a review.
  6. If the document doesn’t exist, the assistant sends an email for review.

Solution overview

The following diagram illustrates the architecture of the solution.

architecture

The system workflow includes the following steps:

  1. The user interacts with the generative AI application, which connects to Amazon Bedrock Agents.
  2. The application uses Amazon Bedrock Knowledge Bases to answer the user questions. These knowledge bases are created with Amazon Simple Storage Service (Amazon S3) as the data source and Amazon Titan (or another model of your choice) as the embedding model.
  3. Amazon Bedrock Agents uses action groups to integrate with different systems.
  4. The action groups call different AWS Lambda functions within private subnet of a virtual private cloud (VPC).
  5. The agent uses a tree-of-thought (ToT) prompt to execute different actions from the action groups.
  6. A Lambda function fetches the classification of the device from Amazon DynamoDB. The function invokes DynamoDB using a gateway endpoint.
  7. A Lambda function checks if quality documents exist in Amazon S3. The function invokes Amazon S3 using interface endpoints.
  8. A Lambda function calls the Appian REST API using a NAT gateway in a public subnet.
  9. The Appian key is stored in AWS Secrets Manager.
  10. A Lambda function uses AWS Identity and Access Management (IAM) permissions to make an SDK call to Amazon Simple Email Service (Amazon SES). Amazon SES sends an email using SMTP to verified emails provided by the user.

Prerequisites

You will need the following prerequisites before you can build the solution:

  • A valid AWS account.
  • Access to Anthropic’s Claude 3 Sonnet or the model you intend to use (for more information, see Access Amazon Bedrock foundation models). For this post, we use Anthropic’s Claude 3 Sonnet, and all instructions are pertaining to that model. If you want to use another FM, update the prompts accordingly.
  • An IAM role in the account that has sufficient permissions to create the necessary resources.
  • AWS CloudTrail logging enabled for operational and risk auditing. For more details, see Creating a trail for your AWS account.
  • AWS Budgets policy notifications enabled to protect you from unwanted billing. For more details, see Enable Budget policy.
  • Two email addresses to send and receive emails. Do not use existing verified identities in Amazon SES for these email addresses. The AWS CloudFormation template will fail otherwise.

This solution is supported only in the us-east-1 AWS Region. You can make the necessary changes to the CloudFormation template to deploy to other Regions.

Create an Appian account

Depending on your needs, follow the corresponding steps to create an Appian account.

Sign up for Appian Community Edition for personal use

The Appian Community Edition provides a personal environment for learning and exploration at no additional cost. To sign up for Apian Community Edition, complete the following steps:

  1. Visit the Appian Community Edition page.
  2. Enter your email address and choose Submit to receive confirmation and login details.
  3. Check your inbox for a verification email from Appian.
  4. Choose the link in the email to validate your email address and finish setting up your account by providing your first name, last name, email, and password, then accept the terms.
  5. Choose Register to complete the registration.
  6. Choose the activation link and log in with your email address and password.
  7. Complete your profile by entering information about your company, phone number, and learning interests, among other details.
  8. Choose Access Environment.
  9. Choose your region (USA, India, or Germany) by choosing the appropriate link.
  10. Navigate to Appian Designer and start exploring Appian’s features and capabilities.

Purchase Appian Platform for business use

If you’re evaluating Appian for your organization, complete the following steps:

  1. Visit the Appian Platform listing at AWS Marketplace.
  2. Choose View purchase options.
  3. Fill out the contract form by providing your duration, renewal settings, and contract options.
  4. Choose Create Contract. to submit your request.

An Appian representative will contact you to discuss your needs. They might provide access to a trial environment or schedule a personalized demo.

  1. Follow the instructions provided by the Appian representative to access your account.

By following these steps, you can create an Appian account suited to your personal learning or business evaluation needs. Whether you’re exploring Appian’s platform individually or assessing it for your organization, Appian provides resources and support to help you get started.

Note the following values, which we will use in the CloudFormation template below.

  • AppianHostEndpoint
  • AppianAPIKey

Deploy the CloudFormation template

Complete the following steps to deploy the CloudFormation template:

  1. Download the CloudFormation template.
  2. Open the AWS CloudFormation console in the us-east-1
  3. Choose Stacks in the navigation pane, then choose Create stack.
  4. Upload the template and choose Next.
  5. For Stack name, enter a name, such as QualityReviewStack.
  6. In the Parameters section, provide the following information:
    1. For DynamoDBTableName, enter the name of the DynamoDB table.
    2. For Fromemailaddress, enter the email address to send emails.
    3. For Toemailaddress, enter the email address to receive emails.
    4. For AppianHostEndpoint enter the AppianHostEndpoint captured earlier.
    5. For AppianAPIKey enter the AppianAPIKey captured earlier.
  7. Leave other settings as default and choose Next.

  1. Under Capabilities on the last page, select I acknowledge that AWS CloudFormation might create IAM resources.
  2. Choose Submit to create the CloudFormation stack.

After the successful deployment of the whole stack, an email will be sent to the email addresses provided earlier.

  1. Verify the newly created email identities by choosing link in the email.
  2. On the Resources tab of the CloudFormation template, make a note of the physical IDs for the following resource logical IDs. You will need them later.
    1. OpenAPISpecsS3Bucket
    2. QualityFormsBucket

This post does not cover auto scaling of AWS Lambda. To integrate Lambda with AWS Application Auto Scaling, see AWS Lambda and Application Auto Scaling.

Upload Open API files to the S3 bucket

Complete the following steps to upload the Open API specifications to Amazon S3:

  1. Download the following the Open API specifications:
    1. Device Classification (deviceclassification.json)
    2. Verify Quality Documents (verifyQualityDocuments.json)
    3. Email Reviewers (emailReviewers.json)
    4. Appian Case (appian-case.json)
  2. On the Amazon S3 console, navigate to the OpenAPISpecsS3Bucket captured earlier.
  3. Upload the downloaded files to the bucket.

Upload the quality forms to the S3 bucket

Complete the following steps to upload the quality form to the Amazon S3:

  1. Download the dummy quality form.
  2. On the AWS CloudFormation console, navigate to the Resources tab of the stack and choose the link next to the physical ID of QualityFormsBucket.

  1. Upload the file downloaded sample articles to the bucket.

Create an effective prompt

Before we configure the agents, we will define a prompt. Prompts are the key to unlocking the full potential of Amazon Bedrock agents. Prompts are the textual inputs that guide the agent’s behavior and responses. Crafting well-designed prompts is essential for making sure that the agent understands the context, intent, and desired output.

When creating prompts, consider the following best practices:

  • Provide clear and concise instructions
  • Include relevant background information and context
  • Follow the model best practices to format the prompt

Amazon Bedrock Agents supports advanced prompting techniques, Chain of thought (CoT) and Tree-of-thought (ToT) prompting. CoT prompting is a technique that enhances the reasoning capabilities of FMs by breaking down complex questions or tasks into smaller, more manageable steps. ToT prompting is a technique used to improve FM reasoning capabilities by breaking down larger problem statements into a treelike format, where each problem is divided into smaller subproblems. We use Tree-of-thought (ToT) prompting and start by breaking down the business process into logical steps and then incorporate model formatting.

The following is the prompt developed for Anthropic’s Claude 3 Sonnet:

You are an agent that helps determine if device requires a quality review and you always use actions groups to answer. To verify if a review is needed, follow these steps:

1. Ask the user to provide the device type. If not provided, prompt for it.
2. Fetch the device classification from the database based on the provided device type using deviceClassification action group
3. If the classification returned from action group is Class III or 3
4. Ask the user for the specific device name.
5. Check if the device name has quality review forms using the verifyifformsExists action group
6. If a quality review document exists:
7. Prepare an email with the relevant content.
8. Ask for to email address and from email address
9. Send the email to the user.
10. If no quality review document exists, create a case.

Create an Amazon Bedrock Agent

The first step in configuring Amazon Bedrock Agents is to define their capabilities. Amazon Bedrock agents can be trained to perform a wide range of tasks, from natural language processing and generation to task completion and decision-making. When defining an agent’s capabilities, consider the specific use case and the desired outcomes.

To create an agent, complete the following steps:

  1. On the Amazon Bedrock console, choose Agents in the navigation pane.
  2. Choose Create Agent.

create agent

  1. In the Agent details section, enter a name for the agent and an optional description.
  2. Choose Create.

agent details

  1. In the agent builder, choose Create and use a new service role for the agent resource role.

choose role

  1. Choose Anthropic’s Claude 3 Sonnet as the model.
  2. In the Instructions for the Agent section, provide the prompt crafted earlier.

  1. In the Additional settings section, for User input, select Enabled.

enable user input

  1. Choose Save and exit to save the agent.

Create action groups

Complete the following steps to create the action groups for the newly created agent:

  1. On the Amazon Bedrock console, choose Agents in the navigation pane.
  2. Choose the newly created agent and choose Edit in Agent Builder.
  3. In the Action groups section, choose Add.

  1. In the Action group details section, change the automatically generated name to checkdeviceclassification and provide an optional description for your action group.
  2. In the Action group type section, select Define with API schemas to use the OpenAPI schema.

  1. In the Action group invocation section, select Select an existing Lambda function to use an existing Lambda function.
  2. On the drop-down menu, choose the Lambda function with the name containing DeviceClassification.

  1. In the Action group schema section, select Define via in-line schema editor to define the schema.
  2. Choose JSON on the drop-down menu next to
  3. Open the device classification file downloaded earlier and copy the content of the schema file.
  4. Enter the content in the schema editor.

  1. Choose Create to create an action group.
  2. Repeat the preceding steps to create additional action groups. Use the following table to map the action groups to the respective Lambda functions and Open API schemas.
Action Group Name Lambda Functin Name Containing Open API Schema
checkdeviceclassification DeviceClassification deviceclassification.json
verifyqualitydocuments VerifyQualityDocuments verifyQualityDocuments.json
emailreviewers EmailReviewers emailReviewers.json
appiancase Appian appian-case.json

To customize the agent’s behavior to your specific use case, you can modify the prompt templates for the preprocessing, orchestration, knowledge base response generation, and postprocessing steps. For more information, see Enhance agent’s accuracy using advanced prompt templates in Amazon Bedrock.

Create a knowledge base

You can create an Amazon Bedrock knowledge base to retrieve information from your proprietary data and generate responses to answer natural language questions. As part of creating a knowledge base, you configure a data source and a vector store of your choice.

The prompt crafted earlier provides instructions that are not dependent on a knowledge base. To use a knowledge base, modify the prompt accordingly.

Prepare the agent

Complete the following steps to prepare the agent for deployment:

  1. On the Amazon Bedrock console, navigate to the agent you created.
  2. In the agent builder, choose Save.

After the agent is saved, the Prepare button will be enabled.

  1. Choose Prepare to build the agent.

Test the agent

To test the agent, we use the Amazon Bedrock agent console. You can embed the API calls into your applications.

If you use AWS published API calls to access Amazon Bedrock through the network, the client must adhere to the following requirements.

Complete the following steps to test the agent on the Amazon Bedrock console:

  1. On the Test page for the agent, choose the arrows icon to enlarge the test window.

  1. In the message bar, enter “verify if the device requires review.”

The agent will respond by asking for the type of device.

  1. Enter “HIV diagnostic tests.”

The CloudFormation template only deploys “HIV diagnostic tests” as a Type 3 device.

The agent fetches the classification of the device from the DynamoDB. You can update the CloudFormation template to add more values.

Because the classification of HIV diagnostic tests is Type 3, the agent will ask for the device name to verify if the quality document exists.

  1. Enter anytech.

The agent will verify if the document with the name anytech exists in Amazon S3. (Earlier, you uploaded a dummy document for anytech.)

The agent should now ask for an email address to receive the quality review request.

An email will be sent with the review details.

  1. Repeat the preceding steps but this time, enter anytechorg as the document name.

We did not upload a document named anytechorg, so the agent will create a case by asking for the following information:

  • First name
  • Last name
  • Mobile phone number
  • Description
  • Title of the case

case details

  1. Provide the required information to the agent.

The agent now creates a case.

Best practices

Consider the following best practices for building efficient and well-architected generative AI applications:

Clean up

To avoid incurring future charges, delete the resources you created. To clean up the AWS environment, complete the following steps:

  1. Empty the contents of the S3 buckets you created as part of the CloudFormation stack.
  2. Delete the agent from Amazon Bedrock.
  3. Delete the CloudFormation stack you created.

Conclusion

Integrating generative AI with existing systems is crucial to unlocking its transformative potential. By using tools like Amazon Bedrock Agents, organizations can seamlessly connect generative AI to core data and workflows, enabling automation, content generation, and problem-solving while maintaining connectivity. The strategies and techniques showcased in this post demonstrate how generative AI can be orchestrated to drive maximum value across a wide range of use cases, from extracting intelligence from regulatory submissions to providing prescriptive guidance to industry. As generative AI continues to evolve, the ability to integrate it with existing infrastructure will be paramount to realizing its true business impact.

To get started with integrating generative AI into your business, explore How Amazon Bedrock Agents works and discover how you can unlock the transformative potential of this technology across your organization.

Stay up to date with the latest advancements in generative AI and start building on AWS. If you’re seeking assistance on how to begin, check out the Generative AI Innovation Center.


About the Authors

Sujatha Dantuluri is a seasoned Senior Solutions Architect in the US federal civilian team at AWS, with over two decades of experience supporting commercial and federal government clients. Her expertise lies in architecting mission-critical solutions and working closely with customers to ensure their success. Sujatha is an accomplished public speaker, frequently sharing her insights and knowledge at industry events and conferences.

Arianna Burgman is a Solutions Architect at AWS based in NYC, supporting state and local government agencies. She is a data and AI enthusiast with experience collaborating with organizations to architect technical solutions that further their missions for continuous innovation and positive, lasting impact.

Annie Cimack is an Associate Solutions Architect based in Arlington, VA, supporting public sector customers across the federal government as well as higher education. Her area of focus is data analytics, and she works closely with customers of all sizes to support projects ranging from storage to intelligent document processing.

Sunil Bemarkar is a Sr. Partner Solutions Architect at AWS based out of San Francisco with over 20 years of experience in the information technology field. He works with various independent software vendors and AWS partners specialized in cloud management tools and DevOps segments to develop joint solutions and accelerate cloud adoption on AWS.

Marcelo Silva is a Principal Product Manager at Amazon Web Services, leading strategy and growth for Amazon Bedrock Knowledge Bases and Amazon Lex.

Read More

NVIDIA Blackwell Now Generally Available in the Cloud

NVIDIA Blackwell Now Generally Available in the Cloud

AI reasoning models and agents are set to transform industries, but delivering their full potential at scale requires massive compute and optimized software. The “reasoning” process involves multiple models, generating many additional tokens, and demands infrastructure with a combination of high-speed communication, memory and compute to ensure real-time, high-quality results.

To meet this demand, CoreWeave has launched NVIDIA GB200 NVL72-based instances, becoming the first cloud service provider to make the NVIDIA Blackwell platform generally available.

With rack-scale NVIDIA NVLink across 72 NVIDIA Blackwell GPUs and 36 NVIDIA Grace CPUs, scaling to up to 110,000 GPUs with NVIDIA Quantum-2 InfiniBand networking, these instances provide the scale and performance needed to build and deploy the next generation of AI reasoning models and agents.

NVIDIA GB200 NVL72 on CoreWeave 

NVIDIA GB200 NVL72 is a liquid-cooled, rack-scale solution with a 72-GPU NVLink domain, which enables the six dozen GPUs to act as a single massive GPU.

NVIDIA Blackwell features many technological breakthroughs that accelerate inference token generation, boosting performance while reducing service costs. For example, fifth-generation NVLink enables 130TB/s of GPU bandwidth in one 72-GPU NVLink domain, and the second-generation Transformer Engine enables FP4 for faster AI performance while maintaining high accuracy.

CoreWeave’s portfolio of managed cloud services is purpose-built for Blackwell. CoreWeave Kubernetes Service optimizes workload orchestration by exposing NVLink domain IDs, ensuring efficient scheduling within the same rack. Slurm on Kubernetes (SUNK) supports the topology block plug-in, enabling intelligent workload distribution across GB200 NVL72 racks. In addition, CoreWeave’s Observability Platform provides real-time insights into NVLink performance, GPU utilization and temperatures.

CoreWeave’s GB200 NVL72 instances feature NVIDIA Quantum-2 InfiniBand networking that delivers 400Gb/s bandwidth per GPU for clusters up to 110,000 GPUs. NVIDIA BlueField-3 DPUs also provide accelerated multi-tenant cloud networking, high-performance data access and GPU compute elasticity for these instances.

Full-Stack Accelerated Computing Platform for Enterprise AI 

NVIDIA’s full-stack AI platform pairs cutting-edge software with Blackwell-powered infrastructure to help enterprises build fast, accurate and scalable AI agents.

NVIDIA Blueprints provides pre-defined, customizable, ready-to-deploy reference workflows to help developers create real-world applications. NVIDIA NIM is a set of easy-to-use microservices designed for secure, reliable deployment of high-performance AI models for inference. NVIDIA NeMo includes tools for training, customization and continuous improvement of AI models for modern enterprise use cases. Enterprises can use NVIDIA Blueprints, NIM and NeMo to build and fine-tune models for their specialized AI agents.

These software components, all part of the NVIDIA AI Enterprise software platform, are key enablers to delivering agentic AI at scale and can readily be deployed on CoreWeave.

Bringing Next-Generation AI to the Cloud 

The general availability of NVIDIA GB200 NVL72-based instances on CoreWeave underscores the latest in the companies’ collaboration, focused on delivering the latest accelerated computing solutions to the cloud. With the launch of these instances, enterprises now have access to the scale and performance needed to power the next wave of AI reasoning models and agents.

Customers can start provisioning GB200 NVL72-based instances through CoreWeave Kubernetes Service in the US-WEST-01 region using the gb200-4x instance ID. To get started, contact CoreWeave.

Read More