Getting started with PyTorch, ExecuTorch, and Ethos-U85 in three easy steps

Getting started with PyTorch, ExecuTorch, and Ethos-U85 in three easy steps

ExecuTorch support for Ethos-U85

In the rapidly evolving landscape of machine learning, PyTorch has emerged as a leading framework for model development, given its flexibility and comprehensive ecosystem. Arm has worked with Meta to introduce support for Arm platforms in ExecuTorch, that further simplifies this process, making it seamless to deploy PyTorch models on edge devices.

The Arm Ethos-U85 NPU is the highest performing Ethos NPU addressing the growing demand for running advanced AI inference workloads at the edge, including transformer-based networks like LLMs. Arm offers reference designs, including the Corstone-320 IoT reference design platform, around the Ethos-U to accelerate and simplify the chip development cycle. The reference design platform includes, among many items, a Fixed Virtual Platform (FVP) that simulates an entire system, enabling cutting edge embedded software development and neural network deployment for the Ethos-U85.

Today, Arm is extending the support for developers building IoT edge applications, by supporting ExecuTorch beta on Ethos-U85. Leveraging ExecuTorch, developers can now efficiently land their natively developed PyTorch models to enable intelligent and responsive IoT solutions built on Arm.

With this package now available, thousands of developers looking to create Edge AI applications, can start their model and application development months before the platforms arrive on the market.

Getting started with ExecuTorch on Ethos-U85

A full development environment has been provided in the public ExecuTorch GitHub repository. This provides an integrated and tested development flow with all necessary components.

The three simple steps are:

  1. Set up ExecuTorch
  2. Set up the Arm Build environment
  3. Compile and Run models on the arm_executor_runner

You can then build on this flow for compiling and running models, to capture runtime behavior from the Ethos-U85 driver, such as cycle count information.

To make the process easier for end users, we have also added scripts to the ExecuTorch repository:

  1. Set up ExecuTorch
  2. setup.sh: Download the necessary software.
  3. run.sh: to compile and run the model on the Corstone-320 FVP

To build other models, you can use the ahead of time compiler script aot_arm_compiler.py, which takes a PyTorch program (nn.module) to an ExecuTorch program (.pte flatbuffer file). To write custom applications which use ExecuTorch you can follow the application flow in the example executor_runner application.

We support approximately 40 core ATen operators and already support end-to-end deployment of models such as Mobilenetv2. Ongoing efforts to support further operators will enable more PyTorch models every week .

As more functionality is added, it will be demonstrated through the tutorial materials for Ethos-U on pytorch.org

How this deployment flow works in more detail

Leveraging the extensibility of ExecuTorch and the expressiveness of Arm’s Tensor Operator Set Architecture (TOSA), we have enabled Ethos-U support in ExecuTorch. The Ethos-U compiler, Vela, has been enhanced with a TOSA front-end, making it possible to compile models for all products in the Ethos-U family. Combining these components into a cohesive workflow involves the following steps.

  1. Converting a PyTorch model into a deployable ExecuTorch program (AOT flow)
  2. Compile the ExecuTorch program into an executable, which can be deployed on Corstone-320 (runtime flow)

The ExecuTorch Ahead of time (AOT) flow

The process begins by converting a PyTorch model into a quantized TOSA representation using the PyTorch dynamo export flow. This allows us to generate an Ethos-U set of machine instructions, known as a command stream, utilizing the Vela compiler TOSA frontend. The command stream is bundled into an ExecuTorch program, represented by a flatbuffer file (.pte). This file contains everything the ExecuTorch runtime needs to perform inference using Ethos-U hardware.

flow diagram

The ExecuTorch Runtime flow

The ExecuTorch runtime, written in C/C++, is designed to support multiple backends. We have extended it to include support for the Ethos-U device driver. Following this flow will produce a self-contained compiled executable. Deploying the executable on the Corstone-320 FVP is straightforward and requires only the appropriate flags when calling the FVP.

flow diagram

Ethos-U85 and Corstone-320

The Ethos-U family of NPUs offers high performance and energy-efficient solutions for edge AI. The Ethos-U55 (also supported by ExecuTorch) is widely deployed in many Cortex-M heterogeneous systems, while the Ethos-U65 extends the applicability of the Ethos-U family to Cortex-A-based systems and increases the performance.

Ethos-U85 further extends the Ethos-U product line, supporting current and future workloads on the edge using transformer-based networks. Ethos-U85 delivers a 4x performance uplift and 20% higher energy efficiency compared to its predecessor, with up to 85% utilization on popular networks. Notable feature of Ethos-U85 includes;

  • configurations from 128 to 2048 MACs/cycle, delivering up 4 TOP/s at 1GHz
  • Compatible with Cortex-A and Cortex-M based systems
  • Native support for major neural networks though support for TOSA
  • Full hardware acceleration of all major neural networks
  • For a full list of features, see the Ethos-U85 Technical Overview

A typical compute subsystem design with Ethos-U85

A typical compute subsystem design with Ethos-U85

What’s next

We are adding new operator support every week, extending ExecuTorch core ATen operator coverage, and enabling a wider range of models to run on Ethos-U. Our ongoing efforts focus on improving performance to ensure models run as optimally as possible on Ethos-U.

The ExecuTorch delegate framework supports fallback to running operators not supported by Ethos-U on the CPU using reference kernel implementations. We will work towards optimal performance on Cortex-M CPUs using CMSIS-NN, providing the best possible support for fallback operators and ensuring optimal performance for devices without Ethos-U capability.

The package above with the Corstone-320 FVP are more steps to simplify application development, so please, go ahead, check out the code and build process and send us feedback. Meanwhile we will be busy making weekly releases to enable more features, models and to extract the maximum performance out of the hardware.

Read More

Unleashing the Power of AI on Mobile: LLM Inference for Llama 3.2 Quantized Models with ExecuTorch and KleidiAI

Unleashing the Power of AI on Mobile: LLM Inference for Llama 3.2 Quantized Models with ExecuTorch and KleidiAI

Introduction

At the recent PyTorch Conference, Arm highlighted the widespread impact of its technology, spanning from cloud to edge, emphasizing its commitment to delivering its advanced AI computing capabilities seamlessly to millions of developers worldwide.

key stats

During the presentation, it was emphasized that Arm bears the immense responsibility of equipping 20+ million developers and billions of users with advanced AI computing features without friction. Achieving this requires crucial software collaborations across a vast ecosystem of software and hardware partners.

Just a few months ago, Arm launched Arm Kleidi, developer enablement technologies and resources to drive technical collaboration and innovation across the ML stack. This includes the KleidiAI software library providing optimized software routines, which when integrated into key frameworks such as XNNPACK enable automatic AI acceleration for developers on Arm Cortex-A CPUs.

Today, we’re excited to announce a new milestone for the AI open-source community that brings Arm even closer to realizing this vision: the integration of KleidiAI into ExecuTorch via XNNPACK, boosting AI workload performance on Arm mobile CPUs!

Thanks to the collaborative efforts of the engineering teams at Arm and Meta, AI developers can now deploy quantized Llama models which run up to 20% faster on Arm Cortex-A v9 CPUs with the i8mm ISA extension.

And there’s more exciting news – the ExecuTorch team has officially launched the Beta release!

This marks an important milestone in our partnership. In this blog, we are eager to share more details about ExecuTorch capabilities, the new Meta Llama 3.2 models, the integer 4-bit with per-block quantization, and the impressive performance recorded on certain Arm CPUs. Notably, we have achieved speeds of over 350 tokens per second on the prefill stage with the quantized Llama 3.2 1B model on Samsung S24+ device, as shown in the following screenshots.

mobile app screenshots

Now, let’s dive into the key components that enabled the demo creation presented in the preceding images. First up: new Llama 3.2 models!

Meta Llama 3.2

Meta recently announced the first lightweight quantized Llama models, which are designed to run on popular mobile devices. Meta used two techniques for quantizing Llama 3.2 1B and 3B models: Quantization-Aware Training (QAT) with LoRA adaptors (QLoRA), and SpinQuant, a state-of-the-art post-training quantization method. The quantized models were evaluated using PyTorch’s ExecuTorch framework as the inference engine, with the Arm CPU as a backend.

These instruction-tuned models retain the quality and safety of the original 1B and 3B models while achieving a 2-4x speedup and reducing model size by 56% on average and memory footprint by 41% on average compared to the original BF16 format.

In this blog post, we will demonstrate the performance improvements we observed in our experiments.

ExecuTorch

ExecuTorch is a PyTorch-native framework specifically designed for deploying AI models on-device, enhancing privacy and reducing latency. It supports the deployment of cutting-edge open-source AI models, including the Llama family of models and vision and speech models like Segment Anything and Seamless.

This unlocks new possibilities for edge devices such as mobile phones, smart glasses, VR headsets, and smart home cameras. Traditionally, deploying PyTorch-trained AI models to resource-limited edge devices has been challenging and time-consuming, often requiring conversion to other formats which could lead to errors and suboptimal performance. The varied toolchains across the hardware and edge ecosystem have also degraded the developer experience, making a universal solution impractical.

ExecuTorch addresses these issues by providing composable components that include core runtime, operator library, and delegation interface that allows for portability as well extensibility. Models can be exported using torch.export(), producing a graph that is natively compatible with the ExecuTorch runtime, capable of running on most edge devices with CPUs, and extendable to specialized hardware like GPUs and NPUs for enhanced performance.

Working with Arm, ExecuTorch now leverages the optimized low-bit matrix multiplication kernels from the Arm KleidiAI library to improve on-device Large Language Model (LLM) inference performance via XNNPACK. We also thank the XNNPACK team at Google for supporting this effort.

In this post, we will focus on this integration available in ExecuTorch

Evolving the architecture for AI workloads

At Arm, we have been deeply committed to investing in open-source projects and advancing new technologies in our processors since the early days of the deep learning wave, focusing on making AI workloads high-performing and more power-efficient.

For instance, Arm introduced the SDOT instruction, starting with the Armv8.2-A architecture, to accelerate dot product arithmetic between 8-bit integer vectors. This feature, now widely available in mobile devices, significantly speeds up the computation of quantized 8-bit models. After the SDOT instruction, Arm introduced the BF16 data type and the MMLA instruction to further enhance the floating-point and integer matrix multiplication performance on CPUs and, most recently, announced the Scalable Matrix Extension (SME), marking a significant leap forward in machine learning capabilities.

The following image shows a few examples of Arm CPU’s continuous innovations in the AI space over the last decade:

line chart

Given the widespread use of Arm CPUs, AI frameworks need to take full advantage of these technologies in key operators to maximize performance. Recognizing this, we saw the need for an open-source library to share these optimized software routines. However, we were mindful of the challenges in integrating a new library into AI frameworks, such as concerns about library size, dependencies, and documentation and the need to avoid adding extra burdens for developers. So, we took extra steps to gather feedback from our partners and ensure a smooth integration process that does not require additional dependencies for AI developers. This effort led to KleidiAI, an open-source library that provides optimized performance-critical routines for artificial intelligence (AI) workloads tailored for Arm CPUs. You can learn more about KleidiAI here.

Working with the ExecuTorch team at Meta, Arm provided the software optimizations for their novel 4-bit with per-block quantization schema, which is used to accelerate the matrix multiplication kernel in the Transformer layer’s torch.nn.linear operator for Llama 3.2 quantized models. This flexible 4-bit quantization schema from ExecuTorch strikes a balance between model accuracy and low-bit matrix multiplication performance targeting on-device LLMs.

The integer 4-bit with per-block quantization

In KleidiAI, we introduced micro-kernels optimized for this new 4-bit integer quantization scheme (matmul_clamp_f32_qai8dxp_qsi4c32p)

As shown in the following image, this 4-bit quantization uses a per-block strategy for weight (RHS matrix) quantization and an 8-bit per-row quantization for activations (LHS matrix):

arch diagram

As you can see in the preceding image, each output feature map (OFM) in the weight matrix is divided into equally sized blocks (group size), with each block having a scale factor stored in BF16 format. BF16 is advantageous because it maintains the dynamic range of 32-bit floating-point (FP32) format with half the bit size, and it’s easy to convert to and from FP32 using a simple shift operation. This makes BF16 ideal for saving model space, preserving accuracy, and ensuring backward compatibility with devices that lack BF16 hardware acceleration. You can learn more about the BF16 format in this Arm Community blog post.

For completeness, this 4-bit quantization scheme and our implementation in KleidiAI allow users to configure group size for the linear weights (RHS), allowing them to trade-off between model size, model accuracy, and model performance if the model is quantized by the user.

At this point, we are ready to unveil the incredible performance recorded on Arm CPUs with ExecuTorch when running Llama 3.2 1B and Llama 3.2 3B. Let’s first go over metrics we will use to evaluate the performance of LLM inference.

Metrics for LLM Inference

Typically, performance metrics used to evaluate LLM performance during inference include:

  • Time To First Token (TTFT): This measures the time it takes to produce the first output token after a prompt is provided by the user. This latency or response time is important for a good user experience, especially on a phone. TTFT is also a function of the length of the prompt or prompt tokens. To make this metric independent of the prompt length, we use Prefill tokens/second as a proxy here. The relationship between these is inverse: lower TTFT corresponds to higher Prefill tokens/second.
  • Decode Performance: This is the average number of output tokens generated per second, thus reported in Tokens/Second. It is independent of the total number of tokens generated. For on-device inference, it is important to keep this higher than a user’s average reading speed.
  • Peak Runtime Memory: This metric reflects the amount of RAM, typically reported in MegaBytes (MiB), needed to run the model with expected performance measured using the metrics above. Given the limited amount of RAM available on Android and iOS devices, this is one of the key metrics for on-device LLM deployment. It dictates the type of models that can be deployed on a device.

Results

The quantized Llama 3.2 1B models, both SpinQuant and QLoRA, are designed to run efficiently on a wide range of phones with limited RAM. In this section, we demonstrate that the quantized Llama 3.2 1B models can achieve over 350 tokens per second in the prefill phase and over 40 tokens per second in the decode stage. This level of performance is sufficient to enable on-device text summarization with a reasonable user experience using only Arm CPUs. To put this into perspective, on average, 50 unread messages contain about 600 tokens. With this performance, the response time (the time it takes for the first generated word to appear on the screen) is approximately two seconds.

We present measurements from a Samsung S24+ running vanilla Android. We used Llama 3.2 1B parameter models for these experiments. Although we only demonstrate using 1B models, similar performance gains can be expected for the 3B parameter models. The experiment setup involves doing a single warmup run, sequence length of 128, prompt length of 64, and using 6 out of 8 available CPUs, and measuring results over adb.

Using the ExecuTorch main branch from GitHub, we first generated the ExecuTorch PTE binary files for each model using the published checkpoints. Then, using the same repository, we generated the ExecuTorch runtime binary for Armv8. In the rest of the section, we will compare the performance of different quantized 1B models against the BF16 model using the binary built with KleidiAI. We will also compare the performance gains for quantized models between the binary with KleidiAI and the one without KleidiAI to distill the impact from KleidiAI.

Quantized Model Performance

Llama 3.2 quantized models both SpinQuant and QLoRA perform significantly better on prompt prefill and text generation (decode) compared to the baseline BF16. We observed a >2x improvement in decode and a >5x improvement in prefill performance.

Furthermore, the quantized model size, PTE file size in bytes, is less than half that of the BF16 model, 2.3 GiB vs. 1.1 GiB. Although the size of int4 is a quarter of BF16, some layers in the model are quantized with int8, making the PTE file size ratio larger. We observed runtime peak memory footprint reduction of almost 40% from 3.1 GiB for the BF16 model to 1.9 GiB for the SpinQuant model, measured in Resident Set Size (RSS) for a maximum sequence length of 2048.

With all-around improvements, the new quantized Llama 3.2 models are ideal for on-device deployment targeting Arm CPUs. For more information on accuracy, check out the Meta Llama 3.2 blog.

bar graph

KleidiAI Impact

ExecuTorch relies on the Arm KleidiAI library to provide low-bit performant matrix multiplication kernels for the latest Arm CPUs with advanced Armv8/9 ISA features. These kernels are utilized for on-device quantized Llama 3.2 model inference in ExecuTorch. As depicted in the graph below, ExecuTorch achieves an average of >20% better prefill performance on S24+ with KleidiAI compared to non-KleidiAI kernels, while maintaining the same accuracy. This performance advantage is not limited to specific models or devices, and is expected to benefit all ExecuTorch models using low-bit quantized matrix multiplication on Arm CPUs.

To assess the impact of Kleidi, we generated two ExecuTorch runtime binaries targeting Arm Cortex-A CPUs and compared their performance.

  1. The first ExecuTorch runtime binary built with the Arm KleidiAI library through the XNNPACK library.
  2. The second binary was built without the Arm KleidiAI repository, using native kernels from the XNNPACK library.

bar chart

Try it yourself!

Ready to experience the performance improvements firsthand? Here’s how you can try out ExecuTorch with the optimizations provided by KleidiAI on your projects: Here is a link to the learning path from Arm to start developing your own application using LLMs using ExecuTorch and KleidiAI.

We look forward to hearing your feedback!

Read More

Promoting Cross-Modal Representations to Improve Multimodal Foundation Models for Physiological Signals

Many healthcare applications are inherently multimodal, involving several physiological signals. As sensors for these signals become more common, improving machine learning methods for multimodal healthcare data is crucial. Pretraining foundation models is a promising avenue for success. However, methods for developing foundation models in healthcare are still in early exploration and it is unclear which pretraining strategies are most effective given the diversity of physiological signals. This is partly due to challenges in multimodal health data: obtaining data across many patients is…Apple Machine Learning Research

Smart Audit System Empowered by LLM

Manufacturing quality audits are pivotal for ensuring high product standards in mass production environments. Traditional auditing processes, however, are labor-intensive and heavily reliant on human expertise, posing challenges in maintaining transparency, accountability, and continuous improvement across complex global supply chains. To address these challenges, we propose a smart audit system empowered by large language models (LLMs). Our approach introduces three key innovations: a dynamic risk assessment model that streamlines audit procedures and optimizes resource allocation; a…Apple Machine Learning Research

NVIDIA Works With Deloitte to Deploy Digital AI Agents for Healthcare

NVIDIA Works With Deloitte to Deploy Digital AI Agents for Healthcare

Ahead of a visit to the hospital for a surgical procedure, patients often have plenty of questions about what to expect — and can be plenty nervous.

To help minimize presurgery jitters, Deloitte is enhancing its Quartz Frontline AI, an AI solution for customer service powered by NVIDIA AI, to now include AI agents to bring the next generation of digital, frontline teammates to patients before they even step foot inside the hospital.

“The Frontline AI Teammate offers a novel and innovative solution to help combat our health human resource crisis.” — Mathieu LeBreton, digital experience lead at The Ottawa Hospital

These virtual teammates, built with the NVIDIA AI Enterprise software platform, can have natural, human-like conversations with patients, answer a wide range of questions and provide support prior to preadmission appointments at hospitals.

Working with NVIDIA, Deloitte has added Frontline AI Teammate to its Quartz platform for use in settings like hospitals, where the digital avatar can have practical conversations — in multiple languages — that give the end user, such as a patient, instant answers to pressing questions.

“Avatar-based conversational AI agents offer an incredible opportunity to reduce the productivity paradox that our healthcare system faces with digitization,” said Niraj Dalmia, partner at Deloitte Canada. “It could possibly be the complementary innovation that reduces administrative burden, complements our healthcare human resources to free up capacity and helps solve for patient experience challenges.”

Next-Gen Technologies Powering Digital Humans

Digital human technology can provide lifelike interactions that can enhance experiences for doctors and patients.

Deloitte’s Frontline AI Teammate, built with NVIDIA AI Enterprise and Deloitte’s Conversational AI Framework, is designed to deliver human-to-machine experiences in healthcare settings. Developed on the NVIDIA Omniverse platform, Deloitte’s lifelike avatar can respond to complex, domain-specific questions that are pivotal in healthcare delivery.

Developers can tap into NVIDIA NIM microservices, which streamline the path for developing AI-powered applications and moving AI models into production, to craft digital humans for healthcare industry applications.

NVIDIA NIM Agent Blueprints offer a customizable, reference AI workflow for creating interactive, AI-driven avatars that are ideal for telehealth — and include best practices for how to use NVIDIA NeMo Retriever, an industry-leading embedding, retrieval and re-ranking model that allows for fast responses based on up-to-date healthcare data.

Customizable digital humans — like James, an interactive demo developed by NVIDIA — can handle tasks such as scheduling appointments, filling out intake forms and answering questions about upcoming health services. This can make healthcare services more efficient while also improving patient access.

In addition to NIM microservices, the James interactive demo also uses NVIDIA ACE to provide natural, low-latency responses.

NVIDIA ACE is a suite of AI, graphics and simulation technologies for bringing digital humans to life. It can integrate every aspect of a digital human into healthcare applications — from speech and translation abilities capable of understanding diverse accents and languages, to realistic animations of facial and body movements.

Personalized Experiences for Hospital Patients

Patients can get overwhelmed with the amount of preoperative information. Typically, they have only one pre-admission appointment, often many weeks before the surgery, which can leave them with lingering questions and escalating concerns. The stress of a serious diagnosis may also prevent them from asking all the necessary questions during these brief interactions, leaving them without comprehensive knowledge about the appointment’s purpose, duration, location and necessary documents — and potentially leading to delays or even rescheduling of their surgeries.

To enhance patient preparation and reduce pre-procedure anxiety, The Ottawa Hospital is using AI agents, powered by NVIDIA and Deloitte’s technologies, to provide more consistent, accurate and continuous access to information.

With the Frontline AI teammate, patients can experience benefits including:

  • 24/7 access to the digital teammate using a smartphone, tablet or home computer.
  • Reliable, preapproved answers to detailed questions, including information around anesthesia or the procedure itself.
  • Postsurgery consultation to resolve any questions about the recovery process, potentially improving treatment adherence and health outcomes.

In user acceptance testing conducted this summer, a majority of the testers noted that responses provided were clear, relevant and met the needs of the given interaction.

“The Frontline AI Teammate offers a novel and innovative solution to help combat our health human resource crisis — it has the potential to reduce the administrative burden, giving back time to healthcare providers to provide the quality care our population deserves and expects from The Ottawa Hospital,” said Mathieu LeBreton, digital experience lead at The Ottawa Hospital. “The opportunity to explore these technologies is well-timed, given the planning of the New Campus Development, a new hospital project in Ottawa. Proper identification of the problems we are trying to solve is imperative to ensure this is done responsibly and transparently.”

Deloitte is working with other hospitals and healthcare institutions to deploy digital agents. A patient-facing pilot with Ottawa Hospital is expected to go live by the end of the year.

Developers can get started by accessing the digital human NIM Agent Blueprint.

Read More

How Planview built a scalable AI Assistant for portfolio and project management using Amazon Bedrock

How Planview built a scalable AI Assistant for portfolio and project management using Amazon Bedrock

This post is co-written with Lee Rehwinkel from Planview.

Businesses today face numerous challenges in managing intricate projects and programs, deriving valuable insights from massive data volumes, and making timely decisions. These hurdles frequently lead to productivity bottlenecks for program managers and executives, hindering their ability to drive organizational success efficiently.

Planview, a leading provider of connected work management solutions, embarked on an ambitious plan in 2023 to revolutionize how 3 million global users interact with their project management applications. To realize this vision, Planview developed an AI assistant called Planview Copilot, using a multi-agent system powered by Amazon Bedrock.

Developing this multi-agent system posed several challenges:

  • Reliably routing tasks to appropriate AI agents
  • Accessing data from various sources and formats
  • Interacting with multiple application APIs
  • Enabling the self-serve creation of new AI skills by different product teams

To overcome these challenges, Planview developed a multi-agent architecture built using Amazon Bedrock. Amazon Bedrock is a fully managed service that provides API access to foundation models (FMs) from Amazon and other leading AI startups. This allows developers to choose the FM that is best suited for their use case. This approach is both architecturally and organizationally scalable, enabling Planview to rapidly develop and deploy new AI skills to meet the evolving needs of their customers.

This post focuses primarily on the first challenge: routing tasks and managing multiple agents in a generative AI architecture. We explore Planview’s approach to this challenge during the development of Planview Copilot, sharing insights into the design decisions that provide efficient and reliable task routing.

We describe customized home-grown agents in this post because this project was implemented before Amazon Bedrock Agents was generally available. However, Amazon Bedrock Agents is now the recommended solution for organizations looking to use AI-powered agents in their operations. Amazon Bedrock Agents can retain memory across interactions, offering more personalized and seamless user experiences. You can benefit from improved recommendations and recall of prior context where required, enjoying a more cohesive and efficient interaction with the agent. We share our learnings in our solution to help you understanding how to use AWS technology to build solutions to meet your goals.

Solution overview

Planview’s multi-agent architecture consists of multiple generative AI components collaborating as a single system. At its core, an orchestrator is responsible for routing questions to various agents, collecting the learned information, and providing users with a synthesized response. The orchestrator is managed by a central development team, and the agents are managed by each application team.

The orchestrator comprises two main components called the router and responder, which are powered by a large language model (LLM). The router uses AI to intelligently route user questions to various application agents with specialized capabilities. The agents can be categorized into three main types:

  • Help agent – Uses Retrieval Augmented Generation (RAG) to provide application help
  • Data agent – Dynamically accesses and analyzes customer data
  • Action agent – Runs actions within the application on the user’s behalf

After the agents have processed the questions and provided their responses, the responder, also powered by an LLM, synthesizes the learned information and formulates a coherent response to the user. This architecture allows for a seamless collaboration between the centralized orchestrator and the specialized agents, which provides users an accurate and comprehensive answers to their questions. The following diagram illustrates the end-to-end workflow.

End-to-end workflow showing responder and router components

Technical overview

Planview used key AWS services to build its multi-agent architecture. The central Copilot service, powered by Amazon Elastic Kubernetes Service (Amazon EKS), is responsible for coordinating activities among the various services. Its responsibilities include:

  • Managing user session chat history using Amazon Relational Database Service (Amazon RDS)
  • Coordinating traffic between the router, application agents, and responder
  • Handling logging, monitoring, and collecting user-submitted feedback

The router and responder are AWS Lambda functions that interact with Amazon Bedrock. The router considers the user’s question and chat history from the central Copilot service, and the responder considers the user’s question, chat history, and responses from each agent.

Application teams manage their agents using Lambda functions that interact with Amazon Bedrock. For improved visibility, evaluation, and monitoring, Planview has adopted a centralized prompt repository service to store LLM prompts.

Agents can interact with applications using various methods depending on the use case and data availability:

  • Existing application APIs – Agents can communicate with applications through their existing API endpoints
  • Amazon Athena or traditional SQL data stores – Agents can retrieve data from Amazon Athena or other SQL-based data stores to provide relevant information
  • Amazon Neptune for graph data – Agents can access graph data stored in Amazon Neptune to support complex dependency analysis
  • Amazon OpenSearch Service for document RAG – Agents can use Amazon OpenSearch Service to perform RAG on documents

The following diagram illustrates the generative AI assistant architecture on AWS.

AWS services and data flow in Generative AI chatbot

Router and responder sample prompts

The router and responder components work together to process user queries and generate appropriate responses. The following prompts provide illustrative router and responder prompt templates. Additional prompt engineering would be required to improve reliability for a production implementation.

First, the available tools are described, including their purpose and sample questions that can be asked of each tool. The example questions help guide the natural language interactions between the orchestrator and the available agents, as represented by tools.

tools = '''
<tool>
<toolName>applicationHelp</toolName>
<toolDescription>
Use this tool to answer application help related questions.
Example questions:
How do I reset my password?
How do I add a new user?
How do I create a task?
</toolDescription>
</tool>
<tool>
<toolName>dataQuery</toolName>
<toolDescription>
Use this tool to answer questions using application data.
Example questions:
Which tasks are assigned to me?
How many tasks are due next week?
Which task is most at risk?
</toolDescription>
</tool>

Next, the router prompt outlines the guidelines for the agent to either respond directly to user queries or request information through specific tools before formulating a response:

system_prompt_router = f'''
<role>
Your job is to decide if you need additional information to fully answer the User's 
questions.
You achieve your goal by choosing either 'respond' or 'callTool'.
You have access to your chat history in <chatHistory></chatHistory> tags.
You also have a list of available tools to assist you in <tools></tools> tags.
</role>
<chatHistory>
{chatHistory}
</chatHistory>
<tools>
{tools}
</tools>
<rules>
- If the chat history contains sufficient information to answer the User's questions, 
choose the 'respond' action.
- To gather more information before responding, choose the 'callTool' action.
- You many only choose from the tools in the <tools></tools> tags.
- If no tool can assist with the question, choose the 'respond' action.
- Place your chosen action within <action></action> tags.
- When you chose the 'callTool' action, provide the <toolName> and the <toolQuestion> you
would like to ask.
- Your <toolQuestion> should be verbose and avoid using pronouns.
- Start by providing your step-by-step thinking in <thinking></thinking> tags.
- Then you will give your answer in <answer></answer> tags.
- Your answer should follow the format of one of these three examples:
When choosing the 'respond' action, your answer should follow the below example EXACTLY:
<example>
<answer>
<action>respond</action>
</answer>
</example>
When choosing the 'callTool' action for a single Tool:
<example>
<answer>
<action>callTool</action>
<tools>
<tool>
<toolName>applicationHelp</toolName>
<toolQuestion>How do I reset my password?</toolQuestion>
</tool>
</tools>
</answer>
</example>
Executing the above, would produce the following result:
You can also call multiple Tools using this format:
<example>
<answer>
<action>callTool</action>
<tools>
<tool>
<toolName>dataQuery</toolName>
<toolQuestion>How many tasks are assigned to me?</toolQuestion>
</tool>
<tool>
<toolName>applicationHelp</toolName>
<toolQuestion>How do I add a new task?</toolQuestion>
</tool>
</tools>
</answer>
</example>
</rules>
'''

The following is a sample response from the router component that initiates the dataQuery tool to retrieve and analyze task assignments for each user:

<thinking>
To determine who has the most tasks assigned, I will need to query the application data. The "dataQuery" tool seems most appropriate for this question.
</thinking>

<answer>
<action>callTool</action>
<tools>
    <tool>
        <toolName>dataQuery</toolName>
        <toolQuestion>Which user has the most tasks currently assigned to them?   </toolQuestion>
    </tool>
</tools>
</answer>

The following is a sample response from the responder component that uses the dataQuery tool to fetch information about the user’s assigned tasks. It reports that the user has five tasks assigned to them.

<thinking>
Based on the chat history, I previously called the dataQuery tool to ask "How many tasks are currently assigned to the user?". The tool responded that the user has 5 tasks assigned to them.
</thinking>

<answer>
According to the data I queried previously, you have 5 tasks assigned to you.
</answer>

Model evaluation and selection

Evaluating and monitoring generative AI model performance is crucial in any AI system. Planview’s multi-agent architecture enables assessment at various component levels, providing comprehensive quality control despite the system’s complexity. Planview evaluates components at three levels:

  • Prompts – Assessing LLM prompts for effectiveness and accuracy
  • AI agents – Evaluating complete prompt chains to maintain optimal task handling and response relevance
  • AI system – Testing user-facing interactions to verify seamless integration of all components

The following figure illustrates the evaluation framework for prompts and scoring.

Evaluation framework for prompts scoring

To conduct these evaluations, Planview uses a set of carefully crafted test questions that cover typical user queries and edge cases. These evaluations are performed during the development phase and continue in production to track the quality of responses over time. Currently, human evaluators play a crucial role in scoring responses. To aid in the evaluation, Planview has developed an internal evaluation tool to store the library of questions and track the responses over time.

To assess each component and determine the most suitable Amazon Bedrock model for a given task, Planview established the following prioritized evaluation criteria:

  • Quality of response – Assuring accuracy, relevance, and helpfulness of system responses
  • Time of response – Minimizing latency between user queries and system responses
  • Scale – Making sure the system can scale to thousands of concurrent users
  • Cost of response – Optimizing operational costs, including AWS services and generative AI models, to maintain economic viability

Based on these criteria and the current use case, Planview selected Anthropic’s Claude 3 Sonnet on Amazon Bedrock for the router and responder components.

Results and impact

Over the past year, Planview Copilot’s performance has significantly improved through the implementation of a multi-agent architecture, development of a robust evaluation framework, and adoption of the latest FMs available through Amazon Bedrock. Planview saw the following results between the first generation of Planview Copilot developed mid-2023 and the latest version:

  • Accuracy – Human-evaluated accuracy has improved from 50% answer acceptance to now exceeding 95%
  • Response time – Average response times have been reduced from over 1 minute to 20 seconds
  • Load testing – The AI assistant has successfully passed load tests, where 1,000 questions were submitted simultaneous with no noticeable impact on response time or quality
  • Cost-efficiency – The cost per customer interaction has been slashed to one tenth of the initial expense
  • Time-to-market – New agent development and deployment time has been reduced from months to weeks

Conclusion

In this post, we explored how Planview was able to develop a generative AI assistant to address complex work management process by adopting the following strategies:

  • Modular development – Planview built a multi-agent architecture with a centralized orchestrator. The solution enables efficient task handling and system scalability, while allowing different product teams to rapidly develop and deploy new AI skills through specialized agents.
  • Evaluation framework – Planview implemented a robust evaluation process at multiple levels, which was crucial for maintaining and improving performance.
  • Amazon Bedrock integration – Planview used Amazon Bedrock to innovate faster with broad model choice and access to various FMs, allowing for flexible model selection based on specific task requirements.

Planview is migrating to Amazon Bedrock Agents, which enables the integration of intelligent autonomous agents within their application ecosystem. Amazon Bedrock Agents automate processes by orchestrating interactions between foundation models, data sources, applications, and user conversations.

As next steps, you can explore Planview’s AI assistant feature built on Amazon Bedrock and stay updated with new Amazon Bedrock features and releases to advance your AI journey on AWS.


About Authors

Sunil Ramachandra is a Senior Solutions Architect enabling hyper-growth Independent Software Vendors (ISVs) to innovate and accelerate on AWS. He partners with customers to build highly scalable and resilient cloud architectures. When not collaborating with customers, Sunil enjoys spending time with family, running, meditating, and watching movies on Prime Video.

Benedict Augustine is a thought leader in Generative AI and Machine Learning, serving as a Senior Specialist at AWS. He advises customer CxOs on AI strategy, to build long-term visions while delivering immediate ROI.As VP of Machine Learning, Benedict spent the last decade building seven AI-first SaaS products, now used by Fortune 100 companies, driving significant business impact. His work has earned him 5 patents.

Lee Rehwinkel is a Principal Data Scientist at Planview with 20 years of experience in incorporating AI & ML into Enterprise software. He holds advanced degrees from both Carnegie Mellon University and Columbia University. Lee spearheads Planview’s R&D efforts on AI capabilities within Planview Copilot. Outside of work, he enjoys rowing on Austin’s Lady Bird Lake.

Read More

Intel GPU Support Now Available in PyTorch 2.5

Intel GPU Support Now Available in PyTorch 2.5

Support for Intel GPUs is now available in PyTorch® 2.5, providing improved functionality and performance for Intel GPUs which including Intel® Arc™ discrete graphics, Intel® Core™ Ultra processors with built-in Intel® Arc™ graphics and Intel® Data Center GPU Max Series. This integration brings Intel GPUs and the SYCL* software stack into the official PyTorch stack, ensuring a consistent user experience and enabling more extensive AI application scenarios, particularly in the AI PC domain.

Developers and customers building for and using Intel GPUs will have a better user experience by directly obtaining continuous software support from native PyTorch, unified software distribution, and consistent product release time.

Furthermore, Intel GPU support provides more choices to users. Now PyTorch provides a consistent GPU programming paradigm on both front ends and back ends. Developers can now run and deploy workloads on Intel GPUs with minimal coding efforts.

Overview of Intel GPU support

Intel GPU support in PyTorch provides eager mode and graph mode support in the PyTorch built-in front end. Eager mode now has an implementation of commonly used Aten operators with the SYCL programming language. Graph mode (torch.compile) now has an enabled Intel GPU back end to implement the optimization for Intel GPUs and to integrate Triton. 

Essential components of Intel GPU support were added to PyTorch, including runtime, Aten operators, oneDNN, TorchInductor, Triton and Intel GPU tool chains integration. Meanwhile, quantization and distributed are being actively developed in preparation for the PyTorch 2.6 release.

Features

In addition to providing key features for Intel® Client GPUs and Intel® Data Center GPU Max Series for inference and training, PyTorch keeps the same user experience as other hardware the PyTorch supports. If you migrate code from CUDA*, you can run the existing application code on an Intel GPU with minimal code changes for the device name (from cuda to xpu). For example:

# CUDA Code
tensor = torch.tensor([1.0, 2.0]).to(“cuda”)

# Code for Intel GPU
tensor = torch.tensor([1.0, 2.0]).to(“xpu”)

PyTorch 2.5 features with an Intel GPU include: 

  • Inference and training workflows.
  • Enhance both torch.compile and eager mode functionalities (more Ops), together with performance improvement, and fully run three Dynamo Hugging Face*, TIMM* and TorchBench* benchmarks for eager and compile modes. 
  • Data types such as FP32, BF16, FP16, and automatic mixed precision (AMP).
  • Runs on Intel® Client GPUs and Intel® Data Center GPU Max Series.
  • Supports Linux (Ubuntu, SUSE Linux and Red Hat Linux) and Windows 10/11.

Get Started

Get a tour of the environment setup, PIP wheels installation, and examples on Intel® Client GPUs and Intel® Data Center GPU Max Series from Getting Started Guide. Support for Intel GPUs can be experienced through PyTorch PIP wheels installation by nightly and preview binary releases.

  • Try Intel® Client GPUs through Intel® Arc™ Graphics family (Codename DG2), Intel® Core™ Ultra processor family with Intel® Graphics (Codename Meteor Lake), and Intel® Core™ Ultra mobile processor family with Intel® Graphics (Codename Lunar Lake).

  • Try Intel Data Center GPU Max Series through Intel® Tiber™ AI Cloud.

    1. To learn how to create a free Standard account, see Get Started. Then do the following:

Performance

The performance of Intel GPU on PyTorch was continuously optimized to achieve decent result on three Dynamo Hugging Face, TIMM and TorchBench benchmarks for eager and compile modes.

The latest performance data measured on top of PyTorch Dynamo Benchmarking Suite using Intel® Data Center GPU Max Series 1100 single card showcase the FP16/BF16 significant speedup ratio over FP32 on eager mode in Figure 1, and Torch.compile mode speedup ratio over eager mode in Figure 2. Both inference and training reached the similar significant improvements.

Figure 2: FP16/BF16 Performance Gains Over FP32 Eager

Figure 2: FP16/BF16 Performance Gains Over FP32 Eager

Figure 3: Torch.compile Performance Gains Over Eager Mode

Figure 3: Torch.compile Performance Gains Over Eager Mode

Summary

Intel GPU on PyTorch 2.5 brings Intel® Client GPUs (Intel® Core™ Ultra processors with built-in Intel® Arc™ graphics and Intel® Arc™ Graphics for dGPU parts) and Intel® Data Center GPU Max Series into the PyTorch ecosystem for AI workload acceleration. Especially, Client GPUs is added to the GPU-supported list for AI PC use scenarios on Windows and Linux environment.

We warmly welcome the community to evaluate and provide feedback on these enhancements to  Intel GPU support on PyTorch. 

Resources

Acknowledgments

We want thank PyTorch open source community for their technical discussions and insights: Andrey TalmanAlban Desmaison, Nikita ShulgaEli Uriegas, Jason Ansel, and Bin Bao.

We also thank collaborators from PyTorch for their professional support and guidance.

Performance Configuration

The configurations in the table are collected with svr-info. Test by Intel on September 12, 2024.

Table 1

Component Details
Name Intel® Max Series GPU 1100 in Intel® Tiber™ Developer Cloud
Time Thu Sep 12 08:21:27 UTC 2024
System Supermicro SYS-521GE-TNRT
Baseboard Supermicro X13DEG-OA
Chassis Supermicro Other
CPU Model Intel(R) Xeon(R) Platinum 8468V
Microarchitecture SPR_XCC
Sockets 2
Cores per Socket 48
Hyperthreading Enabled
CPUs 192
Intel Turbo Boost Enabled
Base Frequency 2.4GHz
All-core Maximum Frequency 2.4GHz
Maximum Frequency 2.9GHz
NUMA Nodes 2
Prefetchers L2 HW: Enabled, L2 Adj.: Enabled, DCU HW: Enabled, DCU IP: Enabled, AMP: Disabled, Homeless: Disabled, LLC: Disabled
PPINs 5e3f862ef7ba9d50, 6c85812edfcc84b1
Accelerators DLB 2, DSA 2, IAA 2, QAT (on CPU) 2, QAT (on chipset) 0
Installed Memory 1024GB (16x64GB DDR5 4800 MT/s [4800 MT/s])
Hugepagesize 2048 kB
Transparent Huge Pages madvise
Automatic NUMA Balancing Enabled
NIC 2 x Ethernet Controller X710 for 10GBASE-T, 4 x MT2892 Family [ConnectX-6 Dx]
Disk 1 x 894.3G Micron_7450_MTFDKBG960TFR
BIOS 1.4a
Microcode 0x2b0004b1
OS Ubuntu 22.04.2 LTS
Kernel 5.15.0-73-generic
TDP 330W
Power & Perf Policy Normal (6)
Frequency Governor performance
Frequency Driver acpi-cpufreq
Max C-State 9

Table 2

Component Details
Single Card Intel® Max Series GPU 1100 series on 4th Gen Intel® Xeon® processors of Intel Tiber Developer Cloud
Workload & version Timm ac34701, TorchBench 03cde49, Torchvision d23a6e1, Torchaudio b3f6f51, Transformers 243e186
Software Stack intel-for-pytorch-gpu-dev 0.5.3, intel-pti-dev 0.9.0, Intel xpu backend for Triton cc981fe
Framework Pytorch 4a3dabd67f8ce63f2fc45f278421cca3cc532cfe
GPU driver agama-ci-devel-803.61
GFX FW Version PVC2_1.23374

Notices & Disclaimers

Performance varies by use, configuration and other factors. Learn more on the Performance Index site. Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates.  See backup for configuration details.  No product or component can be absolutely secure. Your costs and results may vary. Intel technologies may require enabled hardware, software or service activation.

Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.

AI disclaimer:
AI features may require software purchase, subscription or enablement by a software or platform provider, or may have specific configuration or compatibility requirements. Details at  www.intel.com/AIPC. Results may vary.

Read More

Divide-or-Conquer? Which Part Should You Distill Your LLM?

Recent methods have demonstrated that Large Language Models (LLMs) can solve reasoning tasks better when they are encouraged to solve subtasks of the main task first. In this paper we devise a similar strategy that breaks down reasoning tasks into a problem decomposition phase and a problem solving phase and show that the strategy is able to outperform a single stage solution. Further, we hypothesize that the decomposition should be easier to distill into a smaller model compared to the problem solving because the latter requires large amounts of domain knowledge while the former only requires…Apple Machine Learning Research

‘India Should Manufacture Its Own AI,’ Declares NVIDIA CEO

‘India Should Manufacture Its Own AI,’ Declares NVIDIA CEO

Artificial intelligence will be the driving force behind India’s digital transformation, fueling innovation, economic growth, and global leadership, NVIDIA founder and CEO Jensen Huang said Thursday at NVIDIA’s AI Summit in Mumbai.

Addressing a crowd of entrepreneurs, developers, academics and business leaders, Huang positioned AI as the cornerstone of the country’s future.

India has an “amazing natural resource” in its IT and computer science expertise, Huang said, noting the vast potential waiting to be unlocked.

To capitalize on this country’s talent and India’s immense data resources, the country’s leading cloud infrastructure providers are rapidly accelerating their data center capacity. NVIDIA is playing a key role, with NVIDIA GPU deployments expected to grow nearly 10x by year’s end, creating the backbone for an AI-driven economy.

Together with NVIDIA, these companies are at the cutting edge of a shift Huang compared to the seismic change in computing introduced by IBM’s System 360 in 1964, calling it the most profound platform shift since then.

“This industry, the computing industry, is going to become the intelligence industry,” Huang said, pointing to India’s unique strengths to lead this industry,  thanks to its enormous amounts of data and large population.

With this rapid expansion in infrastructure, AI factories will play a critical role in India’s future, serving as the backbone of the nation’s AI-driven growth.

NVIDIA founder and CEO Jensen Huang speaking with Reliance Industries Chairman Mukesh Ambani at NVIDIA’s AI Summit in Mumbai.

“It makes complete sense that India should manufacture its own AI,” Huang said. “You should not export data to import intelligence,” he added, noting the importance of India building its own AI infrastructure.

Huang identified three areas where AI will transform industries: sovereign AI, where nations use their own data to drive innovation; agentic AI, which automates knowledge-based work; and physical AI, which applies AI to industrial tasks through robotics and autonomous systems. India, Huang noted, is uniquely positioned to lead in all three areas.

India’s startups are already harnessing NVIDIA technology to drive innovation across industries and are positioning themselves as global players, bringing the country’s AI solutions to the world.

Meanwhile, India’s robotics ecosystem is adopting NVIDIA Isaac and Omniverse to power the next generation of physical AI, revolutionizing industries like manufacturing and logistics with advanced automation.

Huang’s also keynote featured a surprise appearance by actor and producer Akshay Kumar.

Following Huang’s remarks, the focus shifted to a fireside chat between Huang and Reliance Industries Chairman Mukesh Ambani, where the two leaders explored how AI will shape the future of Indian industries, particularly in sectors like energy, telecommunications and manufacturing.

Ambani emphasized that AI is central to this continued growth. Reliance, in partnership with NVIDIA, is building AI factories to automate industrial tasks and transform processes in sectors like energy and manufacturing.

Both men discussed their companies’ joint efforts to pioneer AI infrastructure in India.

Ambani underscored the role of AI in public sector services, explaining how India’s data combined with AI is already transforming governance and service delivery.

Huang added that AI promises to democratize technology.

“The ability to program AI is something that everyone can do … if AI could be put into the hands of every citizen, it would elevate and put into the hands of everyone this incredible capability,” he said.

Huang emphasized NVIDIA’s role in preparing India’s workforce for an AI-driven future.

NVIDIA is partnering with India’s IT giants such as Infosys, TCS, Tech Mahindra and Wipro to upskill nearly half a million developers, ensuring India leads the AI revolution with a highly trained workforce.

“India’s technical talent is unmatched,” Huang said.

Ambani echoed these sentiments, stressing that “India will be one of the biggest intelligence markets,” pointing to the nation’s youthful, technically talented population.

A Vision for India’s AI-Driven Future

As the session drew to a close, Huang and Ambani reflected on their vision for India’s AI-driven future.

With its vast talent pool, burgeoning tech ecosystem and immense data resources, the country, they agreed, has the potential to contribute globally in sectors such as energy, healthcare, finance and manufacturing.

“This cannot be done by any one company, any one individual, but we all have to work together to bring this intelligence age safely to the world so that we can create a more equal world, a more prosperous world,” Ambani said.

Huang echoed the sentiment, adding: “Let’s make it a promise today that we will work together so that India can take advantage of the intelligence revolution that’s ahead of us.”

Read More