January 2025 – Page 6

How Cato Networks uses Amazon Bedrock to transform free text search into structured GraphQL queries

This is a guest post authored by Asaf Fried, Daniel Pienica, Sergey Volkovich from Cato Networks.

Cato Networks is a leading provider of secure access service edge (SASE), an enterprise networking and security unified cloud-centered service that converges SD-WAN, a cloud network, and security service edge (SSE) functions, including firewall as a service (FWaaS), a secure web gateway, zero trust network access, and more.

On our SASE management console, the central events page provides a comprehensive view of the events occurring on a specific account. With potentially millions of events over a selected time range, the goal is to refine these events using various filters until a manageable number of relevant events are identified for analysis. Users can review different types of events such as security, connectivity, system, and management, each categorized by specific criteria like threat protection, LAN monitoring, and firmware updates. However, the process of adding filters to the search query is manual and can be time consuming, because it requires in-depth familiarity with the product glossary.

To address this challenge, we recently enabled customers to perform free text searches on the event management page, allowing new users to run queries with minimal product knowledge. This was accomplished by using foundation models (FMs) to transform natural language into structured queries that are compatible with our products’ GraphQL API.

In this post, we demonstrate how we used Amazon Bedrock, a fully managed service that makes FMs from leading AI startups and Amazon available through an API, so you can choose from a wide range of FMs to find the model that is best suited for your use case. With the Amazon Bedrock serverless experience, you can get started quickly, privately customize FMs with your own data, and quickly integrate and deploy them into your applications using AWS tools without having to manage the infrastructure. Amazon Bedrock enabled us to enrich FMs with product-specific knowledge and convert free text inputs from users into structured search queries for the product API that can greatly enhance user experience and efficiency in data management applications.

Solution overview

The Events page includes a filter bar with both event and time range filters. These filters need to be added and updated manually for each query. The following screenshot shows an example of the event filters (1) and time filters (2) as seen on the filter bar (source: Cato knowledge base).

The event filters are a conjunction of statements in the following form:

Key – The field name
Operator – The evaluation operator (for example, is, in, includes, greater than, etc.)
Value – A single value or list of values

For example, the following screenshot shows a filter for action in [ Alert, Block ].

The time filter is a time range following ISO 8601 time intervals standard.

For example, the following screenshot shows a time filter for UTC.2024-10-{01/00:00:00--02/00:00:00}.

Converting free text to a structured query of event and time filters is a complex natural language processing (NLP) task that can be accomplished using FMs. Customizing an FM that is specialized on a specific task is often done using one of the following approaches:

Prompt engineering – Add instructions in the context/input window of the model to help it complete the task successfully.
Retrieval Augmented Generation (RAG) – Retrieve relevant context from a knowledge base, based on the input query. This context is augmented to the original query. This approach is used for reducing the amount of context provided to the model to relevant data only.
Fine-tuning – Train the FM on data relevant to the task. In this case, the relevant context will be embedded into the model weights, instead of being part of the input.

For our specific task, we’ve found prompt engineering sufficient to achieve the results we needed.

Because the event filters on the Events page are specific to our product, we need to provide the FM with the exact instructions for how to generate them, based on free text queries. The main considerations when creating the prompt are:

Include the relevant context – This includes the following:
- The available keys, operators, and values the model can use.
- Specific instructions. For example, numeric operators can only be used with keys that have numeric values.
Make sure it’s simple to validate – Given the extensive number of instructions and limitations, we can’t trust the model output without checking the results for validity. For example, what if the model generates a filter with a key not supported by our API?

Instead of asking the FM to generate the GraphQL API request directly, we can use the following method:

Instruct the model to return a response following a well-known JSON schema validation IETF standard.
Validate the JSON schema on the response.
Translate it to a GraphQL API request.

Request prompt

Based on the preceding examples, the system prompt will be structured as follows:

# Genral Instructions

Your task is to convert free text queries to a JSON format that will be used to query security and network events in a SASE management console of Cato Networks. You are only allowed to output text in JSON format. Your output will be validated against the following schema that is compatible with the IETF standard:

# Schema definition
{
    "$schema": "https://json-schema.org/draft/2020-12/schema",
    "title": "Query Schema",
   "description": "Query object to be executed in the 'Events' management console page. ",
    "type": "object",
    "properties":
    {
        "filters":
        {
            "type": "array",
           "description": "List of filters to apply in the query, based on the free text query provided.",
            "items":
            {
                "oneOf":
                [
                    {
                        "$ref": "#/$defs/Action"
                    },
                    .
                    .
                    .
                ]
            }
        },
        "time":
        {
            "description": "Start datetime and end datetime to be used in the query.",
            "type": "object",
            "required":
            [
                "start",
                "end"
            ],
            "properties":
            {
                "start":
                {
                    "description": "start datetime",
                    "type": "string",
                    "format": "date-time"
                },
                "end":
                {
                    "description": "end datetime",
                    "type": "string",
                    "format": "date-time"
                }
            }
        },
        "$defs":
        {
            "Operator":
            {
                "description": "The operator used in the filter.",
                "type": "string",
                "enum":
                [
                    "is",
                    "in",
                    "not_in",
                    .
                    .
                    .
                ]
            },
            "Action":
            {
                "required":
                [
                    "id",
                    "operator",
                    "values"
                ],
                "description": "The action taken in the event.",
                "properties":
                {
                    "id":
                    {
                        "const": "action"
                    },
                    "operator":
                    {
                        "$ref": "#/$defs/Operator"
                    },
                    "values":
                    {
                        "type": "array",
                        "minItems": 1,
                        "items":
                        {
                            "type": "string",
                            "enum":
                            [
                                "Block",
                                "Allow",
                                "Monitor",
                                "Alert",
                                "Prompt"
                            ]
                        }
                    }
                }
            },
            .
            .
            .
        }
    }
}

Each user query (appended to the system prompt) will be structured as follows:

# Free text query
Query: {free_text_query}

# Add current timestamp for context (used for time filters) 
Context: If you need a reference to the current datetime, it is {datetime}, and the current day of the week is {day_of_week}

The same JSON schema included in the prompt can also be used to validate the model’s response. This step is crucial, because model behavior is inherently non-deterministic, and responses that don’t comply with our API will break the product functionality.

In addition to validating alignment, the JSON schema can also point out the exact schema violation. This allows us to create a policy based on different failure types. For example:

If there are missing fields marked as required, output a translation failure to the user
If the value given for an event filter doesn’t comply with the format, remove the filter and create an API request from other values, and output a translation warning to the user

After the FM successfully translates the free text into structured output, converting it into an API request—such as GraphQL—is a straightforward and deterministic process.

To validate this approach, we’ve created a benchmark with hundreds of text queries and their corresponding expected JSON outputs. For example, let’s consider the following text query:

Security events with high risk level from IPS and Anti Malware engines

For this query, we expect the following response from the model, based on the JSON schema provided:

{
    "filters":
    [
        {
            "id": "risk_level",
            "operator": "is",
            "values":
            [
                "High"
            ]
        },
        {
            "id": "event_type",
            "operator": "is",
            "values":
            [
                "Security"
            ]
        },
        {
            "id": "event_subtype ",
            "operator": "in",
            "values":
            [
                "IPS",
                "Anti Malware"
            ]
        }
    ]
}

For each response of the FM, we define three different outcomes:

Success:
- Valid JSON
- Valid by schema
- Full match of filters
Partial:
- Valid JSON
- Valid by schema
- Partial match of filters
Error:
- Invalid JSON or invalid by schema

Because translation failures lead to a poor user experience, releasing the feature was contingent on achieving an error rate below 0.05, and the selected FM was the one with the highest success rate (ratio of responses with full match of filters) passing this criterion.

Working with Amazon Bedrock

Amazon Bedrock is a fully managed service that simplifies access to a wide range of state-of-the-art FMs through a single, serverless API. It offers a production-ready service capable of efficiently handling large-scale requests, making it ideal for enterprise-level deployments.

Amazon Bedrock enabled us to efficiently transition between different models, making it simple to benchmark and optimize for accuracy, latency, and cost, without the complexity of managing the underlying infrastructure. Additionally, some vendors within the Amazon Bedrock landscape, such as Cohere and Anthropic’s Claude, offer models with native understanding of JSON schemas and structured data, further enhancing their applicability to our specific task.

Using our benchmark, we evaluated several FMs on Amazon Bedrock, taking into account accuracy, latency, and cost. Based on the results, we selected anthropic.claude-3-5-sonnet-20241022-v2:0, which met the error rate criterion and achieved the highest success rate while maintaining reasonable costs and latency. Following this, we proceeded to develop the complete solution, which includes the following components:

Management console – Cato’s management application that the user interacts with to view their account’s network and security events.
GraphQL server – A backend service that provides a GraphQL API for accessing data in a Cato account.
Amazon Bedrock – The cloud service that handles hosting and serving requests to the FM.
Natural language search (NLS) service – An Amazon Elastic Kubernetes Service (Amazon EKS) hosted service to bridge between Cato’s management console and Amazon Bedrock. This service is responsible for creating the complete prompt for the FM and validating the response using the JSON schema.

The following diagram illustrates the workflow from the user’s manual query to the extraction of relevant events.

With the new capability, users can also use free text query mode, which is processed as shown in the following diagram.

The following screenshot of the Events page displays free text query mode in action.

Business impact

The recent feature update has received positive customer feedback. Users, especially those unfamiliar with Cato, have found the new search capability more intuitive, making it straightforward to navigate and engage with the system. Additionally, the inclusion of multi-language input, natively supported by the FM, has made the Events page more accessible for non-native English speakers to use, helping them interact and find insights in their own language.

One of the standout impacts is the significant reduction in query time—cut down from minutes of manual filtering to near-instant results. Account admins using the new feature have reported near-zero time to value, experiencing immediate benefits with minimal learning curve.

Conclusion

Accurately converting free text inputs into structured data is crucial for applications that involve data management and user interaction. In this post, we introduced a real business use case from Cato Networks that significantly improved user experience.

By using Amazon Bedrock, we gained access to state-of-the-art generative language models with built-in support for JSON schemas and structured data. This allowed us to optimize for cost, latency, and accuracy without the complexity of managing the underlying infrastructure.

Although a prompt engineering solution met our needs, users handling complex JSON schemas might want to explore alternative approaches to reduce costs. Including the entire schema in the prompt can lead to a significantly high token count for a single query. In such cases, consider using Amazon Bedrock to fine-tune a model, to embed product knowledge more efficiently.

About the Authors

Asaf Fried leads the Data Science team in Cato Research Labs at Cato Networks. Member of Cato Ctrl. Asaf has more than six years of both academic and industry experience in applying state-of-the-art and novel machine learning methods to the domain of networking and cybersecurity. His main research interests include asset discovery, risk assessment, and network-based attacks in enterprise environments.

Daniel Pienica is a Data Scientist at Cato Networks with a strong passion for large language models (LLMs) and machine learning (ML). With six years of experience in ML and cybersecurity, he brings a wealth of knowledge to his work. Holding an MSc in Applied Statistics, Daniel applies his analytical skills to solve complex data problems. His enthusiasm for LLMs drives him to find innovative solutions in cybersecurity. Daniel’s dedication to his field is evident in his continuous exploration of new technologies and techniques.

Sergey Volkovich is an experienced Data Scientist at Cato Networks, where he develops AI-based solutions in cybersecurity & computer networks. He completed an M.Sc. in physics at Bar-Ilan University, where he published a paper on theoretical quantum optics. Before joining Cato, he held multiple positions across diverse deep learning projects, ranging from publishing a paper on discovering new particles at the Weizmann Institute to advancing computer networks and algorithmic trading. Presently, his main area of focus is state-of-the-art natural language processing.

Omer Haim is a Senior Solutions Architect at Amazon Web Services, with over 6 years of experience dedicated to solving complex customer challenges through innovative machine learning and AI solutions. He brings deep expertise in generative AI and container technologies, and is passionate about working backwards from customer needs to deliver scalable, efficient solutions that drive business value and technological transformation.

How AI Helps Fight Fraud in Financial Services, Healthcare, Government and More

Companies and organizations are increasingly using AI to protect their customers and thwart the efforts of fraudsters around the world.

Voice security company Hiya found that 550 million scam calls were placed per week in 2023, with INTERPOL estimating that scammers stole $1 trillion from victims that same year. In the U.S., one of four noncontact-list calls were flagged as suspected spam, with fraudsters often luring people into Venmo-related or extended warranty scams.

Traditional methods of fraud detection include rules-based systems, statistical modeling and manual reviews. These methods have struggled to scale to the growing volume of fraud in the digital era without sacrificing speed and accuracy. For instance, rules-based systems often have high false-positive rates, statistical modeling can be time-consuming and resource-intensive, and manual reviews can’t scale rapidly enough.

In addition, traditional data science workflows lack the infrastructure required to analyze the volumes of data involved in fraud detection, leading to slower processing times and limiting real-time analysis and detection.

Plus, fraudsters themselves can use large language models (LLMs) and other AI tools to trick victims into investing in scams, giving up their bank credentials or buying cryptocurrency.

But AI — coupled with accelerated computing systems— can be used to check AI and help mitigate all of these issues.

Businesses that integrate robust AI fraud detection tools have seen up to a 40% improvement in fraud detection accuracy — helping reduce financial and reputational damage to institutions.

These technologies offer robust infrastructure and solutions for analyzing vast amounts of transactional data and can quickly and efficiently recognize fraud patterns and identify abnormal behaviors.

AI-powered fraud detection solutions provide higher detection accuracy by looking at the whole picture instead of individual transactions, catching fraud patterns that traditional methods might overlook. AI can also help reduce false positives, tapping into quality data to provide context about what constitutes a legitimate transaction. And, importantly, AI and accelerated computing provide better scalability, capable of handling massive data networks to detect fraud in real time.

How Financial Institutions Use AI to Detect Fraud

Financial services and banking are the front lines of the battle against fraud such as identity theft, account takeover, false or illegal transactions, and check scams. Financial losses worldwide from credit card transaction fraud are expected to reach $43 billion by 2026.

AI is helping enhance security and address the challenge of escalating fraud incidents.

Banks and other financial service institutions can tap into NVIDIA technologies to combat fraud. For example, the NVIDIA RAPIDS Accelerator for Apache Spark enables faster data processing to handle massive volumes of transaction data. Banks and financial service institutions can also use the new NVIDIA AI workflow for fraud detection — harnessing AI tools like XGBoost and graph neural networks (GNNs) with NVIDIA RAPIDS, NVIDIA Triton and NVIDIA Morpheus — to detect fraud and reduce false positives.

BNY Mellon improved fraud detection accuracy by 20% using NVIDIA DGX systems. PayPal improved real-time fraud detection by 10% running on NVIDIA GPU-powered inference, while lowering server capacity by nearly 8x. And Swedbank trained generative adversarial networks on NVIDIA GPUs to detect suspicious activities.

US Federal Agencies Fight Fraud With AI

The United States Government Accountability Office estimates that the government loses up to $521 billion annually due to fraud, based on an analysis of fiscal years 2018 to 2022. Tax fraud, check fraud and improper payments to contractors, in addition to improper payments under the Social Security and Medicare programs have become a massive drag on the government’s finances.

While some of this fraud was inflated by the recent pandemic, finding new ways to combat fraud has become a strategic imperative. As such, federal agencies have turned to AI and accelerated computing to improve fraud detection and prevent improper payments.

For example, the U.S. Treasury Department began using machine learning in late 2022 to analyze its trove of data and mitigate check fraud. The department estimated that AI helped officials prevent or recover more than $4 billion in fraud in fiscal year 2024.

Along with the Treasury Department, agencies such as the Internal Revenue Service have looked to AI and machine learning to close the tax gap — including tax fraud — which was estimated at $606 billion in tax year 2022. The IRS has explored the use of NVIDIA’s accelerated data science frameworks such as RAPIDS and Morpheus to identify anomalous patterns in taxpayer records, data access and common vulnerability and exposures. LLMs combined with retrieval-augmented generation and RAPIDS have also been used to highlight records that may not be in alignment with policies.

How AI Can Help Healthcare Stem Potential Fraud

According to the U.S. Department of Justice, healthcare fraud, waste and abuse may account for as much as 10% of all healthcare expenditures. Other estimates have deemed that percentage closer to 3%. Medicare and Medicaid fraud could be near $100 billion. Regardless, healthcare fraud is a problem worth hundreds of billions of dollars.

The additional challenge with healthcare fraud is that it can come from all directions. Unlike the IRS or the financial services industry, the healthcare industry is a fragmented ecosystem of hospital systems, insurance companies, pharmaceutical companies, independent medical or dental practices, and more. Fraud can occur at both provider and patient levels, putting pressure on the entire system.

Common types of potential healthcare fraud include:

Billing for services not rendered
Upcoding: billing for a more expensive service than the one rendered
Unbundling: multiple bills for the same service
Falsifying records
Using someone else’s insurance
Forged prescriptions

The same AI technologies that help combat fraud in financial services and the public sector can also be applied to healthcare. Insurance companies can use pattern and anomaly detection to look for claims that seem atypical, either from the provider or the patient, and scrutinize billing data for potentially fraudulent activity. Real-time monitoring can detect suspicious activity at the source, as it’s happening. And automated claims processing can help reduce human error and detect inconsistencies while improving operational efficiency.

Data processing through NVIDIA RAPIDS can be combined with machine learning and GNNs or other types of AI to help better detect fraud at every layer of the healthcare system, assisting patients and practitioners everywhere dealing with high costs of care.

AI for Fraud Detection Could Save Billions of Dollars

Financial services, the public sector and the healthcare industry are all using AI for fraud detection to provide a continuous defense against one of the world’s biggest drains on economic activity.

The NVIDIA AI platform supports the entire fraud detection and identity verification pipeline — from data preparation to model training to deployment — with tools like NVIDIA RAPIDS, NVIDIA Triton Inference Server and NVIDIA Morpheus on the NVIDIA AI Enterprise software platform.

Learn more about NVIDIA solutions for fraud detection with AI and accelerated computing.

The Future of Marketing: How AI Agents Can Enhance Customer Journeys in Retail

AI agents — which can understand, adapt to and support each user’s unique journey — are making online shopping and digital marketing more efficient and personalized. Plus, these intelligent systems are poised to turn marketing interactions into valuable customer research data.

In this episode of the NVIDIA AI Podcast, Jon Heller, co-CEO and founder of Firsthand, discusses how the company’s Brand Agents are revolutionizing the relationship between consumers, marketers and publishers. By using a company’s own knowledge, Firsthand Brand Agents act as AI-powered guides that engage customers on a brand’s website and beyond — assisting at every step of the customer’s journey, from finding solutions to making purchases.

Drawing on decades of industry experience including leadership roles at advertising companies DoubleClick and FreeWheel, Heller explains Firsthand’s vision of AI as a new medium rather than just a technology.

The AI Podcast · Firsthand’s Jon Heller Shares How AI Agents Enhance Consumer Journeys in Retail – Episode 242

AI agents are enabling companies in the retail and consumer-packaged goods (CPG) industries to increase internal efficiency and productivity while improving customer service.

Two of the top use cases for generative AI in retail are: personalized marketing and advertising, and digital shopping assistants or copilots. Learn more about AI’s rapid integration across businesses in NVIDIA’s second annual State of AI in Retail and CPG survey report.

Time Stamps

2:10 — How large language models revealed a new approach to digital marketing.

12:46 — How Firsthand Brand Agents can use various AI capabilities beyond traditional chat.

16:33 — How Firsthand Brand Agents create a connected customer journey by maintaining context across touchpoints.

23:57 — The technical challenges in building agents while maintaining brand safety.

30:10 — How AI can generate unprecedented insights into consumer needs and preferences.

You Might Also Like…

Imbue CEO Kanjun Qiu on Transforming AI Agents Into Personal Collaborators

Imbue is building software infused with intelligence through collaborative AI systems that work alongside users. CEO Kanjun Qiu discusses her company’s approach to AI agents and compares the personal computer revolution of the late 1970s and 80s to today’s AI agent transformation.

Media.Monk’s Lewis Smithingham on Enhancing Media and Marketing With AI

Media.Monks’ platform Wormhole streamlines marketing and content creation with AI-powered insights. Hear from Lewis Smithingham, senior vice president of innovation and special operations at Media.Monks, as he addresses AI’s potential in entertainment and advertising.

Snowflake’s Baris Gultekin on Unlocking the Value of Data With Large Language Models

Snowflake is using AI to help enterprises transform data into insights and applications. Baris Gultekin, head of AI at Snowflake, explains how the company’s AI Data Cloud platform separates the storage of data from compute, enabling organizations across the world to connect via cloud technology and work on a unified platform.

Subscribe to the AI Podcast

Get the AI Podcast through Amazon Music, Apple Podcasts, Google Podcasts, Google Play, Castbox, DoggCatcher, Overcast, PlayerFM, Pocket Casts, Podbay, PodBean, PodCruncher, PodKicker, SoundCloud, Spotify, Stitcher and TuneIn.

Into the Omniverse: OpenUSD Workflows Advance Physical AI for Robotics, Autonomous Vehicles

Editor’s note: This post is part of Into the Omniverse, a series focused on how developers, 3D practitioners and enterprises can transform their workflows using the latest advances in Universal Scene Description (OpenUSD) and NVIDIA Omniverse.

The next frontier of AI is physical AI. Physical AI models can understand instructions and perceive, interact and perform complex actions in the real world to power autonomous machines like robots and self-driving cars.

Similar to how large language models can process and generate text, physical AI models can understand the world and generate actions. To do this, these models must be trained in simulation environments to comprehend physical dynamics, like gravity, friction or inertia — and understand geometric and spatial relationships, as well as the principles of cause and effect.

Global leaders in software development and professional services are using NVIDIA Omniverse, powered by OpenUSD, to build new products and services that will accelerate the development of AI and controllable simulations to enable the creation of true-to-reality virtual worlds, known as digital twins, that can be used to train physical AI with unprecedented accuracy and detail.

Generate Exponentially More Synthetic Data With Omniverse and NVIDIA Cosmos

At CES, NVIDIA announced generative AI models and blueprints that expand Omniverse integration further into physical AI applications such as robotics, autonomous vehicles and vision AI.

Among these announcements was NVIDIA Cosmos, a platform of state-of-the-art generative world foundation models, advanced tokenizers, guardrails and an accelerated video processing pipeline — all designed to accelerate physical AI development.

Developing physical AI models is a costly, resource- and time-intensive process that requires vast amounts of real-world data and testing. Cosmos’ world foundation models (WFM), which predict future world states as videos based on multimodal inputs, provide an easy way for developers to generate massive amounts of photoreal, physics-based synthetic data to train and evaluate AI for robotics, autonomous vehicles and machines. Developers can also fine-tune Cosmos WFMs to build downstream world models or improve quality and efficiency for specific physical AI use cases.

When paired with Omniverse, Cosmos creates a powerful synthetic data multiplication engine. Developers can use Omniverse to create 3D scenarios, then feed the outputs into Cosmos to generate controlled videos and variations. This can drastically accelerate the development of physical AI systems such as autonomous vehicles and robots by rapidly generating exponentially more training data covering a variety of environments and interactions.

OpenUSD ensures the data in these scenarios is seamlessly integrated and consistently represented, enhancing the realism and effectiveness of the simulations.

Leading robotics and automotive companies, including 1X, Agile Robots, Agility Robotics, Figure AI, Foretellix, Fourier, Galbot, Hillbot, IntBot, Neura Robotics, Skild AI, Virtual Incision, Waabi and XPENG, along with ridesharing giant Uber, are among the first to adopt Cosmos.

Learn more about how world foundation models will advance physical AI by listening to the NVIDIA AI Podcast episode with Ming-Yu Liu, vice president of research at NVIDIA.

See Cosmos in Action for Physical AI Use Cases

Cosmos WFMs are revolutionizing industries by providing a unified framework for developing, training and deploying large-scale AI models across various applications. Enterprises in the automotive, industrial and robotics sectors can harness the power of generative physical AI and simulation to accelerate innovation and operational efficiency.

Humanoid robots: The NVIDIA Isaac GR00T Blueprint for synthetic motion generation helps developers generate massive synthetic motion datasets to train humanoid robots using imitation learning. With GR00T workflows, users can capture human actions and use Cosmos to exponentially increase the size and variety of the dataset, making it more robust for training physical AI systems.
Autonomous vehicles: Autonomous vehicle (AV) simulation powered by Omniverse Sensor RTX application programming interfaces lets AV developers replay driving data, generate new ground-truth data and perform closed-loop testing to accelerate their pipelines. With Cosmos, developers can generate synthetic driving scenarios to amplify training data by orders of magnitude, accelerating physical AI model development for autonomous vehicles. Global ridesharing giant Uber is partnering with NVIDIA to accelerate autonomous mobility. Rich driving datasets from Uber, combined with Cosmos and NVIDIA DGX Cloud, can help AV partners build stronger AI models more efficiently.
Industrial settings: Mega is an Omniverse Blueprint for developing, testing and optimizing physical AI and robot fleets at scale in a USD-based digital twin before deployment in factories and warehouses. The blueprint uses Omniverse Cloud Sensor RTX APIs to simultaneously render multisensor data from any type of intelligent machine, enabling high-fidelity sensor simulation at scale. Cosmos can enhance Mega by generating synthetic edge case scenarios to amplify training data, significantly improving the robustness and efficiency of training robots in simulation. KION Group, a supply chain solutions company, is among the first to adopt Mega to drive warehouse automation in retail, consumer packaged goods, parcel services and more.

Get Plugged Into the World of OpenUSD

For more on Cosmos, watch the replay of NVIDIA CEO Jensen Huang’s CES keynote, and get started with Cosmos WFMs available now under an open model license on Hugging Face and the NVIDIA NGC catalog. Join the upcoming livestream on Wednesday, February 5 for a deep dive into Cosmos WFMs and physical AI workflows.

Continue to optimize OpenUSD workflows with the new self-paced Learn OpenUSD curriculum for 3D developers and practitioners, available at no cost through the NVIDIA Deep Learning Institute. For more resources on OpenUSD, explore the Alliance for OpenUSD forum and the AOUSD website.

Meet Cosmos, OpenUSD and physical AI experts at NVIDIA GTC, the conference for the era of AI, taking place March 17-21 at the San Jose Convention Center.

Stay up to date by subscribing to NVIDIA news, joining the community, and following NVIDIA Omniverse on Instagram, LinkedIn, Medium and X.

Bringing the PyTorch Community Together

As we step into a new year, it’s a great moment to reflect on the incredible community events that made 2024 a memorable year for the PyTorch Foundation. Global meetups, events, and conferences brought the community together to learn, connect, and grow. Here’s a quick recap of the year’s highlights and what to expect in 2025.

PyTorch Seattle Meetup (May 23)

We hosted a PyTorch Meetup in Seattle in May at the Meta Bellevue Office where Meta, Microsoft, and Google gave technical talks and about 60 attendees participated in discussion and networking.

PyTorch Docathon 2024 (June 4-20)

The PyTorch Docathon returned for its third edition, spanning over two weeks in June. This unique event focused on improving PyTorch’s documentation with contributions from community members worldwide. Documentation is the backbone of any successful open source project, and PyTorch’s Docathon fostered inclusivity and collaboration, making it easier for new users to adopt the framework and for experienced developers to maximize its potential. The 2024 Docathon resulted in more than 50 merged pull requests and was a testament to the collaborative spirit of the PyTorch community and its commitment to enhancing accessibility and usability. Watch the PyTorch Docathon Kickoff on YouTube.

PyTorch Shanghai Meetup (August 15)

In August, the PyTorch Shanghai Meetup brought together developers, researchers, and enthusiasts in Shanghai, China. This event served as a platform for knowledge sharing, with engaging talks and networking opportunities. Highlights from the agenda included insights into PyTorch’s latest developments, community-led presentations showcasing innovative use cases, and networking sessions fostering collaboration among attendees.

PyTorch Conference 2024 (September 18-19)

The PyTorch Conference in San Francisco was undoubtedly one of the year’s most significant events. This two-day gathering brought together top-tier researchers, developers, and academic communities, fostering collaboration and innovation in machine learning.

What Made It Special:

Keynote speeches from industry leaders and PyTorch maintainers.
In-depth sessions covering PyTorch’s end-to-end machine learning capabilities.
Hands-on workshops and breakout sessions.
A vibrant expo area showcasing cutting-edge tools and applications.
Startup Showcase where early-stage founders pitched their AI startups to a panel of top venture capitalists.
DL Compiler Mini-Summit that took a deep dive into the advances in deep learning (DL) compilers that are transforming AI workloads.
Fine-Tuning Mini-Summit that covered everything from memory efficiency, parameter-efficient fine-tuning and quantization to performance at scale and reproducible evaluations.
Poster Session showcasing innovations in PyTorch, including model optimization, hardware integration, generative AI, quantization, and tools for enhanced performance and usability, with contributions from industry leaders.

The conference’s focus on fostering collaboration underscored PyTorch’s role as a driving force in the open source ML community. Missed out? You can watch the PyTorch Conference 2024 Playlist to catch any sessions you might have missed.

GPU MODE IRL Hackathon (September 21)

PyTorch sponsored this meetup in person in San Francisco where attendees made friends, watched keynotes, hacked all day, took breaks with afternoon talks, and then hacked all night. We heard about torchao, our new quantization and sparsity library, vLLM which deploys PyTorch models in production, llm.c, and more. Key takeaways included: GPU Mode IRL Hackathon 1st place winner was inspired by PyTorch FlexAttention to improve CUTLASS, NCCL in Triton would help us do distributed programming with a minimal NCCL reimplementation in pure Python, No libtorch pytorch binaries dramatically reduces binary sizes for on device deployments.

Consumer AI Edge Hackathon (November 22-23)

The PyTorch team served as mentors and coaches in a Hackathon in Paris, co-sponsored by Hugging Face, Scaleway, and Entrepreneur First, challenging teams to create innovative consumer (B2C) applications leveraging Hugging Face, PyTorch and other open source on-device tools and models. 120+ people across 22 teams hacked for 2 days (and nights!) building the future of AI-powered on-device solutions based on open source models and tools. Participants created innovative applications, powered by PyTorch, ExecuTorch and Hugging Face resources, such as an on-device yoga coach, a magical storytelling companion and a Kinect-like experience to mobile phones. The PyTorch team is planning similar events in other geographies in 2025 around innovative on-device AI applications.

PyTorch Korea User Group Meetup (November 30)

The PyTorch Korea User Group, founded in 2018, is a community dedicated to introducing PyTorch to Korean-speaking users and growing together. The group began by translating PyTorch 0.3 tutorials into Korean and has since supported PyTorch’s growth in Korea. The group focuses on three primary activities:

Sharing knowledge for PyTorch learning and application,
Sharing insights and experiences in the field of artificial intelligence, and
Fostering growth through online and offline networking.

The PyTorch Korea User Group reaches tens of thousands of Korean AI developers every month. If you’re interested in their activities, check out these links:

The PyTorch Korea User Group has planned three major activities for the year:

PyTorch CoreSIG
Since December 2024, this weekly online event has been held every Wednesday afternoon. Led by Kim Hong-Seok, CSO of Rebellions (a PyTorch member company), it provides in-depth knowledge and experience regarding PyTorch internals. Approximately 150 Korean developers participate weekly, reflecting growing interest in PyTorch Core development in Korea.
Offline Meetup
These meetups provide opportunities to share insights and experiences in PyTorch and artificial intelligence, along with networking. Around 3–4 sessions are planned for this year, focusing on key topics in PyTorch and AI.
Online Community Engagement
This activity involves sharing and discussing various projects and papers in the AI field. For more information, visit: https://discuss.pytorch.kr.

Open Source AI Night at NeurIPS 2024 (December 10)

The PyTorch Foundation co-hosted a social event at NeurIPS along with The Fin AI and Open Finance Foundation that featured engaging discussions on open source AI and applications in finance.

PyTorch Webinars

Throughout 2024, PyTorch hosted the following virtual webinars:

Expert Exchanges:

Summer Series:

Release Live Q&As:

Live Webinars:

Each of these events underscored the importance of collaboration and community engagement in advancing AI research and applications. Thank you to everyone who participated, organized, and supported these events—your contributions make all the difference!

Looking Ahead

2024 was packed with opportunities to connect, learn, and contribute, and there will be even more ways to connect with the PyTorch community in 2025.

Mark your calendar! The PyTorch Conference is returning to San Francisco on October 22-23, 2025. Get ready for an exciting event filled with technical deep dives, exciting announcements, insightful sessions, and enhanced opportunities for community collaboration.

Stay tuned for more upcoming events and opportunities to get involved by subscribing to our newsletter.

Mapping Cells Through Time and Space With Moscot

Single-cell genomics technologies enable multimodal profiling of millions of cells across temporal and spatial dimensions. Experimental limitations prevent the measurement of all-encompassing cellular states in their native temporal dynamics or spatial tissue niche. Optimal transport theory has emerged as a powerful tool to overcome such constraints, enabling the recovery of the original cellular context. However, most algorithmic implementations currently available have not kept up the pace with increasing dataset complexity, so that current methods are unable to incorporate multimodal…Apple Machine Learning Research

Solve forecasting challenges for the retail and CPG industry using Amazon SageMaker Canvas

Businesses today deal with a reality that is increasingly complex and volatile. Companies across retail, manufacturing, healthcare, and other sectors face pressing challenges in accurate planning and forecasting. Predicting future inventory needs, setting achievable strategic goals, and budgeting effectively involve grappling with ever-changing consumer demand and global market forces. Inventory shortages, surpluses, and unmet customer expectations pose constant threats. Supply chain forecasting is critical to helping businesses tackle these uncertainties.

By using historical sales and supply data to anticipate future shifts in demand, supply chain forecasting supports executive decision-making on inventory, strategy, and budgeting. Analyzing past trends while accounting for impacts ranging from seasons to world events provides insights to guide business planning. Organizations that tap predictive capabilities to inform decisions can thrive amid fierce competition and market volatility. Overall, mastering demand predictions allows businesses to fulfill customer expectations by providing the right products at the right times.

In this post, we show you how Amazon Web Services (AWS) helps in solving forecasting challenges by customizing machine learning (ML) models for forecasting. We dive into Amazon SageMaker Canvas and explain how SageMaker Canvas can solve forecasting challenges for retail and consumer packaged goods (CPG) enterprises.

Introduction to Amazon SageMaker Canvas

Amazon SageMaker Canvas is a powerful no-code ML service that gives business analysts and data professionals the tools to build accurate ML models without writing a single line of code. This visual, point-and-click interface democratizes ML so users can take advantage of the power of AI for various business applications. SageMaker Canvas supports multiple ML modalities and problem types, catering to a wide range of use cases based on data types, such as tabular data (our focus in this post), computer vision, natural language processing, and document analysis. To learn more about the modalities that Amazon SageMaker Canvas supports, visit the Amazon SageMaker Canvas product page.

For time-series forecasting use cases, SageMaker Canvas uses autoML to train six algorithms on your historical time-series dataset and combines them using a stacking ensemble method to create an optimal forecasting model. The algorithms are: Convolutional Neural Network – Quantile Regression (CNN-QR), DeepAR+, Prophet, Non-Parametric Time Series (NPTS), Autoregressive Integrated Moving Average (ARIMA), and Exponential Smoothing (ETS). To learn more about these algorithms visit Algorithms support for time-series forecasting in the Amazon SageMaker documentation.

How Amazon SageMaker Canvas can help retail and CPG manufacturers solve their forecasting challenges

The combination of a user-friendly UI interface and automated ML technology available in SageMaker Canvas gives users the tools to efficiently build, deploy, and maintain ML models with little to no coding required. For example, business analysts who have no coding or cloud engineering expertise can quickly use Amazon SageMaker Canvas to upload their time-series data and make forecasting predictions. And this isn’t a service to be used by business analysts only. Any team at a retail or CPG company can use this service to generate forecasting data using the user-friendly UI of SageMaker Canvas.

To effectively use Amazon SageMaker Canvas for retail forecasting, customers should use their sales data for a set of SKUs for which they would like to forecast demand. It’s crucial to have data across all months of the year, considering the seasonal variation in demand in a retail environment. Additionally, it’s essential to provide a few years’ worth of data to eliminate anomalies or outliers within the data.

Retail and CPG organizations rely on industry standard methods in their approach to forecasting. One of these methods is quantiles. Quantiles in forecasting represent specific points in the predicted distribution of possible future values. They allow ML models to provide probabilistic forecasts rather than merely single point estimates. Quantiles help quantify the uncertainty in predictions by showing the range and spread of possible outcomes. Common quantiles used are the 10th, 50th (median), and 90th percentiles. For example, the 90th percentile forecast means there’s a 90% chance the actual value will be at or below that level.

By providing a probabilistic view of future demand, quantile forecasting enables retail and CPG organizations to make more informed decisions in the face of uncertainty, ultimately leading to improved operational efficiency and financial performance.

Amazon SageMaker Canvas addresses this need with ML models coupled with quantile regression. With quantile regression, you can select from a wide range of planning scenarios, which are expressed as quantiles, rather than rely on single point forecasts. It’s these quantiles that offer choice.

What do these quantiles mean? Check the following figure, which is a sample of a time-series forecasting prediction using Amazon SageMaker Canvas. The figure provides a visual of a time-series forecast with multiple outcomes, made possible through quantile regression. The red line, denoted with p05, offers a probability that the real number, whatever it may be, is expected to fall below the p05 line about 5% of the time. Conversely, this means 95% of the time the true number will likely fall above the p05 line.

Retail or CPG organizations can evaluate multiple quantile prediction points with a consideration for the over- and under-supply costs of each item to automatically select the quantile likely to provide the most profit in future periods. When necessary, you can override the selection when business rules desire a fixed quantile over a dynamic one.

To learn more about how to use quantiles for your business, check out this Beyond forecasting: The delicate balance of serving customers and growing your business.

Another powerful feature that Amazon SageMaker Canvas offers is what-if analysis, which complements quantile forecasting with the ability to interactively explore how changes in input variables affect predictions. Users can change model inputs and immediately observe how these changes impact individual predictions. This feature allows for real-time exploration of different scenarios without needing to retrain the model.

What-if analysis in SageMaker Canvas can be applied to various scenarios, such as:

Forecasting inventory in coming months
Predicting sales for the next quarter
Assessing the effect of price reductions on holiday season sales
Estimating customer footfall in stores over the next few hours

How to generate forecasts

The following example illustrates the steps to follow for users to generate forecasts from a time-series dwe use a consumer electronics dataset to forecast 5 months of sales based on current and historic demand. To download a copy of this dataset, visit .

In order to access Amazon SageMaker Canvas, you can either directly sign in using the AWS Management Console and navigate to Amazon SageMaker Canvas, or you can access Amazon SageMaker Canvas directly using single sign-on as detailed in Enable single sign-on access of Amazon SageMaker Canvas using AWS IAM Identity Center. In this post, we access Amazon SageMaker Canvas through the AWS console.

Generate forecasts

To generate forecasts, follow these steps:

On the Amazon SageMaker console, in the left navigation pane, choose Canvas.
Choose Open Canvas on the right side under Get Started, as shown in the following screenshot. If this is your first time using SageMaker Canvas, you need to create a SageMaker Canvas user by following the prompts on the screen. A new browser tab will open for the SageMaker Canvas console.

In the left navigation pane, choose Datasets.
To import your time-series dataset, choose the Import data dropdown menu and then choose Tabular, as shown in the following screenshot.

In Dataset name, enter a name such as Consumer_Electronics and then choose Create, as shown in the following screenshot.

Upload your dataset (in CSV or Parquet format) from your computer or an Amazon Simple Storage Service (Amazon S3) bucket.
Preview the data, then choose Create dataset, as shown in the following screenshot.

Under Status, your dataset import will show as Processing. When it shows as Complete, proceed to the next step.

Now that you have your dataset created and your time-series data file uploaded, create a new model to generate forecasts for your dataset. In the left navigation pane, choose My Models, then choose New model, as shown in the following screenshot.

In Model name, enter a name such as consumer_electronics_forecast. Under Problem type, select your use case type. Our use case is Predictive analysis, which builds models using tabular datasets for different problems, including forecasts.
Choose Create.

You will be transferred to the Build In the Target column dropdown menu, select the column where you want to generate the forecasts. This is the demand column in our dataset, as shown in the followings screenshot. After you select the target column, SageMaker Canvas will automatically select Time series forecasting as the Model type.
Choose Configure model.

A window will pop up asking you to provide more information, as shown in the following screenshot. Enter the following details:
1. Choose the column that uniquely identifies the items in your dataset – This configuration determines how you identify your items in the datasets in a unique way. For this use case, select item_id because we’re planning to forecast sales per store.
2. Choose a column that groups the forecast by the values in the column – If you have logical groupings of the items selected in the previous field, you can choose that feature here. We don’t have one for this use case, but examples would be state, region, country, or other groupings of stores.
3. Choose the column that contains the time stamps – The timestamp is the feature that contains the timestamp information. SageMaker Canvas requires data timestamp in the format YYYY-MM-DD HH:mm:ss (for example, 2022-01-01 01:00:00).
4. Specify the number of months you want to forecast into the future – SageMaker Canvas forecasts values up to the point in time specified in the timestamp field. For this use case, we will forecast values up to 5 months in the future. You may choose to enter any valid value, but be aware a higher number will impact the accuracy of predictions and also may take longer to compute.
5. You can use a holiday schedule to improve your prediction accuracy – (Optional) You can enable Use holiday schedule and choose a relevant country if you want to learn how it helps with accuracy. However, it might not have much impact on this use case because our dataset is synthetic.

To change the quantiles from the default values as explained previously, in the left navigation pane, choose Forecast quantiles. In the Forecast quantiles field, enter your own values, as shown in the following screenshot.

SageMaker Canvas chooses an AutoML algorithm based on your data and then trains an ensemble model to make predictions for time-series forecasting problems. Using time-series forecasts, you can make predictions that can vary with time, such as forecasting:

Your inventory in the coming months
Your sales for the next months
The effect of reducing the price on sales during the holiday season
The number of customers entering a store in the next several hours
How a reduction in the price of a product affects sales over a time period

If you’re not sure which forecasting algorithms to try, select all of them. To help you decide which algorithms to select, refer to Algorithms support for time-series forecasting, where you can learn more details and compare algorithms.

Choose Save.

Train the model

Now that the configuration is done, you can train the model. SageMaker Canvas offers two build options:

Quick build – Builds a model in a fraction of the time compared to a standard build. Potential accuracy is exchanged for speed.
Standard build – Builds the best model from an optimized process powered by AutoML. Speed is exchanged for greatest accuracy.

For this walkthrough, we choose Standard build, as shown in the following screenshot.

When the model training finishes, you will be routed to the Analyze There, you can find the average prediction accuracy and the column impact on prediction outcome.

Your numbers might differ from what the following screenshot shows. This is due to the stochastic nature of the ML process.

Here are explanations of what these metrics mean and how you can use them:

wQL – The average Weighted Quantile Loss (wQL) evaluates the forecast by averaging the accuracy at the P10, P50, and P90 quantiles (unless the user has changed them). A lower value indicates a more accurate model. In our example, we used the default quantiles. If you choose quantiles with different percentiles, wQL will center on the numbers you choose.
MAPE – Mean absolute percentage error (MAPE) is the percentage error (percent difference of the mean forecasted value compared to the actual value) averaged over all time points. A lower value indicates a more accurate model, where MAPE = 0 is a model with no errors.
WAPE – Weighted Absolute Percent Error (WAPE) is the sum of the absolute error normalized by the sum of the absolute target, which measure the overall deviation of forecasted values from observed values. A lower value indicates a more accurate model, where WAPE = 0 is a model with no errors.
RMSE – Root mean square error (RMSE) is the square root of the average squared errors. A lower RMSE indicates a more accurate model, where RMSE = 0 is a model with no errors.
MASE – Mean absolute scaled error (MASE) is the mean absolute error of the forecast normalized by the mean absolute error of a simple baseline forecasting method. A lower value indicates a more accurate model, where MASE < 1 is estimated to be better than the baseline and MASE > 1 is estimated to be worse than the baseline.

You can change the default metric based on your needs. wQL is the default metric. Companies should choose a metric that aligns with their specific business goals and is straightforward for stakeholders to interpret. The choice of metric should be driven by the specific characteristics of the demand data, the business objectives, and the interpretability requirements of stakeholders.

For instance, a high-traffic grocery store that sells perishable items requires the lowest possible wQL. This is crucial to prevent lost sales from understocking while also avoiding overstocking, which can lead to spoilage of those perishables.

It’s often recommended to evaluate multiple metrics and select the one that best aligns with the company’s forecasting goals and data patterns. For example, wQL is a robust metric that can handle intermittent demand and provide a more comprehensive evaluation of forecast accuracy across different quantiles. However, RMSE gives higher weight to larger errors due to the squaring operation, making it more sensitive to outliers.

Choose Predict to open the Predict

To generate forecast predictions for all the items in the dataset, select Batch prediction. To generate forecast predictions for a specific item (for example, to predict demand in real-time), select Single prediction. The following steps show how to perform both operations.

To generate forecast predictions for a specific item, follow these steps:

Choose Single item and select any of the items from the item dropdown list. SageMaker Canvas generates a prediction for our item, showing the average prediction (that is, demand of that item with respect to timestamp). SageMaker Canvas provides results for all upper bound, lower bound, and expected forecast.

It’s a best practice to have bounds rather than a single prediction point so that you can pick whichever fits best your use case. For example, you might want to reduce waste of resources of overstock by choosing to use the lower bound, or you might want to choose to follow the upper bound to make sure that you meet customer demand. For instance, a highly advertised item in a promotional flyer might be stocked at the 90th percentile (p90) to make sure of availability and prevent customer disappointment. On the other hand, accessories or bulky items that are less likely to drive customer traffic could be stocked at the 40th percentile (p40). It’s generally not advisable to stock below the 40th percentile, to avoid being consistently out of stock.

To generate the forecast prediction, select the Download prediction dropdown menu button to download the forecast prediction chart as image or forecast prediction values as CSV file.

You can use the What if scenario button to explore how changing the price will affect the demand of an item. To use this feature, you must leave empty the future dated rows with the feature you’re predicting. This dataset has empty cells for a few items, which means that this feature is enabled for them. Choose What if scenario and edit the values for the different dates to view how changing the price will affect demand. This feature helps organizations test specific scenarios without making changes to the underlying data.

To generate batch predictions on the entire dataset, follow these steps:

Choose All items and then choose Start Predictions. The Status will show as Generating predictions, as shown in the following screenshot.

When it’s complete, the Status will show as Ready, as shown in the following screenshot. Select the three-dot additional options icon and choose Preview. This will open the prediction results in a preview page.

Choose Download to export these results to your local computer or choose Send to Amazon QuickSight for visualization, as shown in the following screenshot.

Training time and performance

SageMaker Canvas provides efficient training times and offers valuable insights into model performance. You can inspect model accuracy, perform backtesting, and evaluate various performance metrics for the underlying models. By combining multiple algorithms in the background, SageMaker Canvas significantly reduces the time required to train models compared to training each model individually. Additionally, by using the model leaderboard dashboard, you can assess the performance of each trained algorithm against your specific time-series data, ranked based on the selected performance metric (wQL by default).

This dashboard also displays other metrics, which you can use to compare different algorithms trained on your data across various performance measures, facilitating informed decision-making and model selection.

To view the leaderboard, choose Model leaderboard, as shown in the following screenshot.

The model leaderboard shows you the different algorithms used to train your data along with their performance based on all the available metrics, as shown in the following screenshot.

Integration

Retail and (CPG) organizations often rely on applications such as inventory lifecycle management, order management systems, and business intelligence (BI) dashboards, which incorporate forecasting capabilities. In these scenarios, organizations can seamlessly integrate the SageMaker Canvas forecasting service with their existing applications, enabling them to harness the power of forecasting data. To use the forecasting data within these applications, an endpoint for the forecasting model is required. Although SageMaker Canvas models can be deployed to provide endpoints, this process may require additional effort from a machine learning operations (MLOps) perspective. Fortunately, Amazon SageMaker streamlines this process, streamlining the deployment and integration of SageMaker Canvas models.

The following steps show how you can deploy SageMaker Canvas models using SageMaker:

On the SageMaker console, in the left navigation pane, choose My Models.
Select the three-dot additional options icon next to the model you want to deploy and choose Deploy, as shown in the following screenshot.

Under Instance type, select the size of the instance where your model will be deployed to. Choose Deploy and wait until your deployment status changes to In service.

After your deployment is in service, in the left navigation pane, choose ML Ops to get your deployed model endpoint, as shown in the following screenshot. You can test your deployment or start using the endpoint in your applications.

Reproducibility and API management

It’s important to understand that Amazon SageMaker Canvas uses Speed up your time series forecasting by up to 50 percent with Amazon SageMaker Canvas UI and AutoML APIs in the AWS Machine Learning Blog.

Insights

Retail and CPG enterprises typically use visualization tools such as Amazon QuickSight or third-party software such as Tableau to understand forecast results and share them across business units. To streamline the visualization, SageMaker Canvas provides embedded visualization for exploring forecast results. For those retail and CPG enterprises who want to visualize the forecasting data in their own BI dashboard systems (such as Amazon QuickSight, Tableau, and Qlik), SageMaker Canvas forecasting models can be deployed to generate forecasting endpoints. Users can also generate a batch prediction file to Amazon QuickSight for batch prediction from the predict window as shown in the following screenshot.

The following screenshot shows the batch prediction file in QuickSight as a database that you can use for analysis

When your dataset is in Amazon QuickSight, you can start analyzing or even visualizing your data using the visualizations tools, as shown in the following screenshot.

Cost

Amazon SageMaker Canvas offers a flexible, cost-effective pricing model based on three key components: workspace instance runtime, utilization of pre-built models, and resource consumption for custom model creation and prediction generation. The billing cycle commences upon launching the SageMaker Canvas application, encompassing a range of essential tasks including data ingestion, preparation, exploration, model experimentation, and analysis of prediction and explainability results. This comprehensive approach means that users only pay for the resources they actively use, providing a transparent and efficient pricing structure. To learn more about pricing examples, check out Amazon SageMaker Canvas pricing.

Ownership and portability

More retail and CPG enterprises have embraced multi-cloud deployments for several reasons. To streamline portability of models built and trained on Amazon SageMaker Canvas to other cloud providers or on-premises environments, Amazon SageMaker Canvas provides downloadable model artifacts.

Also, several retail and CPG companies have many business units (such as merchandising, planning, or inventory management) within the organization who all use forecasting for solving different use cases. To streamline ownership of a model and facilitate straightforward sharing between business units, Amazon SageMaker Canvas now extends its Model Registry integration to timeseries forecasting models. With a single click, customers can register the ML models built on Amazon SageMaker Canvas with the SageMaker Model Registry, as shown in the following screenshot. Register a Model Version in the Amazon SageMaker Developer Guide shows you where to find the S3 bucket location where your model’s artifacts are stored.

Clean up

To avoid incurring unnecessary costs, you can delete the model you just built, then delete the dataset, and sign out of your Amazon SageMaker Canvas domain. If you also signed up for Amazon QuickSight, you can unsubscribe and remove your Amazon QuickSight account.

Conclusion

Amazon SageMaker Canvas empowers retail and CPG companies with a no-code forecasting solution. It delivers automated time-series predictions for inventory planning and demand anticipation, featuring an intuitive interface and rapid model development. With seamless integration capabilities and cost-effective insights, it enables businesses to enhance operational efficiency, meet customer expectations, and gain a competitive edge in the fast-paced retail and consumer goods markets.

We encourage you to evaluate how you can improve your forecasting capabilities using Amazon SageMaker Canvas. Use the intuitive no-code interface to analyze and improve the accuracy of your demand predictions for retail and CPG products, enhancing inventory management and operational efficiency. To get started, you can review the workshop Amazon SageMaker Canvas Immersion Day.

About the Authors

Aditya Pendyala is a Principal Solutions Architect at AWS based out of NYC. He has extensive experience in architecting cloud-based applications. He is currently working with large enterprises to help them craft highly scalable, flexible, and resilient cloud architectures, and guides them on all things cloud. He has a Master of Science degree in Computer Science from Shippensburg University and believes in the quote “When you cease to learn, you cease to grow.

Julio Hanna, an AWS Solutions Architect based in New York City, specializes in enterprise technology solutions and operational efficiency. With a career focused on driving innovation, he currently leverages Artificial Intelligence, Machine Learning, and Generative AI to help organizations navigate their digital transformation journeys. Julio’s expertise lies in harnessing cutting-edge technologies to deliver strategic value and foster innovation in enterprise environments.

Enabling generative AI self-service using Amazon Lex, Amazon Bedrock, and ServiceNow

Chat-based assistants have become an invaluable tool for providing automated customer service and support. This post builds on a previous post, Integrate QnABot on AWS with ServiceNow, and explores how to build an intelligent assistant using Amazon Lex, Amazon Bedrock Knowledge Bases, and a custom ServiceNow integration to create an automated incident management support experience.

Amazon Lex is powered by the same deep learning technologies used in Alexa. With it, developers can quickly build conversational interfaces that can understand natural language, engage in realistic dialogues, and fulfill customer requests. Amazon Lex can be configured to respond to customer questions using Amazon Bedrock foundation models (FMs) to search and summarize FAQ responses. Amazon Bedrock Knowledge Bases provides the capability of amassing data sources into a repository of information. Using knowledge bases, you can effortlessly create an application that uses Retrieval Augmented Generation (RAG), a technique where the retrieval of information from data sources enhances the generation of model responses.

ServiceNow is a cloud-based platform for IT workflow management and automation. With its robust capabilities for ticketing, knowledge management, human resources (HR) services, and more, ServiceNow is already powering many enterprise service desks.

By connecting an Amazon Lex chat assistant with Amazon Bedrock Knowledge Bases and ServiceNow, companies can provide 24/7 automated support and self-service options to customers and employees. In this post, we demonstrate how to integrate Amazon Lex with Amazon Bedrock Knowledge Bases and ServiceNow.

Solution overview

The following diagram illustrates the solution architecture.

The workflow includes the following steps:

The ServiceNow knowledge bank is exported into Amazon Simple Storage Service (Amazon S3), which will be used as the data source for Amazon Bedrock Knowledge Bases. Data in Amazon S3 is encrypted by default. You can further enhance security by Using server-side encryption with AWS KMS keys (SSE-KMS).
Amazon AppFlow can be used to sync between ServiceNow and Amazon S3. Other alternatives like AWS Glue can also be used to ingest data from ServiceNow.
Amazon Bedrock Knowledge Bases is created with Amazon S3 as the data source and Amazon Titan (or any other model of your choice) as the embedding model.
When users of the Amazon Lex chat assistant ask queries, Amazon Lex fetches answers from Amazon Bedrock Knowledge Bases.
If the user requests a ServiceNow ticket to be created, it invokes the AWS Lambda
The Lambda function fetches secrets from AWS Secrets Manager and makes an HTTP call to create a ServiceNow ticket.
Application Auto Scaling is enabled on AWS Lambda to automatically scale Lambda according to user interactions.
The solution will confer with responsible AI policies and Guardrails for Amazon Bedrock will enforce organizational responsible AI policies.
The solution is monitored using Amazon CloudWatch, AWS CloudTrail, and Amazon GuardDuty.

Be sure to follow least privilege access policies while giving access to any system resources.

Prerequisites

The following prerequisites need to be completed before building the solution.

On the Amazon Bedrock console, sign up for access to the Anthropic Claude model of your choice using the instructions at Manage access to Amazon Bedrock foundation models. For information about pricing for using Amazon Bedrock, see Amazon Bedrock pricing.
Sign up for a ServiceNow account if you do not have one. Save your username and password. You will need to store them in AWS Secrets Manager later in this walkthrough.
Create a ServiceNow instance following the instructions in Integrate QnABot on AWS ServiceNow.
Create a user with permissions to create incidents in ServiceNow using the instructions at Create a user. Make a note of these credentials for use later in this walkthrough.

The instructions provided in this walkthrough are for demonstration purposes. Follow ServiceNow documentation to create community instances and follow their best practices.

Solution overview

To integrate Amazon Lex with Amazon Bedrock Knowledge Bases and ServiceNow, follow the steps in the next sections.

Deployment with AWS CloudFormation console

In this step, you first create the solution architecture discussed in the solution overview, except for the Amazon Lex assistant, which you will create later in the walkthrough. Complete the following steps:

On the CloudFormation console, verify that you are in the correct AWS Region and choose Create stack to create the CloudFormation stack.
Download the CloudFormation template and upload it in the Specify template Choose Next.
For Stack name, enter a name such as ServiceNowBedrockStack.
In the Parameters section, for ServiceNow details, provide the values of ServiceNow host and ServiceNow username created earlier.
Keep the other values as default. Under Capabilities on the last page, select I acknowledge that AWS CloudFormation might create IAM resources. Choose Submit to create the CloudFormation stack.
After the successful deployment of the whole stack, from the Outputs tab, make a note of the output key value BedrockKnowledgeBaseId because you will need it later during creation of the Amazon Lex assistant.

Integration of Lambda with Application Auto Scaling is beyond the scope of this post. For guidance, refer to the instructions at AWS Lambda and Application Auto Scaling.

Store the secrets in AWS Secrets Manager

Follow these steps to store your ServiceNow username and password in AWS Secrets Manager:

On the CloudFormation console, on the Resources tab, enter the word “secrets” to filter search results. Under Physical ID, select the console URL of the AWS Secrets Manager secret you created using the CloudFormation stack.
On the AWS Secrets Manager console, on the Overview tab, under Secret value, choose Retrieve secret value.
Select Edit and enter the username and password of the ServiceNow instance you created earlier. Make sure that both the username and password are correct.

Download knowledge articles

You need access to ServiceNow knowledge articles. Follow these steps:

Create a knowledge base if you don’t have one. Periodically, you may need to sync your knowledge base to keep it up to date.
Sync the data from ServiceNow to Amazon S3 using Amazon AppFlow by following instructions at ServiceNow. Alternatively, you can use AWS Glue to ingest data from ServiceNow to Amazon S3 by following instructions at the blog post, Extract ServiceNow data using AWS Glue Studio in an Amazon S3 data lake and analyze using Amazon Athena.
Download a sample article.

Sync Amazon Bedrock Knowledge Bases:

This solution uses the fully managed Knowledge Base for Amazon Bedrock to seamlessly power a RAG workflow, eliminating the need for custom integrations and data flow management. As the data source for the knowledge base, the solution uses Amazon S3. The following steps outline uploading ServiceNow articles to an S3 bucket created by a CloudFormation template.

On the CloudFormation console, on the Resources tab, enter “S3” to filter search results. Under Physical ID, select the URL for the S3 bucket created using the CloudFormation stack.
Upload the previously downloaded knowledge articles to this S3 bucket.

Next you need to sync the data source.

On the CloudFormation console, on the Outputs tab, enter “Knowledge” to filter search results. Under Value, select the console URL of the knowledge bases that you created using the CloudFormation stack. Open that URL in a new browser tab.
Scroll down to Data source and select the data source. Choose Sync.

You can test the knowledge base by choosing the model in the Test the knowledge base section and asking the model a question.

Responsible AI using Guardrails for Amazon Bedrock

Conversational AI applications require robust guardrails to safeguard sensitive user data, adhere to privacy regulations, enforce ethical principles, and mitigate hallucinations, fostering responsible development and deployment. Guardrails for Amazon Bedrock allow you to configure your organizational policies against the knowledge bases. They help keep your generative AI applications safe by evaluating both user inputs and model responses

To set up guardrails, follow these steps:

Follow the instructions at the Amazon Bedrock User Guide to create a guardrail.

You can reduce the hallucinations of the model responses by enabling grounding check and relevance check and adjusting the threshold

Create a version of the guardrail.
Select the newly created guardrail and copy the guardrail ID. You will use this ID later in the intent creation.

Amazon Lex setup

In this section, you configure your Amazon Lex chat assistant with intents to call Amazon Bedrock. This walkthrough uses Amazon Lex V2.

On the CloudFormation console, on the Outputs tab, copy the value of BedrockKnowledgeBaseId. You will need this ID later in this section.
On the Outputs tab, under Outputs, enter “bot” to filter search results. Choose the console URL of the Amazon Lex assistant you created using the CloudFormation stack. Open that URL in a new browser tab.
On the Amazon Lex Intents page, choose Create another intent. On the Add intent dropdown menu, choose Use built-in intent.
On the Use built-in intent screen, under Built-in intent, choose QnAIntent- Gen AI feature.
For Intent name, enter BedrockKb and select Add.
In the QnA configuration section, under Select model, choose Anthropic and Claude 3 Haiku or a model of your choice.
Expand Additional Model Settings and enter the Guardrail ID for the guardrails you created earlier. Under Guardrail Version, enter a number that corresponds to the number of versions you have created.
Enter the Knowledge base for Amazon Bedrock Id that you captured earlier in the CloudFormation outputs section. Choose Save intent at the bottom.

You can now add more QnAIntents pointing to different knowledge bases.

Return to the intents list by choosing Back to intents list in the navigation pane.
Select Build to build the assistant.

A green banner on the top of the page with the message Successfully built language English (US) in bot: servicenow-lex-bot indicates the Amazon Lex assistant is now ready.

Test the solution

To test the solution, follow these steps:

In the navigation pane, choose Aliases. Under Aliases, select TestBotAlias.
Under Languages, choose English (US). Choose Test.
A new test window will pop up in the bottom of the screen.
Enter the question “What benefits does AnyCompany offer to its employees?” Then press Enter.

The chat assistant generates a response based on the content in knowledge base.

To test Amazon Lex to create a ServiceNow ticket for information not present in the knowledge base, enter the question “Create a ticket for password reset” and press Enter.

The chat assistant generates a new ServiceNow ticket because this information is not available in the knowledge base.

To search for the incident, log in to the ServiceNow endpoint that you configured earlier.

Monitoring

You can use CloudWatch logs to review the performance of the assistant and to troubleshoot issues with conversations. From the CloudFormation stack that you deployed, you have already configured your Amazon Lex assistant CloudWatch log group with appropriate permissions.

To view the conversation logs from the Amazon Lex assistant, follow these directions.

On the CloudFormation console, on the Outputs tab, enter “Log” to filter search results. Under Value, choose the console URL of the CloudWatch log group that you created using the CloudFormation stack. Open that URL in a new browser tab.

To protect sensitive data, Amazon Lex obscures slot values in conversation logs. As security best practice, do not store any slot values in request or session attributes. Amazon Lex V2 doesn’t obscure the slot value in audio. You can selectively capture only text using the instructions at Selective conversation log capture.

Enable logging for Amazon Bedrock ingestion jobs

You can monitor Amazon Bedrock ingestion jobs using CloudWatch. To configure logging for an ingestion job, follow the instructions at Knowlege bases logging.

AWS CloudTrail logs

AWS CloudTrail is an AWS service that tracks actions taken by a user, role, or an AWS service. CloudTrail is enabled on your AWS account when you create the account. When activity occurs in that activity is recorded in a CloudTrail event along with other AWS service events in Event history. You can view, search, and download recent events in your AWS account. For more information, see Working with CloudTrail Event history.

As security best practice, you should monitor any access to your environment. You can configure Amazon GuardDuty to identify any unexpected and potentially unauthorized activity in your AWS environment.

Cleanup

To avoid incurring future charges, delete the resources you created. To clean up the AWS environment, use the following steps:

Empty the contents of the S3 bucket you created as part of the CloudFormation stack.
Delete the CloudFormation stack you created.

Conclusion

As customer expectations continue to evolve, embracing innovative technologies like conversational AI and knowledge management systems becomes essential for businesses to stay ahead of the curve. By implementing this integrated solution, companies can enhance operational efficiency and deliver superior service to both their customers and employees, while also adapting the responsible AI policies of the organization.

Stay up to date with the latest advancements in generative AI and start building on AWS. If you’re seeking assistance on how to begin, check out the Generative AI Innovation Center.

About the Authors

Marcelo Silva is an experienced tech professional who excels in designing, developing, and implementing cutting-edge products. Starting off his career at Cisco, Marcelo worked on various high-profile projects including deployments of the first ever carrier routing system and the successful rollout of ASR9000. His expertise extends to cloud technology, analytics, and product management, having served as senior manager for several companies such as Cisco, Cape Networks, and AWS before joining GenAI. Currently working as a Conversational AI/GenAI Product Manager, Marcelo continues to excel in delivering innovative solutions across industries.

Sujatha Dantuluri is a seasoned Senior Solutions Architect on the US federal civilian team at AWS, with over two decades of experience supporting commercial and federal government clients. Her expertise lies in architecting mission-critical solutions and working closely with customers to ensure their success. Sujatha is an accomplished public speaker, frequently sharing her insights and knowledge at industry events and conferences. She has contributed to IEEE standards and is passionate about empowering others through her engaging presentations and thought-provoking ideas.

NagaBharathi Challa is a solutions architect on the US federal civilian team at Amazon Web Services (AWS). She works closely with customers to effectively use AWS services for their mission use cases, providing architectural best practices and guidance on a wide range of services. Outside of work, she enjoys spending time with family and spreading the power of meditation.

Pranit Raje is a Cloud Architect on the AWS Professional Services India team. He specializes in DevOps, operational excellence, and automation using DevSecOps practices and infrastructure as code. Outside of work, he enjoys going on long drives with his beloved family, spending time with them, and watching movies.

NoTraffic Reduces Road Delays, Carbon Emissions With NVIDIA AI and Accelerated Computing

More than 90 million new vehicles are introduced to roads across the globe every year, leading to an annual 12% increase in traffic congestion — according to NoTraffic, a member of the NVIDIA Inception program for cutting-edge startups and the NVIDIA Metropolis vision AI ecosystem.

Still, 99% of the world’s traffic signals run on fixed timing plans, leading to unnecessary congestion and delays.

To reduce such inefficiencies, mitigate car accidents and reduce carbon emissions from vehicles, NoTraffic’s AI Mobility platform predicts road scenarios, helps ensure continuous traffic flow, minimizes stops and optimizes safety at intersections across the U.S., Canada and elsewhere.

The platform — which enables road infrastructure management at both local-intersection and city-grid scale — integrates NVIDIA-powered software and hardware at the edge, under a cloud-based operating system.

It’s built using the NVIDIA Jetson edge AI platform, NVIDIA accelerated computing and the NVIDIA Metropolis vision AI developer stack.

“With NVIDIA accelerated computing, we achieved a 3x speedup in AI training and doubled AI Mobility’s energy efficiency,” said Uriel Katz, cofounder and chief technology officer of NoTraffic. “These optimizations in time, money and energy efficiency are all bolstered by NVIDIA Jetson, which sped our image preprocessing tasks by 40x compared with a CPU-only workflow. Plus, GPU-accelerated NVIDIA CUDA libraries increased our model throughput by 30x.”

These libraries include the NVIDIA TensorRT ecosystem of application programming interfaces for high-performance deep learning inference and the NVIDIA cuDNN library of primitives for deep neural networks.

Taming Traffic in Tuscon, Vancouver and Beyond

In Tuscon, Arizona, more than 80 intersections are tapping into the NoTraffic AI Mobility platform, which has enabled up to a 46% reduction in road delays during rush hours — and a half-mile reduction in peak queue length.

The work is an expansion of NoTraffic’s initial deployment on Tuscon’s West Ajo Way. That effort led to an average delay reduction of 23% for drivers.

Since installation, NoTraffic technology has helped free Tucson drivers from over 1.25 million hours stuck in traffic, the company estimates, representing an economic benefit of over $24.3 million. The company has also tracked a nearly 80% reduction in red-light runners since its platform was deployed, helping improve safety at Tucson intersections.

By reducing travel times, drivers have also saved over $1.6 million in gas, cutting emissions and improving air quality to make the equivalent impact of planting 650,000 trees.

In Vancouver, Canada, the University of British Columbia (UBC) is using the NoTraffic platform and Rogers Communications’ 5G-connected, AI-enabled smart-traffic platform to reduce both pedestrian delays and greenhouse gas emissions.

Rogers Communications’ 5G networks provide robust and stable connectivity to the sensors embedded on the traffic poles.

This advanced network infrastructure enhances the NoTraffic platform’s efficacy and scalability, as the improved speed and reduced latency of 5G networks means traffic data can be processed in real time. This is critical for predicting numerous potential traffic scenarios, adjusting signal timings and prioritizing road users accordingly.

With AI Mobility deployed at seven intersections across the campus, the university experienced an up to 40% reduction in pedestrian delays and significant decreases in vehicle wait time.

In addition, UBC reduces 74 tons of carbon dioxide emissions each year thanks to the NoTraffic and Rogers solution, which is powered by NVIDIA edge AI and accelerated computing.

The platform is also in action on the roads of Phoenix, Arizona; Baltimore, Maryland; and in 35 states through 200+ agencies across the U.S. and Canada.

Honk If You Love Reducing Congestion, Carbon Emissions

The NoTraffic AI Mobility platform offers local AI-based predictions that, based on sensor inputs at multiple intersections, analyze numerous traffic scenarios up to two minutes in advance.

It can adapt to real-time changes in traffic patterns and volumes, send messages between intersections and run optimization algorithms that control traffic signals to improve overall transportation efficiency and safety through cloud connectivity.

Speedups in the AI Mobility platform mean quicker optimizations of traffic signals — and reduced congestion on the roads means reduced carbon emissions from vehicles.

NoTraffic estimates that for every city optimized with this platform, eight hours of traffic time could be saved per driver. Plus, with over 300,000 signalized intersections in the U.S., the company says this could result in a total of $14 billion in economic savings per year.

Learn more about the NVIDIA Metropolis platform and how it’s used in smart cities and spaces.

Accelerating LLM Inference with GemLite, TorchAO and SGLang

Large Language Models (LLMs) are typically very resource-intensive, requiring significant amounts of memory, compute and power to operate effectively. Quantization provides a solution by reducing weights and activations from 16 bit floats to lower bitrates (e.g., 8 bit, 4 bit, 2 bit), achieving significant speedup and memory savings and also enables support for larger batch sizes.

Existing solutions for low precision inference work well for small batch sizes, but suffer from following issues:

Performance drops when we increase the batch size
Restrictions on types of quantization, for example, some kernels only support symmetric quantization that could have implications on accuracy of the model at lower bits
Interplay between quantization, serialization, and tensor parallelism (TP) makes it difficult to load quantized models and requires changes to user models

To address these challenges, we created an end-to-end, performant, modular and extensible low-precision inference solution integrating the following libraries:

GemLite, a Triton kernel library, tackles the performance limitations of large batch sizes and restrictions on the types of quantization
TorchAO, a PyTorch-native library, provides a streamlined experience for quantization, sparsity, and tensor parallelism (with DTensor)
SGLang, a fast, efficient and hackable serving framework for Large Language Model (LLM) and Vision Language Models (VLM) with extensive model support

If you’re interested in trying this out in SGLang, please follow these repro instructions. For the rest of the blog, we’ll walk through relevant details for GemLite, TorchAO and SGlang both in terms of the design of the library itself and integration in addressing the problems we mentioned above, in the end we’ll present the benchmarking results on Llama 3.1-8B model across different batch sizes and tensor parallel sizes.

1. Teaser of Results

Following is a summary of the results in 8xH100 machine on Llama 3.1-8B for decode. For all experiments, the baseline is bfloat16 torch.compiled model:

	bfloat16 w/ torch.compile	int4 weight only quantization, group size 64	float8 per row dynamic quantization
Batch size 1, TP size 1	131 tokens/sec	255 tokens/sec (1.95x speedup)	166 tokens/sec (1.27x speedup)
Batch size 32, TP size 1	2799 tokens/sec	3241 tokens/sec (1.16x speedup)	3586 tokens/sec (1.28x speedup)
Batch size 32, TP size 4	5575 tokens/sec	6334 tokens/sec (1.14x speedup)	6159 tokens/sec (1.10x speedup)

Our solution supports NVIDIA GPUs, including H100 and A100, and achieves speedup over the compiled bfloat16 baseline across batch sizes and TP sizes for both int4 weight only (from 1.14x to 1.95x) and float8 dynamic quantization (from 1.10x to 1.28x). Note that quantization may have a small impact on accuracy, which is outside the scope of this blogpost. Our int4 weight-only quantization is compatible with accuracy preserving techniques like HQQ. Please refer to TorchAO’s README, this benchmark, and this blog for more information.

2. GemLite: Kernel Development

The kernels were developed as part of GemLite, a project dedicated to optimizing low-bit matrix multiplication kernels. Developed using Triton, GemLite provides highly flexible and performant solutions across various activations, bitrates and hardware. In a nutshell, the kernels offer:

Support for various activation data types: fp16, int8 and fp8
Compatibility: works seamlessly with non-packed (e.g., int8, fp8) and packed formats (e.g., uint4, uint2, uint1)
Performance Optimization: includes optimized kernels and autotuning tools to achieve high performance across different hardware and batch sizes
Integration: Compatible with torch.compile and CUDA graphs, ensuring support for advanced features like tensor parallelism

Kernel Selection

Optimizing kernel selection for large language model (LLM) generation requires addressing the distinct needs of different batch sizes. LLM workloads involve a mix of compute-bound and memory-bound iterations: smaller batch sizes are memory-bound, while larger batch sizes become compute-bound. GemLite kernels are designed to adapt to these varying demands, ensuring optimal execution for each scenario.

In memory-bound scenarios, where data transfer is the limiting factor, the processor often waits for data to be fetched, leading to underutilized computational resources. For batch size = 1, a GEMV kernel performs best, whereas for larger batch sizes, GEMM kernels are more efficient. For batch sizes between 2 and 64, when matrices are “skinny,” a GEMM-SPLITK kernel is used to enable better GPU utilization (arXiv).

GemLite includes the following kernels optimized for each of these scenarios:

Single Sample Inference

For single-sample inferences, we use GEMV kernels. However, asymmetric quantization methods require additional metadata, such as scales and zero points, to be loaded for each block. This can lead to increased memory transfer, so careful handling is essential.

Specifically, for packed data, our experiments indicate that loading scales and zero points only once per two consecutive blocks minimizes redundant operations. Since these blocks share the same metadata, this approach results in:

5–8% end-to-end inference speedup compared to the default GEMV kernel
30–40% improvement over the traditional Split-K method

This new kernel/algorithm, GEMV_REVSPLITK, is available here.

For non-packed data, the GEMV_SPLITK algorithm is employed. This algorithm iterates over the k-dimension to compute the dot product without relying on Triton’s tl.dot.

Batched Inference

For moderate batch sizes, we use the GEMM-based Split-K method (arXiv) which splits the k-dimension (weight rows) into multiple jobs. The optimal-split SPLIT_K parameter is found by autotuning values ranging from 1 to 16. Setting SPLIT_K=1 enables a fallback implementation to a GEMM kernel, allowing the same kernel code to be used for compute-bound batch sizes starting from 32 and 64, depending on the matrix shape and the device.

Maximizing High Performance: Key Implementation Insights

Various implementation details must be carefully addressed to achieve high performance. Following are some of the key aspects we focused on to ensure high performance:

Autotuning for Performance

Autotuning is critical for achieving optimal kernel performance. Since this process can be time-intensive, GemLite provides tools to automatically save and load autotuning results for all kernels. This ensures that the autotuning process is performed only once per GPU device, minimizing runtime, reducing repetitive overhead, and maintaining consistent performance across runs.
Ensuring Kernel Correctness

Ensuring kernel correctness across different quantization and configuration settings is essential. Triton’s early configuration pruning plays a key role in this process. For example, during Split-K tuning, configurations are selected only if K is divisible by BLOCK_SIZE_K × SPLIT_K,, and BLOCKS_SIZE_K is further pruned based on the group-size value. This approach ensures both efficiency and correctness in kernel operation.
Overcoming Bit-Unpacking Bottlenecks

When deploying on data center-grade GPUs like NVIDIA’s A100 and H100, performance bottlenecks related to bit-unpacking were observed. To mitigate these, various bit-packing configurations were explored, including packing along columns versus rows and experimenting with different bit-packing widths (e.g., 8-bit vs. 32-bit). Notably, transitioning from 32-bit to 8-bit packing delivered performance improvements of up to 18% on the A100 and 6% on the H100
torch.compile compatibility

To ensure seamless compatibility with PyTorch’s torch.compile, kernel calls are wrapped in a custom_op. This integration allows advanced features such as pre-hooks and early configuration pruning to function correctly, delivering accurate results without sacrificing performance. While some of these features are not yet fully supported in PyTorch, the custom_op implementation effectively bridges the gap, ensuring smooth integration and high performance.

3. TorchAO

TorchAO is a PyTorch native quantization and sparsity library for both training and inference, featuring simple user APIs to train, quantize and deploy low precision models, and composability with other PyTorch features like distributed inference and torch.compile.

PyTorch does not support low precision dtypes or different packing formats by default. With Tensor Subclass, we extend PyTorch native Tensor abstractions and model quantization as dtype conversion, while different packing formats for custom kernels are handled through layouts. For example, we support quantized linear operations with int4 weights, packed in a Tensor Core friendly layout, with tinygemm or GemLite kernel implementations. More details can be found here.

Apart from more PyTorch native abstractions for developers, we want to highlight two benefits of this design for modeling users.

Serialization: Save and load quantized weights into a state_dict just like a floating point model, eliminating the need to transform floating point model to quantized model before the quantized weights are loaded. This reduces friction of distributing and deploying quantized models.
Composability: Seamless integration with downstream features like tensor parallel, allowing users to focus on modeling without worrying about compatibility with tensor parallel, torch.compile, and other PyTorch features. Since these features are implemented with Tensor level abstraction, users can quantize and do distributed inference with no model changes most of the time.

GemLite Kernel Integration

To achieve the aforementioned benefits for the GemLite kernel, we integrated GemLite into TorchAO. This integration takes advantage of GemLite’s wide support and flexibility to allow for weight only quantization at 4 and 8 bits, under asymmetric and symmetric quantization schemes, 32 and 8 bit packing sizes, as well as grouped and ungrouped quantization. We enable this integration via the quantize_ api which can be used alongside the GemLite constructor as follows

quantize_(model, gemlite_uintx_weight_only(group_size, bit_width, packing_bitwidth))

The primary difficulty in creating this integration was making sure that the TorchAO composability guarantees were satisfied for the entire breadth of GemLite quantization kernel options. While the primary integration was relatively straight forward, making sure every different quantization type and their associated kernels worked well with tensor parallel was non-trivial.

Torch Tensor Parallel

Tensor Parallelism is an effective way to speed up LLM inference. TP shards large matrices of linear or embedding modules onto multiple devices, typically in column-wise or row-wise styles. As the weight matrix gets distributed, computation is decomposed too. For example, the column-wise pattern below enables simultaneous matrix-vector multiply on four devices:

PyTorch implements TP by converting a regular tensor (e.g. matrix A) into a DTensor:

dtensor = _shard_tensor(mA, device_mesh, (Shard(0),))

Since DTensor stores meta information about the sharding, it knows how to reconstruct the full result when needed. Take Transformers’ feedforward module for example, as the down projection and up projection use column-wise and row-wise sharding respectively, DTensor will automatically perform an all-reduce on the ranks’ results as they move into the next operation. Such automation allows model authors to focus on computation without worrying about the communication needed for distributed execution.

Tensor Parallel and Quantization Order

Since both DTensor and quantization are tensor-level transformations, the application order matters in ensuring a workflow can generally work on different setups. We have two observations: (i) checkpoints are typically saved in quantized formats, to save the quantization overhead before each run; and (ii) TP may run on a different number of devices, depending on resource constraints or service agreements. As such, we first apply quantization to the original tensor, save it to disk depending on whether a reuse is desired. At service launch time, we load the quantized checkpoint and shard the tensors into DTensors on-the-fly as we load them into the model.

Tensor Parallel Support in TorchAO

Since we quantize the model first then distribute the Tensor, we’ll have DTensor(QuantizedTensor(weight)), where DTensor means a distributed Tensor class and QuantizedTensor means a quantized tensor class in TorchAO. QuantizedTensor should support the operators called when constructing a DTensor, including slice and view ops. To make sure the overall execution is efficient, the packed weight that’s sliced in the dimension 0 and 1 should match the result of first slice the unpacked weight then pack (pack and slice operation should commute), otherwise the packing format is not compatible with tensor parallelism.

4. SGLang

SGLang is a fast serving framework for large language models and vision language models. It is known for its almost zero-overhead batch scheduler and fast constrained decoding. It is mainly implemented in Python, lightweight, and easy to hack. It is also one of the first frameworks to integrate torch.compile.

TorchAO integration in SGLang

We integrated quantize_ API for applying a specific type of quantization to model into SGLang that supports int4 weight only quantization (both tinygemm and GemLite version), float8 dynamic quantization and a few other types of quantization so far. Users can enable quantization by adding --torchao-config argument to the benchmarking script. The currently enabled options also support tensor parallelism through composition with DTensor that is enabled with --tp-size option.

Torch Native Tensor Parallel Support in SGLang

Existing model definitions in SGLang use special linear modules that are coupled with tensor parallelism style, for example: MergedColumnParallelLinear, QKVParallelLinear and RowParallelLinear. To decouple the model definition and tensor parallelization style, we defined a pytorch native model that uses plain nn.Linear module from PyTorch and rely on PyTorch tensor parallelism APIs for parallelization and torch.compile for speedup. At related module hierarchies, we add a dictionary describing how a submodule should be parallelized. For example, in class LlamaAttention, we define:

_tp_plan = {
    "qkv_proj": "Colwise_Sharded",
    "o_proj": "Rowwise",
}

where "qkv_proj" and "o_proj" are the FQNs of the wqkv and wo projections, and the values are their TP styles.

We then define a TP engine in model_parallel.py. It searches for _tp_plan recursively within the model, and applies the indicated TP styles to the submodules using PyTorch’s parallelize_module API.

5. Results

The evaluation focused on two popular quantization techniques for H100 machines: int4 weight-only quantization and float8 dynamic quantization. These methods were chosen due to their widespread use in optimizing memory efficiency and computational performance on H100 machines, making them ideal candidates for benchmarking against various workloads.

int4 Weight-Only Quantization: This method significantly reduces memory footprint and accelerates decode for memory-bound workloads, with minimal impact on performance in compute-intensive scenarios like prefill or larger batch sizes. We present results for bf16, GemLite, and tinygemm kernels below, across various batch sizes and tensor parallel configurations
float8 Dynamic Quantization: While offering less memory savings, this method often provides higher accuracy and balanced speedups for both memory-bound and compute-bound tasks. With Hopper-grade hardware and native fp8 support, the efficient cutlass/cuBLAS kernels used by AO contribute to a significant speedup

The graphs below show the decode tokens/sec for different tp sizes, each graph shows the results across different batch sizes and for different types of quantization:

BF16 is our bfloat16, torch.compile’d baseline
tinygemm-4-64 is using int4_weight_only quantization in TorchAO, it’s a 4 bit groupwise quantization with group size of 64, using tinygemm kernel
gemlite-4-64 is using gemlite_uintx_weight_only quantization in TorchAO, 4 means 4 bit, and 64 is also the group size, using GemLite kernel
fp8dq-per_row is using float8_dynamic_activation_float8_weight quantization in TorchAO, both activation and weights are quantized with per row scales

For int4 weight-only quantization, at batch size 1, the tinygemm kernel achieved the best performance. However, its efficiency declined with increasing batch sizes. Conversely, GemLite effectively bridged this gap, delivering superior performance at larger batch sizes. GemLite also achieved a 9–10x speedup during the prefill phase compared to tinygemm, despite ongoing performance optimizations constrained by Triton.

Float8 dynamic quantization showed 1.3x speedup over bfloat16 consistently with tensor parallel size 1 across different batch sizes and 1.1x to 1.2x speedup in larger tensor parallel sizes. As the tensor parallel size increases, the overall speedup decreases, which is expected due to the reduction in matmul size. Note that we do expect to get speedup for prefill as well, but since we rely on torch.compile for speedup and prefill compile is not enabled in SGLang yet, we will leave this for future work.

Repro Instructions

We conducted benchmarks on an 8xH100 machine using GemLite 0.4.1, SGLang built from commit feb2b76, TorchAO nightly 0.8.0.dev20241223+cu124, and PyTorch 2.5.1. The Llama-3.1 Instruct models were chosen as the architecture for evaluation.

BATCH_SIZE=16
# Note: gemlite is only compatible with float16
# while int4wo-64 (tinygemm-4-64 as shown in the graph) and fp8dq-per_row should use bfloat16
DTYPE=float16
# int4wo-64, fp8dq-per_tensor
TORCHAO_CONFIG=gemlite-4-64
TP_SIZE=2
# Decode performance
python3 -m sglang.bench_offline_throughput --model-path meta-llama/Llama-3.1-8B-Instruct --json-model-override-args '{"architectures": ["TorchNativeLlamaForCausalLM"]}' --dataset-name random --random-input 1024 --random-output 512 --random-range 1 --num-prompts $BATCH_SIZE --enable-torch-compile --dtype $DTYPE --torchao-config $TORCHAO_CONFIG --tp-size $TP_SIZE

# Example output
# Benchmark...
# [2024-12-20 12:42:16 TP0] Prefill batch. #new-seq: 2, #new-token: 2046, #cached-token: 4, cache hit rate: .06%, token usage: 0.00, #running-req: 0, #queue-req: 0
# ...
# [2024-12-20 12:45:35 TP0] Decode batch. #running-req: 16, #token: 16763, token usage: 0.01, gen throughput (token/s): 2.20, #queue-req: 0
# [2024-12-20 12:45:38 TP0] Decode batch. #running-req: 16, #token: 24443, token usage: 0.02, gen throughput (token/s): 2739.89, #queue-req: 0

# We reported the last throughput (token/s) as the performance for decode

Conclusion

With performant and extensible kernels from GemLite, PyTorch native architecture optimization library TorchAO and high performance inference framework SGLang, we showcased fast end-to-end quantized inference for both int4 and float8 across different batch sizes and tensor parallel sizes with simple and composable user APIs to reduce the resource requirement for LLMs. This integration is our first step towards meeting the needs of fast inference across different models, workloads, precisions and hardwares and we are looking forward to continuing advancing the state of the art for end to end mixed and low precision LLM inference.

Our immediate future work focuses on the following:

Exploring diverse combinations of weight and activation quantization to strike the best balance between speed and accuracy
Extending support to additional GPU architectures to broaden accessibility
Enhancing compatibility with MoE models to address growing demands in scalable inference
Allow for easy integration of fast custom kernels in TorchAO so that they can be easily leveraged by SGLang and other inference frameworks
While we didn’t measure accuracy impact in this blogpost, we can develop auto quantization tool in TorchAO to allow users to trade off between performance and accuracy
Better integration with tensor parallelism in SGLang to support running larger models
Enable torch.compile for prefill phase in SGLang

We also invite the community to actively test, provide feedback, and contribute to shaping the future of fast and efficient LLM inference.

Solution overview

Request prompt

Working with Amazon Bedrock

Business impact

Conclusion

About the Authors

How Financial Institutions Use AI to Detect Fraud

US Federal Agencies Fight Fraud With AI

How AI Can Help Healthcare Stem Potential Fraud

AI for Fraud Detection Could Save Billions of Dollars

Time Stamps

You Might Also Like…

Subscribe to the AI Podcast

Generate Exponentially More Synthetic Data With Omniverse and NVIDIA Cosmos

See Cosmos in Action for Physical AI Use Cases

Get Plugged Into the World of OpenUSD

PyTorch Shanghai Meetup (August 15)

PyTorch Conference 2024 (September 18-19)

What Made It Special:

GPU MODE IRL Hackathon (September 21)

Consumer AI Edge Hackathon (November 22-23)

PyTorch Korea User Group Meetup (November 30)

Open Source AI Night at NeurIPS 2024 (December 10)

Looking Ahead

Introduction to Amazon SageMaker Canvas

How Amazon SageMaker Canvas can help retail and CPG manufacturers solve their forecasting challenges

How to generate forecasts

Generate forecasts

Train the model

Training time and performance

Integration

Reproducibility and API management

Insights

Cost

Ownership and portability

Clean up

Conclusion

About the Authors

Solution overview

Prerequisites

Solution overview

Deployment with AWS CloudFormation console

Store the secrets in AWS Secrets Manager

Download knowledge articles

Sync Amazon Bedrock Knowledge Bases:

Responsible AI using Guardrails for Amazon Bedrock

Amazon Lex setup

Test the solution

Monitoring

Enable logging for Amazon Bedrock ingestion jobs

AWS CloudTrail logs

Cleanup

Conclusion

About the Authors

Taming Traffic in Tuscon, Vancouver and Beyond

Honk If You Love Reducing Congestion, Carbon Emissions

1. Teaser of Results

2. GemLite: Kernel Development

Kernel Selection

Single Sample Inference

Batched Inference

Maximizing High Performance: Key Implementation Insights

3. TorchAO

GemLite Kernel Integration

Torch Tensor Parallel

4. SGLang

5. Results

Repro Instructions

Conclusion

Navigation

GenAI Vision Endless Possibilities

"I'm interested in things that change the world or that affect the future and wondrous, new technology where you see it, and you're like, 'Wow, how did that even happen? How is that possible?'" -- Elon Musk

Copyright © 2019-2025 Vedere AI. All Rights Reserved.