Build an ecommerce product recommendation chatbot with Amazon Bedrock Agents

Build an ecommerce product recommendation chatbot with Amazon Bedrock Agents

Many ecommerce applications want to provide their users with a human-like chatbot that guides them to choose the best product as a gift for their loved ones or friends. To enhance the customer experience, the chatbot need to engage in a natural, conversational manner to understand the user’s preferences and requirements, such as the recipient’s gender, the occasion for the gift, and the desired product category. Based on the discussion with the user, the chatbot should be able to query the ecommerce product catalog, filter the results, and recommend the most suitable products.

Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading artificial intelligence (AI) companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI.

Amazon Bedrock Agents is a feature that enables generative AI applications to run multistep tasks across company systems and data sources. In this post, we show you how to build an ecommerce product recommendation chatbot using Amazon Bedrock Agents and FMs available in Amazon Bedrock.

Solution overview

Traditional rule-based chatbots often struggle to handle the nuances and complexities of open-ended conversations, leading to frustrating experiences for users. Furthermore, manually coding all the possible conversation flows and product filtering logic is time-consuming and error-prone, especially as the product catalog grows.

To address this challenge, you need a solution that uses the latest advancements in generative AI to create a natural conversational experience. The solution should seamlessly integrate with your existing product catalog API and dynamically adapt the conversation flow based on the user’s responses, reducing the need for extensive coding.

With Amazon Bedrock Agents, you can build intelligent chatbots that can converse naturally with users, understand their preferences, and efficiently retrieve and recommend the most relevant products from the catalog. Amazon Bedrock Agents simplifies the process of building and deploying generative AI models, enabling businesses to create engaging and personalized conversational experiences without the need for extensive machine learning (ML) expertise.

For our use case, we create a recommender chatbot using Amazon Bedrock Agents that prompts users to describe who they want to buy the gift for and the relevant occasion. The agent queries the product information stored in an Amazon DynamoDB table, using an API implemented as an AWS Lambda function. The agent adapts the API inputs to filter products based on its discussion with the user, for example gender, occasion, and category. After obtaining the user’s gift preferences by asking clarifying questions, the agent responds with the most relevant products that are available in the DynamoDB table based on user preferences.

The following diagram illustrates the solution architecture.

ecommerce recommender chatbot architecture

As shown in the preceding diagram, the ecommerce application first uses the agent to drive the conversation with users and generate product recommendations. The agent uses an API backed by Lambda to get product information. Lastly, the Lambda function looks up product data from DynamoDB.

Prerequisites

You need to have an AWS account with a user or role that has at minimum the following AWS Identity and Access Management (IAM) policies and permissions:

  • AWS managed policies:
    • AmazonBedrockFullAccess
    • AWSMarketplaceManageSubscriptions
    • AWSLambda_ReadOnlyAccess
    • AmazonDynamoDBReadOnlyAccess
  • IAM actions:
    • iam:CreateRole
    • iam:CreatePolicy
    • iam:AttachRolePolicy

Deploy the solution resources with AWS CloudFormation

Before you create your agent, you need to set up the product database and API. We use an AWS CloudFormation template to create a DynamoDB table to store product information and a Lambda function to serve as the API for retrieving product details.

At the time of writing this post, you can use any of the following AWS Regions to deploy the solution: US East (N. Virginia), US West (Oregon), Asia Pacific (Mumbai, Sydney), Europe (Frankfurt, Paris), Canada (Central), or South America (São Paulo). Visit Supported regions and models for Amazon Bedrock Agents for updates.

To deploy the template, choose Launch Stack:

Launch Stack to create solution resources

This template creates a DynamoDB table named Products with the following attributes: product_name (partition key), category, gender, and occasion. It also defines a global secondary index (GSI) for each of these attributes to enable efficient querying.

Additionally, the template sets up a Lambda function named GetProductDetailsFunction that acts as an API for retrieving product details, This Lambda function accepts query parameters such as category, gender, and occasion. It constructs a filter expression based on the provided parameters and scans the DynamoDB table to retrieve matching products. If no parameters are provided, it retrieves all the products in the table and returns the first 100 products.

The template also creates another Lambda function called PopulateProductsTableFunction that generates sample data to store in the Products table. The CloudFormation template includes a custom resource that will run the PopulateProductsTableFunction function one time as part of the template deployment, to add 100 sample product entries in the products DynamoDB table, with various combinations of product names, descriptions, categories, genders, and occasions.

You can optionally update the sample product entries or replace it with your own product data. To do so, open the DynamoDB console, choose Explore items, and select the Products table. Choose Scan and choose Run to view and edit the current items or choose Create item to add a new item. If your data has different attributes than the sample product entries, you need to adjust the code of the Lambda function GetProductDetailsFunction, the OpenAPI schema, and the instructions for the agent that are used in the following section.

Create the agent

Now that you have the infrastructure in place, you can create the agent. The first step is to request model access.

  1. On the Amazon Bedrock console, choose Model access in the navigation pane.
  2. Choose Enable specific models.

Model Access Enable specific model

  1. Select the model you need access to (for this post, we select Claude 3 Sonnet).

edit model access page and select claude 3 sonnet

Wait for the model access status to change to Access granted.

Model access granted

Now you can create your agent. We use a CloudFormation template to create the agent and the action group that will invoke the Lambda function.

  1. To deploy the template, choose Launch Stack:

Launch Stack to create Agent

Now you can check the details of the agent that was created by the stack.

  1. On the Amazon Bedrock console, choose Agents under Builder tools in the navigation pane.
  2. Choose the agent product-recommendation-agent, then choose Edit in Agent Builder.
  3. The Instructions for the Agent section includes a set of instructions that guides the agent in how to communicate with the user and use the API. You can adjust the instructions based on different use cases and business scenarios as well as the available APIs.

The agent’s primary goal is to engage in a conversation with the user to gather information about the recipient’s gender, the occasion for the gift, and the desired category. Based on this information, the agent will query the Lambda function to retrieve and recommend suitable products.

Your next step is to check the action group that enables the agent to invoke the Lambda function.

  1. In the Action groups section, choose the Get-Product-Recommendations action group.

You can see the GetProductDetailsFunction Lambda function is selected in the Action group invocation section.

action group invocation details

In the Action group schema section, you can see the OpenAPI schema, which enables the agent to understand the description, inputs, outputs, and the actions of the API that it can use during the conversation with the user.

action group schema

Now you can use the Test Agent pane to have conversations with the chatbot.

Test the chatbot

The following screenshots show example conversations, with the chatbot recommending products after calling the API.

Agent Test sample for a gift for brother graduation

Agent Test sample for a gift for wife in valentine's day

In the sample conversation, the chatbot asks relevant questions to determine the gift recipient’s gender, the occasion, and the desired category. After it has gathered enough information, it queries the API and presents a list of recommended products matching the user’s preferences.

You can see the rationale for each response by choosing Show trace. The following screenshots show how the agent decided to use different API filters based on the discussion.

show trace and rationale

Another show trace and rationale

You can see in the rationale field how the agent made its decision for each interaction. This trace data can help you understand the reasons behind a recommendation. Logging this information can be beneficial for future refinements of your agent’s recommendations.

Clean up

Complete the following steps to clean up your resources:

  1. On the AWS CloudFormation console, delete the stack AgentStack.
  2. Then delete the stack Productstableandapi.

Conclusion

This post showed you how to use Amazon Bedrock Agents to create a conversational chatbot that can assist users in finding the perfect gift. The chatbot intelligently gathers user preferences, queries a backend API to retrieve relevant product details, and presents its recommendations to the user. This approach demonstrates the power of Agents for Amazon Bedrock in building engaging and context-aware conversational experiences.

We recommend you follow best practices while using Amazon Bedrock Agents. For instance, using AWS CloudFormation to create and configure the agent allows you to minimize human error and recreate the agent across different environments and Regions. Also, automating your agent testing using a set of golden questions and their expected answers enables you to test the quality of the instructions for the agent and compare the outputs of the different models on Amazon Bedrock in relation to your use case.

Visit Amazon Bedrock Agents to learn more about features and details.


About the Author

Mahmoud Salaheldin is a Senior Solutions Architect in AWS, working with customers in the Middle East, North Africa, and Turkey, where he helps enterprises, digital-centered businesses, and independent software vendors innovate new products that can enhance their customer experience and increase their business efficiency. He is a generative AI ambassador as well as a containers community member. He lives in Dubai, United Arab Emirates, and enjoys riding motorcycles and traveling.

Read More

How Thomson Reuters Labs achieved AI/ML innovation at pace with AWS MLOps services

How Thomson Reuters Labs achieved AI/ML innovation at pace with AWS MLOps services

This post is co-written by Danilo Tommasina and Andrei Voinov from Thomson Reuters.

Thomson Reuters (TR) is one of the world’s most trusted information organizations for businesses and professionals. TR provides companies with the intelligence, technology, and human expertise they need to find trusted answers, enabling them to make better decisions more quickly. TR’s customers span the financial, risk, legal, tax, accounting, and media markets.

Thomson Reuters Labs (TR Labs) is the dedicated applied research division within TR. TR Labs is focused on the research, development, and application of artificial intelligence (AI) and emerging trends in technologies that can be infused into existing TR products or new offerings. TR Labs works collaboratively with various product teams to experiment, prototype, test, and deliver AI-powered innovation in pursuit of smarter and more valuable tools for our customers. The TR Labs team includes over 150 applied scientists, machine learning specialists, and machine learning engineers.

In this post, we explore how TR Labs was able to develop an efficient, flexible, and powerful MLOps process by adopting a standardized MLOps framework that uses AWS SageMaker, SageMaker Experiments, SageMaker Model Registry, and SageMaker Pipelines. The goal being to accelerate how quickly teams can experiment and innovate using AI and machine learning (ML)—whether using natural language processing (NLP), generative AI, or other techniques. We discuss how this has helped decrease the time to market for fresh ideas and helped build a cost-efficient machine learning lifecycle. Lastly, we will go through the MLOps toolchain that TR Labs built to standardize the MLOps process for developers, scientists, and engineers.

The challenge

Machine learning operations (MLOps) is the intersection of people, for gaining business value from machine learning. An MLOps practice is essential for an organization with large teams of ML engineers and data scientists. Correctly using AI/ML tools to increase productivity directly influences efficiency and cost of development. TR Labs was founded in 1992 with a vision to be a world-leading AI/ML research and development practice, forming the core innovation team that works alongside the tax, legal and news divisions of TR to ensure that their offerings remain at the cutting edge of their markets.

The TR Labs team started off as a small team in its early days, with a team directive to spearhead ML innovation to help the company in various domains including but not limited to text summarization, document categorization, and various other NLP tasks. The team made remarkable progress from an early stage with AI/ML models being integrated into TR’s products and internal editorial systems to help with efficiency and productivity.

However, as the company grew, so did the team’s size and task complexity. The team had grown to over 100 people, and they were facing new challenges. Model development and training processes were becoming increasingly complex and challenging to manage. The team had different members working on different use cases, and therefore, models. Each researcher also had their own way of developing the models. This led to a situation where there was little standardization in the process for model development. Each researcher needed to configure all the underlying resources manually, and a large amount of boilerplate code was created in parallel by different teams. A significant portion of time was spent on tasks that could be performed more efficiently.

The TR Labs leadership recognized that the existing MLOps process wasn’t scalable and needed to be standardized. It lacked sufficient automation and assistance for those new to the platform. The idea was to take well architected practices for ML model development and operations and create a customized workflow specific to Labs that uses Amazon Web Services (AWS). The vision was to harmonize and simplify the model development process and accelerate the pace of innovation. They also aimed to set the path to quickly mature research and development solutions into an operational state that would support a high degree of automation for monitoring and retraining.

In this post, we will focus on the MLOps process parts involved in the research and model development phases.

The overview section will take you through the innovative solution that TR Labs created and how it helped lower the barrier to entry while increasing the adoption of AI/ML services for new ML users on AWS while decreasing time to market for new projects.

Solution overview

The existing ML workflow required a TR user to start from scratch every time they started a new project. Research teams would have to familiarize themselves with the TR Labs standards and deploy and configure the entire MLOps toolchain manually with little automation in place. Inconsistent practices within the research community meant extra work was needed to align with production grade deployments. Many research projects had to be refactored when handing code over to MLOps engineers, who often had to reverse engineer to achieve a similar level of functionality to make the code ready to deploy to production. The team had to create an environment where researchers and engineers worked on one shared codebase and use the same toolchain, reducing the friction between experimentation and production stages. A shared codebase is also a key element for long term maintenance—changes to the existing system should be integrated directly in the production level code and not reverse engineered and re-merged out of a research repository into the production codebase. This is an anti-pattern that leads to large costs and risks over time.

Regardless of the chosen model architecture, or even if the chosen model is a third-party provider for large language models (LLMs) without any fine tuning, a robust ML system requires validation on a relevant dataset. There are multiple testing methods, such as zero-shot learning, a machine learning technique that allows a model to classify objects from previously unseen classes, without receiving any specific training for those classes, with a transition to later introduce fine tuning to improve the model’s performance. How many iterations are necessary to obtain the expected initial quality and maintain or even improve the level over time depends on the use case and the model type being developed. However, when thinking about long-term systems, teams go through tens or even hundreds of repetitions. These repetitions will contain several recurring steps such as pre-processing, training, and post processing, which are similar, if not the same, no matter which approach is taken. Repeating the process manually without following a harmonized approach is also an anti-pattern.

This process inefficiency presented an opportunity to create a coherent set of MLOps tools that would enforce TR Labs standards for how to configure and deploy SageMaker services and expose these MLOps capabilities to a user by providing standard configuration and boilerplate code. The initiative was named TR MLTools and joined several MLOps libraries developed in TR Labs under one umbrella. Under this umbrella, the team provided a command line interface (CLI) tool that would support a standard project structure and deliver boilerplate code abstracting the underlying infrastructure deployment process and promoting a standardized TR ML workflow.

MLTools and MLTools CLI were designed to be flexible and extendable while incorporating a TR Labs-opinionated view on how to run MLOps in line with TR enterprise cloud platform standards.

MLTools CLI

MLTools CLI is a Python package and a command-line tool that promotes the standardization of TR Labs ML experiments workflow (ML model development, training, and so on) by providing code and configuration templates directly into the users’ code repository. At its core, MLTools CLI aims to connect all ML experiment-related entities (Python scripts, Jupyter notebooks, configuration files, data, pipeline definitions, and so on) and provide an easy way to bootstrap new experiments, conduct trials, and run user-defined scripts, testing them locally and remotely running them at scale as SageMaker jobs.

MLTools CLI is added as a development dependency to a new or existing Python project, where code for the planned ML experiments will be developed and tracked, for example in GitHub. As part of an initial configuration step, this source-code project is associated with specific AI Platform Machine Learning Workspaces. The users can then start using the MLTools CLI for running their ML experiments using SageMaker capabilities like Processing and Training jobs, Experiments, Pipelines, and so on.

Note: AI Platform Workspaces is an internal service, developed in TR, that provides secure access to Amazon Simple Storage Service (Amazon S3)-hosted data and AWS resources like SageMaker or SageMaker Studio Notebook instances for our ML researchers. You can find more information about the AI Platform Workspaces in this AWS blog: How Thomson Reuters built an AI platform using Amazon SageMaker to accelerate delivery of ML projects.

MLTools CLI acts effectively as a frontend or as a delivery channel for the set of capabilities (libraries, tools, and templates) that TR collectively refers to as MLTools. The following diagram shows a typical TR Labs ML experiments workflow, with a focus on the role of MLTools and MLTools CLI:

MLTools CLI offers various templates that can be generated using a command-line, including the following:

  • Separate directory structure for new ML experiments and experiment trials.
  • Script templates for launching SageMaker processing, training, and batch transform jobs.
  • Complete experiment pipeline template based on SageMaker Pipeline, with user scripts as steps.
  • Docker image templates for packaging user scripts. For example, for delivery to production.

MLTools CLI also provides the following features to support effective ML experiments:

  • User scripts can be run directly as SageMaker jobs without the need to build Docker images.
  • Each experiment runs in a sandboxed Poetry environment and can have its own code package and dependency tree.
  • The main, project-level code package is shared and can be used by all project experiments and user scripts code, allowing re-use of common code with no copy-paste.
  • Context-aware API resolves and loads experiment and trial metadata based on the current working directory.
  • Created AWS resources are automatically tagged with the experiment metadata.
  • Utilities to query these experiment-related AWS resources are available.

ML experiment workflow

After MLTools CLI is installed and configured on a laptop or notebook instance, a user can begin ML experimentation work. The first step is to create a new experiment using the MLTools CLI create-experiment command:

> mltools-cli create-experiment –experiment-name my-demo-experiment

An experiment template is generated in a sub-directory of the user’s project. The generated experiment folder has a standard structure, including the initial experiment’s configuration, a sandboxed Poetry package, and sample Jupyter notebooks to help quickly bootstrap new ML experiments:

experiments
└── my_demo_experiment
    ├── data
    ├── notebooks
    ├── scripts
    ├── src
    │   └── my_demo_experiment
    │       └── __init__.py
    ├── config.yaml
    ├── poetry.toml
    ├── pyproject.toml
    ├── README.md
    └── setup.sh

The user can then create script templates for the planned ML experiment steps:

> cd experiments/my_demo_experiments
> mltools-cli create-script –script-name preprocess --job-config PROCESS
> mltools-cli create-script –script-name train --job-config TRAIN
> mltools-cli create-script –script-name evaluate --job-config INFERENCE

Generated script templates are placed under the experiment directory:

experiments
└── my_demo_experiment
    ├── ...
    └── scripts
        ├── evaluate
        │   ├── evaluate.py
        │   ├── evaluate_job.py
        │   └── requirements.txt
        ├── preprocess
        │   ├── preprocess.py
        │   ├── preprocess_job.py
        │   └── requirements.txt
        └── train
            ├── train.py
            ├── train_job.py
            └── requirements.txt

Script names should be short and unique within their parent experiment, because they’re used to generate standardized AWS resource names. Script templates are supplemented by a job configuration for a specific type of job, as specified by the user. Templates and configurations for SageMaker processing, training, and batch transform jobs are currently supported by MLTools—these offerings will be expanded in the future. A requirements.txt file is also included where users can add any dependencies required by the script code to be automatically installed by SageMaker at runtime. The script’s parent experiment and project packages are added to the requirements.txt by default, so the user can import and run code from the whole project hierarchy.

The user would then proceed to add or adapt code in the generated script templates. Experiment scripts are ordinary Python scripts that contain common boilerplate code to give users a head start. They can be run locally while adapting and debugging the code. After the code is working, the same scripts can be launched directly as SageMaker jobs. The required SageMaker job configuration is defined separately in a <script_name>_job.py file, and job configuration details are largely abstracted from the notebook experiment code. As a result, an experiment script can be launched as a SageMaker job with a few lines of code:

Let’s explore the previous code snippet in detail.

First, the MLTools experiment context is loaded based on the current working directory using the load_experiment() factory method. The experiment context concept is a central point of the MLTools API. It provides access to the experiment’s user configuration, the experiment’s scripts, and the job configuration. All project experiments are also integrated with the project-linked AI Platform workspace and therefore have access to the resources and metadata of this workspace. For example, the experiments can access the workspace AWS Identity and Access Management (IAM) role, S3 bucket and its default Amazon Elastic Container Registry (Amazon ECR) repository.

From the experiment, a job context can be loaded, providing one of the experiment’s script names—load_job("train") in this instance. During this operation, the job configuration is loaded from the script’s <script_name>_job.py module. Also, if the script code depends on the experiment or the project packages, they’re automatically built (as Python wheels) and pre-packaged together with the script code, ready to be uploaded to S3.

Next, the training script is launched as a SageMaker training job. In the background, the MLTools factory code ensures that the respective SageMaker estimator or processor instances are created with the default configuration and conform to the rules and best practices accepted in TR. This includes naming conventions, virtual private cloud (VPC) and security configurations, and tagging. Note that SageMaker local mode is fully supported (set in the example by local=True) while its specific configuration details are abstracted from the code. Although the externalized job configuration provides all the defaults, these can be overwritten by the user. In the previous example, custom hyperparameters are provided.

SageMaker jobs that were launched as part of an experiment can be listed directly from the notebook using the experiment’s list_training_jobs() and list_processing_jobs() utilities. SageMaker ExperimentAnalytics data is also available for analysis and can be retrieved by calling the experiment’s experiment_analytics() method.

Integration with SageMaker Experiments

For every new MLTools experiment, a corresponding entity is automatically created in SageMaker Experiments. Experiment names in SageMaker are standardized and made unique by adding a prefix that includes the associated workspace ID and the root commit hash of the user repository. For any job launched from within an MLTools experiment context (that is by using job.run() as shown in the preceding code snippet), a SageMaker Experiments Run instance is created and the job is automatically launched within the SageMaker Experiments Run context. This means all MLTools job runs are automatically tracked in SageMaker Experiments, ensuring that all job run metadata is recorded. This also means that users can then browse their experiments and runs directly in the experiments browser in SageMaker Studio, create visualizations for analysis, and compare model metrics, among other tasks.

As shown in the following diagram, the MLTools experiment workflow is fully integrated with SageMaker Experiments:

Integration with SageMaker Pipelines

Some of the important factors that make ML experiments scalable are their reproducibility and their operationalization level. To support this, MLTools CLI provides users with a capability to add a template with boilerplate code to link the steps of their ML experiment into a deployable workflow (pipeline) that can be automated and delivers reproducible results. The MLTools experiment pipeline implementation is based on AWS SageMaker Pipelines. The same experiment scripts that might have been run and tested as standalone SageMaker jobs can naturally form the experiment pipeline steps.

MLTools currently offers the following standard experiment pipeline template:

We made a deliberate design decision to offer a simple, linear, single-model experiment pipeline template with well-defined standard steps. Oftentimes our project work on multi-model solutions involved an ensemble of ML models that might be ultimately trained on the same set of training data. In such cases, pipelines with more complex flows, or even integrated multi-model experiment pipelines, can be perceived as more efficient. Nevertheless, from a reproducibility and standardization standpoint, a decision to develop a customized experiment pipeline would need to be justified and is generally better suited for the later stages of ML operations where efficient model deployment might be a factor.

On the other hand, using the standard MLTools experiment pipeline template, users can create and start running their experiment pipelines in the early stages of their ML experiments. The underlying pipeline template implementation allows users to easily configure and deploy partial pipelines where only some of the defined steps are implemented. For example, a user can start with a pipeline that only has a single step implemented, such as a DataPreparation step, then add ModelTraining and ModelEvaluation steps and so on. This approach aligns well with the iterative nature of ML experiments and allows for gradually creating a complete experiment pipeline as the ML experiment itself matures.

As shown in the following diagram, MLTools allows users to deploy and run their complete experiment pipelines based on SageMaker Pipelines integrated with SageMaker Model Registry and SageMaker Studio.

Results and future improvements

TR Labs’s successful creation of the MLTools toolchain helps to standardize the MLOps framework throughout the organization and provides several benefits—the first of these is faster model development times. With a consistent process, team members can now work more efficiently by using project templates that deliver a modular setup, facilitating all phases of the ML development process. The structure delivers out-of-the-box integration with TR’s AWS-based AI Platform and the ability to switch between phases of the development including research and data analysis, running experiments at scale, and delivering end-to-end ML pipeline automation. This allows the team to focus on the critical aspects of model development while technicalities are handled and provisioned in advance.

The toolchain is designed to support a close collaboration between researchers and engineers who can work on different aspects of an ML delivery while sharing a codebase that follows software development best practices.

By following a standardized MLOps process, the TR Labs team can also quickly identify issues and model performance drifts more efficiently. It becomes easier to pinpoint where errors are occurring and how to fix them. This can help to reduce downtime and improve the overall efficiency of the development and maintenance processes. The standardized process also ensures that researchers working in model development are using the same environment as ML engineers. This leads to a more efficient transition from ideation and development to deploying the output as models in production and entering the maintenance phase.

Standardizing the MLOps platform has also led to cost savings through efficiencies. With a defined process, the team can reduce the time and resources required to develop and deploy models. This leads to cost savings in the long run, making the development, and particularly the long-term maintenance processes, more cost-effective.

A difficulty the team observed was in measuring how much the toolchain improved time to market and reduced costs. Thoroughly evaluating this would require a dedicated study where independent teams would work on the same use cases with and without the toolchain and comparing the results. However, there are subjective components and possibly different approaches that you can take to resolve this question. Such an approach would be very costly and still contain a high degree of imprecision.

The TR Labs team found an alternate solution for how to measure success. At a yearly interval we run an assessment with the userbase of the toolchain. The assessment covers a variety of aspects ranging over the entire AI/ML lifecycle. Toolchain users are asked to provide subjective assessments on how much of their development time is considered “wasted” on infrastructure issues, configuration issues, or manual tasks that are repetitive. Other questions cover the level of satisfaction with the current toolchain and the perceived improvement in productivity comparing current and past work without the toolchain or earlier versions of the toolchain. The resulting values are averaged over the entire userbase, which includes a mix of job roles ranging from engineers to data scientists to researchers.

The reduction of time spent on inefficiencies, the increase in perceived productivity, and user satisfaction can be used to compute the approximate monetary savings, improvement in code quality, and reduction in time to market. These combined factors contribute to user satisfaction and improvement in the retention of talent within the ML community at TR.

As a measure of success, the TR Labs team was able to achieve reductions in accumulated time spent on inefficiencies and found that this ranges between 3 to 5 days per month per person. Measuring the impact over a period of 12 months, TR has seen improvements of up to 40 percent in perceived productivity in several areas of the lifecycle and a measurable increase in user satisfaction. These numbers are based on what the users of the toolchain reported in the self-assessments.

Conclusion

A standardized MLOps framework can lead to the reduction of bugs, faster model development times, faster troubleshooting of issues, faster reaction to model performance drifts, and cost savings gained through a more efficient end-to-end machine learning process that facilitates experimentation and model creating at scale. By adopting a standardized MLOps framework that uses AWS SageMaker, SageMaker Experiments, SageMaker Model Registry, and SageMaker Pipelines, TR Labs was able to ensure that their machine learning models were developed and deployed efficiently and effectively. This has resulted in a faster time to market and accelerated business value through development.

To learn more about how AWS can help you with your AI/ML and MLOps journey, see What is Amazon SageMaker.


About the Authors

Andrei Voinov is a Lead Software Engineer at Thomson Reuters (TR). He is currently leading a team of engineers in TR Labs with the mandate to develop and support capabilities that help researchers and engineers in TR to efficiently transition ML projects from inception, through research, integration, and delivery into production. He brings over 25 years of experience with software engineering in various sectors and extended knowledge both in the cloud and ML spaces.

Danilo Tommasina is a Distinguished Engineer at Thomson Reuters (TR). With over 20 years of experience working in technology roles ranging from Software Engineer, over Director of Engineering and now as Distinguished Engineer. As a passionate generalist, proficient in multiple programming languages, cloud technologies, DevOps practices and with engineering knowledge in the ML space, he contributed to the scaling of TR Labs’ engineering organization. He is also a big fan of automation including but not limited to MLOps processes and Infrastructure as Code principles.

Simone Zucchet is a Manager of Solutions Architecture at AWS. With close to a decade of experience as a Cloud Architect, Simone enjoys working on innovative projects that help transform the way organizations approach business problems. He helps support large enterprise customers at AWS and is part of the Machine Learning TFC. Outside of his professional life, he enjoys working on cars and photography.

Jeremy Bartosiewicz is a Senior Solutions Architect at AWS. With over 15 years of experience working in technology in multiple roles. Coming from a consulting background, Jeremy enjoys working on a multitude of projects that help organizations grow using cloud solutions. He helps support large enterprise customers at AWS and is part of the Advertising and Machine Learning TFCs.

Read More

Manufacturing Intelligence: Deltia AI Delivers Assembly Line Gains With NVIDIA Metropolis and Jetson

Manufacturing Intelligence: Deltia AI Delivers Assembly Line Gains With NVIDIA Metropolis and Jetson

It all started at Berlin’s Merantix venture studio in 2022, when Silviu Homoceanu and Max Fischer agreed AI could play a big role in improving manufacturing. So the two started Deltia.ai, which runs NVIDIA Metropolis vision AI on NVIDIA Jetson AGX Orin modules to measure and help optimize assembly line processes.

Hailing from AI backgrounds, Homoceanu had previously led self-driving software at Volkswagen, while Fischer had founded a startup that helped digitize more than 40 factories.

Deltia, an NVIDIA Metropolis partner, estimates that today its software platform can provide as much as a 20% performance jump on production lines for its customers.

Customers using the Deltia platform include Viessman, a maker of heating pumps, and industrial electronics company ABB, among others. Viessman is running Deltia at 15 stations, and plans to add it to even more lines in the future. Once all lines are linked to Deltia, production managers say that they expect up to a 50% increase in overall productivity.

“We provide our users with a dashboard that is basically the Google Analytics of manufacturing,” said Homoceanu, Deltia’s CTO. “We install these sensors, and two weeks later they get the keys to this dashboard, and the magic happens in the background.”

Capturing Assembly Line Insights for Digital Transformations  

Once the cameras start gathering data on assembly lines, Deltia uses that information to train models on NVIDIA-accelerated computing that can monitor activities on the line. It then uses those models deployed on Jetson AGX Orin modules at the edge to gather operational insights.

These Jetson-based systems continuously monitor the camera streams and extract metadata. This metadata identifies the exact points in time when a product arrives at a specific station, when it is being worked on and when it leaves the station. This digital information is available to line managers and process improvement personnel via Deltia’s custom dashboard, helping to identify bottlenecks and accelerate line output.

“TensorRT helps us compress complex AI models to a level where we can serve, in an economical fashion, multiple stations with a single Jetson device,” said Homoceanu.

Tapping Into Jetson Orin for Edge AI-Based Customer Insights 

Beyond identifying quick optimizations, Deltia’s analytics help visualize production flows hour-by-hour. This means that Deltia can send rapid alerts when production slips away from predicted target ranges, and it can continuously track output, cycle times and other critical key performance indicators.

It also helps map how processes flow throughout a factory floor, and it suggests improvements for things like walking routes and shop-floor layouts. One of Deltia’s customers used the platform to identify that materials shelves were too far from workers, which caused unnecessarily long cycle times and limited production. Once the shelves were moved, production went up more than 30%.

Deltia’s applications extend beyond process improvements. The platform can be used to help monitor machine states at a granular level, assisting to predict when machine parts are worn out and recommend preemptive replacements, saving time and money down the line. The platform can also suggest optimizations for energy usage, saving on operational costs and reducing maintenance expenses.

“Our vision is to empower manufacturers with the tools to achieve unprecedented efficiency,” said Fischer, CEO of Deltia.ai. “Seeing our customers experience as much as a 30% increase in productivity with our vision models running on NVIDIA Jetson Orin validates the transformative potential of our technology.”

Deltia is a member of the NVIDIA Inception program for cutting-edge startups.

Learn more about NVIDIA Metropolis and NVIDIA Jetson.

Read More

Hammer Time: Machina Labs’ Edward Mehr on Autonomous Blacksmith Bots and More

Hammer Time: Machina Labs’ Edward Mehr on Autonomous Blacksmith Bots and More

Edward Mehr works where AI meets the anvil.  The company he cofounded, Machina Labs, blends the latest advancements in robotics and AI to form metal into countless shapes for use in defense, aerospace, and more. The company’s applications accelerate design and innovation, enabling rapid iteration and production in days instead of the months required by conventional processes. NVIDIA AI Podcast host Noah Kravitz speaks with Mehr, CEO of Machina Labs, on how the company uses AI to develop the first-ever robotic blacksmith. Its Robotic Craftsman platform integrates seven-axis robots that can shape, scan, trim and drill a wide range of materials — all capabilities made possible through AI.

Time Stamps

1:12: What does Machina Labs do?
3:37: Mehr’s background
8:45: Machina Lab’s manufacturing platform, the Robotic Craftsman
10:39: Machina Lab’s history and how AI plays a role in its work
15:07: The versatility of the Robotic Craftsman
21:48: How the Robotic Craftsman was trained in simulations using AI-generated manufacturing data
28:10: From factory to household — Mehr’s insight on the future of robotic applications

You Might Also Like:

How Two Stanford Students Are Building Robots for Handling Household Chores – Ep. 224

BEHAVIOR-1K is a robot that can perform 1,000 household chores, including picking up fallen objects or cooking. In this episode, Stanford Ph.D. students Chengshu Eric Li and Josiah David Wong discuss the breakthroughs and challenges they experienced while developing BEHAVIOR-1K.

Hittin’ the Sim: NVIDIA’s Matt Cragun on Conditioning Autonomous Vehicles in Simulation – Ep. 185

NVIDIA DRIVE Sim, built on Omniverse, provides a virtual proving ground for AV testing and validation. It’s a highly accurate simulation platform that can enable groundbreaking tools — including synthetic data and neural reconstruction — to build digital twins of driving environments. In this episode, Matt Cragun, senior product manager for AV simulation at NVIDIA, details the origins and inner workings of DRIVE Sim.

NVIDIA’s Liila Torabi Talks the New Era of Robotics Through Isaac Sim – Ep. 147

Robotics are not just limited to the assembly line. Liila Torabi, senior product manager for NVIDIA Isaac Sim, works on making the next generation of robotics possible. In this episode, she discusses the new era of robotics — one driven by making robots smarter through AI.

Art(ificial) Intelligence: Pindar Van Arman Builds Robots That Paint – Ep. 129

Pindar Van Arman is an American artist and roboticist, designing painting robots that explore the intersection of human and computational creativity. He’s built multiple artificially creative robots, the most famous of which being Cloud Painter, which was awarded first place at Robotart 2018. Tune in to hear how Van Arman deconstructs his own artistic process and teaches it to robots.

Subscribe to the AI Podcast

Get the AI Podcast through iTunes, Google Play, Amazon Music, Castbox, DoggCatcher, Overcast, PlayerFM, Pocket Casts, Podbay, PodBean, PodCruncher, PodKicker, Soundcloud, Spotify, Stitcher and TuneIn.

Make the AI Podcast better: Have a few minutes to spare? Fill out this listener survey.

Read More

Do the Math: New RTX AI PC Hardware Delivers More AI, Faster

Do the Math: New RTX AI PC Hardware Delivers More AI, Faster

Editor’s note: This post is part of the AI Decoded series, which demystifies AI by making the technology more accessible, and showcases new hardware, software, tools and accelerations for RTX PC users.

At the IFA Berlin consumer electronics and home appliances trade show this week, new RTX AI PCs will be announced, powered by RTX GPUs for advanced AI in gaming, content creation, development and academics and a neural processing unit (NPU) for offloading lightweight AI.

RTX GPUs, built with specialized AI hardware called Tensor Cores, provide the compute performance needed to run the latest and most demanding AI models. They now accelerate more than 600 AI-enabled games and applications, with more than 100 million GeForce RTX and NVIDIA RTX GPUs in users’ hands worldwide.

Since the launch of NVIDIA DLSS — the first widely deployed PC AI technology — more than five years ago, on-device AI has expanded beyond gaming to livestreaming, content creation, software development, productivity and STEM use cases.

Accelerating AI 

AI boils down to massive matrix multiplication — in other words, incredibly complex math. CPUs can do math, but, as serial processors, they can only perform one operation per CPU core at a time. This makes them far too slow for practical use with AI.

GPUs, on the other hand, are parallel processors, performing multiple operations at once. With hundreds of Tensor Cores each and being optimized for AI, RTX GPUs can accelerate incredibly complex mathematical operations.

RTX-powered systems give users a powerful GPU accelerator for demanding AI workloads in gaming, content creation, software development and STEM subjects. Some also include an NPU, a lightweight accelerator for offloading select low-power workloads.

Local accelerators make AI capabilities always available (even without an internet connection), offer low latency for high responsiveness and increase privacy so that users don’t have to upload sensitive materials to an online database before they become usable by an AI model.

Advanced Processing Power

NVIDIA powers much of the world’s AI — from data center to the edge to an install base of over 100 million PCs worldwide.

The GeForce RTX and NVIDIA RTX GPUs found in laptops, desktops and workstations share the same architecture as cloud servers and provide up to 686 AI trillion operations operations per second (TOPS) across the GeForce RTX 40 Series Laptop GPU lineup.

RTX GPUs unlock top-tier performance and power a wider range of AI and generative AI than systems with just an integrated system-on-a-chip (SoC).

“Many projects, especially within Windows, are built for and expect to run on NVIDIA cards. In addition to the wide software support base, NVIDIA GPUs also have an advantage in terms of raw performance.” — Jon Allman, industry analyst at Puget Systems

Gamers can use DLSS for AI-enhanced performance and can look forward to NVIDIA ACE digital human technology for next-generation in-game experiences. Creators can use AI-accelerated video and photo editing tools, asset generators, AI denoisers and more. Everyday users can tap RTX Video Super Resolution and RTX Video HDR for improved video quality, and NVIDIA ChatRTX and NVIDIA Broadcast for productivity improvements. And developers can use RTX-powered coding and debugging tools, as well as the NVIDIA RTX AI Toolkit to build and deploy AI-enabled apps for RTX.

Large language models — like Google’s Gemma, Meta’s Llama and Microsoft’s Phi — all run faster on RTX AI PCs, as systems with GPUs load LLMs into VRAM. Add in NVIDIA TensorRT-LLM acceleration and RTX GPUs can run LLMs 10-100x faster than on CPUs.

New RTX AI PCs Available Now

New systems from ASUS and MSI are now shipping with up to GeForce RTX 4070 Laptop GPUs — delivering up to 321 AI TOPS of performance — and power-efficient SoCs with Windows 11 AI PC capabilities. Windows 11 AI PCs will receive a free update to Copilot+ PC experiences when available.

ASUS’ Zephyrus G16 comes with up to a GeForce RTX 4070 Laptop GPU to supercharge photo and video editing, image generation and coding, while game-enhancing features like DLSS create additional high-quality frames and improve image quality. The 321 TOPS of local AI processing power available from the GeForce RTX 4070 Laptop GPU enables multiple AI applications to run simultaneously, changing the way gamers, creators and engineers work and play.

The ASUS ProArt P16 is the first AI PC built for advanced AI workflows across creativity, gaming, productivity and more. Its GeForce RTX 4070 Laptop GPU provides creatives with RTX AI acceleration in top 2D, 3D, video editing and streaming apps. The ASUS ProArt P13 also comes with state-of-the-art graphics and an OLED touchscreen for ease of creation. Both laptops also come NVIDIA Studio-validated, enabling and accelerating your creativity.

The MSI Stealth A16 AI+ features the latest GeForce RTX 40 Series Laptop GPUs, delivering up to 321 AI TOPS with a GeForce RTX 4070 Laptop GPU. This fast and intelligent AI-powered PC is designed to excel in gaming, creation and productivity, offering access to next-level technology.

These laptops join hundreds of RTX AI PCs available today from top manufacturers, with support for the 600+ AI applications and games accelerated by RTX.

Generative AI is transforming graphics and interactive experiences of all kinds. Make sense of what’s new and what’s next by subscribing to the AI Decoded newsletter.

Read More

CUDA-Free Inference for LLMs

CUDA-Free Inference for LLMs

In this blog, we discuss the methods we used to achieve FP16 inference with popular LLM models such as Meta’s Llama3-8B and IBM’s Granite-8B Code, where 100% of the computation is performed using OpenAI’s Triton Language.
For single token generation times using our Triton kernel based models, we were able to approach 0.76-0.78x performance relative to the CUDA kernel dominant workflows for both Llama and Granite on Nvidia H100 GPUs, and 0.62-0.82x on Nvidia A100 GPUs.

Why explore using 100% Triton? Triton provides a path for enabling LLMs to run on different types of GPUs – NVIDIA, AMD, and in the future Intel and other GPU based accelerators. It also provides a higher layer of abstraction in Python for programming GPUs and has allowed us to write performant kernels faster than authoring them using vendor specific APIs. In the rest of this blog, we will share how we achieve CUDA-free compute, micro-benchmark individual kernels for comparison, and discuss how we can further improve future Triton kernels to close the gaps.

Figure 1. Inference throughput benchmarks with Triton and CUDA variants of Llama3-8B and Granite-8B, on NVIDIA H100 and A100
Settings: batch size = 2, input sequence length = 512, output sequence length = 256

2.0 Composition of a Transformer Block

We start with a breakdown of the computations that happen in Transformer-based models. The figure below shows the “kernels” of a typical Transformer block.

Figure 2. Transformer Block by core kernels

The core operations for a Llama3 architecture are summarized in this list:

  1. RMSNorm
  2. Matrix multiplication: Fused QKV
  3. RoPE
  4. Attention
  5. Matrix multiplication: Output Projection
  6. RMSNorm
  7. Matrix multiplication: Fused Gate + Up Projection
  8. Activation function: SiLU
  9. Element Wise Multiplication
  10. Matrix multiplication: Down Projection

Each of these operations is computed on the GPU through the execution of one (or multiple) kernels. While the specifics of each of these kernels can vary across different transformer models, the core operations remain the same. For example, IBM’s Granite 8B Code model uses bias in the MLP layer, different from Llama3. Such changes do require modifications to the kernels. A typical model is a stack of these transformer blocks wired together with embedding layers.

3.0 Model Inference

Typical model architecture code is shared with a python model.py file that is launched by PyTorch. In the default PyTorch eager execution mode, these kernels are all executed with CUDA. To achieve 100% Triton for end-to-end Llama3-8B and Granite-8B inference we need to write and integrate handwritten Triton kernels as well as leverage torch.compile (to generate Triton ops). First, we replace smaller ops with compiler generated Triton kernels, and second, we replace more expensive and complex computations (e.g. matrix multiplication and flash attention) with handwritten Triton kernels.

Torch.compile generates Triton kernels automatically for RMSNorm, RoPE, SiLU and Element Wise Multiplication. Using tools like Nsight Systems we can observe these generated kernels; they appear as tiny dark green kernels in-between the matrix multiplications and attention.

Figure 3. Trace of Llama3-8B with torch.compile, showing CUDA kernels being used for matrix multiplications and flash attention

For the above trace, we note that the two major ops that make up 80% of the E2E latency in a Llama3-8B style model are matrix multiplication and attention kernels and both remain CUDA kernels. Thus to close the remaining gap, we replace both matmul and attention kernels with handwritten Triton kernels.

4.0 Triton SplitK GEMM Kernel

For the matrix multiplications in the linear layers, we wrote a custom FP16 Triton GEMM (General Matrix-Matrix Multiply) kernel that leverages a SplitK work decomposition. We have previously discussed this parallelization in other blogs as a way to accelerate the decoding portion of LLM inference.

5.0 GEMM Kernel Tuning

To achieve optimal performance we used the exhaustive search approach to tune our SplitK GEMM kernel. Granite-8B and Llama3-8B have linear layers with the following shapes:

Linear Layer Shape (in_features, out_features)
Fused QKV Projection (4096, 6144)
Output Projection (4096, 4096)
Fused Gate + Up Projection (4096, 28672)
Down Projection (14336, 4096)

Figure 4. Granite-8B and Llama3-8B Linear Layer Weight Matrix Shapes

Each of these linear layers have different weight matrix shapes. Thus, for optimal performance the Triton kernel must be tuned for each of these shape profiles. After tuning for each linear layer we were able to achieve 1.20x E2E speedup on Llama3-8B and Granite-8B over the untuned Triton kernel.

6.0 Flash Attention Kernel

We evaluated a suite of existing Triton flash attention kernels with different configurations, namely:

  1. AMD Flash
  2. OpenAI Flash
  3. Dao AI Lab Flash
  4. XFormers Flash
  5. PyTorch FlexAttention

We evaluated the text generation quality of each of these kernels, first, in eager mode and then (if we were able to torch.compile the kernel with standard methods) compile mode. For kernels 2-5, we noted the following:

Kernel Text Generation Quality Torch.compile Support for Arbitrary Sequence Length
AMD Flash Coherent Yes Yes
OpenAI Flash Incoherent Did not evaluate. WIP to debug precision in eager mode first No
Dao AI Lab Flash Incoherent Did not evaluate. WIP to debug precision in eager mode first Yes
Xformers FlashDecoding Hit a compilation error before we were able to evaluate text quality WIP No (This kernel is optimized for decoding)
PyTorch FlexAttention Coherent WIP WIP

Figure 5. Table of combinations we tried with different Flash Attention Kernels

The above table summarizes what we observed out-of-the box. With some effort we expect that kernels 2-5 can be modified to meet the above criteria. However, this also shows that having a kernel that works for benchmarking is often only the start of having it usable as an end to end production kernel.
We chose to use the AMD flash attention kernel in our subsequent tests as it can be compiled via torch.compile and produces legible output in both eager and compiled mode.

To satisfy torch.compile compatibility with the AMD flash attention kernel, we had to define it as a torch custom operator. This process is explained in detail here. The tutorial link discusses how to wrap a simple image crop operation. However, we note that wrapping a more complex flash attention kernel follows a similar process. The two step approach is as follows:

  1. Wrap the function into a PyTorch Custom Operator

  1. Add a FakeTensor Kernel to the operator, which given the shapes of the input tensors of flash (q, k and v) provides a way to compute the output shape of the flash kernel

After defining the Triton flash kernel as a custom op, we were able to successfully compile it for our E2E runs.

Figure 6. Trace of Llama3-8B with torch.compile, after swapping in Triton matmul and Triton flash attention kernels

From Figure 5, we note that now, after integrating both the SplitK matrix multiplication kernel, the torch op wrapped flash attention kernel, and then running torch.compile, we are able to achieve a forward pass that uses 100% Triton computation kernels.

7.0 End-to-End Benchmarks

We performed end-to-end measurements on NVIDIA H100s and A100s (single GPU) with Granite-8B and Llama3-8B models. We performed our benchmarks with two different configurations.

The Triton kernel configuration uses:

  1. Triton SplitK GEMM
  2. AMD Triton Flash Attention

The CUDA Kernel configuration uses:

  1. cuBLAS GEMM
  2. cuDNN Flash Attention – Scaled Dot-Product Attention (SDPA)

We found the following throughput and inter-token latencies for both eager and torch compiled modes, with typical inference settings:

GPU Model Kernel Config Median Latency (Eager) [ms/tok] Median Latency (Compiled) [ms/tok]
H100 Granite-8B Triton 27.42 11.59
    CUDA 18.84 9.50
  Llama3-8B Triton 20.36 10.61
    CUDA 16.59 8.59
A100 Granite-8B Triton 53.44 16.88
    CUDA 37.13 14.25
  Llama3-8B Triton 44.44 17.94
    CUDA 32.45 12.96

Figure 7. Granite-8B and Llama3-8B Single Token Generation Latency on H100 and A100,
(batch size = 2, input sequence length = 512, output sequence length = 256)

To summarize, the Triton models can get up to 78% of the performance of the CUDA models on the H100 and up to 82% on the A100.

The performance gap can be explained by the kernel latencies we observe for matmul and flash attention, which are discussed in the next section.

8.0 Microbenchmarks

Kernel Triton [us] CUDA [us]
QKV Projection Matmul 25 21
Flash Attention 13 8
Output Projection Matmul 21 17
Gate + Up Projection Matmul 84 83
Down Projection Matmul 58 42

Figure 8. Triton and CUDA Kernel Latency Comparison (Llama3-8B on NVIDIA H100)
Input was an arbitrary prompt (bs=1, prompt = 44 seq length), decoding latency time

From the above, we note the following:

  1. Triton matmul kernels are 1.2-1.4x slower than CUDA

  2. AMDs Triton Flash Attention kernel is 1.6x slower than CUDA SDPA

These results highlight the need to further improve the performance of kernels that are core primitives like GEMM and Flash Attention. We leave this as future research, as recent works (e.g. FlashAttention-3, FlexAttention) provide ways to leverage the underlying hardware better as well as Triton pathways that we hope to be able to build on to produce greater speedups. To illustrate this, we compared FlexAttention with SDPA and AMD’s Triton Flash kernel.

We are working to verify E2E performance with FlexAttention. For now, initial microbenchmarks with Flex show promise for longer context lengths and decoding problem shapes, where the query vector is small:

Figure 9. FlexAttention Kernel Benchmarks on NVIDIA H100 SXM5 80GB
(batch=1, num_heads=32, seq_len=seq_len, head_dim=128)

9.0 Future Work

For future work we plan to explore ways to further optimize our matmuls that leverage the hardware better, such as this blog we published on utilizing TMA for H100, as well as different work decompositions (persistent kernel techniques like StreamK etc.) to get greater speedups for our Triton-based approach. For flash attention, we plan to explore FlexAttention and FlashAttention-3 as the techniques used in these kernels can be leveraged to help further close the gap between Triton and CUDA.
We also note that our prior work has shown promising results for FP8 Triton GEMM kernel performance versus cuBLAS FP8 GEMM, thus in a future post we will explore E2E FP8 LLM inference.

Read More