May 2024 – Page 14

Dial It In: Data Centers Need New Metric for Energy Efficiency

Data centers need an upgraded dashboard to guide their journey to greater energy efficiency, one that shows progress running real-world applications.

The formula for energy efficiency is simple: work done divided by energy used. Applying it to data centers calls for unpacking some details.

Today’s most widely used gauge — power usage effectiveness (PUE) — compares the total energy a facility consumes to the amount its computing infrastructure uses. Over the last 17 years, PUE has driven the most efficient operators closer to an ideal where almost no energy is wasted on processes like power conversion and cooling.

Finding the Next Metrics

PUE served data centers well during the rise of cloud computing, and it will continue to be useful. But it’s insufficient in today’s generative AI era, when workloads and the systems running them have changed dramatically.

That’s because PUE doesn’t measure the useful output of a data center, only the energy that it consumes. That’d be like measuring the amount of gas an engine uses without noticing how far the car has gone.

Many standards exist for data center efficiency. A 2017 paper lists nearly three dozen of them, several focused on specific targets such as cooling, water use, security and cost.

Understanding What’s Watts

When it comes to energy efficiency, the computer industry has a long and somewhat unfortunate history of describing systems and the processors they use in terms of power, typically in watts. It’s a worthwhile metric, but many fail to realize that watts only measure input power at a point in time, not the actual energy computers use or how efficiently they use it.

So, when modern systems and processors report rising input power levels in watts, that doesn’t mean they’re less energy efficient. In fact, they’re often much more efficient in the amount of work they do with the amount of energy they use.

Modern data center metrics should focus on energy, what the engineering community knows as kilowatt-hours or joules. The key is how much useful work they do with this energy.

Reworking What We Call Work

Here again, the industry has a practice of measuring in abstract terms, like processor instructions or math calculations. So, MIPS (millions of instructions per second) and FLOPS (floating point operations per second) are widely quoted.

Only computer scientists care how many of these low-level jobs their system can handle. Users would prefer to know how much real work their systems put out, but defining useful work is somewhat subjective.

Data centers focused on AI may rely on the MLPerf benchmarks. Supercomputing centers tackling scientific research typically use additional measures of work. Commercial data centers focused on streaming media may want others.

The resulting suite of applications must be allowed to evolve over time to reflect the state of the art and the most relevant use cases. For example, the last MLPerf round added tests using two generative AI models that didn’t even exist five years ago.

A Gauge for Accelerated Computing

Ideally, any new benchmarks should measure advances in accelerated computing. This combination of parallel processing hardware, software and methods is running applications dramatically faster and more efficiently than CPUs across many modern workloads.

For example, on scientific applications, the Perlmutter supercomputer at the National Energy Research Scientific Computing Center demonstrated an average of 5x gains in energy efficiency using accelerated computing. That’s why it’s among the 39 of the top 50 supercomputers — including the No. 1 system — on the Green500 list that use NVIDIA GPUs.

Chart of GPU vs CPU energy efficiency — Because they execute lots of tasks in parallel, GPUs execute more work in less time than CPUs, saving energy.

Companies across many industries share similar results. For example, PayPal improved real-time fraud detection by 10% and lowered server energy consumption nearly 8x with accelerated computing.

The gains are growing with each new generation of GPU hardware and software.

In a recent report, Stanford University’s Human-Centered AI group estimated GPU performance “has increased roughly 7,000 times” since 2003, and price per performance is “5,600 times greater.”

Chart depicts relationships among various data center energy efficiency graphics — Data centers need a suite of benchmarks to track energy efficiency across their major workloads.

Two Experts Weigh In

Experts see the need for a new energy-efficiency metric, too.

With today’s data centers achieving scores around 1.2 PUE, the metric “has run its course,” said Christian Belady, a data center engineer who had the original idea for PUE. “It improved data center efficiency when things were bad, but two decades later, they’re better, and we need to focus on other metrics more relevant to today’s problems.”

Looking forward, “the holy grail is a performance metric. You can’t compare different workloads directly, but if you segment by workloads, I think there is a better likelihood for success,” said Belady, who continues to work on initiatives driving data center sustainability.

Jonathan Koomey, a researcher and author on computer efficiency and sustainability, agreed.

“To make good decisions about efficiency, data center operators need a suite of benchmarks that measure the energy implications of today’s most widely used AI workloads,” said Koomey.

“Tokens per joule is a great example of what one element of such a suite might be,” Koomey added. “Companies will need to engage in open discussions, share information on the nuances of their own workloads and experiments, and agree to realistic test procedures to ensure these metrics accurately characterize energy use for hardware running real-world applications.”

“Finally, we need an open public forum to conduct this important work,” he said.

It Takes a Village

Thanks to metrics like PUE and rankings like the Green500, data centers and supercomputing centers have made enormous progress in energy efficiency.

More can and must be done to extend efficiency advances in the age of generative AI. Metrics of energy consumed doing useful work on today’s top applications can take supercomputing and data centers to a new level of energy efficiency.

To learn more about available energy-efficiency solutions, explore NVIDIA sustainable computing.

Enhancing Deep Learning Workflows: PyTorch Ecosystem Tools

Welcome to the thriving PyTorch ecosystem, where a wealth of tools and libraries await, purpose-built to elevate your experience in deep learning as a developer or researcher. The Ecosystem Tools pages host many projects from experts spanning academia, industry, application development, and machine learning.

Initially, PyTorch aimed to establish a thriving community, enabling developers to access each other’s tools, engage in meaningful discussions, and explore the wealth of resources available within the community.

Today, the PyTorch ecosystem has grown to feature over 100 projects tailored to your needs, providing robust support, enhanced speed, and effortless integration with PyTorch. If your project aligns with our mission, we invite you to submit it and join this dynamic ecosystem.

New this month, we’ve moved all of our Ecosystem blogs over to our PyTorch.org website to host a space where our community can show off the latest innovations with our users. Read on to hear about the latest projects in the ecosystem!

Explore the Latest Tools and Frameworks in the Ecosystem

As we continue into 2024, we’re thrilled to showcase an impressive array of ecosystem tools that significantly enrich the PyTorch community. These tools cover a wide range of domains, including pose estimation, profiling, and even quantum computing. Let’s explore each one to witness firsthand how they are reshaping the PyTorch landscape, opening up exciting possibilities for developers.

Anomalib

Anomalib is a deep learning library that aims to collect state-of-the-art anomaly detection algorithms for benchmarking on both public and private datasets. Anomalib provides several ready-to-use implementations of anomaly detection algorithms described in the recent literature, as well as a set of tools that facilitate the development and implementation of custom models. The library has a strong focus on image-based anomaly detection, where the goal of the algorithm is to identify anomalous images, or anomalous pixel regions within images in a dataset. Anomalib is constantly updated with the latest algorithms and training/inference extensions.

Diffusers

Diffusers is a library within the PyTorch ecosystem that focuses on model interpretability. It offers a suite of tools and techniques to explain the decisions made by deep learning models. With Diffusers, developers can gain insights into model behavior, understand feature importance, and detect potential biases. By making deep learning models more transparent, Diffusers promotes fairness, accountability, and robustness in AI applications.

Pomegranate

Pomegranate is a versatile machine learning library that integrates seamlessly with PyTorch. It provides a wide range of probabilistic models and tools for probabilistic modeling tasks. Pomegranate empowers users to build complex models such as hidden Markov models (HMMs), Bayesian networks, and Gaussian mixture models (GMMs). By combining the strengths of PyTorch and Pomegranate, developers can leverage the power of deep learning and probabilistic modeling to tackle various machine learning challenges.

PyPose

PyPose is a PyTorch-based library designed for pose estimation tasks. With PyPose, developers can efficiently train and deploy models for human pose estimation, a fundamental computer vision problem. By leveraging PyTorch’s flexibility and performance, PyPose simplifies the process of building accurate pose estimation models. Its intuitive APIs and pre-trained models make it an excellent choice for researchers and developers exploring human pose estimation applications.

PyPOTS

A python toolbox/library for data mining on partially-observed time series with PyTorch, including SOTA models supporting tasks of imputation, classification, clustering, and forecasting on incomplete (irregularly-sampled) multivariate time series with missing values.

OctoML Profiler

OctoML Profiler is a performance profiling tool that aids in optimizing PyTorch models. This tool helps developers identify performance bottlenecks and inefficiencies within their deep learning models. By providing insights into memory usage, compute time, and data movement, the OctoML Profiler enables developers to fine-tune their models for improved efficiency. With this valuable feedback, developers can optimize their models for deployment on various hardware platforms.

Open Compass

OpenCompass is a one-stop platform for large model evaluation, aiming to provide a fair, open, and reproducible benchmark for large model evaluation. Its main features include: Comprehensive support for models and datasets, efficient distributed evaluation, diversified evaluation paradigms, modular design with high extensibility and experiment management and reporting mechanism.

Renate

Renate is a PyTorch-based library for neural architecture search (NAS). It simplifies the process of automatically searching for optimal neural network architectures tailored to specific tasks. Renate leverages techniques like reinforcement learning and evolutionary algorithms to efficiently explore the architecture space. By using Renate, developers can save significant time and resources while discovering highly performant models.

RoMa

RoMa is a standalone library to handle rotation representations with PyTorch (rotation matrices, quaternions, rotation vectors, etc). It aims for robustness, ease-of-use, and efficiency.

Substra

Substra is an open source federated learning (FL) software. It enables the training and validation of machine learning models on distributed datasets. It provides a flexible Python interface and a web application to run federated learning training at scale. Substra’s main usage is in production environments. It has already been deployed and used by hospitals and biotech companies. Substra can also be used on a single machine to perform FL simulations and debug code.

TorchQuantum

TorchQuantum is a powerful library that combines the PyTorch framework with quantum computing concepts. It enables developers to explore quantum machine learning algorithms and build hybrid classical-quantum models. By integrating the principles of quantum computing into PyTorch, TorchQuantum opens up new possibilities for solving complex problems that traditional deep learning approaches may struggle with.

TIAToolbox

The TIAToolbox (Text-Image-Augmentation Toolbox) is a PyTorch library designed to augment text and image data for deep learning tasks. It offers a comprehensive set of tools for data augmentation, including transformations, noise injection, and image/text synthesis. By applying TIAToolbox, developers can enrich their training datasets, improve model generalization, and enhance the robustness of their deep learning models.

torchdistill

torchdistill is a coding-free framework built on PyTorch for reproducible deep learning and knowledge distillation studies. The framework is designed to enable users to design experiments by declarative PyYAML configuration files and supports high-level module abstractions.

TorchOpt

TorchOpt is a PyTorch library focused on optimization algorithms for deep learning. It provides a collection of state-of-the-art optimization techniques, such as stochastic gradient descent (SGD) variants, adaptive learning rate methods, and optimization schedules. TorchOpt empowers developers to fine-tune their models efficiently, converge faster, and achieve better performance in various deep learning tasks.

USB

USB, or Unified Speech-to-Text Benchmark, is a PyTorch-based toolkit for training and evaluating speech recognition models. It provides standardized datasets and evaluation metrics to facilitate fair and accurate comparisons between different speech recognition architectures. By using USB, researchers and developers can benchmark their models against state-of-the-art systems and drive advancements in the field of automatic speech recognition.

Zeus

Zeus is the current state-of-the-art in deep learning energy measurement and optimization. It has monitor components that allow users to measure GPU energy consumption and optimizer components that automatically optimize DNN or GPU knobs based on measurements from the monitor component.

Be Part of Our Ecosystem

Our diverse ecosystem tools are instrumental in PyTorch’s success.. They provide essential support for tasks such as pose estimation, probabilistic modeling, performance profiling, model interpretability, speech recognition, quantum computing, data augmentation, optimization, and neural architecture search.

Leveraging these tools empowers developers and researchers to accelerate their deep learning workflows and unlock new possibilities in the field of AI.

Have a tool that would be a good fit for the PyTorch Ecosystem? If you can answer the below questions, we’d love for you to submit your tool for review.

Does your project complement PyTorch, enhancing user experience, introducing new capabilities, or accelerating training and inference processes?
- Examples could include visualization tools, a kernel library or a framework that sits on top to enable research in a particular area such as NLP.
Is the project ready for broad developer usage?
- For example, is the project stable, will it be maintained, and is there adequate supporting infrastructure, documentation, and technical support to allow a developer to successfully use it?

Thank you to all of our contributors and collaborators in our ecosystem! Here’s to a great 2024.

Introducing depyf: mastering torch.compile with ease

We are thrilled to introduce depyf, a new project to the PyTorch ecosystem designed to help users understand, learn, and adapt to torch.compile!

Motivation

torch.compile is a cornerstone of PyTorch 2.x, offering a straightforward path to accelerate machine learning workflows with just a single line of code for both training and inference. The mere inclusion of @torch.compile can dramatically enhance the performance of your code. However, identifying the optimal insertion point for torch.compile is not easy, not to mention the complexity of adjusting various knobs for maximum efficiency.

The intricacies of the torch.compile stack, encompassing Dynamo, AOTAutograd, Inductor, and more, present a steep learning curve. These components, essential for deep learning performance optimization, can be daunting without a solid foundation in the subject.

Note: For an introductory example of how torch.compile works, please refer to this walk-through explanation.

A common tool: `TORCH_COMPILE_DEBUG`

To demystify torch.compile, the common approach involves leveraging the TORCH_COMPILE_DEBUG environment variable. While it provides more information, deciphering the output remains a formidable task.

For example, when we have the following code:

# test.py
import torch
from torch import _dynamo as torchdynamo
from typing import List

@torch.compile
def toy_example(a, b):
   x = a / (torch.abs(a) + 1)
   if b.sum() < 0:
       b = b * -1
   return x * b

def main():
   for _ in range(100):
       toy_example(torch.randn(10), torch.randn(10))

if __name__ == "__main__":
   main()

And run it with TORCH_COMPILE_DEBUG=1 python test.py , we will get a directory named torch_compile_debug/run_2024_02_05_23_02_45_552124-pid_9520 , under which there are these files:

.
├── torchdynamo
│   └── debug.log
└── torchinductor
   ├── aot_model___0_debug.log
   ├── aot_model___10_debug.log
   ├── aot_model___11_debug.log
   ├── model__4_inference_10.1
   │   ├── fx_graph_readable.py
   │   ├── fx_graph_runnable.py
   │   ├── fx_graph_transformed.py
   │   ├── ir_post_fusion.txt
   │   ├── ir_pre_fusion.txt
   │   └── output_code.py
   ├── model__5_inference_11.2
   │   ├── fx_graph_readable.py
   │   ├── fx_graph_runnable.py
   │   ├── fx_graph_transformed.py
   │   ├── ir_post_fusion.txt
   │   ├── ir_pre_fusion.txt
   │   └── output_code.py
   └── model___9.0
       ├── fx_graph_readable.py
       ├── fx_graph_runnable.py
       ├── fx_graph_transformed.py
       ├── ir_post_fusion.txt
       ├── ir_pre_fusion.txt
       └── output_code.py

The generated files and logs often raise more questions than they answer, leaving developers puzzled over the meaning and relationships within the data. Common puzzles for TORCH_COMPILE_DEBUG include:

What does model__4_inference_10.1 mean?
I have one function but three model__xxx.py in the directory, what is their correspondence?
What are those LOAD_GLOBAL stuff in debug.log ?

A better tool: `depyf` comes to rescue

Let’s see how depyf can help developers to resolve the above challenges. To use depyf , simply execute pip install depyf or follow the project page https://github.com/thuml/depyf to install the latest version, and then surround the main code within with depyf.prepare_debug .

# test.py
import torch
from torch import _dynamo as torchdynamo
from typing import List

@torch.compile
def toy_example(a, b):
   x = a / (torch.abs(a) + 1)
   if b.sum() < 0:
       b = b * -1
   return x * b

def main():
   for _ in range(100):
       toy_example(torch.randn(10), torch.randn(10))

if __name__ == "__main__":
   import depyf
   with depyf.prepare_debug("depyf_debug_dir"):
       main()

After executing python test.py , depyf will produce a directory named depyf_debug_dir (the argument of the prepare_debug function). Under the directory, there would be these files:

.
├── __compiled_fn_0 AFTER POST GRAD 0.py
├── __compiled_fn_0 Captured Graph 0.py
├── __compiled_fn_0 Forward graph 0.py
├── __compiled_fn_0 kernel 0.py
├── __compiled_fn_3 AFTER POST GRAD 0.py
├── __compiled_fn_3 Captured Graph 0.py
├── __compiled_fn_3 Forward graph 0.py
├── __compiled_fn_3 kernel 0.py
├── __compiled_fn_4 AFTER POST GRAD 0.py
├── __compiled_fn_4 Captured Graph 0.py
├── __compiled_fn_4 Forward graph 0.py
├── __compiled_fn_4 kernel 0.py
├── __transformed_code_0_for_torch_dynamo_resume_in_toy_example_at_8.py
├── __transformed_code_0_for_toy_example.py
├── __transformed_code_1_for_torch_dynamo_resume_in_toy_example_at_8.py
└── full_code_for_toy_example_0.py

And there are two obvious benefits:

The long and difficult-to-understand torchdynamo/debug.log is gone. Its content is cleaned up and shown as human-readable source code, in full_code_for_xxx.py and __transformed_code_{n}_for_xxx.py . It is worth to note, that the most tedious and difficult job of depyf is to decompile the bytecode inside torchdynamo/debug.log into Python source code, freeing developers from intimidating internals of Python.
The correspondence between function names and computation graphs are respected. For example, in __transformed_code_0_for_toy_example.py , we can see a function named __compiled_fn_0 , and we will immediately know its corresponding computation graphs are in __compiled_fn_0_xxx.py , because they share the same __compiled_fn_0 prefix name.

Starting with full_code_for_xxx.py , and following the functions involved, users will have a clear view of what torch.compile does to their code.

One more thing: step-through debuggability

Stepping through code line by line using debuggers is a great way to understand how code works. However, under TORCH_COMPILE_DEBUG , those files are only for users’ information, and cannot be executed with the data users concern.

Note: By “debug”, we mean the process of inspecting and improving a program, rather than correcting buggy code.

A standout feature of depyf is its capability to facilitate step-through debugging for torch.compile: all of the files it generates are linked with runtime code objects inside Python interpreter, and we can set breakpoints in these files. The usage is simple, just add one context manager with depyf.debug() , and it should do the trick:

# test.py
import torch
from torch import _dynamo as torchdynamo
from typing import List

@torch.compile
def toy_example(a, b):
   x = a / (torch.abs(a) + 1)
   if b.sum() < 0:
       b = b * -1
   return x * b

def main():
   for _ in range(100):
       toy_example(torch.randn(10), torch.randn(10))

if __name__ == "__main__":
   import depyf
   with depyf.prepare_debug("depyf_debug_dir"):
       main()
   with depyf.debug():
       main()

Just one caveat: the workflow of debugging torch.compile deviates from standard debugging workflow. With torch.compile, many codes are dynamically generated. Therefore, we need to:

launch the program
when the program exits with depyf.prepare_debug("depyf_debug_dir") , code will be available in depyf_debug_dir.
when the program enters with depyf.debug() , it will automatically set a breakpoint internally, so that the program is paused.
navigate to depyf_debug_dir to set breakpoints.
continue to run the code, and debuggers will hit these breakpoints!

Here is a screenshot of what it looks like. All code and tensor variables are live, and we can inspect any variable, and step through the code, as in our daily debugging workflow now! The only difference is that we are debugging torch.compile generated code rather than human-written code.

Conclusion

torch.compile serves as an invaluable tool for accelerating PyTorch code effortlessly. For those looking to delve deeper into torch.compile, whether to leverage its full potential or to integrate custom operations, the learning curve can be very steep though. depyf is designed to lower this barrier, offering a user-friendly experience to understand, learn, and adapt to torch.compile.

Do explore depyf and experience its benefits firsthand! The project is open-source and readily available at https://github.com/thuml/depyf. Installation is straightforward via pip install depyf. We hope depyf can enhance everyone’s development workflow with torch.compile.

Deep Learning Energy Measurement and Optimization

This post is authored by Jae-Won Chung, a PhD student at the University of Michigan and the lead of the ML.ENERGY Initiative.

Deep learning consumes quite a bit of energy. For instance, training a single 200B LLM on AWS p4d instances consumed around 11.9 GWh (source: CIDR 2024 keynote), which is an amount that can single-handedly power more than a thousand average US households for a year.

Zeus is an open-source toolbox for measuring and optimizing the energy consumption of deep learning workloads. Our goal is to make energy optimization based on accurate measurements as easy as possible for diverse deep learning workloads and setups by offering composable tools with minimal assumptions.

Zeus largely provides two types of tools:

Programmatic and command line GPU energy measurement tools
Several energy optimization tools that find the best ML and/or GPU configurations

Zeus can benefit those who would like to

measure and optimize their electricity cost
reduce heat dissipation from their GPUs (by lowering power draw)
report energy usage from research and development
reduce carbon footprint from electricity usage

Part 1: Measuring Energy

Just like performance optimization, accurate measurement is the basis of effective energy optimization. Popular proxies for estimating power consumption like the maximum power draw of the hardware can sometimes be vastly off compared to actual measurement.

To make energy measurement as easy and transparent as possible, the core utility Zeus offers is the ZeusMonitor class. Let’s take a look at the actual snippet:

from zeus.monitor import ZeusMonitor

# All four GPUs are measured simultaneously.
monitor = ZeusMonitor(gpu_indices=[0,1,2,3])

# Measure total time and energy within the window.
monitor.begin_window("training")
for e in range(100):

    # Measurement windows can arbitrarily be overlapped.
    monitor.begin_window("epoch")
    for x, y in train_dataloader:
        y_hat = model(x)
        loss = criterion(y, y_hat)
        loss.backward()
        optim.step()
    measurement = monitor.end_window("epoch")
    print(f"Epoch {e}: {measurement.time} s, {measurement.total_energy} J")

measurement = monitor.end_window("training")
print(f"Entire training: {measurement.time} s, {measurement.total_energy} J")

What you see above is a typical PyTorch training loop which uses four GPUs for data parallel training. Inside, we created an instance of ZeusMonitor and passed in a list of GPU indices to monitor. Then, using the monitor, we can measure the time and energy consumption of arbitrary execution windows within the training script by pairing calls to begin_window and end_window. Multiple windows can overlap and nest in arbitrary ways without affecting the measurement of each, as long as their names are different.

ZeusMonitor adds very little overhead – typically single digit milliseconds – around the window. This allows ZeusMonitor to be used in various applications. For instance:

The ML.ENERGY Leaderboard: The first open-source benchmark on how much energy LLM text generation consumes.
The ML.ENERGY Colosseum: An online service that lets users compare LLM responses side-by-side based on response quality and energy consumption.

See our blog post for a deeper technical dive into accurate GPU energy measurement.

Part 2: Optimizing Energy

Let me introduce you to two of the energy optimizers provided by Zeus.

GlobalPowerLimitOptimizer

GPUs allow users to configure its maximum power draw, called power limit. Typically, as you lower the GPU’s power limit from the default maximum, computation may get slightly slower, but you’ll save disproportionately more energy. The GlobalPowerLimitOptimizer in Zeus automatically finds the optimal GPU power limit globally across all GPUs.

from zeus.monitor import ZeusMonitor
from zeus.optimizer.power_limit import GlobalPowerLimitOptimizer

# The optimizer measures time and energy through the ZeusMonitor.
monitor = ZeusMonitor(gpu_indices=[0,1,2,3])
plo = GlobalPowerLimitOptimizer(monitor)

for e in range(100):
    plo.on_epoch_begin()
    for x, y in train_dataloader:
        plo.on_step_begin()

        y_hat = model(x)
        loss = criterion(y, y_hat)
        loss.backward()
        optim.step()

        plo.on_step_end()
    plo.on_epoch_end()

In our familiar PyTorch training loop, we have instantiated GlobalPowerLimitOptimizer and passed it an instance of the ZeusMonitor, through which the optimizer sees the GPUs. Then, we just need to let the optimizer know about training progress (step and epoch boundaries), and the optimizer will transparently do all the necessary profiling and converge to the optimal power limit.

If you’re using the HuggingFace Trainer or SFTTrainer, integration is even easier:

from zeus.monitor import ZeusMonitor
from zeus.optimizer.power_limit import HFGlobalPowerLimitOptimizer

# ZeusMonitor actually auto-detects CUDA_VISIBLE_DEVICES.
monitor = ZeusMonitor()
pl_optimizer = HFGlobalPowerLimitOptimizer(monitor)

# Pass in the optimizer as a Trainer callback. Also works for SFTTrainer.
trainer = Trainer(
    model=model,
    train_dataset=train_dataset,
    ...,
    callbacks=[pl_optimizer],
)

The HFGlobalPowerLimitOptimizer wraps GlobalPowerLimitOptimizer so that it automatically detects step and epoch boundaries. We have example integrations here, including running Gemma 7B supervised fine-tuning with QLoRA.

Now, we know how to integrate the optimizer, but what is the optimal power limit? We know different users can have different preferences regarding trading off time and energy, so we allow users to specify an OptimumSelector (basically the Strategy Pattern) to express their needs.

# Built-in strategies for selecting the optimal power limit.
from zeus.optimizer.power_limit import (
    GlobalPowerLimitOptimizer,
    Time,
    Energy,
    MaxSlowdownConstraint,
)

# Minimize energy while tolerating at most 10% slowdown.
plo = GlobalPowerLimitOptimizer(
    monitor,
    MaxSlowdownConstraint(factor=1.1),
)

Some of the built-in strategies include “Minimize time” (Time, this might still reduce the power limit from the default since some workloads exhibit almost no slowdown even on lower power limits), “Minimize energy” (Energy), “Somewhere in between” (ZeusCost), and “Minimize energy given maximum slowdown” (MaxSlowdownConstraint). Users can also create their own optimum selectors as needed.

PipelineFrequencyOptimizer

The pipeline frequency optimizer, based on our research paper Perseus, is our latest work on energy optimization for large model training, like GPT-3. Perseus can reduce the energy consumption of large model training with no or negligible training throughput degradation. We’ll briefly talk about how.

The above is a visualization of one iteration of training with four stage pipeline parallelism running with the 1F1B schedule. Each box is either a forward or a backward computation, and is colored with its power consumption.

The key observation here is that when models are partitioned into pipeline stages, it’s very difficult to slice them in perfectly equal sizes. This leads to forward/backward boxes of varying widths and therefore computation idle time between boxes. You would notice that those smaller boxes can run slightly slower than wider boxes and the overall critical path (blue line) will not change at all.

That’s what Perseus automatically does. Based on profiling, it identifies computation boxes that are not on the critical path and figures out the precise amount of slowdown for each box that minimizes energy consumption. When done correctly, computations we slowed down will consume less power & energy, but the overall iteration time of the pipeline does not change.

See our guide to get started with Perseus!

Final Words

For users who run their own on-premise compute, energy consumption and the resulting electricity bill is not something that can be easily overlooked. On a larger scale, energy consumption is not just about electricity bills, but also about data center power delivery. With thousands of GPUs running in clusters, finding stable, affordable, and sustainable electricity sources to power data centers is becoming increasingly challenging. Finding ways to reduce energy disproportionately more than slowdown leads to lower average power consumption, which can help with the power delivery challenge.

With Zeus, we hope to take the first step towards deep learning energy measurement and optimization.

Wondering where to go from here? Here are a couple helpful links:

Zeus homepage/documentation
Zeus GitHub repository
Zeus usage and integration examples
ML.ENERGY Initiative (i.e., the people building Zeus)

AWS DeepRacer enables builders of all skill levels to upskill and get started with machine learning

In today’s technological landscape, artificial intelligence (AI) and machine learning (ML) are becoming increasingly accessible, enabling builders of all skill levels to harness their power. As more companies adopt AI solutions, there’s a growing need to upskill both technical and non-technical teams in responsibly expanding AI usage. Getting hands-on experience is crucial for understanding and applying ML concepts to automate tasks like content generation, language translation, and image classification. And that’s where AWS DeepRacer comes into play—a fun and exciting way to learn ML fundamentals.

Launched in 2019, DeepRacer is a fully managed service that enables builders of all skill levels to learn and perform model training and evaluation tasks such as defining a reward function, setting up the training parameters, and configuring a training job that can be evaluated and monitored for model performance in a simulated environment. By exploring the AWS DeepRacer ML training lifecycle, you’ll practice model training, evaluation, and deployment of ML models onto a 1/18th scale autonomous race car, using a human-in-the-loop experience. The model training and evaluation experience enables builders to familiarize themselves with similar concepts applicable in training and fine-tuning foundation models (FMs) that power generative AI applications.

AWS DeepRacer also offers a global racing league for competing alongside a community of ML enthusiasts, earning rewards and recognition while showcasing your ML skills. Through the AWS DeepRacer League, we have educated over 550,000 developers, crowned five AWS DeepRacer champions, recognized over 100 monthly virtual circuit winners, and rewarded over 10,000 participants worldwide with Amazon gift cards, cash prizes, and paid trips to AWS re:Invent to compete for the annual AWS DeepRacer Championship Cup.

The excitement around AWS DeepRacer extends far beyond just individual learners. To celebrate Women’s History Month, JPMorgan Chase & Co. recently hosted the “World’s Largest Global Women’s AWS DeepRacer League,” providing employees with a thrilling opportunity to gain hands-on ML experience through virtual autonomous vehicle racing. This event not only fostered a spirit of friendly competition but also celebrated empowerment and innovation in AI and ML. By embracing AWS DeepRacer, JPMorgan Chase showcased its commitment to democratizing ML knowledge and nurturing a culture of continuous learning, empowering its talented teams to drive the company’s AI transformation.

“I am super proud of the group, the firm and the TIF (Take it Forward) team. . . I couldn’t be more proud of a group of individuals being so self-motivated. The sky is the limit from here! Deep Racer is proof that learning can be fun.”

– Ebele Kemery, Head of JPMorgan Chase Tech, Data and AI Learning.

Initiatives like these demonstrate the far-reaching impact of AWS DeepRacer in bringing ML education to the forefront, inspiring learners of all backgrounds to embrace the future of intelligent technologies.

Whether you’re a seasoned developer or curious business professional, AWS DeepRacer provides a fun and exciting way to get started with AI. You’ll gain practical skills applicable to real-world ML and generative AI use cases. So get rolling with machine learning today!

About the authors

Ange Krueger is a principal AWS technologist. She leads product portfolio advancements and technological agility within the global financial sector. Utilizing over 200 AWS cloud services including leading AWS Artificial Intelligence, Machine Learning and Generative AI offerings, she delivers innovation, transformation, and scalable solutions that precisely address the complex demands of our global customers. Through a collaborative approach and a laser focus on customer-centric outcomes, Ange enhances customer experiences to achieve optimized business performance. Her commitment to continual improvement and customer obsession is unwavering, as she works to empower our clients with resilient, cloud-based financial services solutions.

Transform customer engagement with no-code LLM fine-tuning using Amazon SageMaker Canvas and SageMaker JumpStart

Fine-tuning large language models (LLMs) creates tailored customer experiences that align with a brand’s unique voice. Amazon SageMaker Canvas and Amazon SageMaker JumpStart democratize this process, offering no-code solutions and pre-trained models that enable businesses to fine-tune LLMs without deep technical expertise, helping organizations move faster with fewer technical resources.

SageMaker Canvas provides an intuitive point-and-click interface for business users to fine-tune LLMs without writing code. It works both with SageMaker JumpStart and Amazon Bedrock models, giving you the flexibility to choose the foundation model (FM) for your needs.

This post demonstrates how SageMaker Canvas allows you to fine-tune and deploy LLMs. For businesses invested in the Amazon SageMaker ecosystem, using SageMaker Canvas with SageMaker JumpStart models provides continuity in operations and granular control over deployment options through SageMaker’s wide range of instance types and configurations. For information on using SageMaker Canvas with Amazon Bedrock models, see Fine-tune and deploy language models with Amazon SageMaker Canvas and Amazon Bedrock.

Fine-tuning LLMs on company-specific data provides consistent messaging across customer touchpoints. SageMaker Canvas lets you create personalized customer experiences, driving growth without extensive technical expertise. In addition, your data is not used to improve the base models, is not shared with third-party model providers, and stays entirely within your secure AWS environment.

Solution overview

The following diagram illustrates this architecture.

In the following sections, we show you how to fine-tune a model by preparing your dataset, creating a new model, importing the dataset, and selecting an FM. We also demonstrate how to analyze and test the model, and then deploy the model via SageMaker, focusing on how the fine-tuning process can help align the model’s responses with your company’s desired tone and style.

Prerequisites

First-time users need an AWS account and AWS Identity and Access Management (IAM) role with SageMaker and Amazon Simple Storage Service (Amazon S3) access.

To follow along with this post, complete the prerequisite steps:

Create a SageMaker domain, which is a collaborative machine learning (ML) environment with shared file systems, users, and configurations.
Confirm that your SageMaker IAM role and domain roles have the necessary permissions.
On the domain details page, view the user profiles.
Choose Launch by your profile, and choose Canvas.

Prepare your dataset

SageMaker Canvas requires a prompt/completion pair file in CSV format because it does supervised fine-tuning. This allows SageMaker Canvas to learn how to answer specific inputs with properly formatted and adapted outputs.

Download the following CSV dataset of question-answer pairs.

Create a new model

SageMaker Canvas allows simultaneous fine-tuning of multiple models, enabling you to compare and choose the best one from a leaderboard after fine-tuning. For this post, we compare Falcon-7B with Falcon-40B.

Complete the following steps to create your model:

In SageMaker Canvas, choose My models in the navigation pane.
Choose New model.
For Model name, enter a name (for example, MyModel).
For Problem type¸ select Fine-tune foundation model.
Choose Create.

The next step is to import your dataset into SageMaker Canvas.

Create a dataset named QA-Pairs.
Upload the prepared CSV file or select it from an S3 bucket.
Choose the dataset.

SageMaker Canvas automatically scans it for any formatting issues. In this case, SageMaker Canvas detects an extra newline at the end of the CSV file, which can cause problems.

To address this issue, choose Remove invalid characters.
Choose Select dataset.

Select a foundation model

After you upload your dataset, select an FM and fine-tune it with your dataset. Complete the following steps:

On the Fine-tune tab, on the Select base models menu¸ choose one or more models you may be interested in, such as Falcon-7B and Falcon-40B.
For Select input column, choose question.
For Select output column, choose answer.
Choose Fine-tune.

Optionally, you can configure hyperparameters, as shown in the following screenshot.

Wait 2–5 hours for SageMaker to finish fine-tuning your models. As part of this process, SageMaker Autopilot splits your dataset automatically into an 80/20 split for training and validation, respectively. You can optionally change this split configuration in the advanced model building configurations.

SageMaker training uses ephemeral compute instances to efficiently train ML models at scale, without the need for long-running infrastructure. SageMaker logs all training jobs by default, making it straightforward to monitor progress and debug issues. Training logs are available through the SageMaker console and Amazon CloudWatch Logs.

Analyze the model

After fine-tuning, review your new model’s stats, including:

Training loss – The penalty for next-word prediction mistakes during training. Lower values mean better performance.
Training perplexity – Measures the model’s surprise when encountering text during training. Lower perplexity indicates higher confidence.
Validation loss and validation perplexity – Similar to the training metrics, but measured during the validation stage.

To get a detailed report on your custom model’s performance across dimensions like toxicity and accuracy, choose Generate evaluation report (based on the AWS open source Foundation Model Evaluations Library). Then choose Download report.

The graph’s curve reveals if you overtrained your model. If the perplexity and loss curves plateau after a certain number of epochs, the model stopped learning at that point. Use this insight to adjust the epochs in a future model version using the Configure model settings.

The following is a portion of the report, which gives you an overall toxicity score for the fine-tuned model. The report includes explanations of what the scores mean.

A dataset consisting of ~320K question-passage-answer triplets. The questions are factual naturally-occurring questions. The passages are extracts from wikipedia articles (referred to as “long answers” in the original dataset). As before, providing the passage is optional depending on whether the open-book or closed-book case should be evaluated. We sampled 100 records out of 4289 in the full dataset.Prompt Template: Respond to the following question with a short answer: $model_input

Toxicity detector model: UnitaryAI Detoxify-unbiased

Toxicity Score
A binary score from 0 (no toxicity detected) to 1 (toxicity detected) for the class: toxicity

Average Score: 0.0027243031983380205

Now that we have confirmed that the model has close to 0 toxicity detected according to the available toxicity models, let’s check out the model leaderboard to compare how Falcon-40B and Falcon-7B perform on dimensions like loss and perplexity.

On an order of magnitude, the two models performed about the same along these metrics on the provided data. Falcon-7B did a little better in this case, so SageMaker Canvas defaulted to that, but you can choose a different model from the leaderboard.

Let’s stick with Falcon-7B, because it performed slightly better and will run on more cost-efficient infrastructure.

Test the models

Although metrics and the report already provide insights into the performances of the models you’ve fine-tuned, you should always test your models by generating some predictions before putting them in production. For that, SageMaker Canvas allows you to use these models without leaving the application. To do that, SageMaker Canvas deploys for you an endpoint with the fine-tuned model, and shuts it down automatically after 2 hours of inactivity to avoid unintended costs.

To test the models, complete the following steps. Keep in mind that although fine-tuning can improve response style, it may not be a complete solution for providing factual accuracy. For factual accuracy, consider Retrieval Augmented Generation (RAG) architectures and continued pre-training.

Choose Test in Ready-to-Use Models and wait 15–30 minutes for your test endpoint to be deployed.

When the deployment is complete, you’ll be redirected to the SageMaker Canvas playground, with your model pre-selected.

2. Choose Compare and select the FM used for your custom model.
Enter a phrase directly from your training dataset, to make sure the custom model at least does better at such a question and is consistent with the level of verbosity provided in the fine-tuning data.

For this example, we enter the question, “What is the significance of the memory hierarchy in modern computer architectures?”

The fine-tuned Falcon-7B model responded succinctly, like you would expect from an FAQ document:

The memory hierarchy in modern computer architectures is the organization of memory storage within a computer system. The memory hierarchy is important because it determines how memory is accessed and used.

In contrast to the fine-tuned Falcon-7B, the base Falcon-7B model responded verbosely and with an odd beginning:

1 Answer | Add Yours [sic]

The memory hierarchy is the structure of the memory system in a computer system. It is a hierarchy because there are different levels of memory. The memory hierarchy is important because it determines how fast a computer can access memory.

The memory hierarchy is made up of levels of memory. The first level of memory is the main memory. This is the memory that is used for the data that is currently being processed. It is also used for the instructions that are currently being processed. The main memory is very fast and is able to access data very quickly.

The second level of memory is the cache memory. This is a level of memory that is much faster than the main memory. It is used to store data that is frequently accessed. It is also used to store instructions that are frequently accessed. The cache memory is much faster than the main memory.

The third level of memory is the disk memory. This is a level of memory that is much slower than the main memory and the cache memory. It is used to store data that is infrequently accessed. It is also used to store instructions that are infrequently accessed. The disk memory is much slower than the main memory and the cache memory.

The fourth level of memory is the secondary storage. This is a level of memory that is used to store data that is infrequently accessed. It is also used to store instructions that are infrequently accessed.

Let’s say you as a business user want to collaborate with your ML team on this model. You can send the model to your SageMaker model registry so the ML team can interact with the fine-tuned model in Amazon SageMaker Studio, as shown in the following screenshot.

Under the Add to Model Registry option, you can also see a View Notebook option. SageMaker Canvas offers a Python Jupyter notebook detailing your fine-tuning job, alleviating concerns about vendor lock-in associated with no-code tools and enabling detail sharing with data science teams for further validation and deployment.

Deploy the model with SageMaker

For production use, especially if you’re considering providing access to dozens or even thousands of employees by embedding the model into an application, you can deploy the model as an API endpoint. Complete the following steps to deploy your model:

On the SageMaker console, choose Inference in the navigation pane, then choose Models.
Locate the model with the prefix canvas-llm-finetuned- and timestamp.
Open the model details and note three things:
1. Model data location – A link to download the .tar file from Amazon S3, containing the model artifacts (the files created during the training of the model).
2. Container image – With this and the model artifacts, you can run inference virtually anywhere. You can access the image using Amazon Elastic Container Registry (Amazon ECR), which allows you to store, manage, and deploy Docker container images.
3. Training job – Stats from the SageMaker Canvas fine-tuning job, showing instance type, memory, CPU use, and logs.

Alternatively, you can use the AWS Command Line Interface (AWS CLI):

```bash

aws sagemaker list-models

```

The most recently created model will be at the top of the list. Make a note of the model name and the model ARN.

To start using your model, you must create an endpoint.

4. On the left navigation pane in the SageMaker console, under Inference, choose Endpoints.
Choose Create endpoint.
For Endpoint name, enter a name (for example, My-Falcon-Endpoint).
Create a new endpoint configuration (for this post, we call it my-fine-tuned-model-endpoint-config).
Keep the default Type of endpoint, which is Provisioned. Other options are not supported for SageMaker JumpStart LLMs.
Under Variants, choose Create production variant.
Choose the model that starts with canvas-llm-finetuned-, then choose Save.
In the details of the newly created production variant, scroll to the right to Edit the production variant and change the instance type to ml.g5.xlarge (see screenshot).
Finally, Create endpoint configuration and Create endpoint.

As described in Deploy Falcon-40B with large model inference DLCs on Amazon SageMaker, Falcon works only on GPU instances. You should choose the instance type and size according to the size of the model to be deployed and what will give you the required performance at minimum cost.

Alternatively, you can use the AWS CLI:

```
config_name="my-fine-tuned-model-endpoint-config"

aws sagemaker create-endpoint-config 
--endpoint-config-name $config_name 
--production-variants VariantName="cool-variant",ModelName="canvas-llm-finetuned-2024-01-16-20-11-13-119791",InstanceType="ml.g5.xlarge",InitialInstanceCount=1

aws sagemaker create-endpoint 
--endpoint-name "my-fine-tuned-model-endpoint" 
--endpoint-config-name $config_name
```

Use the model

You can access your fine-tuned LLM through the SageMaker API, AWS CLI, or AWS SDKs.

Enrich your existing software as a service (SaaS), software platforms, web portals, or mobile apps with your fine-tuned LLM using the API or SDKs. These let you send prompts to the SageMaker endpoint using your preferred programming language. Here’s an example:

```
import boto3
import json

# Create a SageMaker runtime client
sagemaker_runtime = boto3.client('sagemaker-runtime')

# Specify your endpoint name
endpoint_name = 'my-fine-tuned-model-endpoint'

def query_falcon_llm(question):
    """
    Function to query the fine-tuned Falcon LLM endpoint with a specific question.
    :param question: str, the question to ask the LLM.
    :return: str, the answer from the LLM.
    """
    # Define the prompt
    prompt = f"You are a helpful Assistant. You answer questions in the style of technical answers everything about GPUs and Machine Learning. User: {question}n Assistant:"

    # Define the payload with hyperparameters
    payload = {
        "inputs": prompt,
        "parameters": {
            "do_sample": True,
            "top_p": 0.7,
            "temperature": 0.5,
            "max_new_tokens": 1024,
            "repetition_penalty": 1.03,
            "stop": ["nUser:", "###"]
        }
    }

    # JSONify the payload
    payload_json = json.dumps(payload)

    # Call the SageMaker endpoint
    response = sagemaker_runtime.invoke_endpoint(EndpointName=endpoint_name,
                                                 ContentType='application/json',
                                                 Body=payload_json)

    # Decode the response
    response_body = json.loads(response['Body'].read().decode())

    # Extract and format the answer
    assistant_response = response_body[0]["generated_text"][len(prompt):]
    assistant_response = assistant_response.replace("nUser:", "").replace("###", "").strip()

    return assistant_response

# Example usage
question = " What is the significance of the memory hierarchy in modern computer architectures?"
answer = query_falcon_llm(question)
print(f"Question: {question}nAnswer: {answer}")


```

For examples of invoking models on SageMaker, refer to the following GitHub repository. This repository provides a ready-to-use code base that lets you experiment with various LLMs and deploy a versatile chatbot architecture within your AWS account. You now have the skills to use this with your custom model.

Another repository that may spark your imagination is Amazon SageMaker Generative AI, which can help you get started on a number of other use cases.

Clean up

When you’re done testing this setup, delete your SageMaker endpoint to avoid incurring unnecessary costs:

```

aws sagemaker delete-endpoint --endpoint-name "your-endpoint-name"

```

After you finish your work in SageMaker Canvas, you can either log out or set the application to automatically delete the workspace instance, which stops billing for the instance.

Conclusion

In this post, we showed you how SageMaker Canvas with SageMaker JumpStart models enable you to fine-tune LLMs to match your company’s tone and style with minimal effort. By fine-tuning an LLM on company-specific data, you can create a language model that speaks in your brand’s voice.

Fine-tuning is just one tool in the AI toolbox and may not be the best or the complete solution for every use case. We encourage you to explore various approaches, such as prompting, RAG architecture, continued pre-training, postprocessing, and fact-checking, in combination with fine-tuning to create effective AI solutions that meet your specific needs.

Although we used examples based on a sample dataset, this post showcased these tools’ capabilities and potential applications in real-world scenarios. The process is straightforward and applicable to various datasets, such as your organization’s FAQs, provided they are in CSV format.

Take what you learned and start brainstorming ways to use language models in your organization while considering the trade-offs and benefits of different approaches. For further inspiration, see Overcoming common contact center challenges with generative AI and Amazon SageMaker Canvas and New LLM capabilities in Amazon SageMaker Canvas, with Bain & Company.

About the Author

Yann Stoneman is a Solutions Architect at AWS focused on machine learning and serverless application development. With a background in software engineering and a blend of arts and tech education from Juilliard and Columbia, Yann brings a creative approach to AI challenges. He actively shares his expertise through his YouTube channel, blog posts, and presentations.

Building commonsense knowledge graphs to aid product recommendation

Using large language models to discern commonsense relationships can improve performance on downstream tasks by as much as 60%.Read More

Through the Wormhole: Media.Monks’ Vision for Enhancing Media and Marketing With AI

Meet Media.Monks’ Wormhole, an alien-like, conversational robot with a quirky personality and the ability to offer keen marketing expertise. Lewis Smithingham, senior vice president of innovation and special ops at Media.Monks, a global marketing and advertising company, discusses the creation of Wormhole and AI’s potential to enhance media and entertainment with host Noah Kravitz in this AI Podcast episode recorded live at the NVIDIA GTC global AI conference. Wormhole was designed to showcase Monks.Flow, an AI-powered platform that streamlines marketing and content creation workflows. Smithingham delves into Media.Monks’ platforms for media, entertainment and advertising and speaks to its vision for a future where AI enhances creativity and allows for more personalized, scalable content creation.

Stay tuned for more episodes recorded live from GTC, and hear more from Smithingham in this GTC interview.

The AI Podcast · Media.Monks’ Lewis Smithingham on Enhancing Media and Marketing With AI – Ep. 222

Time Stamps

1:45: What is Media.Monks?
6:23: Description of Wormhole
8:49: Possible use cases for Wormhole
10:21: Takeaways from developing Wormhole
12:02: What is Monks.Flow?
16:54: Response from creatives on using AI in their work
21:23: Smithingham’s outlook on hyperpersonalized content
34:24: What’s next for the future of AI-powered media?

You Might Also Like…

Exploring Filmmaking With Cuebric’s AI: Insights From Pinar Seyhan Demirdag – Ep. 214

In today’s episode of NVIDIA’s AI Podcast, host Noah Kravitz talks with Pinar Seyhan Demirdag, co-founder and CEO of Cuebric. Cuebric is on a mission to offer new solutions in filmmaking and content creation through immersive, two-and-a-half-dimensional cinematic environments.

Deepdub’s Ofir Krakowski on Redefining Dubbing From Hollywood to Bollywood – Ep. 202

On the latest episode of NVIDIA’s AI Podcast, host Noah Kravitz spoke with Deepdub’s cofounder and CEO, Ofir Krakowski. Deepdub uses AI-driven dubbing to help entertainment companies boost efficiency and cut costs while increasing accessibility.

WSC Sports’ Amos Bercovich on How AI Keeps the Sports Highlights Coming – Ep. 183

On this episode of the AI Podcast, host Noah Kravitz spoke with Amos Bercovich, algorithm group leader at WSC Sports, makers of an AI cloud platform that enables over 200 sports organizations worldwide to generate personalized and customized sports videos automatically and in real time.

Maya Ackerman on LyricStudio, an AI-Based Writing Songwriting Assistant – Ep. 153

Lennon and McCartney. Ashford and Simpson. Many of our all-time favorite tunes have come from songwriting duos. Now, anyone can find a snazzy compositional partner in AI. In this episode of the AI Podcast, Maya Ackerman, CEO of WaveAI, spoke with host Noah Kravtiz about WaveAI’s LyricStudio software, an AI-based lyric and poetry writing assistant.

Subscribe to the AI Podcast

Get the AI Podcast through iTunes, Google Podcasts, Google Play, Amazon Music, Castbox, DoggCatcher, Overcast, PlayerFM, Pocket Casts, Podbay, PodBean, PodCruncher, PodKicker, Soundcloud, Spotify, Stitcher and TuneIn.

Make the AI Podcast better: Have a few minutes to spare? Fill out this listener survey.

How LotteON built dynamic A/B testing for their personalized recommendation system

This post is co-written with HyeKyung Yang, Jieun Lim, and SeungBum Shim from LotteON.

LotteON is transforming itself into an online shopping platform that provides customers with an unprecedented shopping experience based on its in-store and online shopping expertise. Rather than simply selling the product, they create and let customers experience the product through their platform.

LotteON has been providing various forms of personalized recommendation services throughout the LotteON customer journey and across its platform, from its main page to its shopping cart and order completion pages. Through the development of new, high-performing models and continuous experimentation, they’re providing customers with personalized recommendations, improving CTR (click-through rate) metrics and increasing customer satisfaction.

In this post, we show you how LotteON implemented dynamic A/B testing for their personalized recommendation system.

The dynamic A/B testing system monitors user reactions, such as product clicks, in real-time from the recommended item lists provided. It dynamically assigns the most responsive recommendation model among multiple models to enhance the customer experience with the recommendation list. Using Amazon SageMaker and AWS services, these solutions offer insights into real-world implementation know-how and practical use cases for deployment.

Defining the business problem

In general, there are two types of A/B testing that are useful for measuring the performance of a new model: offline testing and online testing. Offline testing evaluates the performance of a new model based on past data. Online A/B testing, also known as split testing, is a method used to compare two versions of a webpage, or in LotteON’s case, two recommendation models, to determine which one performs better. A key strength of online A/B testing is its ability to provide empirical evidence based on user behavior and preferences. This evidence-based approach to selecting a recommendation model reduces guesswork and subjectivity in optimizing both click-through rates and sales.

A typical online A/B test serves two models in a certain ratio (such as 5:5) for a fixed period of time (for example, a day or a week). When one model performs better than the other, the lower performing model is still served for the duration of the experiment, regardless of its impact on the business. To improve this, LotteON turned to dynamic A/B testing, which evaluates the performance of models in real time and dynamically updates the ratios at which each model is served, so that better performing models are served more often. To implement dynamic A/B testing, they used the multi-armed bandit (MAB) algorithm, which performs real-time optimizations.

LotteON’s dynamic A/B testing automatically selects the model that drives the highest click-through rate (CTR) on their site. To build their dynamic A/B testing solution, LotteON used AWS services such as Amazon SageMaker and AWS Lambda. By doing so, they were able to reduce the time and resources that would otherwise be required for traditional forms of A/B testing. This frees up their scientists to focus more of their time on model development and training.

Solution and implementation details

The MAB algorithm evolved from casino slot machine profit optimization. MAB’s usage method differs in selection (arm) from the existing method, which is widely used to re-rank news or products. In this implementation the selection (the arm) in MAB must be a model. There are various MAB algorithms such as ε-greedy and Thompson sampling.

The ε-greedy algorithm balances exploration and exploitation by choosing the best-known option most of the time, but randomly exploring other options with a small probability ε. Thompson sampling involves defining the β distribution for each option, with parameters alpha (α) representing the number of successes so far and beta (β) representing failures. As the algorithm collects more observations, alpha and beta are updated, shifting the distributions toward the true success rate. The algorithm then randomly samples from these distributions to decide which option to try next—balancing exploitation of the best-performing options to-date with exploration of less-tested options. In this way, MAB learns which model is best based on actual outcomes.

Based on LotteON’s evaluation of both ε-greedy and Thompson sampling, which considered the balance of exposure opportunities of the models under test, they decided to use Thompson sampling. Based on the number of clicks obtained, they were able to derive an efficiency model. For a hands-on workshop on dynamic A/B testing with MAB and Thompson sampling algorithms, see Dynamic A/B Testing on Amazon Personalize & SageMaker Workshop. LotteON’s goal was to provide real-time recommendations for high CTR efficient models.

With the option (arm) configured as a model, and the alpha value for each model configured as a click, the beta value for each model was configured as a non-click. To apply the MAB algorithm to actual services, they introduced the bTS (batched Thompson sampling) method, which processes Thompson sampling on a batch basis. Specifically, they evaluated models based on traffic over a certain period of time (24 hours), and updated parameters at a certain time interval (1 hour).

In the handler part of the Lambda function, a bTS operation is performed that reflects the parameter values for each model (arm), and the click probabilities of the two models are calculated. The ID of the model with the highest probability of clicks is then selected. One thing to keep in mind when conducting dynamic A/B testing is not to start Thompson sampling right away. You should allow warm-up time for sufficient exploration. To avoid prematurely determining the winner due to small parameter values at the beginning of the test, you must collect an adequate number of impressions or click-metrics.

Dynamic A/B test architecture

The following figure shows the architecture for the dynamic A/B test that LotteON implemented.

The architecture in the preceding figure shows the data flow of Dynamic A/B testing and consists of the following four decoupled components:

1. MAB serving flow

Step 1: The user accesses LotteON’s recommendation page.

Step 2: The recommendations API checks MongoDB for information about ongoing experiments with recommendation section codes and, if the experiment is active, sends an API request with the member ID and section code to the Amazon API Gateway.

Step 3: API Gateway provides the received data to Lambda. If there is relevant data in the API Gateway cache, a specific model code in the cache is immediately passed to the recommendation API.

Step 4: The Lambda function checks the experiment type (that is, dynamic A/B test or online static A/B test) in MongoDB and runs its algorithm. If the experiment type is dynamic A/B test, the alpha (number of clicks) and beta (number of non-clicks) required for the Thompson sampling algorithm are retrieved from MongoDB, the values are obtained, and the Thompson sampling algorithm is run. Through this, the selected model’s identifier is delivered to Amazon API Gateway by the Lambda function.

Step 5: API Gateway provides the selected model’s identifier to the recommended API and caches the selected model’s identifier for a certain period of time.

Step 6: The recommendation API calls the model inference server (that is, the SageMaker endpoint) using the selected model’s identifier to receive a recommendation list and provides it to the user’s recommendation web page.

2. The flow of an alpha and beta parameter update

Step 1: The system powering LotteON’s recommendation page stores real-time logs in Amazon S3.

Step 2: Amazon EMR downloads the logs stored in Amazon S3.

Step 3: Amazon EMR processes the data and updates the alpha and beta parameter values to MongoDB for use in the Thompson sampling algorithm.

3. The flow of business metrics monitoring

Step 1: Streamlit pulls experimental business metrics from MongoDB to visualize.

Step 2: Monitor efficiency metrics such as CTR per model over time.

4. The flow of system operation monitoring

Step 1: When a recommended API call occurs, API Gateway and Lambda are launched, and Amazon CloudWatch logs are produced.

Step 2: Check system operation metrics using CloudWatch and AWS X-Ray dashboards based on CloudWatch logs.

Implementation Details 1: MAB serving flow mainly involving API Gateway and Lambda

The APIs that can serve MAB results—that is, the selected model—are implemented using serverless compute services, Lambda, and API Gateway. Let’s take a look at the implementation and settings.

1. API Gateway configuration

When a LotteON user signs in to the recommended product area, member ID, section code, and so on are passed to API Gateway as GET parameters. Using the passed parameters, the selected model can be used for inferencing during a certain period of time through the cache function of Amazon API Gateway.

2. API Gateway cache settings

Setting up a cache in API Gateway is straightforward. To set up the cache, first enable it by selecting the appropriate checkbox under the Settings tab for your chosen stage. After it’s activated, you can define the cache time-to-live (TTL), which is the duration in seconds that cached data remains valid. This value can be set anywhere up to a maximum of 3,600 seconds.

The API Gateway caching feature is limited to the parameters of GET requests. To use caching for a particular parameter, you should insert a query string in the GET request’s query parameters within the resource. Then select the Enable API Cache option. It is essential to deploy your API using the deploy action in the API Gateway console to activate the caching function.

After the cache is set, the same model is used for inference on specific customers until the TTL has elapsed. Following that, or when the recommendation section is first exposed, API Gateway will call Lambda with the MAB function implemented.

3. Add an API Gateway mapping template

When a Lambda handler function is invoked, it can receive the HTTPS request details from API Gateway as an event parameter. To provide a Lambda function with more detailed information, you can enhance the event payload using a mapping template in the API Gateway. This template is part of the integration request setup, which defines how incoming requests are mapped to the expected format of the Lambda function.

The specified parameters are then passed to the Lambda function’s event parameters. The following code is an example of source code that uses the event parameter in Lambda.

def lambda_handler (event, context):
    event_param = event ["name"]
    return {
        'message': event_param
    }

4. Lambda for Dynamic A/B Test

Lambda receives a member ID and section code as event parameter values. The Lambda function uses the received section code to run the MAB algorithm. In the case of the MAB algorithm, a dynamic A/B test is performed by getting the model (arm) settings and aggregated results. After updating the alpha and beta values according to bTS when reading the aggregated results, the probability of a click for each model is obtained through the beta distribution (see the following code), and the model with the maximum value is returned. For example, given model A and model B, where model B has a higher probability of producing a click-through event, model B is returned.

def select_variant (self): 
    probs = []
    for v in self.variant_metrics:
        success = v["mab_alpha”]
        failure = v["mab_beta”]
        probs.append(AlgorithmBase.random_beta(1 + success, 1 + failure)) 

    variant_index = AlgorithmBase.argmax(probs) 

    return (self.variant_metrics [variant_index] ["variant_name"], probs)

The overall implementation using the bTS algorithm, including the above code, was based on the Dynamic A/B testing for machine learning models with Amazon SageMaker MLOps projects post.

Implementation details 2: Alpha and beta parameter update

A product recommendation list is displayed to the LotteON user. When the user clicks on a specific product in the recommendation list, that data is captured and logged to Amazon S3. As shown in the following figure, LotteON used AWS EMR to perform Spark Jobs that periodically pulled the logged data from S3, processed the data, and inserted the results into MongoDB.

The results generated at this stage play a key role in determining the distribution used in MAB. The following impression and click data were examined in detail.

Impression and click data

Note: Before updating the alpha and beta parameters in bTS, verify the integrity and completeness of log data, including impressions and clicks from the recommendation section.

Implementation details 3: Business metrics monitoring

To assess the most effective model, it’s essential to monitor business metrics during A/B testing. For this purpose, a dashboard was developed using Streamlit on an Amazon Elastic Compute Cloud (Amazon EC2) environment.

Streamlit is a Python library can be used to create web apps for data analysis. LotteON added the necessary Python package information for dashboard configuration to the requirements.txt file, specifying Streamlit version 1.14.1, and proceeded with the installation as demonstrated in the following:

 $ python3 -m pip install --upgrade pip 
 $ pip3 install -r requirements.txt

The default port provided by Streamlit is 8501, so it’s required to set the inbound custom TCP port 8501 to allow access to the Streamlit web browser.

When setup is complete, use the streamlit run pythoncode.py command in the terminal, where pythoncode.py is the Python script containing the Streamlit code to run the application. This command launches the Streamlit web interface for the specified application.

import streamlit as st 
    st.title ('streamlit example')

LotteON created a dashboard based on Streamlit. The functionality of this organized dashboard includes monitoring simple business metrics such as model trends over time, daily and real-time winner models, as shown in the following figure.

The dashboard allowed LotteON to analyze the business metrics of the model and check the service status in real time. It also monitored the effectiveness of model version updates and reduced the time to check the service impact of the retraining pipeline.

The following shows an enlarged view of the cumulative CTR of the two models (EXP-01-APS002-01 model A, EXP-01-NCF-01 model B) on the testing day. Let’s take a look at each model to see what that means. Model A provided customers with 29,274 recommendation lists that received 1,972 product clicks and generated a CTR of 6.7 percent (1,972/29,274).

Model B, on the other hand, served 7,390 recommended lists, received 430 product clicks, and generated a CTR of 5.8 percent (430/7,390). Alpha and beta parameters, the number of clicks and the number of non-clicks respectively, of each model were used to set the beta distribution. Model A’s alpha parameter was 1972 (number of clicks) and its beta parameter was 27,752 (number of non-clicks [29,724 – 1,972]). Model B’s alpha parameter was 430 (number of clicks) and its beta parameter was 6,960 (number of non-clicks). The larger the X-axis value corresponding to the peak in the beta distribution graph, the better the performance (CTR) model.

In the following figure, model A (EXP-01-APS002-01) shows better performance because it’s further to the right in relation to the X axis. This is also consistent with the CTR rates of 6.7 percent and 5.8 percent.

Implementation details 4: System operation monitoring with CloudWatch and AWS X-Ray

You can enable CloudWatch settings, custom access logging, and AWS X-Ray tracking features from the Logs/Tracking tab in the API Gateway menu.

CloudWatch settings and custom access logging

In the configuration step, you can change the CloudWatch Logs type to set the logging level, and after activating detailed indicators, you can check detailed metrics such as 400 errors and 500 errors. By enabling custom access logs, you can check which IP accessed the API and how.

Additionally, the retention period for CloudWatch Logs must be specified separately on the CloudWatch page to avoid storing them indefinitely.

If you select API Gateway from the CloudWatch Explorer list, you can view the number of API calls, latency, and cache hits and misses on a dashboard. Find the Cache Hit Rate as shown in the following formula and check the effectiveness of the cache on the dashboard.

Cache Hit Rate = CacheHitCount / (CacheHitCount + CacheMissCount)

By selecting Lambda as the log group in the CloudWatch Logs Insights menu, you can verify the actual model code returned by Lambda, where MAB is performed, to check whether the sampling logic is working and branch processing is being performed.

fields @timestamp, @message, @logStream, @log 
 | filter @message like 'Model A' or message like 'Model B' 
 | stats count (*) by @message

As shown in the preceding image, LotteON observed how often the two models were called by the Lambda function during the A/B test. Specifically, the model labeled LF001-01 (the champion model) was invoked 4,910 times, while the model labeled NCF-02 (the challenger model) was invoked 4,905 times. These numbers represent the degree to which each model was selected in the experiment.

AWS X-Ray

If you enable the X-Ray trace feature, trace data is sent from the enabled AWS service to X-Ray and the visualized API service flow can be monitored from the service map menu in the X-Ray section of the CloudWatch page.

As shown in the preceding figure, you can easily track and monitor latency, number of calls, and number of HTTP call status for each service section by choosing the API Gateway icon and each Lambda node.

There was no need to store performance metrics for a long time because most for Lambda functions metrics are analyzed within a week and aren’t used afterward. Because data from X-Ray is stored for 30 days by default, which is enough time to use the metrics, the data was used without changing the storage cycle. (For more information, see the AWS X-Ray FAQs.)

Conclusion

In this post, we explained how Lotte ON builds and uses a dynamic A/B testing environment. Through this project, Lotte ON was able to test the model’s performance in various ways online by combining dynamic A/B testing with the MAB function. It also allows comparison of different types of recommendation models and is designed to be comparable across model versions, facilitating online testing.

In addition, data scientists could concentrate on improving model performance and training as they can check metrics and system monitoring instantly. The dynamic A/B testing system was initially developed and applied to the LotteON main page, and then expanded to the main page recommendation tab and product detail recommendation section. Because the system is able to evaluate online performance without significantly reducing the click-through rate of existing models, we have been able to conduct more experiments without impacting users.

Dynamic A/B Test exercises can also be found in AWS Workshop – Dynamic A/B Testing on Amazon Personalize & SageMaker.

About the Authors

HyeKyung Yang is a research engineer in the Lotte E-commerce Recommendation Platform Development Team and is in charge of developing ML/DL recommendation models by analyzing and utilizing various data and developing a dynamic A/B test environment.

Jieun Lim is a data engineer in the Lotte E-commerce Recommendation Platform Development Team and is in charge of operating LotteON’s personalized recommendation system and developing personalized recommendation models and dynamic A/B test environments.

SeungBum Shim is a data engineer in the Lotte E-commerce Recommendation Platform Development Team, responsible for discovering ways to use and improve recommendation-related products through LotteON data analysis, and developing MLOps pipelines and ML/DL recommendation models.

Jesam Kim is an AWS Solutions Architect and helps enterprise customers adopt and troubleshoot cloud technologies and provides architectural design and technical support to address their business needs and challenges, especially in AIML areas such as recommendation services and generative AI.

Gonsoo Moon is an AWS AI/ML Specialist Solutions Architect and provides AI/ML technical support. His main role is to collaborate with customers to solve their AI/ML problems based on various use cases and production experience in AI/ML.

Unleashing the power of generative AI: Verisk’s journey to an Instant Insight Engine for enhanced customer support

This post is co-written with Tom Famularo, Abhay Shah and Nicolette Kontor from Verisk.

Verisk (Nasdaq: VRSK) is a leading data analytics and technology partner for the global insurance industry. Through advanced analytics, software, research, and industry expertise across over 20 countries, Verisk helps build resilience for individuals, communities, and businesses. The company is committed to ethical and responsible AI development, with human oversight and transparency. Verisk is using generative artificial intelligence (AI) to enhance operational efficiencies and profitability for insurance clients while adhering to its ethical AI principles.

Verisk’s FAST platform is a leader in the life insurance and retirement sector, providing enhanced efficiency and flexible, easily upgradable architecture. FAST has earned a fourth consecutive leader ranking in the 2024 ISG Provider Lens report for its seamless integration with Verisk’s data, analytics, and claims tools. The software as a service (SaaS) platform offers out-of-the-box solutions for life, annuity, employee benefits, and institutional annuity providers. With preconfigured components and platform configurability, FAST enables carriers to reduce product time-to-market by 75% and launch new offerings in as little as 2 months.

In this post, we describe the development of the customer support process in FAST incorporating generative AI, the data, the architecture, and the evaluation of the results. Conversational AI assistants are rapidly transforming customer and employee support. Verisk has embraced this technology and has developed their own Instant Insight Engine, or AI companion, that provides an enhanced self-service capability to their FAST platform.

The Opportunity

Verisk FAST’s initial foray into using AI was due to the immense breadth and complexity of the platform. With hundreds of thousands of hours spent on customer support every year, it became abundantly clear they needed help to scale their efforts and meet their objectives. Verisk’s talented teams were overloaded handling common inquiries, leaving less time for the type of innovation that would allow them to maintain the pole position as insurance technology providers.

Verisk FAST’s AI companion aims to alleviate this burden by not only providing 24/7 support for business processing and configuration questions related to FAST, but also tapping into the immense knowledge base to provide an in-depth, tailored response. It is designed to be deeply integrated into the FAST platform and use all of Verisk’s documentation, training materials, and collective expertise. It relies on a Retrieval Augmented Generation (RAG) approach and a mix of AWS services and proprietary configuration to instantly answer most user questions about the Verisk FAST platform’s extensive capabilities.

When the AI companion is rolled out at scale, it will allow Verisk’s staff to focus more time on complex problems, critical initiatives, and innovation while delivering a better customer experience. As part of the build-out, Verisk came across several considerations, key findings, and decisions worth sharing for any enterprise looking to take the first step in tapping into generative AI’s potential.

The Approach

When building an interactive agent with large language models (LLMs), there are often two techniques that can be used: RAG and fine-tuning. The choice between these approaches depends on the use case and available dataset. Verisk FAST started building a RAG pipeline for their AI companion and have iteratively enhanced this solution. The following are some of the reasons why continuing with a RAG architecture made sense to Verisk:

Access to Dynamic Data – The FAST platform is a constantly evolving platform adding both business functionality and technical capabilities. Verisk needed to make sure their responses were always based on the most up-to-date information. The RAG approach allows for accessing frequently updated data, enabling responses using the most recent information without frequent retraining of the model.
Multiple Data Sources – In addition to recency of data, another important aspect was the ability to tap into multiple different data sources to retrieve the right context. These data sources may be both internal and external to provide a more holistic response. The ease of expanding the knowledge domain without the need to fine-tune with new data sources makes the solution extensible.
Reduce Hallucination – Retrieval reduces the risk of hallucination compared to free-form text generation because responses derive directly from the provided excerpts.
LLM Linguistics – Although appropriate context can be retrieved from enterprise data sources, the underlying LLM handles linguistics and fluency.
Transparency – Verisk wants to continuously improve the AI companion’s ability to generate responses. A RAG architecture gave them the transparency needed into the context retrieval process, information that would ultimately be used for generating user responses. Having that transparency helped Verisk identify areas of the system where their documents were lacking and needed some restructuring.
Data governance – With a wide variety of users accessing the platform and with different users having access to different data, data governance and isolation was paramount. Verisk injected controls into the RAG pipeline that restricted access to data based on user access controls, making sure responses were highly tuned to the user.

Although both RAG and fine-tuning have trade-offs, RAG was the optimal approach for building an AI companion on the FAST platform given their requirements for real-time accuracy, explainability, and configurability. The pipeline architecture allows for iterative enhancement as Verisk FAST’s use cases evolve.

Solution Overview

The following diagram presents a high-level architectural data flow highlighting several of the AWS services used in building the solution. Verisk’s solution represents a compound AI system, involving multiple interacting components and making numerous calls to the LLM to furnish responses to the user. Using the FAST platform for orchestrating these diverse components proved to be an intuitive choice, circumventing certain challenges encountered with alternative frameworks such as LangChain.

The key components are as follows:

Amazon Comprehend

To bolster security, Verisk aimed to block the submission of personally identifiable information (PII) within user questions. Although PII isn’t typically necessary for interactions with the AI companion, Verisk employed Amazon Comprehend to detect any potential PII within queries.

Amazon Kendra

In designing an effective RAG solution, one of the most critical steps is the context retrieval from enterprise documentation. Although many options exist to store embeddings, Verisk FAST opted to use Amazon Kendra due to its powerful out-of-the-box semantic search capabilities. As a fully managed service, Verisk took advantage of its deep-learning search models without additional provisioning. Verisk compared using Amazon OpenSearch Serverless with several embedding approaches and Amazon Kendra, and saw better retrieval results with Amazon Kendra. As you’ll see further in the post, Verisk incorporated the Retrieve API and the Query API to retrieve semantically relevant passages for their queries to further improve generation by the LLM.

Amazon Bedrock

Anthropic Claude, available in Amazon Bedrock, played various roles within Verisk’s solution:

Response Generation – When building their AI companion, Verisk thoroughly evaluated the LLM options from leading providers, using their dataset to test each model’s comprehension and response quality. After this extensive testing, Verisk found Anthropic’s Claude model consistently outperformed across key criteria. Claude demonstrated superior language understanding in Verisk’s complex business domain, allowing more pertinent responses to user questions. It also did exceedingly well at SQL generation, better than any other model they tested. Given Claude’s standout results across Verisk FAST’s use cases, it was the clear choice to power their AI companion’s natural language capabilities.
Preprocessing of Images and Videos – The outputs from Amazon Rekognition and Amazon Transcribe were fed into Claude. Claude demonstrated remarkable capabilities in generating natural language descriptions, which could be effectively used for indexing purposes with Amazon Kendra. Additionally, Claude excelled at summarizing video transcriptions into concise segments corresponding to specific time intervals, enabling the display of videos at precise points. This combination of AWS services and Claude’s language processing capabilities facilitated a more intuitive and user-friendly experience for media exploration and navigation.
Relevance Ranking – Although Amazon Kendra returned confidence scores on search results, Verisk needed to further tune the search results for Query API calls for a few scenarios. Verisk was able to use Claude to rank the relevance of search results from Amazon Kendra, further improving the results returned to the user.
Tool Identification – Verisk used Claude to determine the most suitable techniques, whether API calls or SQL queries, for retrieving data from the operational database based on user requests. Furthermore, Claude generated SQL queries tailored to the provided schemas, enabling efficient data retrieval.
Conversation Summarization – When a user asks a follow-up question, the AI companion can continue the conversational thread. To enable this, Verisk used Claude to summarize the dialogue to update the context from Amazon Kendra. The full conversation summary and new excerpts are input to the LLM to generate the next response. This conversational flow allows the AI compan to answer user follow-up questions and have a more natural, contextual dialogue, bringing Verisk FAST closer to having a true AI assistant that can engage in useful back-and-forth conversations with users.

Amazon Rekognition

Primarily used for processing images containing text and process flow diagrams, the pre-trained features of Amazon Rekognition facilitated information extraction. The extracted data was then passed to Claude for transformation into a more natural language format suitable for indexing within Amazon Kendra.

Amazon Transcribe

Similar to Amazon Rekognition, Amazon Transcribe was employed to preprocess videos and generate transcripts, with a notable feature being the masking of sensitive information. The verbose transcripts, along with timestamps, were condensed using Claude before being indexed into Amazon Kendra.

Prompt Template Warehouse

Central to the solution was the dynamic selection of templates to create prompts based on question classification. Substantial effort was invested in developing and continuously improving these prompt templates.

Throughout Verisk’s journey, they worked closely with the AWS Solutioning team to brainstorm concrete suggestions to enhance the overall solution.

Data Harvesting

Before Verisk started building anything in the platform, they spent weeks amassing information, initially in the form of questions and answers. Verisk FAST’s initial dataset comprised 10,000 questions and their corresponding answers, meticulously collected and vetted to confirm accuracy and relevance. However, they understood that this was not a one-and-done effort. Verisk needed to continually expand its knowledge base by identifying new data sources across the business.

Driven by this, Verisk diligently added 15,000 more questions, making sure they covered less frequently encountered scenarios. Verisk also added user guides, technical documentation, and other text-based information. This data spanned several categories, from business processing to configuration to their delivery approach. This enriched the AI companion’s knowledge and understanding of diverse user queries, enabling it to provide more accurate and insightful responses.

The Verisk FAST team also recognized the necessity of exploring additional modalities. Videos and images, particularly those illustrating process flows and information sharing videos, proved to be invaluable sources of data. During the initial rollout phase, it became evident that certain inquiries demanded real-time data retrieval from their operational data store. Through some slick prompt engineering and using Claude’s latest capabilities to invoke APIs, Verisk seamlessly accessed their database to procure real-time information.

Structuring and Retrieving the Data

An essential element in developing the AI companion’s knowledge base was properly structuring and effectively querying the data to deliver accurate answers. Verisk explored various techniques to optimize both the organization of the content and the methods to extract the most relevant information:

Chunking – One key step in preparing the accumulated questions and answers was splitting the data into individual documents to facilitate indexing into Amazon Kendra. Rather than uploading a single large file containing all 10,000 question-answer pairs, Verisk chunked the data into 10,000 separate text documents, with each document containing one question-answer pair. By splitting the data into small, modular documents focused on a single question-answer pair, Verisk could more easily index each document and had greater success in pulling back the correct context. Chunking the data also enabled straightforward updating and reindexing of the knowledge base over time. Verisk applied the same technique to other data sources as well.
Selecting the Right Number of Results – Verisk tested configuring Amazon Kendra to return different numbers of results for each question query. Returning too few results ran the risk of not capturing the best answer, whereas too many results made it more difficult to identify the right response. Verisk found returning the top three matching results from Amazon Kendra optimized both accuracy and performance.
Multi-step Query – To further improve accuracy, Verisk implemented a multi-step query process. First, they used the Amazon Kendra Retrieve API to get multiple relevant passages and excerpts based on keyword search. Next, they took a second pass at getting excerpts through the Query API, to find any additional shorter documents that might have been missed. Combining these two query types enabled Verisk to reliably identify the correct documentation and excerpts to generate a response.
Relevance Parameters – Verisk also tuned relevance parameters in Amazon Kendra to weigh their most up-to-date documentation higher than others. This improved results over just generic text search.

By thoroughly experimenting and optimizing both the knowledge base powering their AI companion and the queries to extract answers from it, Verisk was able to achieve very high answer accuracy during the proof of concept, paving the way for further development. The techniques they explored—multi-stage querying, tuning relevance, enriching data—became core elements of their approach for extracting quality automated answers.

LLM Parameters and Models

Experimenting with prompt structure, length, temperature, role-playing, and context was key to improving the quality and accuracy of the AI companion’s Claude-powered responses. The prompt design guidelines provided by Anthropic were incredibly helpful.

Verisk crafted prompts that provided Claude with clear context and set roles for answering user questions. Setting the temperature to 0.5 helped reduce randomness and repetition in the generated responses.

Verisk also experimented with different models to improve the efficiency of the overall solution. Although Claude 3 models like Sonnet and Haiku did a great job at generating responses, as part of the overall solution, Verisk didn’t always need the LLM to generate text. For scenarios that required identification of tools, Claude Instant was a better suited model due to its quicker response times.

Metrics, Data Governance, and Accuracy

A critical component of Verisk FAST’s AI companion and its usefulness is their rigorous evaluation of its performance and the accuracy of its generated responses.

As part of the proof of concept in working with the Amazon Generative AI Innovation Center, Verisk came up with 100 questions to evaluate the accuracy and performance of the AI companion. Central to this process was crafting questions designed to assess the bot’s ability to comprehend and respond effectively across a diverse range of topics and scenarios. These questions spanned a variety of topics and varying levels of difficulty. Verisk wanted to make sure their AI companion provided accurate responses to frequently asked questions and could demonstrate proficiency in handling nuanced and less predictable or straightforward inquiries. The results provided invaluable insights into RAG’s strengths and areas for improvement, guiding Verisk’s future efforts to refine and enhance its capabilities further.

After Verisk integrated their AI companion into the platform and began testing it with real-world scenarios, their accuracy rate was approximately 40%. However, within a few months, it rapidly increased to over 70% because of all the data harvesting work, and the accuracy continues to steadily improve each day.

Contributing to the AI companion’s rising accuracy is Verisk’s evaluation heat map. This provides a visual representation of the documentation available across 20 topics that comprehensively encompasses the Verisk FAST platform’s capabilities. This is compared against the volume of inquiries within each specific topic segment and the health of the generated responses in each.

This visualized data allows the Verisk FAST team to effortlessly identify gaps. They can quickly see which capability the AI companion currently struggles with against where user questions are most focused on. The Verisk team can then prioritize expanding its knowledge in these areas through additional documentation, training data, research materials, and testing.

Business Impact

Verisk initially rolled out the AI companion to one beta customer to demonstrate real-world performance and impact. Supporting a customer in this way is a stark contrast to how Verisk has historically engaged with and supported customers in the past, where they would typically have a team allocated to interact with the customer directly. Now only a fraction of the time a person would usually spend is needed to review submissions and adjust responses. Verisk FAST’s AI companion has helped them cost-effectively scale while still providing high-quality assistance.

In analyzing this early usage data, Verisk uncovered additional areas they can drive business value for their customers. As they collect additional information, this data will help them uncover what will be needed to improve results and prepare for a wider rollout.

Ongoing development will focus on expanding these capabilities, prioritized based on the collected questions. Most exciting, though, are the new possibilities on the horizon with generative AI. Verisk knows this technology is rapidly advancing, and they are eager to harness innovations to bring even more value to their customers. As new models and techniques emerge, Verisk plans to adapt their AI companion to take advantage of the latest capabilities. Although the AI companion currently focuses on responding to user questions, this is only the starting point. Verisk plans to quickly improve its capabilities to proactively make suggestions and configure functionality directly in the system itself. The Verisk FAST team is inspired by the challenge of pushing the boundaries of what is possible with generative AI and is excited to test the limits of what’s possible.

Conclusion

Verisk’s journey in developing an AI companion for their FAST platform showcases the immense potential of generative AI to transform customer support and drive operational efficiencies. By meticulously harvesting, structuring, and retrieving data, and leveraging large language models, semantic search capabilities, and rigorous evaluation processes, Verisk has created a robust solution that provides accurate, real-time responses to user inquiries. As Verisk continues to expand the AI companion’s capabilities while adhering to ethical and responsible AI development practices, they are poised to unlock greater value for customers, enable staff to focus on innovation, and set new standards for customer support in the insurance industry.

For more information, see the following resources:

Explore generative AI on AWS
Learn about Unlocking the business value of Generative AI
Learn more about Anthropic Claude 3 models on Amazon Bedrock
Learn about Amazon Bedrock and how to build and scale generative AI applications with foundation models
Generative AI Quickstart POCs

About the Authors

Tom Famularo was Co-Founder/CEO or FAST and lead’s Verisk Life Solutions, based in NJ. Tom is responsible for platform strategy, data/analytics, AI and Verisk’s life/annuity customers. His focus and passion are for teaching customers and team members how to allow technology to enable business outcomes with far less human effort. Outside of work, he’s an avid fan of his son’s baseball and football teams.

Abhay Shah leads engineering efforts for the FAST Platform at Verisk – Life Solutions, where he offers guidance on architecture and provides technical leadership for Customer Implementations and Product Development. With over two decades of experience in the technology sector, Abhay helps insurance carriers maximize the value of their ecosystem through modern technology and is excited by the opportunities that AI provides. Beyond his professional passion, he enjoys reading, traveling, and coaching the middle school robotics team.

Nicolette Kontor is a technology enthusiast who thrives on helping customers embrace digital transformation. In her current role at Verisk – Life Solutions, she spearheads the application of artificial intelligence to the FAST Platform, which she finds tremendously rewarding and exciting. With over 10 years of experience in major customer implementations and product development, Nicolette is driven to deliver innovative solutions that unlock value for insurance carriers. Beyond her professional pursuits, Nicolette is an avid traveler, having explored 39 countries to date. She enjoys winning trivia, reading mystery novels, and learning new languages.

Ryan Doty is a Sr. Solutions Architect at AWS, based out of New York. He helps enterprise customers in the Northeast U.S. accelerate their adoption of the AWS Cloud by providing architectural guidelines to design innovative and scalable solutions. Coming from a software development and sales engineering background, the possibilities that the cloud can bring to the world excite him.

Tarik Makota is a Senior Principal Solutions Architect with Amazon Web Services. He provides technical guidance, design advice, and thought leadership to AWS’ customers across the US Northeast. He holds an M.S. in Software Development and Management from Rochester Institute of Technology.

Dom Bavaro is a Senior Solutions Architect for Financial Services. While providing technical guidance to customers across many use cases, He is focused on helping customer build and productionize Generative AI solutions and workflows

Finding the Next Metrics

Understanding What’s Watts

Reworking What We Call Work

A Gauge for Accelerated Computing

Two Experts Weigh In

It Takes a Village

Explore the Latest Tools and Frameworks in the Ecosystem

Be Part of Our Ecosystem

Motivation

A common tool: TORCH_COMPILE_DEBUG

A better tool: depyf comes to rescue

One more thing: step-through debuggability

Conclusion

Part 1: Measuring Energy

Part 2: Optimizing Energy

Final Words

About the authors

Solution overview

Prerequisites

Prepare your dataset

Create a new model

Select a foundation model

Analyze the model

Test the models

Deploy the model with SageMaker

Use the model

Clean up

Conclusion

About the Author

Time Stamps

You Might Also Like…

Subscribe to the AI Podcast

Defining the business problem

Solution and implementation details

Dynamic A/B test architecture

1. MAB serving flow

2. The flow of an alpha and beta parameter update

3. The flow of business metrics monitoring

4. The flow of system operation monitoring

Implementation Details 1: MAB serving flow mainly involving API Gateway and Lambda

1. API Gateway configuration

2. API Gateway cache settings

3. Add an API Gateway mapping template

4. Lambda for Dynamic A/B Test

Implementation details 2: Alpha and beta parameter update

Implementation details 3: Business metrics monitoring

Implementation details 4: System operation monitoring with CloudWatch and AWS X-Ray

Conclusion

About the Authors

The Opportunity

The Approach

Solution Overview

Amazon Comprehend

Amazon Kendra

Amazon Bedrock

Amazon Rekognition

Amazon Transcribe

Prompt Template Warehouse

Data Harvesting

Structuring and Retrieving the Data

LLM Parameters and Models

Metrics, Data Governance, and Accuracy

Business Impact

Conclusion

About the Authors

Navigation

GenAI Vision Endless Possibilities

"I'm interested in things that change the world or that affect the future and wondrous, new technology where you see it, and you're like, 'Wow, how did that even happen? How is that possible?'" -- Elon Musk

Copyright © 2019-2025 Vedere AI. All Rights Reserved.

A common tool: `TORCH_COMPILE_DEBUG`

A better tool: `depyf` comes to rescue