A review of purpose-built accelerators for financial services

A review of purpose-built accelerators for financial services

Data contains information, and information can be used to predict future behaviors, from the buying habits of customers to securities returns. Businesses are seeking a competitive advantage by being able to use the data they hold, apply it to their unique understanding of their business domain, and then generate actionable insights from it. The financial services industry (FSI) is no exception to this, and is a well-established producer and consumer of data and analytics. All industries have their own nuances and ways of doing business, and FSI is no exception—here, considerations such as regulation and zero-sum game competitive pressures loom large. This mostly non-technical post is written for FSI business leader personas such as the chief data officer, chief analytics officer, chief investment officer, head quant, head of research, and head of risk. These personas are faced with making strategic decisions on issues such as infrastructure investment, product roadmap, and competitive approach. The aim of this post is to level-set and inform in a rapidly advancing field, helping to understand competitive differentiators, and formulate an associated business strategy.

Accelerated computing is a generic term that is often used to refer to specialist hardware called purpose-built accelerators (PBAs). In financial services, nearly every type of activity, from quant research, to fraud prevention, to real-time trading, can benefit from reducing runtime. By performing a calculation more quickly, the user may be able to solve an equation more accurately, provide a better customer experience, or gain an informational edge over a competitor. These activities cover disparate fields such as basic data processing, analytics, and machine learning (ML). And finally, some activities, such as those involved with the latest advances in artificial intelligence (AI), are simply not practically possible, without hardware acceleration. ML is often associated with PBAs, so we start this post with an illustrative figure. The ML paradigm is learning followed by inference. Typically, learning is offline (not streaming real-time data, but historical data) on large volumes of data, whereas inference is online on small volumes of streaming data. Learning means identifying and capturing historical patterns from the data, and inference means mapping a current value to the historical pattern. PBAs, such as graphics processing units (GPUs), have an important role to play in both these phases. The following figure illustrates the idea of a large cluster of GPUs being used for learning, followed by a smaller number for inference. The distinct computational nature of the learning and inference phases means some hardware providers have developed independent solutions for each phase, whereas others have single solutions for both phases.

As shown in the preceding figure, the ML paradigm is learning (training) followed by inference. PBAs, such as GPUs, can be used for both these steps. In this example figure, features are extracted from raw historical data, which are then are fed into a neural network (NN). Due to model and data size, learning is distributed over multiple PBAs in an approach called parallelism. Labeled data is used to learn the model structure and weights. Unseen new streaming data is then applied to the model, and an inference (prediction) on that data is made.

This post starts by looking at the background of hardware accelerated computing, followed by reviewing the core technologies in this space. We then consider why and how accelerated computing is important for data processing. Then we review four important FSI use cases for accelerated computing. Key problem statements are identified and potential solutions given. The post finishes by summarizing the three key takeaways, and makes suggestions for actionable next steps.

Background on accelerated computing

CPUs are designed for processing small volumes of sequential data, whereas PBAs are suited for processing large volumes of parallel data. PBAs can perform some functions, such as some floating-point (FP) calculations, more efficiently than is possible by software running on CPUs. This can result in advantages such as reduced latency, increased throughput, and decreased energy consumption. The three types of PBAs are the easily reprogrammable chips such as GPUs, and two types of fixed-function acceleration; field-programmable gate arrays (FPGAs), and application-specific integrated circuits (ASICs). Fixed or semi-fixed function acceleration is practical when no updates are needed to the data processing logic. FPGAs are reprogrammable, albeit not very easily, whereas ASICs are custom designed fully fixed for a specific application, and not reprogrammable. As a general rule, the less user-friendly the speedup, the faster it is. In terms of resulting speedups, the approximate order is programming hardware, then programming against PBA APIs, then programming in an unmanaged language such as C++, then a managed language such as Python. Analysis of publications containing accelerated compute workloads by Zeta-Alpha shows a breakdown of 91.5% GPU PBAs, 4% other PBAs, 4% FPGA, and 0.5% ASICs. This post is focused on the easily reprogrammable PBAs.

The recent history of PBAs begins in 1999, when NVIDIA released its first product expressly marketed as a GPU, designed to accelerate computer graphics and image processing. By 2007, GPUs became more generalized computing devices, with applications across scientific computing and industry. In 2018, other forms of PBAs became available, and by 2020, PBAs were being widely used for parallel problems, such as training of NN. Examples of other PBAs now available include AWS Inferentia and AWS Trainium, Google TPU, and Graphcore IPU. Around this time, industry observers reported NVIDIA’s strategy pivoting from its traditional gaming and graphics focus to moving into scientific computing and data analytics.

The union of advances in hardware and ML has led us to the current day. Work by Hinton et al. in 2012 is now widely referred to as ML’s “Cambrian Explosion.” Although NN had been around since the 1960s and never really worked, Hinton noted three key changes. Firstly, they added more layers to their NN, improving their performance. Secondly, there was a massive increase in the volume of labeled data available for training. Thirdly, the presence of GPUs enabled the labeled data to be processed. Together, these elements lead to the start of a period of dramatic progress in ML, with NN being redubbed deep learning. In 2017, the landmark paper “Attention is all you need” was published, which laid out a new deep learning architecture based on the transformer. In order to train transformer models on internet-scale data, huge quantities of PBAs were needed. In November 2022, ChatGPT was released, a large language model (LLM) that used the transformer architecture, and is widely credited with starting the current generative AI boom.

Review of the technology

In this section, we review different components of the technology.

Parallel computing

Parallel computing refers to carrying out multiple processes simultaneously, and can be categorized according to the granularity at which parallelism is supported by the hardware. For example, a grid of connected instances, multiple processors within a single instance, multiple cores within a single processor, PBAs, or a combination of different approaches. Parallel computing uses these multiple processing elements simultaneously to solve a problem. This is accomplished by breaking the problem into independent parts so that each processing element can complete its part of the workload algorithm simultaneously. Parallelism is suited for workloads that are repetitive, fixed tasks, involving little conditional branching and often large amounts of data. It also means not all workloads are equally suitable for acceleration.

In parallel computing, the granularity of a task is a measure of the amount of communication overhead between the processing functional units. Granularity is typically split into the categories of fine-grained and coarse-grained. Fine-grained parallelism refers to a workload being split into a large number of small tasks, whereas coarse-grained refers to splitting into a small number of large tasks. The key difference between the two categories is the degree of communication and synchronization required between the processing units. A thread of execution is the smallest sequence of programmed instructions that can be managed independently by a scheduler, and is typically a component of a process. The multiple threads of a given process may be run concurrently by multithreading, while sharing resources such as memory. An application can achieve parallelism by using multithreading to split data and tasks into parallel subtasks and let the underlying architecture manage how the threads run, either concurrently on one core or in parallel on multiple cores. Here, each thread performs the same operation on different segments of memory so that they can operate in parallel. This, in turn, enables better system utilization and provides faster program execution.

Purpose built accelerators

Flynn’s taxonomy is a classification of computer architectures helpful in understanding PBAs. Two classifications of relevance are single instruction stream, multiple data streams (SIMD), and the SIMD sub-classification of single instruction, multiple thread (SIMT). SIMD describes computers with multiple processing elements that perform the same operation on multiple data points simultaneously. SIMT describes processors that are able to operate on data vectors and arrays (as opposed to just scalars), and therefore handle big data workloads efficiently. Each SIMT core has multiple threads that run in parallel, thereby giving true simultaneous parallel hardware-level execution. CPUs have a relatively small number of complex cores and are designed to run a sequence of operations (threads) as fast as possible, and can run a few tens of these threads in parallel. GPUs, in contrast, feature smaller cores and are designed to run thousands of threads in parallel in the SIMT paradigm. It is this design that primarily distinguishes GPUs from CPUs and allows GPUs to excel at regular, dense, numerical, data-flow-dominated workloads.

Suppliers of data center GPUs include NVIDIA, AMD, Intel, and others. The AWS P5 EC2 instance type range is based on the NVIDIA H100 chip, which uses the Hopper architecture. The Hopper H100 GPU (SXM5 variant) architecture includes 8 GPU processing clusters (GPCs), 66 texture processing clusters (TPCs), 2 Streaming Multiprocessors (SMs)/TPC, 528 Tensor cores/GPU, and 128 CUDA cores/SM. Additionally, it features 80 GB HBM3 GPU memory, 900 GBps NVLink GPU-to-GPU interconnect, and a 50 MB L2 cache minimizing HBM3 trips. An NVIDIA GPU is assembled in a hierarchal manner: the GPU contains multiple GPCs, and the role of each GPC is to act as a container to hold all the components together. Each GPC has a raster engine for graphics and several TPCs. Inside each TPC is a texture unit, some logic control, and multiple SMs. Inside each SM are multiple CUDA and Tensor cores, and it is here that the compute work happens. The ratio of units GPU:GPC:TPC:SM:CUDA core/Tensor core varies according to release and version. This hierarchal architecture is illustrated in the following figure.

SMs are the fundamental building blocks of an NVIDIA GPU, and consist of CUDA cores, Tensor cores, distributed shared memory, and instructions to support dynamic programming. When a CUDA program is invoked, work is distributed to the multithreaded SMs with available execution capacity. The CUDA core, released in 2007, is a GPU core approximately equal to a CPU core. Although it’s not as powerful as a CPU core, the CUDA core advantage is its ability to be used for large-scale parallel computing. Like a CPU core, each CUDA core still only runs one operation per clock cycle; however, the GPU SIMD architecture enables large numbers of CUDA cores to simultaneously address one data point each. CUDA cores are split into support for different precision, meaning that in the same clock cycle, multiple precision work can be done. The CUDA core is well suited for high-performance computing (HPC) use cases, but is not so well suited for the matrix math found in ML. The Tensor core, released in 2017, is another NVIDIA proprietary GPU core that enables mixed-precision computing, and is designed to support the matrix math of ML. Tensor cores support mixed FP accuracy matrix math in a computationally efficient manner by treating matrices as primitives and being able to perform multiple operations in one clock cycle. This makes GPUs well suited for data-heavy, matrix math-based, ML training workloads, and real-time inference workloads needing synchronicity at scale. Both use cases require the ability to move data around the chip quickly and controllably.

From 2010 onwards, other PBAs have started becoming available to consumers, such as AWS Trainium, Google’s TPU, and Graphcore’s IPU. While an in-depth review on other PBAs is beyond the scope of this post, the core principle is one of designing a chip from the ground up, based around ML-style workloads. Specifically, ML workloads are typified by irregular and sparse data access patterns. This means there is a requirement to support fine-grained parallelism based on irregular computation with aperiodic memory access patterns. Other PBAs tackle this problem statement in a variety of different ways from NVIDIA GPUs, including having cores and supporting architecture complex enough for running completely distinct programs, and decoupling thread data access from the instruction flow by having distributed memory next to the cores.

AWS accelerator hardware

AWS currently offers a range of 68 Amazon Elastic Compute Cloud (Amazon EC2) instance types for accelerated compute. Examples include F1 Xilinx FPGAs, P5 NVIDIA Hopper H100 GPUs, G4ad AMD Radeon Pro V520 GPUs, DL2q Qualcomm AI 100, DL1 Habana Gaudi, Inf2 powered by Inferentia2, and Trn1 powered by Trainium. In March 2024, AWS announced it will offer the new NVIDIA Blackwell platform, featuring the new GB200 Grace Blackwell chip. Each EC2 instance type has a number of variables associated with it, such as price, chip maker, Regional availability, amount of memory, amount of storage, and network bandwidth.

AWS chips are produced by our own Annapurna Labs team, a chip and software designer, which is a wholly owned subsidiary of Amazon. The Inferentia chip became generally available (GA) in December 2019, followed by Trainium GA in October 2022, and Inferentia2 GA in April 2023. In November 2023, AWS announced the next generation Trainium2 chip. By owning the supply and manufacturing chain, AWS is able to offer high-levels of availability of its own chips. Availability AWS Regions are shown in the following table, with more Regions coming soon. Both Inferentia2 and Trainium use the same basic components, but with differing layouts, accounting for the different workloads they are designed to support. Both chips use two NeuronCore-v2 cores each, connected by a variable number of NeuronLink-v2 interconnects. The NeuronCores contain four engines: the first three include a ScalarEngine for scalar calculations, a VectorEngine for vector calculations, and a TensorEngine for matrix calculations. By analogy to an NVIDIA GPU, the first two are comparable to CUDA cores, and the latter is equivalent to TensorCores. And finally, there is a C++ programmable GPSIMD-engine allowing for custom operations. The silicon architecture of the two chips is very similar, meaning that the same software can be used for both, minimizing changes on the user side, and this similarity can be mapped back to their two roles. In general, the learning phase of ML is typically bounded by bandwidth associated with moving large volumes of data to the chip and about the chip. The inference phase of ML is typically bounded by memory, not compute. To maximize absolute-performance and price-performance, Trainium chips have twice as many NeuronLink-v2 interconnects as Inferentia2, and Trainium instances also contain more chips per instance than Inferentia2 instances. All these differences are implemented at the server level. AWS customers such as Databricks and Anthropic use these chips to train and run their ML models.

The following figures illustrate the chip-level schematic for the architectures of Inferentia2 and Trainium.

The following table shows the metadata of three of the largest accelerated compute instances.

Instance Name GPU Nvidia H100 Chips Trainium Chips Inferentia Chips vCPU Cores Chip Memory (GiB) Host Memory (GiB) Instance Storage (TB) Instance Bandwidth (Gbps) EBS Bandwidth (Gbps) PBA Chip Peer-to-Peer Bandwidth (GBps)
p5.48xlarge 8 0 0 192 640 2048 8 x 3.84 SSD 3,200 80 900 NVSwitch
inf2.48xlarge 0 0 12 192 384 768 EBS only 100 60 192 NeuronLink-v2
trn1n.32xlarge 0 16 0 128 512 512 4 x 1.9 SSD 1,600 80 768 NeuronLink-v2

The following table summarizes performance and cost.

Instance Name On-Demand Rate ($/hr) 3Yr RI Rate ($/hr) FP8 TFLOPS FP16 TFLOPS FP32 TFLOPS $/TFLOPS (FP16, theoretical) Source Reference
p5.48xlarge 98.32 43.18 16,000 8,000 8,000 $5.40 URL
inf2.48xlarge 12.98 5.19 2,280 2,280 570 $2.28 URL
trn1n.32xlarge 24.78 9.29 3,040 3,040 760 $3.06 URL

The following table summarizes Region availability.

Instance Name Number of AWS Regions Supported In AWS Regions Supported In Default Quota Limit
p5.48xlarge 4 us-east-2; us-east-1; us-west-2; eu-north-1 0
inf2.48xlarge 13 us-east-2; us-east-1; us-west-2; ap-south-1; ap-southeast-1; ap-southeast-2; ap-northeast-1; eu-central-1; eu-west-1; eu-west-2; eu-west-3; eu-north-1; sa-east-1; 0
trn1n.32xlarge 3 us-east-2; us-east-1; us-west-2; eu-north-1; ap-northeast-1; ap-south-1; ap-southeast-4 0

After a user has selected the EC2 instance type, it can then be combined with AWS services designed to support large-scale accelerated computing use cases, including high-bandwidth networking (Elastic Fabric Adapter), virtualization (AWS Nitro Enclaves), hyper-scale clustering (Amazon EC2 UltraClusters), low-latency storage (Amazon FSx for Lustre), and encryption (AWS Key Management Service), while noting not all services are available for all instances in all Regions.

The following figure shows an example of a large-scale deployment of P5 EC2 instances, includes UltraCluster support for 20,000 H100 GPUs, with non-blocking petabit-scale networking, and high-throughput low latency storage. Using the same architecture, UltraCluster supports Trainium scaling to over 60,000 chips.

In summary, we see two general trends in the hardware acceleration space. Firstly, improving price-performance to handle increasing data processing volumes and model sizes, coupled with a need to serve more users, more quickly, and at reduced cost. Secondly, improving security of the associated workloads by preventing unauthorized users from being able to access training data, code, or model weights.

Accelerator software

CPUs and GPUs are designed for different types of workloads. However, CPU workloads can run on GPUs, a process called general-purpose computing on graphics processing units (GPGPU). In order to run a CPU workload on a GPU, the work needs to be reformulated in terms of graphics primitives supported by the GPU. This reformulation can be carried out manually, though it is difficult programming, requiring writing code in a low-level language to map data to graphics, process it, and then map it back. Instead, it is commonly carried out by a GPGPU software framework, allowing the programmer to ignore the underlying graphical concepts, and enabling straightforward coding against the GPU using standard programming languages such as Python. Such frameworks are designed for sequential parallelism against GPUs (or other PBAs) without requiring concurrency or threads. Examples of GPGPU frameworks are the vendor-neutral open source OpenCL and the proprietary NVIDIA CUDA.

For the Amazon PBA chips Inferentia2 and Trainium, the SDK is AWS Neuron. This SDK enables development, profiling, and deployment of workloads onto these PBAs. Neuron has various native integrations to third-party ML frameworks like PyTorch, TensorFlow, and JAX. Additionally, Neuron includes a compiler, runtime driver, as well as debug and profiling utilities. This toolset includes Neuron-top for real-time visualization of the NeuronCore and vCPU utilization, host and device memory usage, and a breakdown of memory allocation. This information is also available in JSON format if neuron-monitor is used, including Neuron-ls for device discovery and topology information. With Neuron, users can use inf2 and trn1n instances with a range of AWS compute services, such as Amazon SageMaker, Amazon Elastic Container Service, Amazon Elastic Kubernetes Service, AWS Batch, and AWS ParallelCluster. This usability, tooling, and integrations of the Neuron SDK has made Amazon PBAs extremely popular with users. For example, over 90% of the top 100 Hugging Face models (now over 100,000 AI models) now run on AWS using Optimum Neuron, enabling the Hugging Face transformer natively supported for Neuron. In summary, the Neuron SDK allows developers to easily parallelize ML algorithms, such as those commonly found in FSI. The following figure illustrates the Neuron software stack.

The CUDA API and SDK were first released by NVIDIA in 2007. CUDA offers high-level parallel programming concepts that can be compiled to the GPU, giving direct access to the GPU’s virtual instruction set and therefore the ability to specify thread-level parallelism. To achieve this, CUDA added one extension to the C language to let users declare functions that could run and compile on the GPU, and a lightweight way to call those functions. The core idea behind CUDA was to remove programmers’ barrier to entry for coding against GPUs by allowing use of existing skills and tools as much as possible, while being more user friendly than OpenCL. The CUDA platform includes drivers, runtime kernels, compilers, libraries, and developer tools. This includes a wide and impressive range of ML libraries like cuDNN and NCCL. The CUDA platform is used through complier directives and extensions to standard languages, such as the Python cuNumeric library. CUDA has continuously optimized over the years, using its proprietary nature to improve performance on NVIDIA hardware relative to vendor-neutral solutions like OpenCL. Over time, the CUDA programming paradigm and stack has become deeply embedded in all aspects of the ML ecosystem, from academia to open source ML repositories.

To date, alternative GPU platforms to CUDA have not seen widespread adoption. There are three key reasons for this. Firstly, CUDA has had a decades-long head start, and benefits from the networking effect of its mature ecosystem, from organizational inertia of change, and from risk aversion to change. Secondly, migrating CUDA code to a different GPU platform can be technically difficult, given the complexity of the ML models typically being accelerated. Thirdly, CUDA has integrations with major third-party ML libraries, such as TensorFlow and PyTorch.

Despite the central role CUDA plays in the AI/ML community, there is movement by users to diversify their accelerated workflows by movement towards a Pythonic programming layer to make training more open. A number of such efforts are underway, including projects like Triton and OneAPI, and cloud service features such as Amazon SageMaker Neo. Triton is an open source project lead by OpenAI that enables developers to use different acceleration hardware using entirely open source code. Triton uses an intermediate compiler to convert models written in supported frameworks into an intermediate representation that can then be lowered into highly optimized code for PBAs. Triton is therefore a hardware-agnostic convergence layer that hides chip differences.

Soon to be released is the AWS neuron kernel interface (NKI) programming interface. NKI is a Python-based programming environment designed for the compiler, which adopts commonly used Triton-like syntax and tile-level semantics. NKI provides customization capabilities to fully optimize performance by enabling users to write custom kernels, by passing almost all of the AWS compiler layers.

OneAPI is an open source project lead by Intel for a unified API across different accelerators including GPUs, other PBAs, and FPGAs. Intel believes that future competition in this space will happen for inference, unlike in the learning phase, where there is no software dependency. To this end, OneAPI toolkits support CUDA code migration, analysis, and debug tools. Other efforts are building on top of OneAPI; for, example the Unified Acceleration Foundation’s (UXL) goal is a new open standard accelerator software ecosystem. UXL consortium members include Intel, Google, and ARM.

Amazon SageMaker is an AWS service providing an ML development environment, where the user can select chip type from the service’s fleet of Intel, AMD, NVIDIA, and AWS hardware, offering varied cost-performance-accuracy trade-offs. Amazon contributes to Apache TVM, an open source ML compiler framework for GPUs and PBAs, enabling computations on any hardware backend. SageMaker Neo uses Apache TVM to perform static optimizations on trained models for inference for any given hardware target. Looking to the future, the accelerator software field is likely to evolve; however, this may be slow to happen.

Accelerator supply-demand imbalances

It has been widely reported for the last few years that GPUs are in short supply. Such shortages have led to industry leaders speaking out. For example, Sam Altman said “We’re so short on GPUs the less people use our products the better… we don’t have enough GPUs,” and Elon Musk said “It seems like everyone and their dog is buying GPUs at this point.”

The factors leading to this have been high demand coupled with low supply. High demand has risen from a range of sectors, including crypto mining, gaming, generic data processing, and AI. Omdia Research estimates 49% of GPUs go to the hyper-clouds (such as AWS or Azure), 27% go to big tech (such as Meta and Tesla), 20% go to GPU clouds (such as Coreweave and Lambda) and 6% go to other companies (such as OpenAI and FSI firms). The State of AI Report gives the size and owners of the largest A100 clusters, the top few being Meta with 21,400, Tesla with 16,000, XTX with 10,000, and Stability AI with 5,408. GPU supply has been limited by factors including lack of manufacturing competition and ability at all levels in the supply chain, and restricted supply of base components such as rare metals and circuit boards. Additionally, rate of manufacturing is slow, with an H100 taking 6 months to make. Socio-political events have also caused delays and issues, such as a COVID backlog, and with inert gases for manufacturing coming from Russia. A final issue impacting supply is that chip makers strategically allocate their supply to meet their long-term business objectives, which may not always align with end-users’ needs.

Supported workloads

In order to benefit from hardware acceleration, a workload needs to be parallelizable. An entire branch of science is dedicated to parallelizable problems. In The Landscape of Parallel Computing Research, 13 fields (termed dwarfs) are found to be fundamentally parallelizable, including dense and sparse linear algebra, Monte Carlo methods, and graphical models. The authors also call out a series of fields they term “embarrassingly sequential” for which the opposite holds. In FSI, one of the main data structures dealt with is time series, a series of sequential observations. Many time series algorithms have the property where each subsequent observation is dependent on previous observations. This means only some time series workloads can be efficiently computed in parallel. For example, a moving average is a good example of a computation that seems inherently sequential, but for which there is an efficient parallel algorithm. Sequential models, such as Recurrent Neural Networks (RNN) and Neural Ordinary Differential Equations, also have parallel implementations. In FSI, non-time series workloads are also underpinned by algorithms that can be parallelized. For example, Markovitz portfolio optimization requires the computationally intensive inversion of large covariance matrices, for which GPU implementations exist.

In computer science, a number can be represented with different levels of precision, such as double precision (FP64), single precision (FP32), and half-precision (FP16). Different chips support different representations, and different representations are suitable for different use cases. The lower the precision, the less storage is required, and the faster the number is to process for a given amount of computational power. FP64 is used in HPC fields, such as the natural sciences and financial modeling, resulting in minimal rounding errors. FP32 provides a balance between accuracy and speed, is used in applications such as graphics, and is the standard for GPUs. FP16 is used in deep learning where computational speed is valued, and the lower precision won’t drastically affect the model’s performance. More recently, other number representations have been developed which aim to improve the balance between acceleration and precision, such as OCP Standard FP8, Google BFloat16, and Posits. An example of a mixed representation use case is the updating of model parameters by gradient decent, part of the backpropagation algorithm, as used in deep learning. Typically this is done using FP32 to reduce rounding errors, however, in order to reduce memory load, the parameters and gradients can be stored in FP16, meaning there is a conversion requirement. In this case, BFloat16 is a good choice because it prevents float overflow errors while keeping enough precision for the algorithm to work.

As lower-precision workloads become more important, hardware and infrastructure trends are changing accordingly. For example, comparing the latest NVIDIA GB200 chip against the previous generation NVIDIA H100 chip, lower representation FP8 performance has increased 505%, but FP64 performance has only increased 265%. Likewise, in the forthcoming Trainium2 chip, the focus has been on lower-bit performance increases, giving a 400% performance increase over the previous generation. Looking to the future, we might expect to see a convergence between HPC and AI workloads, as AI starts to become increasingly important in solving what were traditionally HPC FP64 precision problems.

Accelerator benchmarking

When considering compute services, users benchmark measures such as price-performance, absolute performance, availability, latency, and throughput. Price-performance means how much compute can be done for $1, or what is the equivalent dollar cost for a given number of FP operations. For a perfect system, the price-performance ratio increases linearly as the size of a job scales up. A complicating factor when benchmarking compute grids on AWS is that EC2 instances come in a range of system parameters and a grid might contain more than one instance type, therefore systems are benchmarked at the grid level rather than on a more granular basis. Users often want to complete a job as quickly as possible and at the lowest cost; the constituent details of the system that achieves this aren’t as important.

A second benchmarking measure is absolute-performance, meaning how quickly can a given job be completed independent of price. Given linear scaling, job completion time can be reduced by simply adding more compute. However, it might be that the job isn’t infinitely divisible, and that only a single computational unit is required. In this case, the absolute performance of that computational unit is important. In an earlier section, we provided a table with one performance measure, the $/TFLOP ratio based on the chip specifications. However, as a rule of thumb, when such theoretical values are compared against experimental values, only around 45% is realized.

There are a few different ways to calculate price-performance. The first is to use a standard benchmark, such as LINPACK, HPL-MxP, or MFU (Model FLOPS Utilization). These can run a wide range of calculations that are representative of varying use cases, such as general use, HPC, and mixed HPC and AI workloads. From this, the TFLOP/s at a given FP precision for the system can be measured, along with the dollar-cost of running the system. However, it might be that the user has specific use cases in mind. In this case, the best data will come from price-performance data on a more representative benchmark.

There are various types of representative benchmark commonly seen. Firstly, the user can use real production data and applications with the hardware being benchmarked. This option gives the most reliable results, but can be difficult to achieve due to operational and compliance hurdles. Secondly, the user can replicate their existing use case with a synthetic data generator, avoiding the challenges of getting production data into new test systems. Thirdly, the use can employ a third-party benchmark for the use case, if one exists. For example, STAC is a company that coordinates an FSI community called the STAC Benchmark Council, which maintain a selection of accelerator benchmarks, including A2, A3, ML and AI (LLM). A2 is designed for compute-intensive analytic workloads involved in pricing and risk management. Specifically, the A2 workload uses option price discovery by Monte Carlo estimation of Heston-based Greeks for a path-dependent, multi-asset option with early exercise. STAC members can access A2 benchmarking reports, for example EC2 c5.metal, with the oneAPI. STAC-ML benchmarks the latency of NN inference—the time from receiving new input data until the model output is computed. STAC-A3 benchmarks the backtesting of trading algorithms to determine how strategies would have performed on historical data. This benchmark supports accelerator parallelism to run many backtesting experiments simultaneously, for the same security. For each benchmark, there exists a series of software packages (termed STAC Packs), which are accelerator-API specific. For some of the preceding benchmarks, STAC Packs are maintained by providers such as NVIDIA (CUDA) and Intel (oneAPI).

Some FSI market participants are performing in-house benchmarking at the microarchitecture level, in order to optimize performance as far as possible. Citadel has published microbenchmarks for NVIDIA GPU chips, dissecting the microarchitecture to achieve “bare-metal performance tuning,” noting that peak performance is inaccessible to software written in plain CUDA. Jane Street has looked at performance optimization through functional programming techniques, while PDT Partners has supported work on the Nixpkgs repository of ML packages using CUDA.

Some AWS customers have benchmarked the AWS PBAs against other EC2 instance types. ByteDance, the technology company that runs the video-sharing app TikTok, benchmarked Inf1 against a comparable EC2 GPU instance type. With Inf1, they were able to reduce their inference latency by 25%, and costs by 65%. In a second example, Inf2 is benchmarked against a comparable inference-optimized EC2 instance. The benchmark used is the RoBERTa-Base, a popular model used in natural language processing (NLP) applications, that uses the transformer architecture. In the following figure, on the x-axis we plotted throughput (the number of inferences that are completed in a set period of time), and on the y-axis we plotted latency (the time it takes the deep learning model to provide an output). The figure shows that Inf2 gives higher throughput and lower latency than the comparable EC2 instance type.

In a third benchmark example, Hugging Face benchmarked the trn1.32xlarge instance (16 Trainium chips) and two comparable EC2 instance types. For the first instance type, they ran fine-tuning for the BERT Large model on the full Yelp review dataset, using the BF16 data format with the maximum sequence length supported by the model (512). The benchmark results show the Trainium job is five times faster while being only 30% more expensive, resulting in a “huge improvement in cost-performance.” For the latter instance type, they ran three tests: language pretraining with GPT2, token classification with BERT Large, and image classification with the Vision Transformer. These results showed trn1 to be 2–5 times faster and 3–8 times cheaper than the comparable EC2 instance types.

FSI use cases

As with other industry sectors, there are two reasons why FSI uses acceleration. The first is to get a fixed result in the lowest time possible, for example parsing a dataset. The second is to get the best result in a fixed time, for example overnight parameter re-estimation. Use cases for acceleration exist across the FSI, including banking, capital markets, insurance, and payments. However, the most pressing demand comes from capital markets, because acceleration speeds up workloads and time is one of the easiest edges people can get in the financial markets. Put differently, a time advantage in financial services often equates to an informational advantage.

We begin by providing some definitions:

  • Parsing is the process of converting between data formats
  • Analytics is data processing using either deterministic or simple statistical methods
  • ML is the science of learning models from data, using a variety of different methods, and then making decisions and predictions
  • AI is an application able to solve problems using ML

In this section, we review some of the FSI use cases of PBAs. As many FSI activities can be parallelized, most of what is done in FSI can be sped up with PBAs. This includes most modeling, simulations, and optimization problems— currently in FSI, deep learning is only a small part of the landscape. We identify four classes of FSI use cases and look at applications in each class: parsing financial data, analytics on financial data, ML on financial data, and low-latency applications. To try and show how these classes relate to each other, the following figure shows a simplified representation of a typical capital market’s workflow. In this figure, acceleration categories have been assigned to the workflow steps. However, in reality, every step in the process may be able to benefit from one or more of the defined acceleration categories.

Parsing

A typical capital markets workflow consists of receiving data and then parsing it into a useable form. This data is commonly market data, as output from a trading venue’s matching engine, or onward from a market data vendor. Market participants who are receiving either live or historical data feeds need to ingest this data and perform one or more steps, such as parse the message out of a binary protocol, rebuild the limit order book (LOB), or combine multiple feeds into a single normalized format. Any of these parsing steps that run in parallel could be sped up relative to sequential processing. To give an idea of scale, the largest financial data feed is the consolidated US equity options feed, termed OPRA. This feed comes from 18 different trading venues, with 1.5 million contracts broadcast across 96 channels, with a supported peak message rate of 400 billion messages per day, equating to approximately 12 TB per day, or 3 PB per year. As well as maintaining real-time feeds, participants need to maintain a historical depositary, sometimes of several years in size. Processing of historical repositories is done offline, but is often a source of major cost. Overall, a large consumer of market data, such as an investment bank, might consume 200 feeds from across public and private trading venues, vendors, and redistributors.

Any point in this data processing pipeline that can be parallelized, can potentially be sped up by acceleration. For example:

  • Trading venues broadcast on channels, which can be groupings of alphabetical tickers or products.
  • On a given channel, different tickers update messages are broadcast sequentially. These can then be parsed out into unique streams per ticker.
  • For a given LOB, some events might be applicable to individual price levels independently.
  • Historical data is normally (but not always) independent inter-day, meaning that days can be parsed independently.

In GPU Accelerated Data Preparation for Limit Order Book Modeling, the authors describe a GPU pipeline handling data collection, LOB pre-processing, data normalization, and batching into training samples. The authors note their LOB pre-processing relies on the previous LOB state, and must be done sequentially. For LOB building, FPGAs seem to be used more commonly than GPUs because of the fixed nature of the workload; see examples from Xilinx and Algo-Logic. For example code for a build lab, using the AWS FPGA F1 instance type, refer to the following GitHub repo.

An important part of the data pipeline is the production of features, both online and offline. Features (also called alphas, signals, or predictors) are statistical representations of the data, which can then be used in downstream model building. A current trend in the FSI prediction space is the large-scale automation of dataset ingestion, curation, processing, feature extraction, feature combination, and model building. An example of this approach is given by WorldQuant, an algorithmic trading firm. The WSJ reports “a data group scours the globe for interesting and new data sets, including everything from detailed market pricing data to shipping statistics to footfall in stores captured by apps on smartphones”. WorldQuant states “in 2007 we had two data sets—today [2022] we have more than 1,400.” The general idea being if they could buy, consume, create, and web scrape more data than anyone else, they could create more alphas, and find more opportunities. Such an approach is based on performance being proportional to √N, where N is the number of alphas. Therefore, as long as an alpha is not perfectly correlated with another, there is value in adding it to the set. In 2010, WorldQuant was producing several thousand alphas per year, by 2016 had one million alphas, by 2022, had multiple millions, with a stated ambition to get to 100 million alphas. Although traditional quant finance mandates the importance of an economic rationale behind an alpha, the data-driven approach is led purely by the patterns in the data. After alphas have been produced, they can be intelligently merged together in a time-variant manner. Examples of signal combination methodologies which can benefit from PBA speed-up include Mean Variance Optimization and Bayesian Model Averaging. The same WSJ article states “No one alpha is important. Our edge is putting things together, it’s the implementation…. The idea is that with so many ‘alphas,’ even weak signals can be useful. If counting cars in parking lots next to big box retailers has only a tiny predictive power for those retailers’ stock prices, it can still be used to enhance a bigger prediction if combined with other weak signals. For example, an uptick in cars at Walmart parking lots—itself a relatively weak signal—could combine with similar trends captured by mobile phone apps and credit-card receipts harvested by companies that scan emails to create a more reliable prediction.” The automated process of data ingestion, processing, packaging, combination, and prediction is referred to by WorldQuant as their “alpha factory.”

From examples such as those we’ve discussed, it seems clear that parallelization, speed-up and scale-up, of such huge data pipelines is potentially an important differentiator. All the way through this pipeline, activities could be accelerated using PBAs. For example, for use at the signal combination phase, the Shapley value is a metric that can be used to compute the contribution of a given feature to a prediction. Shapley value computation has PBA-acceleration support in the Python XGBoost library.

Analytics

In this section, we consider the applicability of accelerator parallelism to analytics workloads. One of the parallelizable dwarfs is Monte Carlo, and for FSI and time series work in general, this is an important method. Monte Carlo is a way to compute expected values by generating random scenarios and then averaging them. By using GPUs, a simulated path can be assigned to each thread, allowing simulation of thousands of paths in parallel.

Post the 2008 credit crunch, new regulations require banks to run credit valuation adjustment (CVA) calculations every 24 hours. CVA is an adjustment to a derivatives price as charged by a bank to a counterparty. CVA is one of a family of related valuation adjustments collectively known as xVA, which include debt valuation adjustment (DVA), initial margin valuation adjustment (MVA), capital valuation adjustment (KVA), and funding valuation adjustment (FVA). Because this adjustment calculation can happen over large portfolios of complex, non-linear instruments, closed-form analytical solutions aren’t possible, and as such an empirical approximation by a technique such as Monte Carlo is required. The downside of Monte Carlo here is how computationally demanding it is, due to the size of the search space. The advent of this new regulation coincided with the coming of age of GPUs, and as such banks commonly use GPU grids to run their xVA calculations. In XVA principles, nested Monte Carlo strategies, and GPU optimizations, the authors find a nested simulation time of about an hour for a billion scenarios on the bank portfolio, and a GPU speedup of 100 times faster relative to CPUs. Rather than develop xVA applications internally, banks often use third-party independent software vendor (ISV) solutions to run their xVA calculations, such as Murex M3 or S&P Global XVA. Banking customers can choose to run such ISV software as a service (SaaS) solutions inside their own AWS accounts, and often on AWS accelerated instances.

A second use of PBAs in FSI Monte Carlo is in option pricing, especially for exotic options whose payoff is sometimes too complex to solve in closed-form. The core idea is using a random number generator (RNG) to simulate the stochastic components in a formula and then average the results, leading to the expected value. The more paths that are simulated, the more accurate the result is. In Quasi-Monte Carlo methods for calculating derivatives sensitivities on the GPU, the authors find 200-times greater speedup over CPUs, and additionally develop a number of refinements to reduce variance, leading to fewer paths needing to be simulated. In High Performance Financial Simulation Using Randomized Quasi-Monte Carlo Methods, the authors survey quasi Monte Carlo sequences in GPU libraries and review commercial software tools to help migrate Monte Carlo pricing models to GPU. In GPU Computing in Bayesian Inference of Realized Stochastic Volatility Model, the author computes a volatility measure using Hybrid Monte Carlo (HMC) applied to realized stochastic volatility (RSV), parallelized on a GPU, resulting in a 17-times faster speedup. Finally, in Derivatives Sensitivities Computation under Heston Model on GPU, the authors achieve a 200-times faster speedup; however, the accuracy of the GPU method is inferior for some Greeks relative to CPU.

A third use of PBAs in FSI Monte Carlo is in LOB simulations. We can categorize different types of LOB simulations: replay of the public historical data, replay of the mapped public-private historical data, replay of synthetic LOB data, and replay of a mix of historical and synthetic data to simulate the effects of a feedback loop. For each of these types of simulation, there are multiple ways in which hardware acceleration could occur. For example, for the simple replay case, each accelerator thread could have a different LOB. For the synthetic data case, each thread could have a different version of the same LOB, thereby allowing multiple realizations of a single LOB. In Limit Order Book Simulations: A Review, the authors provide their own simulator classification scheme based on the mathematical modeling technique used—point processes, agent based, deep learning, stochastic differential equations. In JAX-LOB: A GPU-Accelerated limit order book simulator to unlock large scale reinforcement learning for trading, the authors use GPU accelerated training, processing thousands of LOBs in parallel, giving a “notably reduced per message processing time.”

Machine learning

Generative AI is the most topical ML application at this point in time. Generative AI has four main applications: classification, prediction, understanding, and data generation, which in turn map to use cases such as customer experience, knowledge worker productivity, surfacing information and sentiment, and innovation and automation. FSI examples exist for all of these; however, a thorough review of these is beyond the scope of this post. For this post, we remain focused on PBA applicability and look at two of these topics: chatbots and time series prediction.

The 2017, the publication of the paper Attention is all you need resulted in a new wave of interest in ML. The transformer architecture presented in this paper allowed for a highly parallelizable network structure, meaning more data could be processed than before, allowing patterns to be better captured. This has driven impressive real-world performance, as seen by popular public foundation models (FMs) such as OpenAI ChatGPT, and Anthropic Claude. These factors in turn have driven new demand for PBAs for training and inference on these models.

FMs, also termed LLMs, or chatbots when text focused, are models that are typically trained on a broad spectrum of generalized and unlabeled data and are capable of performing a wide variety of general tasks in FSI, such as the Bridgewater Associates LLM-powered Investment Analyst Assistant, which generates charts, computes financial indicators, and summarizes results. FSI LLMs are reviewed in Large Language Models in Finance: A Survey and A Survey of Large Language Models for Financial Applications: Progress, Prospects and Challenges. FMs are often used as base models for developing more specialized downstream applications.

PBAs are used in three different types of FM training. Firstly, to train a FM from scratch. In BloombergGPT: A Large Language Model for Finance, the training dataset was 51% financial data from their systems and 49% public data, such as Wikipedia and Pile. SageMaker was used to train and evaluate their FM. Specifically, 64 p4d.24xlarge instances, giving a total of 512 A100 GPUs. Also used was SageMaker model parallelism, enabling the automatic distribution of the large model across multiple GPU devices and instances. The authors started with a compute budget of 1.3 million GPU hours, and noted training took approximately 53 days.

The second training approach is to fine-tune an existing FM. This requires using an FM whose model parameters are exposed, and updating them in light of new data. This approach can be effective when the data corpus differs significantly from the FM training data. Fine-tuning is cheaper and quicker than training FM from scratch, because the volume of data is likely to be much smaller. As with the larger-scale training from scratch, fine-tuning benefits significantly from hardware acceleration. In an FSI example, Efficient Continual Pre-training for Building Domain Specific Large Language Models, the authors fine-tune an FM and find that their approach outperforms standard continual pre-training performance with just 10% of the corpus size and cost, without any degradation on open-domain standard tasks.

The third training approach is to perform Retrieval Augmented Generation (RAG). To equip FMs with up-to-date and proprietary information, organizations use RAG, a technique that fetches data from company data sources and enriches the prompt to provide more relevant and accurate responses. The two-step workflow consists of ingesting data and vectorizing data, followed by runtime orchestration. Although hardware acceleration is less common in RAG applications, latency of search is a key component and as such the inference step of RAG can be hardware optimized. For example, the performance of OpenSearch, a vectorized database available on AWS, can be improved by using PBAs, with both NVIDIA GPUs and Inferentia being supported.

For these three training approaches, the role of PBAs varies. For processing the huge data volumes of FM building, PBAs are essential. Then, as the training volumes reduce, so does the value-add role of the PBA. Independent of how the model has been trained, PBAs have a key role in LLM inference, again because they are optimized for memory bandwidth and parallelism. The specifics of how to optimally use an accelerator depend on the use case—for example, a paid-for-service chatbot might be latency sensitive, whereas for a free version, a delay of a few milliseconds might be acceptable. If a delay is acceptable, then batching the queries together could help make sure a given chip’s processes are saturated, giving better dollar usage of the resource. Dollar costs are particularly importance in inference, because unlike training, which is a one-time cost, inference is a reoccurring cost.

Using ML for financial time series prediction is nothing new; a large body of public research exists on these methods and applications dating to the 1970s and beyond—for approximately the last decade, PBAs have been applied to this field. As discussed earlier, most ML approaches can be accelerated with hardware; however, the attention-based architecture using the transformer model is currently the most topical. We consider three areas of FSI application: time series FMs, NN for securities prediction, and reinforcement learning (RL).

The initial work on LLMs was conducted on text-based models. This was followed by multi-modal models, able to handle images and other data structures. Subsequent to this, publications have started to appear on time series FMs, including Amazon Chronos, Nixtla TimeGEN-1, and Google TimesFM. The behavior of the time series models appears to be similar to that of the language models. For example, in Scaling-laws for Large Time-series Models, the authors observe the models follow the same scaling laws. A review of these models is provided in Foundation Models for Time Series Analysis: A Tutorial and Survey. As with leading LLMs, time series FMs are likely to be successfully trained on large clusters of PBAs. In terms of size, GPT-3 was trained on a cluster of 10,000 V100s. The size of the GPT-4 training cluster is not public, but is speculated to have been trained on a cluster of 10,000–25,000 A100s. This is analogous in size to one algorithmic trading firm’s statement, “our dedicated research cluster contains … 25,000 A/V100 GPUs (and growing fast).”

Looking to the future, one possible outcome might be that time series FMs, trained at huge expense by a few large corporates, become the base models for all financial prediction. Financial services firms then modify these FMs through additional training with private data or their own insights. Examples of private labeled data might be knowledge of which orders and executions in the public feed belonged to them, or similarly which (meta)orders and executions had parent-child relationships.

Although such financial time series FMs trained on PBA clusters may offer enhanced predictive capabilities, they also bring risks. For example, the EU’s AI act, adopted in March 2024, states that if a model has been trained with a total compute power in excess of 1025 FLOPs, then that model is considered to pose “systemic risk” and is subject to enhanced regulation, including fines of 3% of global turnover, so on this basis Meta announced in June 2024 they will not be enabling some models inside Europe. This legislation assumes that training compute is a direct proxy for model capabilities. EpochAI provides an analysis of the training compute required for a wide range of FMs; for example, GPT-4 took 2.125 FLOPS to train (exceeding the threshold by a factor of 2.1), whereas BloombergGPT took 2.423 FLOPS (under the threshold by a factor of 0.02). It seems possible that in the future, similar legislation may apply to financial FMs, or even to the PBA clusters themselves, with some market participants choosing not to operate in legislative regimes that are subject to such risks.

Feature engineering plays a key role in building NN models, because features are fed into the NN model. As seen earlier in this post, some participants have generated large numbers of features. Examples of features derived from market time series data include bid-ask spreads, weighted mid-points, imbalance measures, decompositions, liquidity predictions, trends, change-points, and mean-reversions. Together, the features are called the feature space. A transformer assigns more importance to part of the input feature space, even though it might only be a small part of the data. Learning which part of the data is more important than another depends on the context of the features. The true power of FMs in time series prediction is the ability to capture these conditional probabilities (the context) across the feature space. To give a simple example, based on historical data, trends might reduce in strength as they go on, leading to a change-point, and then reversion to the mean. A transformer potentially offers the ability to recognize this pattern and capture the relationship between the features more accurately than other approaches. An informative visualization of this for the textual case is given by the FT article Generative AI exists because of the transformer. In order to build and train such FMs on PBAs, access to high-quality historical data tightly coupled with scalable compute to generate the features is an essential prerequisite.

Prior to the advent of the transformer, NN have historically been applied to securities prediction with varying degrees of success. Deep Learning for Limit Order Books uses a cluster of 50 GPUs to predict the sign of the future return by mapping the price levels of the LOB to the visible input layer of a NN, resulting in a trinomial output layer. Conditional on the return the sign, the magnitude of the return is estimated using regression. Deep Learning Financial Market Data uses raw LOB data pre-processed into discrete, fixed-length features for training a recurrent autoencoder, whose recurrent structure allows learning patterns on different time scales. Inference occurs by generating the decoded LOB, and nearest-matching that to the real-time data.

In Multi-Horizon Forecasting for Limit Order Books: Novel Deep Learning Approaches and Hardware Acceleration using Intelligent Processing Units, the authors benchmark the performance of Graphcore IPUs against an NVIDIA GPU on an encoder-decoder NN model. Given that encoder-decoder models rely on recurrent neural layers, they generally suffer from slow training processes. The authors address this by finding that the IPU offers a significant training speedup over the GPU, 694% on average, analogous to the speedup a transformer architecture would provide. In some examples of post-transformer work in this space, Generative AI for End-to-End Limit Order Book Modelling and A Generative Model Of A Limit Order Book Using Recurrent Neural Networks have trained LLM analogues on historical LOB data, interpreting each LOB event (such as insertions, cancellations, and executions) as a word and predicting the series of events following a given word history. However, the authors find the prediction horizon for LOB dynamics appears to be limited to a few tens of events, possibly because of the high-dimensionality of the problem and the presence of long-range correlations in order sign. These results have been improved in the work “Microstructure Modes” — Disentangling the Joint Dynamics of Prices & Order Flow, by down-sampling the data and reducing its dimensionality, allowing identification of stable components.

RL is an ML technique where an algorithm interacts with a dynamic environment that provides feedback to the algorithm, allowing the algorithm to iteratively optimize a reward metric. Because RL closely mimics how human traders interact with the world, there are various areas of applicability in FSI. In JAX-LOB: A GPU-Accelerated limit order book simulator to unlock large scale reinforcement learning for trading, the authors use GPUs for end-to-end RL training. RL agent training with a GPU has a 7-times speedup relative to a CPU based simulation implementation. The authors then apply this to the problem of optimal trade execution. A second FSI application of RL to optimal trade execution has been reported by JPMorgan in an algorithm called LOXM.

Latency-sensitive, real-time workloads

Being able to transmit, process, and act on data more quickly than others gives an informational advantage. In the financial markets, this is directly equivalent to being able to profit from trading. These real-time, latency-sensitive workloads exist on a spectrum, from the most sensitive to the least sensitive. The specific numbers in the following table are open to debate, but present the general idea.

Band Latency Application Examples
1 Less than 1 microsecond Low-latency trading strategy. Tick 2 trade.
2 1–4 microseconds Feed handler. Raw or normalized format.
3 40 microseconds Normalized format and symbology.
4 4–200 milliseconds Consolidated feed. Full tick.
5 1 second to daily Intraday and EOD. Reference, Corp, FI, derivatives.

The most latency-sensitive use cases are typically handled by FPGA or custom ASICs. These react to incoming network traffic, like market data, and put triggering logic directly into the network interface controller. Easily reprogrammable PBAs play little to no role in any latency sensitive work, due to the SIMD architecture being designed for the use case of parallel processing large amounts of data with a bandwidth bottleneck of getting data onto the chip.

However, three factors maybe driving change in the role hardware acceleration plays in the low-latency space. Firstly, as PBAs mature, some of their previous barriers are being reduced. For example, NVIDIA’s new NVLink design now enables significantly higher bandwidth relative to previous chip interconnects, meaning that data can get onto the chip far more quickly than before. Comparing the latest NVIDIA GB200 chip against the previous generation NVIDIA H100 chip, NVLink performance has increased 400%, from 900 GBps to 3.6 TBps.

Secondly, some observers believe the race for speed is shifting to a “race for intelligence.” With approximately only ten major firms competing in the top-tier low latency space, the barrier to entry seems almost unsurmountable for other parties. At some point, low-latency hardware and techniques might slowly diffuse through technology supplier offerings, eventually leveling the playing field, perhaps having been driven by new regulations.

Thirdly, although FPGA/ASIC undoubtedly provides the fastest performance, they come at a cost of being a drain on resources. Their developers are hard to hire for, the work has long deployment cycles, and it results in a significant maintenance burden with bugs that are difficult to diagnose and triage. Firms are keen to identify alternatives.

Although the most latency-sensitive work will remain on FPGA/ASIC, there may be a shift of less latency-sensitive work from FPGA/ASIC to GPUs and other PBAs as users weigh the trade-off between speed and other factors. In comparison, easily reprogrammable PBA processors are now simple to hire for, are straightforward to code against and maintain, and allow for relatively rapid innovation. Looking to the future, we may see innovation at the language level, for example, through functional programming with array-languages such as the Co-dfns project, as well as further innovation at the hardware level, with future chips tightly integrating the best components of today’s FPGAs, GPUs and CPUs.

Key Takeaways

In this section, we present three key takeaways. Firstly, the global supply-demand ratio for GPUs is low, meaning price can be high, but availability can be low. This can be a constraining factor for end-user businesses wanting to innovate in this space. AWS helps address this on behalf of its customers in three ways:

  • Through economies of scale, AWS is able to offer significant availability of the PBAs, including GPUs.
  • Through in-house research and development, AWS is able to offer its own PBAs, developed and manufactured in-house, which are not subject to the constraints of the wider market, while also having optimized price-performance.
  • AWS innovates at the software level to improve allocation to the end-user. Therefore, although total capacity might be fixed, by using intelligent allocation algorithms, AWS is better able to meet customers’ needs. For example, Amazon EC2 Capacity Blocks for ML enables guaranteed access to the required PBAs at the point in time they are needed.

The second takeaway is that proprietary software can lock users in to a single supplier and end up acting as a barrier to innovation. In the case of PBAs, the chips that use proprietary software mean that users can’t easily move between chip manufacturers, as opposed to open source software supporting multiple chip manufacturers. Any future supply constraints, such as regional armed conflict, could further exasperate existing supply-demand imbalances. Although migrating existing legacy workloads from an acceleration chip with proprietary software can be challenging, new greenfield workloads can be built on open source libraries without difficulty. In the FSI space, examples of legacy workloads might include risk calculations, and examples of greenfield workloads might include time series prediction using FMs. In the long term, business leaders need to consider and formulate their strategy for moving away from software lock-in, and enable access to wider acceleration hardware offerings, with the cost benefits that can bring.

The final takeaway is that financial services, and the subsection of capital markets in particular, is subject to constant and evolving competitive pressures. Over time, the industry has seen the race for differentiation move from data access rights, to latency, and now to an increased focus on predictive power. Looking to the future, if the world of financial prediction is based in part on a small number of expensive and complex FMs built and trained by a few large global corporates, where will the differentiation come from? Speculative areas could range from at-scale feature engineering to being able to better handle increased regulatory burdens. Whichever field it comes from, it is certain to include data processing and analytics at its core, and therefore benefit from hardware acceleration.

Conclusion

This post aimed to provide business leaders with a non-technical overview of PBAs and their role within the FSI. With this technology currently being regularly discussed in the mainstream media, it is essential business leaders understand the basis of this technology and its potential future role. Nearly every organization is now looking to a data-centric future, enabled by cloud-based infrastructure and real-time analytics, to support revenue-generating AI and ML use cases. One of the ways organizations will be differentiated in this race will be by making the right strategic decisions about technologies, partners, and approaches. This includes topics such as open source versus closed source, build versus buy, tool complexity and associated ease of use, hiring and retention challenges, and price-performance. Such topics are not just technology decisions within a business, but also cultural and strategic ones.

Business leaders are encouraged to reach out to their AWS point of contact and ask how AWS can help their business win in the long term using PBAs. This might result in a range of outcomes, from a short proof of concept against an existing well-defined business problem, to a written strategy document that can be consumed and debated by peers, to onsite technical workshops and business briefing days. Whatever the outcome, the future of this space is sure to be exciting!

Acknowledgements

I would like to thank the following parties for their kind input and guidance in writing this post: Andrea Rodolico, Alex Kimber, and Shruti Koparkar. Any errors are mine alone.


About the Author

Dr. Hugh Christensen works at Amazon Web Services with a specialization in data analytics. He holds undergraduate and master’s degrees from Oxford University, the latter in computational biophysics, and a PhD in Bayesian inference from Cambridge University. Hugh’s areas of interest include time series data, data strategy, data leadership, and using analytics to drive revenue generation. You can connect with Hugh on LinkedIn.

Read More

Anomaly detection in streaming time series data with online learning using Amazon Managed Service for Apache Flink

Anomaly detection in streaming time series data with online learning using Amazon Managed Service for Apache Flink

Time series data is a distinct category that incorporates time as a fundamental element in its structure. In a time series, data points are collected sequentially, often at regular intervals, and they typically exhibit certain patterns, such as trends, seasonal variations, or cyclical behaviors. Common examples of time series data include sales revenue, system performance data (such as CPU utilization and memory usage), credit card transactions, sensor readings, and user activity analytics.

Time series anomaly detection is the process of identifying unexpected or unusual patterns in data that unfold over time. An anomaly, also known as an outlier, occurs when a data point deviates significantly from an expected pattern.

For some time series, like those with well-defined expected ranges such as machine operating temperatures or CPU usage, a threshold-based approach might suffice. However, in areas like fraud detection and sales, where simple rules fall short due to their inability to catch anomalies across complex relationships, more sophisticated techniques are required to identify unexpected occurrences.

In this post, we demonstrate how to build a robust real-time anomaly detection solution for streaming time series data using Amazon Managed Service for Apache Flink and other AWS managed services.

Solution overview

The following diagram illustrates the core architecture of the Anomaly Detection Stack solution.

This solution employs machine learning (ML) for anomaly detection, and doesn’t require users to have prior AI expertise. It offers an AWS CloudFormation template for straightforward deployment in an AWS account. With the CloudFormation template, you can deploy an application stack with the necessary AWS resources required for detecting anomalies. Setting up one stack creates an application with one anomaly detection task or detector. You can set up multiple such stacks to run them simultaneously, with each one analyzing the data and reporting back the anomalies.

The application, once deployed, constructs an ML model using the Random Cut Forest (RCF) algorithm. It initially sources input time series data from Amazon Managed Streaming for Apache Kafka (Amazon MSK) using this live stream for model training. Post-training, the model continues to process incoming data points from the stream. It evaluates these points against the historical trends of the corresponding time series. The model also generates an initial raw anomaly score while processing and maintains an internal threshold to eliminate noisy data points. Subsequently, the model generates a normalized anomaly score for each data point that the model treats as an anomaly. These scores, ranging from 0–100, indicate the deviation from typical patterns; scores closer to 100 signify higher anomaly levels. You have the flexibility to set a custom threshold on these anomaly scores, allowing you to define what you consider anomalous.

This solution uses a CloudFormation template, which takes inputs such as MSK broker endpoint and topics, AWS Identity and Access Management (IAM) roles, and other parameters related to virtual private cloud (VPC) configuration. The template creates the essential resources like the Apache Flink application and Amazon SageMaker real-time endpoint in the customer account.

To request the access to this solution, send an email to anomalydetection-support-canvas@amazon.com.

In this post, we outline how you can build an end-to-end solution with the Anomaly Detection Stack. Consider a hypothetical sales scenario where AnyBooks, an on-campus bookstore at a large university, sells various supplies to college students. Due to the timing of class schedules, their seasonality is such that they sell around 20 Item-A units and 30 Item-B units during even hours, and approximately half that during odd hours throughout the day. Recently, there have been some unexplained spikes in the quantity of items sold, and the management team wants to start tracking these quantity anomalies so that they can better plan their staffing and inventory levels.

The following diagram shows the detailed architecture for the end-to-end solution.

In the following sections, we discuss each layer shown in the preceding diagram.

Ingestion

In the ingestion layer, an AWS Lambda function retrieves sales transactions for the current minute from a PostgreSQL transactional database, transforms each record into a JSON message, and publishes it to an input Kafka topic. This Lambda function is configured to run every minute using Amazon EventBridge Scheduler.

Anomaly detection stack

The Flink application initiates the process of reading raw data from the input MSK topic, training the model, and commencing the detection of anomalies, ultimately recording them to the MSK output topic. The following code is the output results JSON:

{"detectorName":"canvas-ad-blog-demo-1","measure":"quantity","timeseriesId":"f3c7f14e7a445b79a3a9877dfa02064d56533cc29fb0891945da4512c103e893","anomalyDecisionThreshold":70,"dimensionList":[{"name":"product_name","value":"item-A"}],"aggregatedMeasureValue":14.0,"anomalyScore":0.0,"detectionPeriodStartTime":"2024-08-29 13:35:00","detectionPeriodEndTime":"2024-08-29 13:36:00","processedDataPoints":1261,"anomalyConfidenceScore":80.4674989791107,"anomalyDecision":0,"modelStage":"INFERENCE","expectedValue":0.0}

The following is a brief explanation of the output fields:

  • measure – This represents the metric we are tracking for anomalies. In our case, the measure field is the quantity of sales for Item-A.
  • aggregatedMeasureVaue – This represents the aggregated value of quantity in the time window.
  • timeseriesid – This unique identifier corresponds to a combination of unique values for the dimensions and the metric. In this scenario, it’s the product name, Item-A, within the product_name
  • anomalyConfidenceScore – As the model evolves through learning and inference, this confidence score will progressively improve.
  • anomalyScore – This field represents the score for anomaly detection. With an anomalyThreshold set at 70, any value exceeding 70 is considered a potential anomaly.
  • modelStage – When the model is in the learning phase, the anomalyScore is 0.0 and the value of this field is set to LEARNING. After the learning is complete, the value of this field changes to INFERENCE.
  • anomalyDecisionThreshold – The decision threshold is provided as input in the CloudFormation stack. If you determine there are too many false positives, you can increase this threshold to change the sensitivity.
  • anomalyDecision – If the anomalyScore exceeds the anomalyDecisionThreshold, this field is set to 1, indicating an anomaly is detected.

Transform

In the transformation layer, an Amazon Data Firehose stream is configured to consume data from the output Kafka topic and invoke a Lambda function for transformation. The Lambda function flattens the nested JSON data from the Kafka topic. The transformed results are then partitioned by date and stored in an Amazon Simple Storage Service (Amazon S3) bucket in Parquet format. An AWS Glue crawler is used to crawl the data in the Amazon S3 location and catalog it in the AWS Glue Data Catalog, making it ready for querying and analysis.

Visualize

To visualize the data, we’ve created an Amazon QuickSight dashboard that connects to the data in Amazon S3 through the Data Catalog and queries it using Amazon Athena. The dashboard can be refreshed to display the latest detected anomalies, as shown in the following screenshot.

In this example, the darker blue line in the line graph represents the seasonality of the quantity measure for Item-A over time, showing higher values during even hours and lower values during odd hours. The pink line represents the anomaly detection score, plotted on the right Y-axis. The anomaly score approaches 100 when the quantity value significantly deviates from its seasonal pattern. The blue line represents the anomaly threshold, set at 70. When anomalyScore exceeds this threshold, anomalyDecision is set to 1.

The “Number of Timeseries Tracked” KPI displays how many time series the model is currently monitoring. In this case, because we’re tracking two products (Item-A and Item-B), the count is 2. The “Number of Datapoints Processed” KPI shows the total number of data points the model has processed, and the “Anomaly Confidence Score” indicates the confidence level in predicting anomalies. Initially, this score is low, but will approach 100 as the model matures over time.

Notification

Although visualization is valuable for investigating anomalies, data analysts often prefer to receive near real-time notifications for critical anomalies. This is achieved by adding a Lambda function that reads results from the output Kafka topic and analyzes them. If the anomalyScore value exceeds the defined threshold, the function invokes an Amazon Simple Notification Service (Amazon SNS) topic to send email or SMS notifications to a designated list, alerting the team about the anomaly in near real time.

Conclusion

This post demonstrated how to build a robust real-time anomaly detection solution for streaming time series data using Managed Service for Apache Flink and other AWS services. We walked through an end-to-end architecture that ingests data from a source database, passes it through an Apache Flink application that trains an ML model and detects anomalies, and then lands the anomaly data in an S3 data lake. The anomaly scores and decisions are visualized through a QuickSight dashboard connected to the Amazon S3 data using AWS Glue and Athena. Additionally, a Lambda function analyzes the results and sends notifications in near real time.

With AWS managed services like Amazon MSK, Data Firehose, Lambda, and SageMaker, you can rapidly deploy and scale this anomaly detection solution for your own time series use cases. This allows you to automatically identify unexpected behaviors or patterns in your data streams in real time without manual rules or thresholds.

Give this solution a try, and explore how real-time anomaly detection on AWS can unlock insights and optimize operations across your business!


About the Authors

Noah Soprala is a Solutions Architect based out of Dallas. He is a trusted advisor to his customers and helps them build innovative solutions using AWS technologies. Noah has over 20 years of experience in consulting, development, and solution architecture and delivery.

Dan Sinnreich is a Sr. Product Manager for Amazon SageMaker, focused on expanding no-code / low-code services. He is dedicated to making ML and generative AI more accessible and applying them to solve challenging problems. Outside of work, he can be found playing hockey, scuba diving, and reading science fiction.

Syed Furqhan is a Senior Software Engineer for AI and ML at AWS. He was part of many AWS service launches like Amazon Lookout for Metrics, Amazon Sagemaker and Amazon Bedrock. Currently, he is focusing on generative AI initiatives as part of Amazon Bedrock Core Systems. He is a clean code advocate and a subject-matter expert on server-less and event-driven architecture. You can follow him on linkedin, syedfurqhan

Nirmal Kumar is Sr. Product Manager for the Amazon SageMaker service. Committed to broadening access to AI/ML, he steers the development of no-code and low-code ML solutions. Outside work, he enjoys travelling and reading non-fiction.

Read More

Generative AI-powered technology operations

Generative AI-powered technology operations

Technology operations (TechOps) refers to the set of processes and activities involved in managing and maintaining an organization’s IT infrastructure and services. There are several terminologies used with reference to managing information technology operations, including ITOps, SRE, AIOps, DevOps, and SysOps. For the context of this post, we refer to these terminologies as TechOps. This includes tasks such as managing servers, networks, databases, and applications to maintain reliability, performance, and security of IT systems. However, certain tasks require manual and repetitive efforts such as incident detection and response, analyzing incoming tickets from disparate service providers, finding standard operating procedures for known and unknown issues, and managing support case resolution. In recent years, TechOps has been using AI capabilities—called AIOps—for operational data collection, aggregation, and correlation to generate actionable insights, identity root causes, and more.

This post describes how AWS generative AI solutions (including Amazon Bedrock, Amazon Q Developer, and Amazon Q Business) can further enhance TechOps productivity, reduce time to resolve issues, enhance customer experience, standardize operating procedures, and augment knowledge bases. The ability of generative AI technology to interpret complex situations on a nuanced, case-by-case basis implies that generative AI can solve challenges that other approaches—including traditional artificial intelligence and machine learning (AI/ML)-based pattern matching—couldn’t handle. The following table depicts a few examples of how AWS generative AI services can help with some of the day-to-day TechOps activities.

Amazon Bedrock Amazon Q Developer Amazon Q Business
Root cause analysis Maintenance tasks code generation Standard operating procedure
Knowledge base creation Increase productivity and efficiency Organization policy and procedure
Recurring reporting . Customer experience and sentiment analysis
Outbound support case generation . Shift handover chatbot
Inbound maintenance notifications formatting . .

A typical day in the life of a TechOps team includes issue resolution, root cause analysis, maintenance activities, and updating knowledge bases to provide a positive customer experience. In the following sections, we discuss some of these areas and how generative AI can help enhance TechOps.

Event management

By monitoring systems and analyzing patterns in performance data, an AI model can predict issues before they cause outages or degraded service. When incidents do occur, generative AI models can generate preliminary documentation of the event, including details on impacted systems, potential root causes, and troubleshooting steps. This allows engineers to quickly get up to speed on new incidents and accelerate response efforts.

Generative AI can also generate summary reports of past incidents to help teams identify recurring problems and opportunities for preventative measures. Furthermore, it can help with formatting inbound maintenance notifications from various service providers into a standard format, which can speed up understanding the impact of upcoming maintenance. Similarly, generative AI can automatically generate outbound cases to service providers if it detects an anomaly.

By taking over basic documentation and prediction tasks, generative AI can help infrastructure teams spend less time on repetitive work and more time resolving issues to improve overall system reliability.

To learn more about using Amazon Bedrock for summary tasks, refer to Create summaries of recordings using generative AI with Amazon Bedrock and Amazon Transcribe. To learn how Wiz uses Amazon Bedrock to address security risks, see How Wiz is empowering organizations to remediate security risks faster with Amazon Bedrock. To learn how HappyFox uses Anthropic Claude in Amazon Bedrock, refer to HappyFox Automates Support Agent Responses with Claude in Amazon Bedrock, Increasing Ticket Resolution by 40%.

Knowledge base management

Generative AI has the potential to help engineers automatically create operational documents such as standard operating procedures (SOPs) and supplemental documents, such as server hardening, security policies for external IPs allow lists and operating system patching, and more.

Using natural language models trained on large datasets of existing SOPs and similar content, generative AI systems can understand the common structure and language used in these types of documents. Engineers can then provide the system with high-level requirements or parameters for a new procedure, and generative AI can automatically generate a draft document formatted with the appropriate sections, level of detail, and terminology. This allows engineers to spend less time on documentation and more time focused on other engineering tasks. The initial drafts from AI also provide a strong starting point that engineers can refine.

Overall, generative AI offers a more efficient way for engineers to develop standardized procedural content at scale.

To learn how to use Amazon Bedrock to generate product descriptions, see Automating product description generation with Amazon Bedrock. Additionally, refer to How Skyflow creates technical content in days using Amazon Bedrock to learn how Skyflow Inc.—a data privacy company—uses Amazon Bedrock to streamline the creation of technical content, reducing the process from weeks to days while maintaining the highest standards of data privacy and security.

Automation

Generative AI can assist engineers and automate certain tasks that would otherwise require manual work. One area this could help in is script code generation for repetitive automation processes. By training AI models on large datasets of existing code examples for common programming tasks like file operations or system configuration, generative models can learn patterns and syntax.

An Amazon Q customization is a set of elements that enables Amazon Q to provide you with suggestions based on your company’s code base. Engineers can then provide high-level descriptions or specifications of what they need automated, such as “Generate a script to back up and archive files older than 30 days in this directory.” The AI model would be able to produce working code to accomplish this automatically based on its training. This would save engineers considerable time writing and testing scripts for routine jobs, allowing them to focus on more creative and challenging aspects of their work. As generative AI techniques advance, more complex engineering automation may also be achieved.

Refer to Upgrade your Java applications with Amazon Q Code Transformation to learn about the Amazon Q Code Transformation feature. Also, refer to Using Amazon Bedrock Agents to interactively generate infrastructure as code to learn how to configure Amazon Bedrock Agents to generate infrastructure as code. Lastly, refer to TymeX Accelerates Clean Coding by 40% by Implementing Generative AI on AWS to learn how TymeX uses generative AI on AWS.

Customer experience

Generative AI can analyze large volumes of customer service data, like call logs and support tickets, and identify patterns in issues customers frequently report. This insight allows operations teams to proactively address common problems before they severely impact customers. Generative AI assistants can also automate many routine service tasks, freeing up human agents to focus on more complex inquiries that require personalization. With AI assistance, infrastructure services can be restored more quickly when outages occur. This helps make sure operations are more efficient and transparent, directly enhancing the experience for the customers that infrastructure teams aim to support.

Amazon Q Business offers a conversational experience with generative prompts and tasks that can act as a front-line support engineer, answering customer questions and resolving known issues efficiently. The feature can use data from enterprise systems to provide accurate and timely responses, reducing the burden on human engineers and improving customer satisfaction.

With Amazon Bedrock, you can perform sentiment analysis to help analyze customer emotions and provide context to human engineers, enabling them to provide better support and improve customer loyalty, retention, and growth.

Refer to Develop advanced generative AI chat-based assistants by using RAG and ReAct prompting to learn one way to develop generative AI assistants. Refer to Building a Generative AI Contact Center Solution for DoorDash Using Amazon Bedrock, Amazon Connect, and Anthropic’s Claude to learn how DoorDash built a generative AI contact center solution using AWS services. To learn how PGA TOUR built a generative AI virtual assistant, see The journey of PGA TOUR’s generative AI virtual assistant, from concept to development to prototype.

Staff productivity

An all-day infrastructure operations team faces challenges in maintaining staff productivity during off-hours and nights when the volume of support requests is lower. A generative AI assistant can help improve staff productivity in these periods and streamline the shift-handover process.

The assistant can be trained on historical support conversations to understand and resolve a large percentage of routine queries independently. It can communicate with customers on messaging platforms to provide instant assistance. Simple requests that the assistant can address free up the team to focus on complex issues requiring human expertise. The AI system can escalate any queries it can’t resolve on its own to the on-call staff. This allows the night and weekend crew to work with fewer interruptions. They can work through tasks more efficiently knowing the assistant is handling basic support needs independently. Generative AI-powered contact center solutions can improve an agent’s ability to interact with customers more precisely and speed up issue resolution, increasing overall productivity.

To learn how to automate document and data retrieval for AI assistants, see Automate chatbot for document and data retrieval using Amazon Bedrock Agents and Knowledge Bases. Refer to How LeadSquared accelerated chatbot deployments with generative AI using Amazon Bedrock and Amazon Aurora PostgreSQL to learn how LeadSquared uses Amazon Bedrock and Amazon Aurora PostgreSQL-Compatible Edition to deploy generative AI-powered assistants on their Converse platform, which personalize interactions based on customer-specific training data. This integration reduces customer onboarding costs, minimizes manual effort, and improves chatbot responses, transforming customer support and engagement by providing swift and relevant assistance.

Reporting

Generative AI has the potential to help infrastructure operations teams streamline reporting processes. By using ML algorithms trained on past report examples, a generative AI system can automatically generate draft reports based on incoming data from monitoring systems and other operational tools. This can save teams significant time spent compiling information into standardized report formats. The AI-generated reports could include summary data visualizations, descriptive analyses, and recommendations tailored to each recipient.

Teams would still need to review the drafts for accuracy before finalizing and distributing them. However, having an initial version generated automatically could cut down on routine reporting tasks so engineers have more time for higher-value problem-solving and strategic planning work. The use of AI could help infrastructure teams meet their reporting obligations more efficiently.

Amazon Q in QuickSight is your generative AI assistant that makes it straightforward to build and consume insights. For more information, see Amazon Q is now generally available in Amazon QuickSight, bringing Generative BI capabilities to the entire organization. Also, refer to Anthology uses embedded analytics offered by Amazon QuickSight to democratize decision making for higher education to learn how Anthology is using Amazon Q in QuickSight to offer institutions self-serve options for analytics needs that aren’t directly addressed by the central dashboards.

You can explore more customer stories and case studies at Generative AI Customer Stories to learn how customers are using AWS generative AI services. Refer to Derive meaningful and actionable operational insights from AWS Using Amazon Q Business to learn how to use AWS generative AI services, like Amazon Q Business, with AWS Support cases, AWS Trusted Advisor, and AWS Health data to derive actionable insights based on common patterns, issues, and resolutions while using the AWS recommendations and best practices enabled by support data.

Conclusion

Integrating generative AI into TechOps represents a transformative leap in the management and optimization of IT infrastructure and services. By using AWS generative AI solutions such as Amazon Bedrock, Amazon Q Developer, and Amazon Q Business, organizations can significantly enhance productivity, reduce the time required to resolve issues, and improve overall customer experience. Generative AI’s sophisticated capabilities in predicting and preventing outages, automating documentation, and generating actionable insights from operational data position it as a critical tool for modern TechOps teams.

You can unlock unimagined possibilities with generative AI by using the AWS Generative AI Innovation Center program, which pairs you with AWS science and strategy experts with deep experience in AI/ML and generative AI techniques. To get started, contact your AWS Account Manager. If you don’t have an AWS Account Manager, contact AWS Sales.


About the Authors

Raman Pujani is a Solutions Architect at Amazon Web Services, where he helps customers to accelerate their business transformation journey with AWS. He builds simplified and sustainable solutions for complex business problems with innovative technology. Raman has 25+ years of experience in IT Transformation. Besides work, he enjoys spending time with family, vacationing in the mountains, and music.

Rachanee Singprasong is a Principal Customer Solutions Manager in Strategic Accounts at Amazon Web Services. Her role is focused on enabling customer in their cloud adoption and digital transformation journey. She has a Ph.D. in Operations Research and her passion is to solve complex customer challenges using creative solutions.

Vijay Sivaji is a Senior Technical Account Manager in Strategic Accounts at Amazon Web Services. He helps customers in solving architectural, operational and cost optimization challenges. In his spare time he enjoys playing tennis.

Read More

Optimizing MLOps for Sustainability

Optimizing MLOps for Sustainability

Machine learning operations (MLOps) are a set of practices that automate and simplify machine learning (ML) workflows and deployments. What is MLOps provides a detailed description of this concept. As ML workloads become increasingly complex and consume more energy and resources, a growing number of companies are looking for ways to manage both the costs and the carbon footprint associated with these workloads. AWS published Guidance for Optimizing MLOps for Sustainability on AWS to help customers maximize utilization and minimize waste in their ML workloads.

In this blog post, you will learn how to optimize MLOps for sustainability.

There are three main workflows in the overall process for building, deploying and using ML models, as shown in the following figure. The process begins with data preparation, followed by model training and tuning, and then model deployment and management.

Data preparation

The workflow starts with data preparation, which includes four components: your data stream, Amazon SageMaker Processing jobAmazon SageMaker Feature Store and an Amazon Simple Storage Service (Amazon S3) bucket for raw data, as shown in the following figure.

Data preparation is essential for model training and is also the first phase in the MLOps lifecycle. Optimizing the artificial intelligence and machine learning (AI/ML) data preparation workload on AWS with sustainability best practices helps reduce the carbon footprint and the cost.

The data preparation process can be complex and energy-intensive because of the vast amount of data processing and computations involved. This leads to substantial resource consumption. There are a few things to consider that can help reduce energy consumption.

Start with the AWS Region you choose for your workload. If possible, choose a Region that has low carbon intensity or where the electricity is attributed to 100% renewable energy sources. In addition, consider storing data and training models in the same Region if possible. This reduces the data movement and latency across the network, optimizing the networking resources required.

Using a serverless architecture can help further reduce resource consumption and remove maintenance overhead by provisioning resources only when required. It’s also important to avoid duplication and re-run of code across teams. Look for services such as Amazon SageMaker Feature Store which helps achieve this goal. Finally, choosing the right storage type for the data used for model training can limit the carbon impact of your workload.

For example, by using S3 One Zone-Infrequent Access to store data that isn’t frequently accessed, such as test data and training data, you can optimize the carbon impact of the data stored. Also, using S3 Intelligent-Tiering can help move the data to more energy-efficient tiers based on access patterns.

Model training and tuning

The second area for you to consider is model training and tuning, shown in the following figure.

While data preparation isn’t unique to AI/ML workloads, the model training and tuning workflow is specific to AI/ML. It’s an important step in making the models functionally useful while also reducing the resources required to run them at scale. There are costs in terms of both operations and sustainability. The good news is that optimization for sustainability also helps to optimizing operations.

For example, SageMaker provides the model parallel library to help efficiently distribute and train models on multiple compute nodes. The library has multiple features that can be combined to more efficiently train models from relatively small parameter sets up to sets with hundreds of billions of parameters. The library can also help use the features of Elastic Fabric Adapter (EFA) supported devices to maximize throughput and minimize latency across nodes. Further optimization is possible using SageMaker Training Compiler to compile deep learning models for training on supported GPU instances. SageMaker Training Compiler converts deep learning models from high-level language representation to hardware-optimized instructions. Hardware-optimized instructions can speed up model training by up to 50% by more efficiently using the GPU memory and using a larger batch size per iteration, all without altering the final trained model.

To reduce the time and energy required to tune a model, SageMaker automatic model tuning (AMT) runs multiple training jobs on a given dataset; it then uses the results to converge on a set of hyperparameter values to create the best performing model for a given metric. There are multiple approaches to the process of searching for the right hyperparameter ranges. For example, Bayesian optimization typically requires 10 times fewer jobs to find the best set of values compared to other methods, reducing the resource usage and carbon footprint of the process.

Right-sizing is another method for managing resource usage and minimizing the environmental impact of your workloads. SageMaker debugger helps to optimize resource consumption by detecting under-utilization of system resources, identifying training problems, and using built-in rules to monitor and stop training jobs as soon as bugs are detected.

Data pre- and post-processing and model evaluation tasks can be run as Amazon SageMaker Processing jobs. In addition to evaluating the accuracy of your models, processing jobs help you to make informed decisions about the tradeoffs between a model’s accuracy and its carbon footprint. Thus, you can establish performance criteria that support your sustainability goals while meeting your business requirements. SageMaker Processing also provides Amazon CloudWatch logs and metrics that can be used for monitoring and right-sizing jobs based on CPU, memory, GPU, GPU memory, and disk metrics.

Dedicated Amazon Elastic Compute Cloud (Amazon EC2) Trn1 instances provide both efficiency and environmental benefits for running your training jobs. These instances use Trainium processors: purpose-built chips designed specifically for deep learning training of models that can exceed 100 billion parameters. Each Trn1 instance provides up to 16 Trainium accelerators, ensuring that jobs will be both efficient and cost optimized. EC2 Trn1 instances offer up to 52% cost-to-train savings compared to comparable EC2 instance types.

Next, you can use governance to share information about the environmental impact of your model. Amazon SageMaker Model Cards provide versioned records documenting various aspects and attributes of your model. This allows you to share the intended uses and assessed carbon impact of a model so that data scientists, ML engineers, and other teams can make informed decisions when choosing and running models.

Model deployment and management

The last area of MLOps is deployment and management, shown in the following figure.

Automating the deployment of ML models provides several sustainability benefits. The deployed model can use a lot of resources when data or code is updated and retrained. You want to ensure that the deployed model is as efficient as possible to reduce the carbon footprint of the workload.

One approach is to use Amazon SageMaker Model Registry. This feature helps improve sustainability and resource optimization by providing a centralized repository for cataloging ML models and reducing redundancy. This approach improves model reusability by allowing existing models to be fine-tuned, rather than training new models from scratch. Consider running your deployment code using AWS CodePipeline to ensure repeatability and version control and optimize resource utilization by running only the necessary stages in the pipeline. This helps your workloads remove the waste associated with manual processes and supports incremental improvements over time.

If your workloads can tolerate latency, consider deploying your model on Amazon SageMaker Asynchronous Inference with auto-scaling groups. This can help minimize idle resources and reduce the impact of load spikes. This also means you pay for compute only when the endpoint is actively handling inference requests. Alternatively, if you don’t need real-time inference, use batch transform. Unlike persistent endpoints, clusters are decommissioned when a batch transform job is complete. Batch transform automatically partitions large datasets and distributes workloads across compute to ensure efficient resource utilization.

To simplify deployment and management and increase resource utilization, use multi-model endpoints instead of separate endpoints for each model. One example for this approach is models with different data formats, such as recommendation systems that process text and images using separate endpoints. Or deploying a variety of models that include PyTorch, Scikit-learn, and TensorFlow models. Automatic scaling can amplify resource optimization for your hosted models. Auto scaling dynamically adjusts the number of instances provisioned for a model in response to changes in your workload. This helps you avoid cost and consumes less energy and resources. If your workload has intermittent or unpredictable traffic with idle periods between traffic peaks and can tolerate cold starts, use Amazon SageMaker Serverless Inference endpoints, which automatically launch compute resources and scale depending on traffic. Optionally, you can use Provisioned Concurrency with Serverless Inference when you have predictable bursts in your traffic.

AWS offers a few different options to better utilize your resources and lower emissions when working with inference workloads. AWS Inferentia is designed to deliver high performance at the lowest cost in EC2 instances for your deep learning and generative AI inference applications. AWS Inferentia is built for sustainability and provides up to 50% better performance per watt over comparable EC2 instances. You can further optimize resource utilization by combining AWS Inferentia and Amazon Elastic Inference to attach the right amount of GPU-powered inference acceleration to any EC2 or SageMaker instance type.

After training a model for high accuracy, developers often turn to more expensive large instances with lots of memory and processing power to achieve better throughput. You can reduce resource usage and avoid the need for more powerful instances by using pre-trained models and compiling them into optimized executables that can be hosted in SageMaker or edge devices for inference with Amazon SageMaker Neo.

Monitoring CPU, memory, and GPU resource utilization is critical to optimize model performance and avoid wasted resources. AWS offers a variety of tools that you can use to optimize MLOps for sustainability, such as CloudWatch, SageMaker Inference recommender, and SageMaker Model Monitor. Inference Recommender helps you choose the optimal instance type and configuration for ML models and workloads. You can use SageMaker Model Monitor to automate drift detection of your ML model in production, and only retrain it when prediction performance drops below predetermined key performance indicators (KPIs). This approach improves operational efficiency and retrains the model based on your business metrics.

Conclusion

Sustainability and ML are redefining how many companies deliver value for their customers. Incorporating sustainability into the design, development and deployment of ML models is a crucial long-term consideration. AWS is investing in the sustainability of the cloud and providing resources to assist customers in transforming their workloads to be more energy efficient. In this post, we have reviewed the Guidance for Optimizing MLOps for Sustainability on AWS, providing service-specific practices to understand and reduce the environmental impact of these workloads. MLOps consists of several distinct phases that can be independently optimized for sustainability. Regular reviews using tools such as AWS Well-Architected Machine Learning Lens help you identify optimization opportunities and provide a mechanism for you to meet your sustainability goals.


About the Authors

Archana Srinivasan is a Senior Technical Account Manager within Enterprise Support at Amazon Web Services (AWS). Archana provides strategic technical guidance for independent software vendors (ISVs) to innovate and operate their workloads efficiently on AWS.

Chris Procunier is a Senior Technical Account Manager at AWS, based out of Washington DC. He has been managing systems and infrastructure for 25 years as an entrepreneur, IT Director and architect. Outside of work Chris is passionate about family, friends, music, cooking and cycling.

Meghana Reddy is a Technical Account Manager at AWS, where she offers strategic technical guidance to Independent Software Vendors (ISVs) for optimizing their workloads on AWS. She is passionate about environmental sustainability and actively promotes sustainable practices within the cloud.

Steven David is a Principal Solutions Architect at Amazon Web Services (AWS). He has over 20 years of experience designing solutions for large enterprises. Through these engagements he has developed deep expertise in application development technologies and methodologies.

Read More

Enabling complex generative AI applications with Amazon Bedrock Agents

Enabling complex generative AI applications with Amazon Bedrock Agents

In June, I started a series of posts that highlight the key factors that are driving customers to choose Amazon Bedrock. The first covered building generative AI apps securely with Amazon Bedrock, while the second explored building custom generative AI applications with Amazon Bedrock. Now I’d like to take a closer look at Amazon Bedrock Agents, which empowers our customers to build intelligent and context-aware generative AI applications, streamlining complex workflows and delivering natural, conversational user experiences. The advent of large language models (LLMs) has enabled humans to interact with computers using natural language. However, many real-world scenarios demand more than just language comprehension. They involve executing complex multi-step workflows, integrating external data sources, or seamlessly orchestrating diverse AI capabilities and data workflows. In these real-world scenarios, agents can be a game changer, delivering more customized generative AI applications—and transforming the way we interact with and use LLMs.

Answering more complex queries

Amazon Bedrock Agents enables a developer to take a holistic approach in improving scalability, latency, and performance when building generative AI applications. Generative AI solutions that use Amazon Bedrock Agents can handle complex tasks by combining an LLM with other tools. For example, imagine that you are trying to create a generative AI-enabled assistant that helps people plan their vacations. You want it to be able to handle simple questions like “What’s the weather like in Paris next week?” or “How much does it cost to fly to Tokyo in July?” A basic virtual assistant might be able to answer those questions drawing from preprogrammed responses or by searching the Internet. But what if someone asks a more complicated question, like “I want to plan a trip to three countries next summer. Can you suggest a travel itinerary that includes visiting historic landmarks, trying local cuisine, and staying within a budget of $3,000?” That is a harder question because it involves planning, budgeting, and finding information about different destinations.

Using Amazon Bedrock Agents, a developer can quickly build a generative assistant to help answer this more complicated question by combining the LLM’s reasoning with additional tools and resources, such as natively integrated knowledge bases to propose personalized itineraries. It could search for flights, hotels, and tourist attractions by querying travel APIs, and use private data, public information for destinations, and weather—while keeping track of the budget and the traveler’s preferences. To build this agent, you would need an LLM to understand and respond to questions. But you would also need other modules for planning, budgeting, and accessing travel information.

Agents in action

Our customers are using Amazon Bedrock Agents to build agents—and agent-driven applications—quickly and effectively. Consider Rocket, the fintech company that helps people achieve home ownership and financial freedom:

“Rocket is poised to revolutionize the homeownership journey with AI technology, and agentic AI frameworks are key to our mission. By collaborating with AWS and leveraging Amazon Bedrock Agents, we are enhancing the speed, accuracy, and personalization of our technology-driven communication with clients. This integration, powered by Rocket’s 10 petabytes of data and industry expertise, ensures our clients can navigate complex financial moments with confidence.”

– Shawn Malhotra, CTO of Rocket Companies.

A closer look at how agents work

Unlike LLMs that provide simple lookup or content-generation capabilities, agents integrate various components with an LLM to create an intelligent orchestrator capable of handling sophisticated use cases with nuanced context and specific domain expertise. The following figure outlines the key components of Amazon Bedrock Agents:

The process starts with two parts—the LLM and the orchestration prompt. The LLM—often implemented using models like those in the Anthropic Claude family or Meta Llama models—provides the basic reasoning capabilities. The orchestration prompt is a set of prompts or instructions that guide the LLM when driving the decision-making process.

In the following sections, we discuss the key components of Amazon Bedrock Agents in depth:

Planning: A path from task to goals

The planning component for LLMs entails comprehending tasks and devising multi-step strategies to address a problem and fulfill the user’s need. In Amazon Bedrock Agents, we use chain-of-thought prompting in combination with ReAct in the orchestration prompt to improve an agent’s ability to solve multi-step tasks. In task decomposition, the agent must understand the intricacies of an abstract request. Continuing to explore our travel scenario, if a user wants to book a trip, the agent must recognize that it encompasses transportation, accommodation, reservations for sightseeing attractions, and restaurants. This ability to split up an abstract request, such as planning a trip, into detailed, executable actions, is the essence of planning. However, planning extends beyond the initial formulation of a plan, because during execution, the plan may get dynamically updated. For example, when the agent has completed arranging transportation and progresses to booking accommodation, it may encounter circumstances where no suitable lodging options align with the original arrival date. In such scenarios, the agent must determine whether to broaden the hotel search or revisit alternative booking dates, adapting the plan as it evolves.

Memory: Home for critical information

Agents have both long-term and short-term memory. Short-term memory is detailed and exact. It is relevant to the current conversation and resets when the conversation is over. Long-term memory is episodic and remembers important facts and details in the form of saved summaries. These summaries serve as the memory synopses of previous dialogues. The agent uses this information from the memory store to better solve the current task. The memory store is separate from the LLM, with a dedicated storage and a retrieval component. Developers have the option to customize and control which information is stored (or excluded) in memory. An identity management feature, which associates memory with specific end-users, gives developers the freedom to identify and manage end-users—and enable further development on top of Amazon Bedrock agents’ memory capabilities. The industry-leading memory retention functionality of Amazon Bedrock—launched at the recent AWS New York Summit—allows agents to learn and adapt to each user’s preferences over time, enabling more personalized and efficient experiences across multiple sessions for the same user. It is straightforward to use, allowing users to get started in a single click.

Communication: Using multiple agents for greater efficiency and effectiveness

Drawing from the powerful combination of the capabilities we’ve described, Amazon Bedrock Agents makes it effortless to build agents that transform one-shot query responders into sophisticated orchestrators capable of tackling complex, multifaceted use cases with remarkable efficiency and adaptability. But what about using multiple agents? LLM-based AI agents can collaborate with one another to improve efficiency in solving complex questions. Today, Amazon Bedrock makes it straightforward for developers to connect them through LangGraph, part of LangChain, the popular open source tool set. The integration of LangGraph into Amazon Bedrock empowers users to take advantage of the strengths of multiple agents seamlessly, fostering a collaborative environment that enhances the overall efficiency and effectiveness of LLM-based systems.

Tool Integration: New tools mean new capabilities

New generations of models such as Anthropic Claude Sonnet 3.5, Meta Llama 3.1, or Amazon Titan Text Premier are better equipped to use reources. Using these resources requires that developers keep up with ongoing updates and changes, requiring new prompts every time. To reduce this burden, Amazon Bedrock simplifies interfacing with different models, making it effortless to take advantage of all the features a model has to offer. For example, the new code interpretation capability announced at the recent AWS New York Summit allows Amazon Bedrock agents to dynamically generate and run code snippets within a secure, sandboxed environment to address complex tasks like data analysis, visualization, text processing, and equation solving. It also enables agents to process input files in various formats—including CSV, Excel, JSON—and generate charts from data.

Guardrails: Building securely

Accuracy is critical when dealing with complex queries. Developers can enable Amazon Bedrock Guardrails to help reduce inaccuracies. Guardrails improve the behavior of the applications you’re building, increase accuracy, and help you build responsibly. They can prevent both malicious intent from users and potentially toxic content generated by AI, providing a higher level of safety and privacy protection.

Amplifying and extending the capabilities of generative AI with Amazon Bedrock Agents

Enterprises, startups, ISVs, and systems integrators can take advantage of Amazon Bedrock Agents today because it provides development teams with a comprehensive solution for building and deploying AI applications that can handle complex queries, use private data sources, and adhere to responsible AI practices. Developers can start with tested examples—so-called golden utterances (input prompts) and golden responses (expected outputs). You can continuously evolve agents to fit your top use cases and kickstart your generative AI application development. Agents unlock significant new opportunities to build generative AI applications to truly transform your business. It will be fascinating to see the solutions—and results—that Amazon Bedrock Agents inspires.

Resources

For more information on customization with Amazon Bedrock, see the following resources:


About the author

Vasi Philomin is VP of Generative AI at AWS. He leads generative AI efforts, including Amazon Bedrock and Amazon Titan.

Read More

Genomics England uses Amazon SageMaker to predict cancer subtypes and patient survival from multi-modal data

Genomics England uses Amazon SageMaker to predict cancer subtypes and patient survival from multi-modal data

This post is co-written with Francisco Azuaje from Genomics England.

Genomics England analyzes sequenced genomes for The National Health Service (NHS) in the United Kingdom, and then equips researchers to use data to advance biological research. As part of its goal to help people live longer, healthier lives, Genomics England is interested in facilitating more accurate identification of cancer subtypes and severity, using machine learning (ML). To explore whether such ML models can perform at higher accuracy when using multiple modalities, such as genomic and imaging data, Genomics England has launched a multi-modal program aimed at enhancing its dataset and also partnered with the the AWS Global Health and Non-profit Go-to-Market (GHN-GTM) Data Science and AWS Professional Services teams to create an automatic cancer sub-typing and survival detection pipeline and explore its accuracy on publicly available data.

In this post, we detail our collaboration in creating two proof of concept (PoC) exercises around multi-modal machine learning for survival analysis and cancer sub-typing, using genomic (gene expression, mutation and copy number variant data) and imaging (histopathology slides) data. We provide insights on interpretability, robustness, and best practices of architecting complex ML workflows on AWS with Amazon SageMaker. These multi-modal pipelines are being used on the Genomics England cancer cohort to enhance our understanding of cancer biomarkers and biology.

1. Data

The PoCs have used the publicly available cancer research data from The Cancer Genome Atlas (TCGA), which contain paired high-throughput genome analysis and diagnostic whole slide images with ground-truth survival outcome and histologic grade labels. Specifically, the PoCs focus on whole slide histopathology images of tissue samples as well as gene expression, copy number variations, and the presence of deleterious genetic variants to perform analysis on two cancer types: Breast cancer (BRCA) and gastrointestinal cancer types (Pan-GI). Table 1 shows the sample sizes for each cancer type.

Table 1. Overview of input data sizes across the different cancer types investigated.

2. Multi-modal machine learning frameworks

The ML pipelines tackling multi-modal subtyping and survival prediction have been built in three phases throughout the PoC exercises. First, a state-of-the-art framework, namely Pathology-Omic Research Platform for Integrative Survival Estimation (PORPOISE) (Chen et al., 2022) was implemented (Section 2.1). This was followed by the proposal, development, and implementation of a novel architecture based on Hierarchical Extremum Encoding (HEEC) (Section 2.2) by AWS, which aimed to mitigate the limitations of PORPOISE. The final phase improved on the results of HEEC and PORPOISE—both of which have been trained in a supervised fashion—using a foundation model trained in a self-supervised manner, namely Hierarchical Image Pyramid Transformer (HIPT) (Chen et al., 2023).

2.1 Pathology-Omic Research Platform for Integrative Survival Estimation (PORPOISE)

PORPOISE (Chen et al., 2022) is a multi-modal ML framework that consists of three sub-network components (see Figure 1 at Chen et al., 2022):

  1. CLAM component; an attention-based multiple-instance learning network trained on pre-processed whole slid image (WSI) inputs (CLAM, Lu et al., 2021). CLAM extracts features from image patches of size 256×256 using a pre-trained ResNet50.
  2. A self-normalizing network component for extracting deep molecular features.
  3. A multi-modal fusion layer for integrating feature representations from 1) and 2) by modelling their pairwise interactions. The joint representations obtained from 3) are then used for undertaking the downstream tasks such as survival analysis and cancer-subtyping.

Despite being performant, PORPOISE was observed to output reduced multi-modal performance than single best modality (imaging) performance alone when gene expression data was excluded from the genomic features while performing survival analysis for Pan-GI data (Figure 2). A possible explanation is that the model has difficulty dealing with the extremely high dimensional, sparse genomic data without overfitting.

2.2. Hierarchical Extremum Encoding (HEEC): A novel supervised multi-modal ML framework

To mitigate the limitations of PORPOISE, AWS has developed a novel model structure, HEEC, which is based on three ideas:

  1. Using tree ensembles (LightGBM) to mitigate the sparsity and overfitting issue observed when training PORPOISE (as observed by Grinsztajn et al., 2022, tree-based models tend to overfit less when confronted with high-dimensional data with many largely uninformative features).
  2. Representation construction using a novel encoding scheme (extremum encoding) that preserves spatial relationships and thus interpretability.
  3. Hierarchical learning to allow representations at multiple spatial scales.

Figure 1. Hierarchical Extremum Encoding (HEEC) of pathomic representations.

Figure 1 summarizes the HEEC architecture: starting from the bottom (and clockwise): Every input WSI is cut up into patches of size 4096×4096 and 256×256 pixels in a hierarchical manner and all stacks of patches are fed through ResNet50 to obtain embedding vectors. Additionally, nucleus-level representations (of size 64×64 pixels) are extracted by a graph neural network (GNNs), allowing local nucleus neighborhoods and their spatial relationships to be taken into account. This is followed by filtering for redundancy: Patch embeddings that are important are selected using positive-unlabeled learning, and GNN importance filtering is used for retaining the top nuclei features. The resulting hierarchical embeddings are coded using extremum encoding: the maxima and minima across the embeddings are taken in each vector entry, resulting in a single vector of maxima and minima per WSI. This encoding scheme allows keeping exact track of spatial relationships for each entry in the resulting representation vectors because the model can backtrack each vector entry to a specific patch, and thus to a specific coordinate in the image.

On the genomics side, importance filtering is applied based on excluding features that don’t correlate with the prediction target. The remaining features are horizontally appended to the pathology features, and a gradient boosted decision tree classifier (LightGBM) is applied to achieve predictive analysis.

HEEC architecture is interpretable out of the box, because HEEC embeddings possess implicit spatial information and the LightGBM model supports feature importance, allowing the filtering of the most important features for accurate prediction and backtracking to their location of origin. This location can be visually highlighted on the histology slide to be presented to expert pathologists for verification. Table 2 and Figure 2 show performance results of PORPOISE and HEEC, which show that HEEC is the only algorithm that outperforms the results of the best-performing single modality by combining multiple modalities.

Table 2. Classification and survival prediction performance of the two implemented multi-modal ML models on TCGA data. *Although Chen et al., 2022 provide some interpretability, the proposed attention visualization heatmaps have been deemed difficult to interpret from the pathologist point of view by Genomics England domain experts.

Figure 2. Comparison of performance (AUC) across individual modalities for survival analysis, when excluding the gene expression data. This matches the setting encountered by GEL in practice (GEL’s internal dataset has no gene expression data)

2.3. Improvements using foundation models

Despite yielding promising results, PORPOISE and HEEC algorithms use backbone architectures trained using supervised learning (for example, ImageNet pre-trained ResNet50). To further improve performance, a self-supervised learning-based approach, namely Hierarchical Image Pyramid Transformer (HIPT) (Chen et al., 2023), has been investigated in the final stage of the PoC exercises. Note that HIPT is currently limited to the hierarchical self-supervised learning of the imaging modality (WSIs) and further work includes expansion of self-supervised learning for the genomic modality.

HIPT starts by defining a hierarchy of patches composed of non-overlapping regions of size 16×16, 256×256, and 4096×4096 pixels (see Figure 2 at Chen et al., 2023). The lowest-layer features are extracted from the smallest patches (16×16) using a self-supervised learning algorithm based on DINO with a Vision Transformer (ViT) backbone. For each 256×256 region, the lowest-layer features are then aggregated using a global pooling layer. The aggregated features constitute the (new input) features for the middle-level in the hierarchy, where the process of self-supervised learning followed by global pooling is repeated and the aggregated output features form the input features belonging to the 4096×4096 region. These input features go through self-supervised learning one last time, and the final embeddings are obtained using global attention pooling. After pre-training is completed, fine-tuning is employed only on the final layer of the hierarchy (acting on 4096×4096 regions) using multiple instance learning.

Genomics England investigated whether using HIPT embeddings would be better than using the ImageNet pretrained ResNet50 encoder, and initial experiments have shown a gain in Harrels C-index of approximately 0.05 per cancer type in survival analysis. The embeddings offer other benefits as well. Such as being smaller—meaning that models train faster and the features have a smaller footprint.

3. Architecture on AWS

As part of the PoCs, we built a reference architecture (illustrated in Figure 3) for multi-modal ML using SageMaker, a platform for building training, and deploying ML models, with fully managed infrastructure, tools, and workflows. We aimed to demonstrate some general, reusable patterns that are independent of the specific algorithms:

  • Decouple data pre-processing and feature computation from model training: In our use case, we process the pathology images into numerical feature representations once, we then store the resulting feature vectors in Amazon Simple Storage Service (Amazon S3) and reuse them to train different models. Analogously, we have a second processing branch that processes and extracts features from the genomic data.
  • Decouple model training from inference: As we experiment with different model structures and hyperparameters, we keep track of model versions, hyperparameters, and metrics in SageMaker model registry. We refer to the registry to review our experiments and choose which models to deploy for inference.
  • Wrap long-running computations inside containers and delegate their execution to SageMaker: Any long-running computation benefits from this pattern, whether it’s for data processing, model training, or batch inference. In this way, there’s no need to manage the underlying compute resources for running the containers. Cost is reduced through a pay-as-you-go model (resources are destroyed after a container has finished running) and the architecture is easily scalable to run multiple jobs in parallel.
  • Orchestrate multiple containerized jobs into SageMaker pipelines: We build a pipeline once and run it multiple times with different parametrization. Hence, pipeline invocations can be referred to at a higher-level of abstraction, without having to constantly monitor the status of its long-running constituent jobs.
  • Delegate hyperparameter tuning to SageMaker using a hyperparameter tuning job: A tuning job is a family of related training jobs (all managed by SageMaker) that efficiently explore the hyperparameter space. These training jobs take the same input data for training and validation, but each one is run with different hyperparameters for the learning algorithm. Which hyperparameter values to explore at each iteration are automatically chosen by SageMaker.

3.1 Separation between development and production environments

In general, we advise to do all development work outside of a production environment, because this minimizes the risk of leakage and corruption of sensitive production data and the production environment isn’t contaminated with intermediate data and software artifacts that obfuscate lineage tracking. If data scientists require access to production data during developmental stages, for tasks such as exploratory analysis and modelling work, there are numerous strategies that can be employed to minimize risk. One effective strategy is to employ data masking or synthetic data generation techniques in the testing environment to simulate real-world scenarios without compromising sensitive data. Furthermore, production level data can be securely moved into an independent environment for analysis. Access controls and permissions can be implemented to restrict the flow of data between environments, maintaining separation and ensuring minimal access rights.

Genomics England has created two separate ML environments for testing and production level interaction with data. Each environment sits in its own isolated AWS account. The test environment mimics the production environment in its data storage strategy, but contains synthetic data void of personally identifiable information (PII) or protected health information (PHI), instead of production-level data. This test environment is used for developing essential infrastructure components and refining best practices in a controlled setting, which can be tested with synthetic data before deploying to production. Strict access controls, including role-based permissions employing principles of least privilege, are implemented in all environments to ensure that only authorized personnel can interact with sensitive data or modify deployed resources.

3.2 Automation with CI/CD pipelines

On a related note, we advise ML developers to use infrastructure-as-code to describe the resources that are deployed in their AWS accounts and use continuous integration and delivery (CI/CD) pipelines to automate code quality checks, unit testing, and the creation of artifacts, such as container images. Then, also configure the CI/CD pipelines to automatically deploy the created artifacts into the target AWS accounts, whether they’re for development or for production. These well-established automation techniques minimize errors related to manual deployments and maximize the reproducibility between development and production environments.

Genomics England has investigated the use of CI/CD pipelines for automated deployment of platform resources, as well as automated testing of code.

Figure 3. Overview of the AWS reference architecture employed for multi-modal ML in the cloud

4. Conclusion

Genomics England has a long history of working with genomics data, however the inclusion of imaging data adds additional complexity and potential. The two PoCs outlined in this post have been essential in launching Genomics England’s efforts in creating a multi-modal environment that facilitates ML development for the purpose of tackling cancer. The implementation of state-of-the-art models in Genomics England’s multi-modal environment and assistance in developing robust practices will ensure that users are maximally enabled in their research.

At Genomics England, our mission is to realize the enormous potential of genomic and multi-modal information to further precision medicine and push the boundaries to realize the enormous potential of AWS cloud computing in its success”.

– Dr Prabhu Arumugam, Director of Clinical data and imaging, Genomics England

Acknowledgements

The results published in this blog post are in whole or part based upon data generated by the TCGA Research Network: https://www.cancer.gov/tcga.


About the Authors

Cemre Zor, PhD, is a senior healthcare data scientist at Amazon Web Services. Cemre holds a PhD in theoretical machine learning and postdoctoral experiences on machine learning for computer vision and healthcare. She works with healthcare and life sciences customers globally to support them with machine learning modelling and advanced analytics approaches while tackling real-world healthcare problems.

Tamas Madl, PhD, is a former senior healthcare data scientist and business development lead at Amazon Web Services, with academic as well as industry experience at the intersection between healthcare and machine learning. Tamas helped customers in the Healthcare and Life Science vertical to innovate through the adoption of Machine Learning. He received his PhD in Computer Science from the University of Manchester.

Epameinondas Fritzilas, PhD, is a senior consultant at Amazon Web Services. He works hands-on with customers to design and build solutions for data analytics and AI applications in healthcare. He holds a PhD in bioinformatics and fifteen years of industry experience in the biotech and healthcare sectors.

Lou Warnett is a healthcare data scientist at Amazon Web Services. He assists healthcare and life sciences customers from across the world in tackling some of their most pressing challenges using data science, machine learning and AI, with a particular emphasis more recently on generative AI. Prior to joining AWS, Lou received a master’s in Mathematics and Computing at Imperial College London.

Sam Price is a Professional Services consultant specializing in AI/ML and data analytics at Amazon Web Services. He works closely with public sector customers in healthcare and life sciences to solve challenging problems. When not doing this, Sam enjoys playing guitar and tennis, and seeing his favorite indie bands.

Shreya Ruparelia is a data & AI consultant at Amazon Web Services, specialising in data science and machine learning, with a focus on developing GenAI applications. She collaborates with public sector healthcare organisations to create innovative AI-driven solutions. In her free time, Shreya enjoys activities such as playing tennis, swimming, exploring new countries and taking walks with the family dog, Buddy.

Pablo Nicolas Nunez Polcher, MSc, is a senior solutions architect working for the Public Sector team with Amazon Web Services. Pablo focuses on helping healthcare public sector customers build new, innovative products on AWS in accordance with best practices. He received his M.Sc. in Biological Sciences from Universidad de Buenos Aires. In his spare time, he enjoys cycling and tinkering with ML-enabled embedded devices.

Matthew Howard is the head of Healthcare Data Science and part of the Global Health and Non-Profits team in Amazon Web Services. He focuses on how data, machine learning and artificial intelligence can transform health systems and improve patient outcomes. He leads a team of applied data scientists who work with customers to develop AI-based healthcare solutions. Matthew holds a PhD in Biological Sciences from Imperial College London.

Tom Dyer is a Senior Product Manager at Genomics England. And was previously an Applied Machine Learning Engineer working within the Multimodal squad. His work focussed on building multimodal learning frameworks that allow users to rapidly scale research in the cloud. He also works on developing ML tooling to organise pathology image datasets and explain model predictions on a cohort level

Samuel Barnett is an applied machine learning engineer with Genomics England working on improving healthcare with machine learning. He is embedded with the Multimodal squad and is part of an ongoing effort to show the value of combing genomic, imaging, and text based data in machine learning models.

Prabhu Arumugam is the former Director of Clinical Data Imaging at Genomics England. Having joined the organization in 2019, Prabhu trained in medicine St. Bartholomew’s and the Royal London. He trained in Histopathology and completed his PhD at The Barts Cancer Institute on pancreatic pathology.

Francisco Azuaje, PhD, is the director of bioinformatics at Genomics England, where he provides cross-cutting leadership in strategy and research with a focus on data science and AI. With a career covering academia, the pharmaceutical industry, and the public sector, he has wide experience leading multidisciplinary teams in solving challenges involving diverse data sources and computational modelling approaches. With his expertise in bioinformatics and applied AI, Dr. Azuaje enables the translation of complex data into insights that can improve patient outcomes.

Read More

Align Meta Llama 3 to human preferences with DPO, Amazon SageMaker Studio, and Amazon SageMaker Ground Truth

Align Meta Llama 3 to human preferences with DPO, Amazon SageMaker Studio, and Amazon SageMaker Ground Truth

Large language models (LLMs) have remarkable capabilities. Nevertheless, using them in customer-facing applications often requires tailoring their responses to align with your organization’s values and brand identity. In this post, we demonstrate how to use direct preference optimization (DPO), a technique that allows you to fine-tune an LLM with human preference data, together with Amazon SageMaker Studio and Amazon SageMaker Ground Truth to align the Meta Llama 3 8B Instruct model responses to your organization’s values.

Using SageMaker Studio and SageMaker Ground Truth for DPO

With DPO, you can fine-tune an LLM with human preference data such as ratings or rankings so that it generates outputs that align to end-user expectations. DPO is computationally efficient and helps enhance a model’s helpfulness, honesty, and harmlessness, divert the LLM from addressing specific subjects, and mitigate biases. In this technique, you typically start with selecting an existing or training a new supervised fine-tuned (SFT) model. You use the model to generate responses and you gather human feedback on these responses. After that, you use this feedback to perform DPO fine-tuning and align the model to human preferences.

Whether you are fine-tuning a pre-trained LLM with supervised fine-tuning (SFT) or loading an existing fine-tuned model for DPO, you typically need powerful GPUs. The same applies during DPO fine-tuning. With Amazon SageMaker, you can get started quickly and experiment rapidly by using managed Jupyter notebooks equipped with GPU instances. You can quickly get started by creating a JupyterLab space in SageMaker Studio, the integrated development environment (IDE) purpose-built for machine learning (ML), launch a JupyterLab application that runs on a GPU instance.

Orchestrating the end-to-end data collection workflow and developing an application for annotators to rate or rank model responses for DPO fine-tuning can be time-consuming. SageMaker Ground Truth offers human-in-the-loop capabilities that help you set up workflows, manage annotators, and collect consistent, high-quality feedback.

This post walks you through the steps of using DPO to align an SFT model’s responses to the values of a fictional digital bank called Example Bank. Your notebook runs in a JupyterLab space in SageMaker Studio powered by a single ml.g5.48xlarge instance (8 A10G GPUs). Optionally, you can choose to run this notebook inside a smaller instance type such as ml.g5.12xlarge (4 A10G GPUs) or ml.g6.12xlarge (4 L4 GPUs) with bitsandbytes quantization. You use Meta Llama 3 8B Instruct (the Meta Llama 3 instruction tuned model optimized for dialogue use cases from the Hugging Face Hub) to generate responses, SageMaker Ground Truth to collect preference data, and the DPOTrainer from the HuggingFace TRL library for DPO fine-tuning together with Parameter-Efficient Fine-Tuning (PEFT). You also deploy the aligned model to a SageMaker endpoint for real-time inference. You can use the same approach with other models.

Solution overview

The following diagram illustrates the approach.

The workflow contains the following key steps:

  1. Load the Meta Llama 3 8B Instruct model into SageMaker Studio and generate responses for a curated set of common and toxic questions. The dataset serves as the initial benchmark for the model’s performance.
  2. The generated question-answer pairs are stored in Amazon Simple Storage Service (Amazon S3). These will be presented to the human annotators later so they can rank the model responses.
  3. Create a workflow in SageMaker Ground Truth to gather human preference data for the responses. This involves creating a work team, designing a UI for feedback collection, and setting up a labeling job.
  4. Human annotators interact with the labeling portal to evaluate and rank the model’s responses based on their alignment to the organization’s values.
  5. The collected data is processed to adhere to the DPOTrainer expected format.
  6. Using the Hugging Face TRL library and the DPOTrainer, fine-tune the Llama 3 model using the processed data from the previous step.
  7. Test the fine-tuned model on a holdout evaluation dataset to assess its performance and verify it meets the desired standards.
  8. When you’re satisfied with the model performance, you can deploy it to a SageMaker endpoint for real-time inference at scale.

Prerequisites

To run the solution described in this post, you must have an AWS account set up, along with an AWS Identity and Access Management (IAM) role that grants you the necessary permissions to create and access the solution resources. If you are new to AWS and haven’t created an account yet, refer to Create a standalone AWS account.

To use SageMaker Studio, you need to have a SageMaker domain set up with a user profile that has the necessary permissions to launch the SageMaker Studio application. If you’re new to SageMaker Studio, the Quick Studio setup is the fastest way to get started. With a single click, SageMaker provisions the required domain with default presets, including setting up the user profile, IAM role, IAM authentication, and public internet access. The notebook associated with this post assumes the use of an ml.g5.48xlarge instance type. To review or increase your quota limits, navigate to the AWS Service Quotas console, choose AWS Services in the navigation pane, choose Amazon SageMaker, and refer to the value for Studio JupyterLab Apps running on ml.g5.48xlarge instances.

Request an increase in quota value greater than or equal to 1 for experimentation.

Meta Llama 3 8B Instruct is available under the Llama 3 license. To download the model from Hugging Face, you need an access token. If you don’t already have one, navigate to the Settings page on the Hugging Face website to obtain it.

Make sure that the SageMaker Studio role has the necessary permissions for SageMaker Ground Truth and Amazon S3 access. When you’re working in SageMaker Studio, you’re already using an IAM role, which you’ll need to modify for launching SageMaker Ground Truth labeling jobs. To enable SageMaker Ground Truth functionality, you should attach the AWS managed policy AmazonSageMakerGroundTruthExecution to your SageMaker Studio role. This policy provides the essential permissions for creating and managing labeling jobs.

For Amazon S3 access, scoping permissions to specific buckets and actions enhances security and aligns with best practices. This approach adheres to the principle of least privilege, reducing potential risks associated with overly permissive policies. The following is an example of a restricted Amazon S3 policy that grants only the necessary permissions:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:PutObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::<YOUR-BUCKET-NAME>",
                "arn:aws:s3:::<YOUR-BUCKET-NAME>/*"
            ]
        }
    ]
}

To add these policies to your SageMaker Studio role, complete the following steps:

  1. On the IAM console, find and choose your SageMaker Studio role (it usually starts with AmazonSageMaker-ExecutionRole-).
  2. On the Permissions tab, choose Add permissions and then Attach policies.
  3. Search for and attach AmazonSageMakerGroundTruthExecution.
  4. Create and attach the custom Amazon S3 inline policy as shown in the preceding example, if needed.

Remember to follow the principle of least privilege, granting only the permissions necessary for your specific use case. Regularly review your IAM roles and policies to validate their alignment with your security requirements. For more details on IAM policies for SageMaker Ground Truth, refer to Use IAM Managed Policies with Ground Truth.

Set up the notebook and environment

To get started, open SageMaker Studio and create a JupyterLab space. For Instance, choose ml.g5.48xlarge. Run the space, open JupyterLab, and clone the code in the following GitHub repository. You can configure the JupyterLab space to use up to 100 GB in your Amazon Elastic Block Store (Amazon EBS) volume. In addition, the ml.g5 instance family comes with NVMe SSD local storage, which you can use in the JupyterLab application. The NVMe instance store directory is mounted to the application container in /mnt/sagemaker-nvme. For this post, you use the NVMe storage available in the ml.g5.48xlarge instance.

When your space is ready, clone the GitHub repo and open the notebook llama3/rlhf-genai-studio/RLHF-with-Llama3-on-Studio-DPO.ipynb, which contains the solution code. In the pop-up, make sure that the Python 3 kernel is selected.

Let’s go through the notebook. First, install the necessary Python libraries:

import torch
import os
import sagemaker
import boto3
import datetime
from transformers import pipeline
import json
import asyncio
import aiofiles
from datasets import Dataset, load_dataset
from peft import (
get_peft_model,
    LoraConfig,
    prepare_model_for_kbit_training,
)
import bitsandbytes as bnb
from tqdm import tqdm
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    AutoModelForSequenceClassification
)
from IPython.core.display import display, HTML

The following line sets the default path where you store temporary artifacts to the location in the NVMe storage:

cache_dir = "/mnt/sagemaker-nvme"

This is local storage, which means that your data will be lost when the JupyterLab application is deleted, restarted, or patched. Alternatively, you can increase your EBS volume of your SageMaker Studio space to greater than or equal to 100 GB to provide sufficient storage for the Meta Llama 3 base model, PEFT adapter, and new merged fine-tuned model.

Load Meta Llama 3 8B Instruct in the notebook

After you have imported the necessary libraries, you can download the Meta Llama 3 8B Instruct model and its associated tokenizers from Hugging Face:

base_model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

model = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    token=hf_access_token,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    cache_dir=cache_dir
)

model.config.use_cache = False

tokenizer = AutoTokenizer.from_pretrained(
    base_model_id,
    token=hf_access_token,
    cache_dir=cache_dir
)

Collect initial model responses for common and toxic questions

The example_bank_questions.txt file contains a list of common questions received by call centers in financial organizations combined with a list of toxic and off-topic questions.

Before you ask the model to generate answers to these questions, you need to specify the brand and core values of Example Bank. You will include these values in the prompt as context later so the model has the appropriate information it needs to respond.

company_context = """Example Bank is a next-generation digital bank on a mission to revolutionize the banking experience. Founded in 2020, we are committed to leveraging cutting-edge technology to make banking simple, accessible, and transparent for everyone. In Example Bank, we believe that banking should be seamless, intuitive, and tailored to the needs of modern consumers. Our founders, seasoned professionals from the tech and finance industries, set out to create a bank that puts people first, empowering them to take control of their finances with ease. At Example Bank, we envision a world where banking is no longer a chore but a delightful experience. We are dedicated to breaking down barriers and democratizing access to financial services. Our goal is to empower individuals and businesses alike by providing them with the tools and resources they need to thrive in an increasingly digital landscape.
Our values:
- Innovation: We embrace cutting-edge technologies and continuously seek out innovative solutions to deliver the best possible banking experience. We are a digital-only bank, which means we don't have any physical branches. Instead, we offer all of our services online or through our mobile app. This allows us to keep our costs low and pass the savings on to our customers.
- Transparency: We are committed to being direct and honest with our customers. We believe that transparency is key to building trust, and we want our customers to feel confident that they are making informed decisions about their money. That's why we provide clear and concise information about our products and services, and we are always available to answer any questions our customers may have.
- Accessibility: Our services are designed to be inclusive and user-friendly, catering to a diverse range of customers, regardless of their financial backgrounds.
- Security: We prioritize the safety and security of our customers' data and assets, employing state-of-the-art encryption and cybersecurity measures.
In addition to our core values, Example Bank offers a range of innovative financial products and services:
- Loans: Whether you’re looking to buy a home, start a business, or finance a major purchase, our flexible loan options are designed to meet your needs. With competitive interest rates and a simple application process, obtaining a loan has never been easier.
- Credit Cards: Our credit cards come with a host of benefits including cashback rewards, low-interest rates, and no annual fees. Manage your spending effortlessly with real-time notifications and intuitive budgeting tools.
- Mobile Apps: Our user-friendly apps on the Google Play Store and Apple App Store offer a seamless banking experience. From checking balances to transferring funds, our apps ensure you have complete control of your finances at your fingertips.
- Savings and Investments: Grow your wealth with our high-yield savings accounts and a variety of investment options. Our financial advisors are available to help you make informed decisions tailored to your financial goals.
- Customer Support: We provide 24/7 customer support to assist with any inquiries or issues. Our dedicated team is always ready to help, ensuring you receive the best possible service at all times.
At Example Bank, we are committed to enhancing your financial well-being through innovation, transparency, and unparalleled service. Join us today and experience the future of banking.
"""

Now you’re ready to invoke the model. For each question in the file, you construct a prompt that contains the context and the actual question. You send the prompt to the model four times to generate four different outputs and save the results in the llm_responses.json file.

questions = 'example_bank_questions.txt'
llm_responses = os.path.join(sample_files_path, 'llm_responses.json')

from timeit import default_timer as timer
import tqdm.asyncio

async def invoke_model(question, context):
    pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
    messages = [
        {"role": "user", "content": f"{context}: {question}"}
    ]

    terminators = [
        tokenizer.eos_token_id,
        tokenizer.convert_tokens_to_ids("<|eot_id|>")
    ]

    response = pipe(
        messages, 
        max_new_tokens=120, 
        do_sample=True,
        temperature=gl_temperature, 
        top_p=gl_top_p, 
        eos_token_id=terminators
    )[0]['generated_text'][-1]
    return response['content']

async def process_lines(file_path):
    results = []
    context = f"""{company_context} You are a customer service agent for {company_name} Sometimes you are smart with your answers. Answer the following customer question in one or two sentences:
    """
    async with aiofiles.open(file_path, 'r') as file:
        lines = [line async for line in file]
        for line in tqdm.asyncio.tqdm(lines, desc="Processing Question Bank"):
            start = timer()
            responses = await asyncio.gather(*[invoke_model(line, context) for _ in range(4)])
            result = {
                'context': context,
                'question': line.strip(),
                'responses': responses
            }
            end = timer()
            results.append(result)
    return results

results = await process_lines(questions)

with open(llm_responses, 'w') as file:
    json.dump(
        results, 
        file, 
        indent=4
    )

The following is an example entry from llm_reponses.json.

Set up the SageMaker Ground Truth labeling job and human preference data

To fine-tune the model using DPO, you need to gather human preference data for the generated responses. SageMaker Ground Truth helps orchestrate the data collection process. It offers customizable labeling workflows and robust workforce management features for ranking tasks. This section shows you how to set up a SageMaker Ground Truth labeling job and invite a human workforce with requisite expertise to review the LLM responses and rank them.

Set up the workforce

A private workforce in SageMaker Ground Truth consists of individuals who are specifically invited to perform data labeling tasks. These individuals can be employees or contractors who have the required expertise to evaluate the model’s responses. Setting up a private workforce helps achieve data security and quality by limiting access to trusted individuals for data labeling.

For this use case, the workforce consists of the group of people who will rank the model responses. You can set up a private workforce using the SageMaker console by creating a private team and inviting members through email. For detailed instructions, refer to Create a Private Workforce (Amazon SageMaker Console).

Create the instruction template

With the instruction template, you can manage the UI and guide human annotators in reviewing model outputs. It needs to clearly present the model responses and provide a straightforward way for the annotators to rank them. Here, you use the text ranking template. This template allows you to display the instructions for the human reviewer and the prompts with the pregenerated LLM responses. The annotator reviews the prompt and responses and ranks the latter based on their alignment to the organization’s brand.

The definition of the template is as follows. The template shows a pane on the left with instructions from the job requester, a prompt at the top, and three LLM responses in the main body. The right side of the UI is where the annotator ranks the responses from most to least preferable.

  <html>
  <head>
    <meta charset="UTF-8" />
    <link rel="stylesheet" href="https://assets.crowd.aws/css/gen-ai-components.css" />
    <link rel="icon" href="data:image/svg+xml,<svg xmlns=%22http://www.w3.org/2000/svg%22 viewBox=%220 0 100 100%22><text y=%22.9em%22 font-size=%2290%22>🥇</text></svg>" />
    <title>Text Ranking Tool</title>
    <script src="https://assets.crowd.aws/gen-ai-components.js"></script>
  </head>

  <body>
    <div>
      <crowd-text-ranking
        crowd-form-element-id="crowd-form-submit"
        instructions='Rank the following responses from a language model according to their alignment to the organisation's brand.'
        ordinal-ranking-dimensions='[{"name":"BrandValue","allowTie":true}]'
        text='{{ task.input.source }}'
        responses='{{ task.input.responses | to_json }}' />
    </div>
    <crowd-form id="crowd-form-submit" style="display: none"></crowd-form>
    <script src="https://assets.crowd.aws/crowd-html-elements.js"></script>
  </body>
</html>

The template is saved locally on your Studio JupyterLab space EBS volume as instructions.template in a temporary directory. Then you upload this template file to your designated S3 bucket using s3.upload_file(), placing it in the specified bucket and prefix. This Amazon S3 hosted template will be referenced when you create the SageMaker Ground Truth labeling job, so workers see the correct interface for the text ranking task.

Preprocess the input data

Before you create the labeling job, verify that the input data matches the format expected by SageMaker Ground Truth and is saved as a JSON file in Amazon S3. You can use the prompts and responses in the llm_responses.json file to create the manifest file inp-manifest-trank.json. Each row in the manifest file contains a JSON object (source-responses pair). The previous entry now looks like the following code.

Upload the structured data to the S3 bucket so that it can be ingested by SageMaker Ground Truth.

Create the labeling job

Now you’re ready to configure and launch the labeling job using the SageMaker API from within the notebook. This involves specifying the work team, UI template, and data stored in the S3 bucket. By setting appropriate parameters such as task time limits and the number of workers per data object, you can run jobs efficiently and effectively. The following code shows how to start the labeling job:

sm_client.create_labeling_job(
    LabelingJobName=labeling_job_name,
    LabelAttributeName='label',
    InputConfig={
        'DataSource': {
            'S3DataSource': {
                'ManifestS3Uri': model_responses_s3_uri
            }
        }
    },
    OutputConfig={
        'S3OutputPath': 's3://{}/{}/output/'.format(bucket,prefix) #Enter S3 URI of Output folder
    },
    RoleArn=role, 
    HumanTaskConfig={
        'WorkteamArn': WORKTEAM_ARN,
        'UiConfig':{
            'UiTemplateS3Uri': UI_TEMPLATE_S3_URI
        },
        'PreHumanTaskLambdaArn': 'arn:aws:lambda:us-east-1:432418664414:function:PRE-PassThrough',
        'TaskKeywords': [
            'QnA',
        ],
        'TaskTitle': 'Rank LLM responses',
        'TaskDescription': "Rank the responses provided by the LLM",
        'NumberOfHumanWorkersPerDataObject': 1,
        'TaskTimeLimitInSeconds': 60*30,
        'TaskAvailabilityLifetimeInSeconds': 60*60*24*10,
        'MaxConcurrentTaskCount': 100,
        'AnnotationConsolidationConfig': {
            'AnnotationConsolidationLambdaArn': 'arn:aws:lambda:us-east-1:432418664414:function:ACS-PassThrough'
        } 
    }

As the job is launched, it’s essential to monitor its progress closely, making sure tasks are being distributed and completed as expected.

Gather human feedback through the labeling portal

When the job setup is complete, annotators can log in to the labeling portal and start ranking the model responses.

Workers can first consult the Instructions pane to understand the task, then use the main interface to evaluate and rank the model’s responses according to the given criteria. The following screenshot illustrates the UI.

The human feedback is collected and stored in an S3 bucket. This feedback will be the basis for DPO. With this data, you will fine-tune the Meta Llama 3 model and align its responses with the organization’s values, improving its overall performance.

Align Meta Llama 3 8B Instruct with the DPOTrainer

In this section, we show how to use the preference dataset that you prepared using SageMaker Ground Truth to fine-tune the model using DPO. DPO explicitly optimizes the model’s output based on human evaluations. It aligns the model’s behavior more closely with human expectations and improves its performance on tasks requiring nuanced understanding and contextual appropriateness. By integrating human preferences, DPO enhances the model’s relevance, coherence, and overall effectiveness in generating desired responses.

DPO makes it more straightforward to preference-tune a model in comparison to other popular techniques such as Proximal Policy Optimization (PPO). DPO eliminates the necessity for a separate rewards model, thereby avoiding the cost associated with training it. Additionally, DPO requires significantly less data to achieve performance comparable to PPO.

Fine-tuning a language model using DPO consists of two steps:

  1. Gather a preference dataset with positive and negative selected pairs of generation, given a prompt.
  2. Maximize the log-likelihood of the DPO loss directly.

To learn more about the DPO algorithm, refer to the following whitepaper.

Expected data format

The DPO trainer expects a very specific format for the dataset, which contains sentence pairs where one sentence is a chosen response and the other is a rejected response. This is represented as a Python dictionary with three keys:

  • prompt – Consists of the context prompt given to a model at inference time for text generation
  • chosen – Contains the preferred generated response to the corresponding prompt
  • rejected – Contains the response that is not preferred or should not be the sampled response for the given prompt

The following function definition illustrates how to process the data stored in Amazon S3 to create a DPO dataset using with sample pairs and a prompt:

def return_prompt_and_responses(samples, index):
    prompt = f"{samples['context']}nn{samples['question']}"
    chosen_index = response_rankings[index]["responseRankings"].index(1)
    rejected_index = response_rankings[index]["responseRankings"].index(4)

    prompt = {"role": "user", "content": prompt},

    chosen_messages = [
        {"role": "assistant", "content": samples["responses"][chosen_index]},
    ]
    rejected_messages = [
        # {"role": "system", "content": prompt},
        {"role": "assistant", "content": samples["responses"][rejected_index]},
    ]
    
    return {
        "prompt": tokenizer.apply_chat_template(prompt, tokenize=False),
        "chosen": "{}".format(tokenizer.apply_chat_template(chosen_messages, tokenize=False).replace('<|begin_of_text|>', '')),
        "rejected": "{}".format(tokenizer.apply_chat_template(rejected_messages, tokenize=False).replace('<|begin_of_text|>', ''))
    }

Here is an example sentence pair:

You split the DPO trainer dataset into train and test samples using an 80/20 split and tokenize the dataset in preparation for DPO fine-tuning:

dataset = prepared_dataset.train_test_split(test_size=0.2)

dataset["train"].to_json(
    os.path.join(sample_files_path, "processed_human_feedback", "train_dataset.json"), 
    orient="records", 
    index="False"
)

dataset["test"].to_json(
    os.path.join(sample_files_path, "processed_human_feedback", "test_dataset.json"), 
    orient="records", 
    index="False"

Supervised fine-tuning using DPO

Now that the dataset is formatted for the DPO trainer, you can use the train and test datasets prepared earlier to initiate the DPO model fine-tuning. Meta Llama 3 8B belongs to a category of small language models, but even Meta Llama 3 8B barely fits into a SageMaker ML instance like ml.g5.48xlarge in fp16 or fp32, leaving little room for full fine-tuning. You can use PEFT with DPO to fine-tune Meta Llama 3 8B’s responses based on human preferences. PEFT is a method of fine-tuning that focuses on training only a subset of the pre-trained model’s parameters. This approach involves identifying the most important parameters for the new task and updating only those parameters during training. By doing so, PEFT can significantly reduce the computation required for fine-tuning. See the following code:

# configure PEFT module
peft_config = LoraConfig(
    r=512,
    lora_alpha=1024,
    lora_dropout=0.05,
    task_type="CAUSAL_LM",
    target_modules="all-linear",

For a full list of LoraConfig training arguments, refer to LoRA. At a high level, you need to initialize the DPOTrainer with the following components: the model you want to train, a reference model (ref_model) used to calculate the implicit rewards of the preferred and rejected responses, the beta hyperparameter that controls the balance between the implicit rewards assigned to the preferred and rejected responses, and a dataset containing prompt, chosen, and rejected responses. If ref_model=None, the trainer will create a reference model with the same architecture as the input model to be optimized. See the following code:

from trl import DPOConfig, DPOTrainer

dpo_model_dir = "/path/to/save/dpo/model"

args = DPOConfig(
    output_dir=dpo_model_dir,               # directory to save and repository id
    num_train_epochs=5,                     # number of training epochs
    per_device_train_batch_size=2,
    gradient_accumulation_steps=1,
    gradient_checkpointing=True,            # use gradient checkpointing to save memory
    optim = "adamw_torch_fused",            # use fused adamw optimizer
    learning_rate=1e-5,                     # 10x higher LR than QLoRA paper
    max_grad_norm=0.3,                      # max gradient norm based on QLoRA paper
    warmup_ratio=0.1,                       # warmup ratio based on QLoRA paper
    lr_scheduler_type="cosine",             # use cosine learning rate scheduler
    logging_steps=10,                       
    save_steps=10,                         # when to save checkpoint
    evaluation_strategy="steps",            
    eval_steps=100,
    bf16=True,                              # use bfloat16 precision
    tf32=True,                              # use tf32 precision
    push_to_hub=False,                      # push model to hub,
    report_to='tensorboard',
    remove_unused_columns=False
)

dpo_args = {
    "beta": 0.1,                            # The beta factor in DPO loss. Higher beta means less divergence
    "loss_type": "sigmoid"                  # The loss type for DPO.
}

trainer = DPOTrainer(
    model,
    ref_model=None,
    peft_config=peft_config,
    args=args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer,
    max_length=max_seq_length,
    max_prompt_length=prompt_length,
    beta=dpo_args["beta"],
    loss_type=dpo_args["loss_type"],
)

# kick-off model training
trainer.train()

Once you start the training, you can see the status in the notebook:

When model fine-tuning is complete, save the PEFT adapter model to disk and merge it with the base model to create a newly tuned model. You can use the saved model for local inference and validation or deploy it as a SageMaker endpoint after you have gained sufficient confidence in the model’s responses.

peft_output_dir = "/path/to/save/tuned/model/"
print(f"saving peft model to: {peft_output_dir}")
trainer.save_model(output_dir=peft_output_dir)
...
...
merged_model = model.merge_and_unload()
...
...
merged_model.save_pretrained(
    new_dpo_output_dir,
    safe_serialization=True,
    max_shard_size="9GB"
)

Evaluate the fine-tuned model inside a SageMaker Studio notebook

Before you host your model for inference, verify that its response optimization aligns with user preferences. You can collect the model’s response both before and after DPO fine-tuning and compare them side by side, as shown in the following table.

The DPO Model Response column indicates the RLHF aligned model’s response post-fine-tuning, and the Rejected Model Response column refers to the model’s response to the input prompt prior to fine-tuning.

Deploy the model to a SageMaker endpoint

After you have gained sufficient confidence in your model, you can deploy it to a SageMaker endpoint for real-time inference. SageMaker endpoints are fully managed and provide auto scaling capabilities. For this post, we use DJL Serving to host the fine-tuned, DPO-aligned Meta Llama3 8B model. To learn more about hosting your LLM using DJL Serving, refer to Deploy large models on Amazon SageMaker using DJLServing and DeepSpeed model parallel inference.

To deploy an LLM directly from your SageMaker Studio notebook using DJL Serving, complete the following steps:

  1. Upload model weights and other model artifacts to Amazon S3.
  2. Create a meta-model definition file called serving.properties. This definition file dictates how the DJL Serving container is configured for inference.

engine = DeepSpeed
option.tensor_parallel_degree = 1
option.s3url = s3://<MY-TEST-BUCKET>/llama3-dpo-ft/modelweights
option.hf_access_token=hf_xx1234

  1. Create a custom inference file called model.py, which defines a custom inference logic:
%%writefile llama3-serving-model/model.py

from djl_python import Input, Output
...

predictor = None


def get_model(properties):

    ...
    return generator


def handle(inputs: Input) -> None:
    ...
    outputs = predictor(message, **generation_kwargs)[0]['generated_text'][-1]
    result = {"outputs": outputs['content']}
    return Output().add(result)
  1. Deploy the DPO fine-tuned model as a SageMaker endpoint:
from sagemaker import image_uris
from sagemaker.model import Model
from datetime import datetime

inference_image_uri = image_uris.retrieve(
    framework="djl-deepspeed",
    region=region,
    version="0.23.0"
)

...

dpo_model.deploy(
    initial_instance_count=1,
    instance_type="ml.g5.2xlarge",
    endpoint_name=f"ep-{dpo_model.name}",
    container_startup_health_check_timeout=900,
    wait=False, # <-- Set to True, if you would prefer to wait 6-8 minutes for the endpoint to spin up
)
  1. Invoke the hosted model for inference using the sageMaker.Predictor class:
dpo_ft_predictor = sagemaker.Predictor(
    endpoint_name="my_custom_dpo_endpoint",
    sagemaker_session=sess,
    serializer=serializers.JSONSerializer(),
    deserializer=deserializers.JSONDeserializer(),
)
...
# invoke inference
response = dpo_ft_predictor.predict(
    {
        "inputs": content,
        "parameters": parameters
    }
)

Clean up

After you complete your tasks in the SageMaker Studio notebook, remember to stop your JupyterLab workspace to prevent incurring additional charges. You can do this by choosing Stop next to your JupyterLab space. Additionally, you have the option to set up lifecycle configuration scripts that will automatically shut down resources when they’re not in use.

If you deployed the model to a SageMaker endpoint, run the following code at the end of the notebook to delete the endpoint:

#delete your endpoint
sm_client.delete_endpoint(EndpointName=tg_sm_model.endpoint_name)

Conclusion

Amazon SageMaker offers tools to streamline the process of fine-tuning LLMs to align with human preferences. With SageMaker Studio, you can experiment interactively with different models, questions, and fine-tuning techniques. With SageMaker Ground Truth, you can set up workflows, manage teams, and collect consistent, high-quality human feedback.

In this post, we showed how to enhance the performance of Meta Llama 3 8B Instruct by fine-tuning it using DPO on data collected with SageMaker Ground Truth. To get started, launch SageMaker Studio and run the notebook available in the following GitHub repo. Share your thoughts in the comments section!


About the Authors

Anastasia Tzeveleka is a GenAI/ML Specialist Solutions Architect at AWS. As part of her work, she helps customers build foundation models and create scalable generative AI and machine learning solutions using AWS services.

Pranav Murthy is an AI/ML Specialist Solutions Architect at AWS. He focuses on helping customers build, train, deploy and migrate machine learning (ML) workloads to SageMaker. He previously worked in the semiconductor industry developing large computer vision (CV) and natural language processing (NLP) models to improve semiconductor processes. In his free time, he enjoys playing chess and traveling.

Sundar Raghavan is an AI/ML Specialist Solutions Architect at AWS, helping customers build scalable and cost-efficient AI/ML pipelines with Human in the Loop services. In his free time, Sundar loves traveling, sports and enjoying outdoor activities with his family.

Read More

Amazon EC2 P5e instances are generally available

Amazon EC2 P5e instances are generally available

State-of-the-art generative AI models and high performance computing (HPC) applications are driving the need for unprecedented levels of compute. Customers are pushing the boundaries of these technologies to bring higher fidelity products and experiences to market across industries.

The size of large language models (LLMs), as measured by the number of parameters, has grown exponentially in recent years, reflecting a significant trend in the field of AI. Model sizes have increased from billions of parameters to hundreds of billions of parameters within a span of 5 years. As LLMs have grown larger, their performance on a wide range of natural language processing tasks has also improved significantly, but the increased size of LLMs has led to significant computational and resource challenges. Training and deploying these models requires vast amounts of computing power, memory, and storage.

The size of an LLM has a significant impact on the choice of compute needed for inference. Larger LLMs require more GPU memory to store the model parameters and intermediate computations, as well as greater computational power to perform the matrix multiplications and other operations needed for inference. Large LLMs take longer to perform a single inference pass due to this increased computational complexity. This increased compute requirement can lead to higher inference latency, which is a critical factor for applications that require real-time or near real-time responses.

HPC customers exhibit similar trends. With the fidelity of HPC customer data collection increasing and datasets reaching exabyte scale, customers are looking for ways to enable faster time to solution across increasingly complex applications.

To address customer needs for high performance and scalability in deep learning, generative AI, and HPC workloads, we are happy to announce the general availability of Amazon Elastic Compute Cloud (Amazon EC2) P5e instances, powered by NVIDIA H200 Tensor Core GPUs. AWS is the first leading cloud provider to offer the H200 GPU in production. Additionally, we are announcing that P5en instances, a network optimized variant of P5e instances, are coming soon.

In this post, we discuss the core capabilities of these instances and the use cases they’re well-suited for, and walk you through an example of how to get started with these instances and carry out inference deployment of Meta Llama 3.1 70B and 405B models on them.

EC2 P5e instances overview

P5e instances are powered by NVIDIA H200 GPUs with 1.7 times more GPU memory capacity and 1.5 times faster GPU memory bandwidth as compared to NVIDIA H100 Tensor Core GPUs featured in P5 instances.

P5e instances incorporate 8 NVIDIA H200 GPUs with 1128 GB of high bandwidth GPU memory, 3rd Gen AMD EPYC processors, 2 TiB of system memory, and 30 TB of local NVMe storage. P5e instances also provide 3,200 Gbps of aggregate network bandwidth with support for GPUDirect RDMA, enabling lower latency and efficient scale-out performance by bypassing the CPU for internode communication.

The following table summarizes the details for the instance.

Instance Size vCPUs Instance Memory (TiB) GPU GPU memory Network Bandwidth (Gbps) GPUDirect RDMA GPU Peer to Peer Instance Storage (TB) EBS Bandwidth (Gbps)
p5e.48xlarge 192 2 8 x NVIDIA H200 1128 GB
HBM3e
3200 Gbps EFA Yes 900 GB/s NVSwitch 8 x 3.84 NVMe SSD 80

EC2 P5en instances coming soon

One of the bottlenecks in GPU-accelerated computing may lie in the communication between CPUs and GPUs. The transfer of data between these two components can be time-consuming, especially for large datasets or workloads that require frequent data exchanges. This challenge could impact wide range of GPU-accelerated applications such as deep learning, high-performance computing, and real-time data processing. The need to move data between the CPU and GPU can introduce latency and reduce the overall efficiency. Additionally, network latency can become an issue for ML workloads on distributed systems, because data needs to be transferred between multiple machines.

EC2 P5en instances, coming soon in 2024, can help solve these challenges. P5en instances pair the NVIDIA H200 GPUs with custom 4th Generation Intel Xeon Scalable processors, enabling PCIe Gen 5 between CPU and GPU. These instances will provide up to four times the bandwidth between CPU and GPU and lower network latency, thereby improving workload performance.

P5e use cases

P5e instances are ideal for training, fine-tuning, and running inference for increasingly complex LLMs and multimodal foundation models (FMs) behind the most demanding and compute-intensive generative AI applications, including question answering, code generation, video and image generation, speech recognition, and more.

Customers deploying LLMs for inference can benefit from using P5e instances, which offer several key advantages that make them an excellent choice for these workloads.

Firstly, the higher memory bandwidth of the H200 GPUs in the P5e instances allows the GPU to fetch and process data from memory more quickly. This translates to reduced inference latency, which is critical for real-time applications like conversational AI systems where users expect near-instant responses. The higher memory bandwidth also enables higher throughput, allowing the GPU to process more inferences per second. Customers deploying the 70-billion-parameter Meta Llama 3.1 model on P5e instances can expect up to 1.871 times higher throughput and up to 40%1 lower cost compared to using comparable P5 instances. (1Input Sequence Length 121, Output Sequence Length 5000, batch size 10, vLLM framework)

Secondly, the massive scale of modern LLMs, with hundreds of billions of parameters, requires an immense amount of memory to store the model and intermediate computations during inference. On the standard P5 instances, this would likely necessitate the use of multiple instances to accommodate the memory requirements. However, the P5e instances’ 1.76 times higher GPU memory capacity enables you to scale up by using a single instance to fit the entire model. This avoids the complexity and overhead associated with distributed inference systems, such as data synchronization, communication, and load balancing. Customers deploying the 405-billion-parameter Meta Llama 3.1 model on a single P5e instance can expect up to 1.72 times higher throughput and up to 69%2 lower cost compared to using two P5 instances. (2Input Sequence Length 121, Output Sequence Length 50, batch size 10, vLLM framework)

Finally, the higher GPU memory of the P5e instances also enables the use of larger batch sizes during inference for better GPU utilization, resulting in faster inference times and higher overall throughput. This additional memory can be particularly beneficial for customers with high-volume inference requirements.

When optimizing inference throughput and cost, consider adjusting batch size, input/output sequence length, and quantization level, because these parameters can have a substantial impact. Experiment with different configurations to find the optimal balance between performance and cost for your specific use case.

In summary, the combination of higher memory bandwidth, increased GPU memory capacity, and support for larger batch sizes make the P5e instances an excellent choice for customers deploying LLM inference workloads. These instances can deliver significant performance improvements, cost savings, and operational simplicity compared to alternative options.

P5e instances are also well-suited for memory-intensive HPC applications like simulations, pharmaceutical discovery, seismic analysis, weather forecasting, and financial modeling. Customers using dynamic programming (DP) algorithms for applications like genome sequencing or accelerated data analytics can also see further benefit from P5e through support for the DPX instruction set.

Get started with P5e instances

When launching P5 instances, you can use AWS Deep Learning AMIs (DLAMI) to support P5 instances. DLAMI provides ML practitioners and researchers with the infrastructure and tools to quickly build scalable, secure, distributed ML applications in preconfigured environments. You can run containerized applications on P5 instances with AWS Deep Learning Containers using libraries for Amazon Elastic Container Service (Amazon ECS) or Amazon Elastic Kubernetes Service (Amazon EKS).

P5e instances now available

EC2 P5e instances are now available in the US East (Ohio) AWS Region in the p5e.48xlarge sizes through Amazon EC2 Capacity Blocks for ML. For more information, refer to Amazon EC2 P5 Instances.


About the authors

Avi Kulkarni is an Senior Specialist focusing on worldwide business development and go-to-market for ML and HPC workloads across both commercial and public sector customers. Previously, he has managed partnerships at AWS and led product management for automotive customers at Honeywell, covering electrified, autonomous, and traditional vehicles.

Karthik Venna is a Principal Product Manager at AWS. He leads development of EC2 instances for a wide variety of workloads including deep learning and generative AI.

Khaled Rawashdeh is a Senior Product Manager at AWS. He defines and creates Amazon EC2 accelerated computing instances for most demanding AI/machine learning workloads. Before joining AWS, he worked for leading companies focusing on creating datacenter software and system for enterprise customers.

Aman Shanbhag is an Associate Specialist Solutions Architect on the ML Frameworks team at Amazon Web Services, where he helps customers and partners with deploying ML Training and Inference solutions at scale. Before joining AWS, Aman graduated from Rice University with degrees in Computer Science, Mathematics, and Entrepreneurship.

Pavel Belevich is a Senior Applied Scientist in the ML Frameworks team at Amazon Web Services. He applies his research in distributed training and inference of large models to real-life customer needs. Before joining AWS Pavel worked in PyTorch Distributed team on various distributed training techniques such as FSDP and Pipeline parallelism.

Dr. Maxime Hugues is a Principal WW Specialist Solutions Architect GenAI at AWS, which he joined in 2020. He holds a M.E. from the French National Engineer School “ISEN-Toulon”, a M.S. degree from the University of Science and a Ph.D. degree in Computer Science in 2011 from the University of Lille 1. His researches were mainly focused on programming paradigms, innovative hardware for Extreme computers and performance of HPC/Machine Learning. Prior joining AWS, he worked as HPC Research Scientist and Tech lead at TotalEnergies.

Shruti Koparkar is a Senior Product Marketing Manager at AWS. She helps customers explore, evaluate, and adopt Amazon EC2 accelerated computing infrastructure for their machine learning needs.

Read More

Exploring data using AI chat at Domo with Amazon Bedrock

Exploring data using AI chat at Domo with Amazon Bedrock

This post is co-written with Joe Clark from Domo.

Data insights are crucial for businesses to enable data-driven decisions, identify trends, and optimize operations. Traditionally, gaining these insights required skilled analysts using specialized tools, which can make the process slow and less accessible.

Generative artificial intelligence (AI) has revolutionized this by allowing users to interact with data through natural language queries, providing instant insights and visualizations without needing technical expertise. This can democratize data access and speed up analysis.

However, companies can face challenges when using generative AI for data insights, including maintaining data quality, addressing privacy concerns, managing model biases, and integrating AI systems with existing workflows.

Domo is a cloud-centered data experiences innovator that empowers users to make data-driven decisions. Powered by AI and data science, Domo’s user-friendly dashboards and apps make data actionable, driving exponential business impact. Domo connects, transforms, visualizes, and automates data through simple integrations and intelligent automation, strengthening the entire data journey.

In this post, we share how Domo uses Amazon Bedrock to provide a flexible and powerful AI solution.

Domo’s purpose of using generative AI

The Domo enterprise data environment caters to a diverse customer base with varying data-driven requirements. Domo works with organizations that place a strong emphasis on deriving actionable insights from their data assets. Domo’s existing solution already enables these organizations to extract valuable insights through data visualization and analysis. The next step is to provide them with a more intuitive and conversational interface to interact with their data, empowering them to generate meaningful visualizations and reports through natural language interactions.

Domo.AI powered by Amazon Bedrock

Domo.AI simplifies data exploration and analysis by intelligently guiding you at every turn, from data preparation to forecasting to automation. It does this with natural language conversation, contextual and personalized insights with narrative and visual responses, and robust security and governance for a guided risk control experience.

Domo’s AI Service Layer is the foundation of the Domo.AI experience. Domo uses the Domo AI Service Layer with Amazon Bedrock to provide customers with a flexible and powerful AI solution. The AI Service Layer allows Domo to switch between different models provided by Amazon Bedrock for individual tasks and track their performance across key metrics like accuracy, latency, and cost. This enables Domo to optimize model performance through prompt engineering, preprocessing, and postprocessing, and provide contextual information and examples to the AI system. The AI Service Layer and its integration with Amazon Bedrock empower Domo to offer their customers the tools they need to harness AI throughout their organization, from data exploration using natural language-driven AI chat to custom applications and automations powered by a variety of AI models.

Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI. With Amazon Bedrock, you can quickly experiment with and evaluate top FMs for your use case, privately customize them with your data using techniques such as fine-tuning and Retrieval Augmented Generation (RAG), and build agents that orchestrate tasks using your enterprise systems and data sources. Because Amazon Bedrock is serverless, you don’t have to manage infrastructure, and you can securely integrate and deploy generative AI capabilities into your applications using the AWS services you’re already familiar with.

Solution overview

The following diagram illustrates the solution architecture and data flow.

The workflow includes the following steps:

  1. End-users interact with Domo.AI either through their website or mobile app. The end-user request first goes through an AI chat agent. The AI chat agent uses the capability of large language models (LLMs) to interpret user input, determine how to solve the user question or request using available tools, and form a final response. The request goes through guardrails, which are mechanisms and strategies to enforce the responsible, ethical, and safe use of the AI model. This helps make sure the responses generated by the AI chat agent are aligned with the organization’s responsible AI policies and don’t contain inappropriate or harmful content. Domo uses custom business logic to implement safeguards in their generative AI applications that are customized to their customers’ use cases and responsible AI policies.
  2. The Agent Planner component is responsible for orchestrating the various tasks required to fulfill the end-user request. It calls the Amazon Bedrock service through an API to create an execution plan, which involves selecting the appropriate tools and models to retrieve relevant information or perform custom actions. The tools refer to the various capabilities or actions that the AI chat agent can use to gather information and perform tasks. The tools provide the agent with access to data and functionality beyond what is available in the underlying LLM.
  3. The Tool Execution component is the process of invoking the selected tools and integrating their outputs to generate the final response. This allows the agent to go beyond the knowledge contained in the LLM and incorporate up-to-date information or perform domain-specific operations.
  4. As tools are run, user input is used to find semantically relevant information using vector search or to query private data from sources such as Amazon Redshift using Domo Cloud Amplifier, which is a native integration with cross-cloud systems to unlock data products at the speed businesses needs them. Vector search is a technique used to find semantically relevant information from unstructured data sources, such as knowledge base articles or other documents. By creating vector embeddings of the content, the AI chat agent can efficiently search for and retrieve information that is most relevant to the user’s query, even if the exact phrasing isn’t present in the source material.
  5. Information such as search or query results from Step 3 is returned to the AI chat agent, where either the agent solver component can aggregate results to formulate a final response, or, in the case of more complex queries, the agent can run another tool.
  6. Each of the components in the solution use the Domo AI Service Layer with Amazon Bedrock for planning and reasoning capabilities, converting user questions into SQL queries, or creating embeddings for vector search, and returning results in a natural language answer to the user’s question grounded by private customer data.

The following video of Domo.AI provides a more detailed overview of the product’s key features and capabilities.

Why Domo chose Amazon Bedrock

Domo chose Amazon Bedrock for the following benefits and features:

  • Model choice – Amazon Bedrock provides Domo with access to a wide range of models, including best-in-class options and those from various providers such as Anthropic, AI21 Labs, Cohere, Meta, and Stability AI. This variety allows Domo to extensively test their services using different models, enabling them to select the most suitable option for each specific use case. As a result, Domo can accelerate their development process and deliver value to their customers more rapidly by taking advantage of this flexibility in model selection and experimentation.
  • Security, compliance, and global infrastructure – Amazon Bedrock addresses crucial security and compliance concerns for Domo and their customers. With Amazon Bedrock, Domo makes sure that data remains within the AWS hosting environment, helping prevent model providers from accessing or training on customer data. With encryption in transit and at rest, along with restricted access to model deployment accounts, Amazon Bedrock provides robust data protection. Additionally, Domo has implemented multiple guardrails with varied control combinations to suit different applications and use cases. Amazon Bedrock offers a single API for inference, which facilitates secure communication between users and the FM. Additionally, the global infrastructure and compliance features of Amazon Bedrock enable Domo to scale and deploy their generative AI applications worldwide while adhering to data privacy laws and best practices.
  • Cost – By using Amazon Bedrock, Domo has achieved significant cost savings, reporting a 50% reduction compared to similar models from other providers. The serverless access to high-quality LLMs eliminates the need for substantial upfront infrastructure investments typically associated with LLMs. This cost-effective approach allows Domo to experiment with and test various models without incurring the hefty expenses usually linked to LLM implementation and maintenance, thereby optimizing their resource allocation and improving overall operational efficiency.

In the following video, Joe Clark, Software Architect at Domo, shares how AWS has been instrumental for Domo in the generative AI space.

Getting started with Amazon Bedrock

With Amazon Bedrock, teams and individuals can immediately start using FMs without having to worry about provisioning infrastructure or setting up and configuring ML frameworks.

Before you get started, verify that your user or role has permission to create or modify Amazon Bedrock resources. For details, see Identity-based policy examples for Amazon Bedrock.

To access the models in Amazon Bedrock, on the Amazon Bedrock console, choose Model access in the navigation pane. Review the EULA and enable the FMs you’d like in your account.

You can start interacting with the FMs through the following methods:

Conclusion

Amazon Bedrock has been instrumental in enhancing data insights and visualization capabilities at Domo through generative AI. By providing flexibility in FM selection, secure access, and a fully managed experience, Amazon Bedrock has enabled Domo to deliver more value to their customers while reducing costs. The service’s security and compliance features have also allowed Domo to serve customers in highly regulated industries. By using Amazon Bedrock, Domo has seen a 50% reduction in cost compared to a similarly performing model from another provider.

If you’re ready to start building your own FM innovation with Amazon Bedrock, refer to Getting started with Amazon Bedrock. To learn more about other intriguing Amazon Bedrock applications, see the Amazon Bedrock section of the AWS Machine Learning Blog.


About the Authors

Joe Clark is software architect for the Domo Labs team and lead architect for Domo’s AI Service Layer, AI Chat, and Model Management. At Domo, Joe has also led development of features including Jupyter Workspaces, Sandbox, and Code Engine. With 15 years of professional software development experience, he has previously worked on IoT and smart city initiatives.

Aman Tiwari is a General Solutions Architect working with independent software vendors in the data and generative AI vertical at AWS. He helps them design innovative, resilient, and cost-effective solutions using AWS services. He holds a master’s degree in Telecommunications Networks from Northeastern University. Outside of work, he enjoys playing lawn tennis and reading books.

Sindhu Jambunathan is a Senior Solutions Architect at AWS, specializing in supporting ISV customers in the data and generative AI vertical to build scalable, reliable, secure, and cost-effective solutions on AWS. With over 13 years of industry experience, she joined AWS in May 2021 after a successful tenure as a Senior Software Engineer at Microsoft. Sindhu’s diverse background includes engineering roles at Qualcomm and Rockwell Collins, complemented by a Master’s of Science in Computer Engineering from the University of Florida. Her technical expertise is balanced by a passion for culinary exploration, travel, and outdoor activities.

Mohammad Tahsin is an AI/ML Specialist Solutions Architect at Amazon Web Services. He lives for staying up to date with the latest technologies in AI/ML and helping guide customers to deploy bespoke solutions on AWS. Outside of work, he loves all things gaming, digital art, and cooking.

Read More