NVIDIA Grace Hopper Superchip Sweeps MLPerf Inference Benchmarks

NVIDIA Grace Hopper Superchip Sweeps MLPerf Inference Benchmarks

In its debut on the MLPerf industry benchmarks, the NVIDIA GH200 Grace Hopper Superchip ran all data center inference tests, extending the leading performance of NVIDIA H100 Tensor Core GPUs.

The overall results showed the exceptional performance and versatility of the NVIDIA AI platform from the cloud to the network’s edge.

Separately, NVIDIA announced inference software that will give users leaps in performance, energy efficiency and total cost of ownership.

GH200 Superchips Shine in MLPerf

The GH200 links a Hopper GPU with a Grace CPU in one superchip. The combination provides more memory, bandwidth and the ability to automatically shift power between the CPU and GPU to optimize performance.

Separately, NVIDIA HGX H100 systems that pack eight H100 GPUs delivered the highest throughput on every MLPerf Inference test in this round.

Grace Hopper Superchips and H100 GPUs led across all MLPerf’s data center tests, including inference for computer vision, speech recognition and medical imaging, in addition to the more demanding use cases of recommendation systems and the large language models (LLMs) used in generative AI.

Overall, the results continue NVIDIA’s record of demonstrating performance leadership in AI training and inference in every round since the launch of the MLPerf benchmarks in 2018.

The latest MLPerf round included an updated test of recommendation systems, as well as the first inference benchmark on GPT-J, an LLM with six billion parameters, a rough measure of an AI model’s size.

TensorRT-LLM Supercharges Inference

To cut through complex workloads of every size, NVIDIA developed TensorRT-LLM, generative AI software that optimizes inference. The open-source library — which was not ready in time for August submission to MLPerf — enables customers to more than double the inference performance of their already purchased H100 GPUs at no added cost.

Performance increase using TRT-LLM on H100 GPUs for AI inference

NVIDIA’s internal tests show that using TensorRT-LLM on H100 GPUs provides up to an 8x performance speedup compared to prior generation GPUs running GPT-J 6B without the software.

The software got its start in NVIDIA’s work accelerating and optimizing LLM inference with leading companies including Meta, AnyScale, Cohere, Deci, Grammarly, Mistral AI, MosaicML (now part of Databricks), OctoML, Tabnine and Together AI.

MosaicML added features that it needs on top of TensorRT-LLM and integrated them into its existing serving stack. “It’s been an absolute breeze,” said Naveen Rao, vice president of engineering at Databricks.

“TensorRT-LLM is easy-to-use, feature-packed and efficient,” Rao said. “It delivers state-of-the-art performance for LLM serving using NVIDIA GPUs and allows us to pass on the cost savings to our customers.”

TensorRT-LLM is the latest example of continuous innovation on NVIDIA’s full-stack AI platform. These ongoing software advances give users performance that grows over time at no extra cost and is versatile across diverse AI workloads.

L4 Boosts Inference on Mainstream Servers 

In the latest MLPerf benchmarks, NVIDIA L4 GPUs ran the full range of workloads and delivered great performance across the board.

For example, L4 GPUs running in compact, 72W PCIe accelerators delivered up to 6x more performance than CPUs rated for nearly 5x higher power consumption.

In addition, L4 GPUs feature dedicated media engines that, in combination with CUDA software, provide up to 120x speedups for computer vision in NVIDIA’s tests.

L4 GPUs are available from Google Cloud and many system builders, serving customers in industries from consumer internet services to drug discovery.

Performance Boosts at the Edge

Separately, NVIDIA applied a new model compression technology to demonstrate up to a 4.7x performance boost running the BERT LLM on an L4 GPU. The result was in MLPerf’s so-called “open division,” a category for showcasing new capabilities.

The technique is expected to find use across all AI workloads. It can be especially valuable when running models on edge devices constrained by size and power consumption.

In another example of leadership in edge computing, the NVIDIA Jetson Orin system-on-module showed performance increases of up to 84% compared to the prior round in object detection, a computer vision use case common in edge AI and robotics scenarios.

NVIDIA Jetson Orin performance increase on MLPerf inference

The Jetson Orin advance came from software taking advantage of the latest version of the chip’s cores, such as a programmable vision accelerator, an NVIDIA Ampere architecture GPU and a dedicated deep learning accelerator.

Versatile Performance, Broad Ecosystem

The MLPerf benchmarks are transparent and objective, so users can rely on their results to make informed buying decisions. They also cover a wide range of use cases and scenarios, so users know they can get performance that’s both dependable and flexible to deploy.

Partners submitting in this round included cloud service providers Microsoft Azure and Oracle Cloud Infrastructure and system manufacturers ASUS, Connect Tech, Dell Technologies, Fujitsu, GIGABYTE, Hewlett Packard Enterprise, Lenovo, QCT and Supermicro.

Overall, MLPerf is backed by more than 70 organizations, including Alibaba, Arm, Cisco, Google, Harvard University, Intel, Meta, Microsoft and the University of Toronto.

Read a technical blog for more details on how NVIDIA achieved the latest results.

All the software used in NVIDIA’s benchmarks is available from the MLPerf repository, so everyone can get the same world-class results. The optimizations are continuously folded into containers available on the NVIDIA NGC software hub for GPU applications.

Read More

All About Sample-Size Calculations for A/B Testing: Novel Extensions and Practical Guide

While there exists a large amount of literature on the general challenges and best practices for trustworthy online A/B testing, there are limited studies on sample size estimation, which plays a crucial role in trustworthy and efficient A/B testing that ensures the resulting inference has a sufficient power and type I error control. For example, when the sample size is under-estimated the statistical inference, even with the correct analysis methods, will not be able to detect the true significant improvement leading to misinformed and costly decisions. This paper addresses this fundamental…Apple Machine Learning Research

Intelligent Assistant Language Understanding On-device

It has recently become feasible to run personal digital assistants on phones and other personal devices. In this paper, we describe a design for a natural language understanding system that runs on-device. In comparison to a server-based assistant, this system is more private, more reliable, faster, more expressive, and more accurate. We describe what led to key choices about architecture and technologies. For example, some approaches in the dialog systems literature are difficult to maintain over time in a deployment setting. We hope that sharing learnings from our practical experiences may…Apple Machine Learning Research

Accelerated CPU Inference with PyTorch Inductor using torch.compile

Accelerated CPU Inference with PyTorch Inductor using torch.compile

Story at a Glance

  • Although the PyTorch* Inductor C++/OpenMP* backend has enabled users to take advantage of modern CPU architectures and parallel processing, it has lacked optimizations, resulting in the backend performing worse than eager mode in terms of end-to-end performance.
  • Intel optimized the Inductor backend using a hybrid strategy that classified operations into two categories: Conv/GEMM and non-Conv/GEMM element-wise and reduction ops.
  • For popular deep learning models, this hybrid strategy demonstrates promising performance improvements compared to eager mode and improves the C++/OpenMP backend’s efficiency and reliability for PyTorch models.

Inductor Backend Challenges

The PyTorch Inductor C++/OpenMP backend enables users to take advantage of modern CPU architectures and parallel processing to accelerate computations.

However, during the early stages of its development, the backend lacked some optimizations, which prevented it from fully utilizing the CPU computation capabilities. As a result, for most models the C++/OpenMP backend performed worse than eager mode in terms of end-to-end performance, with 45% of TorchBench, 100% of Hugging Face, and 75% of TIMM models performing worse than eager mode.

In this post, we highlight Intel’s optimizations to the Inductor CPU backend, including the technologies and results.

We optimized the backend by using a hybrid strategy that classified operations into two categories: Conv/GEMM and non-Conv/GEMM element-wise and reduction ops. Post-op fusion and weight prepacking using the oneDNN performance library were utilized to optimize the former, while explicit vectorization in C++ codegen was used to optimize the latter.

This hybrid strategy demonstrated promising performance improvements compared to eager mode, particularly on popular deep learning models such as Inductor Hugging Face, Inductor TorchBench and Inductor TIMM. Overall, Intel’s optimizations improve the C++/OpenMP backend’s efficiency and reliability for PyTorch models.

Figure 1. Performance Speedup Ratio Trend

Figure 1: Performance Speedup Ratio Trend

Performance Status of Intel Hybrid Optimizations

Compared to eager mode with the hybrid optimizations, the C++/OpenMP backend shows promising performance improvements. We measured the performance of the three Inductor benchmark suites—TorchBench, Hugging Face, and TIMM—and the results are as follows. (Note: we publish our performance data twice per week on GitHub.)

Overall, these optimizations help to ensure that the C++/OpenMP backend provides efficient and reliable support for PyTorch models.

Passrate

+----------+------------+-------------+-------------+
| Compiler | torchbench | huggingface | timm_models |
+----------+------------+-------------+-------------+
| inductor | 93%, 56/60 | 96%, 44/46  | 100%, 61/61 |
+----------+------------+-------------+-------------+

Geometric mean speedup (Single-Socket Multi-threads)

+----------+------------+-------------+-------------+
| Compiler | torchbench | huggingface | timm_models |
+----------+------------+-------------+-------------+
| inductor |   1.39x	|	1.20x	|	1.73x	|
+----------+------------+-------------+-------------+

Individual Model Performance

Figure 2. TorchBench FP32 Performance (Single-Socket Multi-threads)

Figure 2: TorchBench FP32 Performance (Single-Socket Multi-threads)

Figure 3. Hugging Face FP32 Performance (Single-Socket Multi-thread)

Figure 3: Hugging Face FP32 Performance (Single-Socket Multi-thread)

Figure 4. TIMM FP32 Performance (Single-Socket Multi-threads)

Figure 4: TIMM FP32 Performance (Single-Socket Multi-threads)

Geometric mean speedup (Single-core Single-thread)

+----------+------------+-------------+-------------+
| Compiler | torchbench | huggingface | timm_models |
+----------+------------+-------------+-------------+
| inductor |   1.29x	|	1.15x	|	1.37x	|
+----------+------------+-------------+-------------+

Figure 5. TorchBench FP32 Performance (Single-Socket Single-thread)

Figure 5: TorchBench FP32 Performance (Single-Socket Single-thread)

Figure 6. Hugging Face FP32 Performance (Single-Socket Single Thread)

Figure 6: Hugging Face FP32 Performance (Single-Socket Single Thread)

Figure 7. TIMM FP32 Performance (Single-Socket Single-thread)

Figure 7: TIMM FP32 Performance (Single-Socket Single-thread)

Technical Deep Dive

Now, let’s take a closer look at the two primary optimizations used in the Inductor C++/OpenMP backend:

  1. weight prepacking and post-operation fusion via oneDNN library
  2. explicit vectorization in Inductor C++ codegen

Weight Prepackaging & Post-op Fusion via oneDNN

Shorthand for Intel® oneAPI Deep Neural Network Library, oneDNN library provides a range of post-op fusions (i.e., fuse convolution and matmal with its consecutive operation) that can benefit popular models. The Intel® Extension for PyTorch has implemented most of these fusions and has achieved significant performance improvements. As a result, we have upstreamed all of these fusions that have been applied in Intel’s PyTorch extension to Inductor, enabling a wider range of models to benefit from these optimizations. We have defined these fusions as operators under the mkldnn namespace. This allows the Python module to invoke these mkldnn operations directly.

Currently, the defined fused operations are as follows. You can find these defined fused operations at RegisterMkldnnOpContextClass.cpp.

  • _linear_pointwise: Fuses Linear and its post-unary element-wise operations
  • _linear_pointwise.binary: Fuses Linear and its post-binary element-wise operations
  • _convolution_pointwise: Fuses Convolution and its post-unary element-wise operations
  • _convolution_pointwise.binary: Fuses Convolution and its post-binary element-wise operations

The detailed fusion patterns are defined in the mkldnn.py file: convolution/linear + sigmoid/hardsigmoid/tanh/hardtanh/hardswish/leaky_relu/gelu/relu/relu6/siluconvolution/linear + add/add_/iadd/sub/sub_

On the Inductor side, we apply these fusions on the FX graph that has been lowered. We have defined mkldnn_fuse_fx as the entry point to apply all the fusions. The code snippet for this is as follows:

def mkldnn_fuse_fx(gm: torch.fx.GraphModule, example_inputs):
    ...
    gm = fuse_unary(gm)
    gm = fuse_binary(gm)
    ...
    if config.cpp.weight_prepack:
        gm = pack_module(gm)
    return gm

In the mkldnn_fuse_fx function, we apply fusion on the FX graph that hasn’t been lowered yet. To fuse convolution/linear and its consecutive elementwise operations, we invoke fuse_unary and fuse_binary as follows:

   gm = fuse_unary(gm)
   gm = fuse_binary(gm)

In addition to the post-op fusion, we apply weight prepacking to improve the Conv/GEMM performance further:

   gm = pack_module(gm)

Weight prepacking involves rearranging the weight tensor in a blocked layout, which:

  • can improve vectorization and cache reuse compared to plain formats like NCHW or NHWC and;
  • can help avoid weight reordering at runtime, which can reduce overhead and improve performance and;
  • increases memory usage as the tradeoff.

For these reasons, we provide config.cpp.weight_prepack flag in Inductor to provide users with more control over this optimization, allowing them to enable it based on their specific needs.

Explicit Vectorization in Inductor C++ Codegen

Vectorization is a key optimization technique that can significantly improve the performance of numerical computations. By utilizing SIMD (Single Instruction, Multiple Data) instructions, vectorization enables multiple computations to be performed simultaneously on a single processor core, which can lead to significant performance improvements.

In the Inductor C++/OpenMP backend, we use Intel® AVX2 and Intel® AVX-512 ISA (Instruction Set Architecture) options for vectorization by leveraging the aten vectorization library to facilitate the implementation. Aten vectorization supports multiple platforms, including x86 and Arm, as well as multiple data types. It can be extended to support other ISAs easily by adding more VecISA sub-classes. This allows Inductor to easily support other platforms and data types in the future.

Due to differences in platforms, the C++/OpenMP backend of Inductor starts by detecting the CPU features to determine the vectorization bit width at the beginning of code generation. By default, if the machine supports both AVX-512 and AVX2, the backend will choose 512-bit vectorization.

If the hardware supports vectorization, the C++/OpenMP backend first detects if the loop body can be vectorized or not. There are primarily three scenarios that we are not able to generate kernel with vectorization:

  1. Loop body lacks vector intrinsics support, e.g., rand and atomic_add.
  2. Loop body lacks efficient vector intrinsics support, e.g., non-contiguous load/store.
  3. Data types with vectorization not yet supported but work in progress, e.g., integer, double, half, and bfloat16.

To address this issue, the C++/OpenMP backend uses CppVecKernelChecker to detect whether all operations in a particular loop body can be vectorized or not. In general, we classified the operations into two categories by identifying if they depend on the context.

For most elementwise operations such as add, sub, relu, vectorization is straightforward, and their execution does not depend on context.

However, for certain other operations, their semantics are more complex and their execution depends on context through static analysis.

For example, let’s consider the where operation that takes in mask, true_value, and false_value while the mask value is loaded from a uint8 tensor. The fx graph could be as follows:

graph():
    %ops : [#users=9] = placeholder[target=ops]
    %get_index : [#users=1] = call_module[target=get_index](args = (index0,), kwargs = {})
    %load : [#users=1] = call_method[target=load](args = (%ops, arg1_1, %get_index), kwargs = {})
    %to_dtype : [#users=1] = call_method[target=to_dtype](args = (%ops, %load, torch.bool), kwargs = {})
    ...
    %where : [#users=1] = call_method[target=where](args = (%ops, %to_dtype, %to_dtype_2, %to_dtype_3), kwargs = {})

Regarding uint8, it is a general data type and could be used for computation but is not limited to being used as Boolean for mask. Hence, we need to analyze its context statically. In particular, the CppVecKernelChecker will check whether a uint8 tensor is only used by to_dtype and to_dtype is only used by where. If yes, it could be vectorized. Otherwise, it will fall back to the scalar version. The generated code could be as follows:

Scalar Version

auto tmp0 = in_ptr0[i1 + (17*i0)];
auto tmp3 = in_ptr1[i1 + (17*i0)];
auto tmp1 = static_cast<bool>(tmp0);
auto tmp2 = static_cast<float>(-33.0);
auto tmp4 = tmp1 ? tmp2 : tmp3;
tmp5 = std::max(tmp5, tmp4);

Vectorization Version

float g_tmp_buffer_in_ptr0[16] = {0};
// Convert the flag to float for vectorization. 
flag_to_float(in_ptr0 + (16*i1) + (17*i0), g_tmp_buffer_in_ptr0, 16);
auto tmp0 = at::vec::Vectorized<float>::loadu(g_tmp_buffer_in_ptr0);
auto tmp3 = at::vec::Vectorized<float>::loadu(in_ptr1 + (16*i1) + (17*i0));
auto tmp1 = (tmp0);
auto tmp2 = at::vec::Vectorized<float>(static_cast<float>(-33.0));
auto tmp4 = decltype(tmp2)::blendv(tmp3, tmp2, tmp1);

In addition to context analysis, the C++/OpenMP backend also incorporates several other vectorization-related optimizations. These include:

  • Tiled kernel implementation for supporting transpose load – cpp.py
  • Data type demotion based on value range – cpp.py
  • Replacement of sleef implementation with oneDNN/oneMKL implementation for optimizing aten vectorization – #94577, #92289, #91613

In summary, we examined vectorization optimization in Inductor C++ backend for FP32 training and inference of 150 benchmark models with 90% of inference kernels and 71% of training kernels being vectorized.

In terms of inference, a total of 28,185 CPP kernels were generated, with 25,579 (90%) of them being vectorized, while the remaining 10% were scalar. As for training, 103,084 kernels were generated, with 73,909 (71%) being vectorized and 29% not vectorized.

The results indicate that the vectorization of inference kernels is quite impressive (there is still some work to be done in training kernels since we just started to work on the training). The remaining non-vectorized kernels are analyzed in different categories, highlighting the next steps to improve vectorization coverage: index-related operations, int64 support, vertical reduction, vectorization with fallback, and more.

In addition, we also optimized the C++/OpenMP backend with other optimizations like buffer-reuse and CppWrapper.

Future Work

The next step, we will continue optimizing the C++/OpenMP backend and extend it to support more data types as the next step. This includes:

  1. Improve vectorization coverage
  2. Support and optimize low precision kernel including BF16, FP16, Quantization
  3. Training optimization
  4. Loop tiling
  5. Autotune
  6. Further fusion optimization of Conv/GEMM kernels.
  7. Explore alternative codegen paths: clang/llvm/triton

Summary

Inductor C++/OpenMP backend is a flexible and efficient backend for the CPU. This blog describes the optimizations used in the C++/OpenMP backend of Inductor for inference and training of three benchmark suites – TorchBench, Hugging

Face and TIMM. The primary optimizations include weight prepacking and post-operation fusion via the oneDNN library, as well as explicit vectorization in Inductor C++ codegen using AVX2 and AVX-512 instructions.

The results show that 90% of inference kernels and 71% of training kernels are vectorized, indicating impressive vectorization for inference and room for improvement in training. In addition, we also applied other optimizations like buffer-reuse and CppWrapper. And we will continuously focus on the future work mentioned above to further improve the performance.

Acknowledgements

The results presented in this blog post are the culmination of a collaborative effort between the Intel PyTorch team and Meta. We would like to express our sincere gratitude to @jansel, @desertfire, and @Chillee for their invaluable contributions and unwavering support throughout the development process. Their expertise and dedication have been instrumental in achieving the optimizations and performance improvements discussed here.

Configuration Details

Hardware Details

Item Value
Manufacturer Amazon EC2
Product Name c6i.16xlarge
CPU Model Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
Installed Memory 128GB (1x128GB DDR4 3200 MT/s [Unknown])
OS Ubuntu 22.04.2 LTS
Kernel 5.19.0-1022-aws
Microcode 0xd000389
GCC gcc (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0
GLIBC ldd (Ubuntu GLIBC 2.35-0ubuntu3.1) 2.35
Binutils GNU ld (GNU Binutils for Ubuntu) 2.38
Python Python 3.10.6
OpenSSL OpenSSL 3.0.2 15 Mar 2022 (Library: OpenSSL 3.0.2 15 Mar 2022)

Software Details

SW Nightly commit Main commit
Pytorch a977a12 0b1b063
Torchbench / a0848e19
torchaudio 0a652f5 d5b2996
torchtext c4ad5dd 79100a6
torchvision f2009ab b78d98b
torchdata 5cb3e6d f2bfd3d
dynamo_benchmarks fea73cb /

Configuration

  • Intel OpenMP
  • Jemalloc – oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:-1,muzzy_decay_ms:-1
  • Single-Socket Multi-threads: #of Instances: 1; Cores/Instance: 32
  • Single-Core Single-thread: #of Instances: 1; Cores/Instance: 1

Read More

Differentially private median and more

Differentially private median and more

Differential privacy (DP) is a rigorous mathematical definition of privacy. DP algorithms are randomized to protect user data by ensuring that the probability of any particular output is nearly unchanged when a data point is added or removed. Therefore, the output of a DP algorithm does not disclose the presence of any one data point. There has been significant progress in both foundational research and adoption of differential privacy with contributions such as the Privacy Sandbox and Google Open Source Library.

ML and data analytics algorithms can often be described as performing multiple basic computation steps on the same dataset. When each such step is differentially private, so is the output, but with multiple steps the overall privacy guarantee deteriorates, a phenomenon known as the cost of composition. Composition theorems bound the increase in privacy loss with the number k of computations: In the general case, the privacy loss increases with the square root of k. This means that we need much stricter privacy guarantees for each step in order to meet our overall privacy guarantee goal. But in that case, we lose utility. One way to improve the privacy vs. utility trade-off is to identify when the use cases admit a tighter privacy analysis than what follows from composition theorems.

Good candidates for such improvement are when each step is applied to a disjoint part (slice) of the dataset. When the slices are selected in a data-independent way, each point affects only one of the k outputs and the privacy guarantees do not deteriorate with k. However, there are applications in which we need to select the slices adaptively (that is, in a way that depends on the output of prior steps). In these cases, a change of a single data point may cascade — changing multiple slices and thus increasing composition cost.

In “Õptimal Differentially Private Learning of Thresholds and Quasi-Concave Optimization”, presented at STOC 2023, we describe a new paradigm that allows for slices to be selected adaptively and yet avoids composition cost. We show that DP algorithms for multiple fundamental aggregation and learning tasks can be expressed in this Reorder-Slice-Compute (RSC) paradigm, gaining significant improvements in utility.

The Reorder-Slice-Compute (RSC) paradigm

An algorithm A falls in the RSC paradigm if it can be expressed in the following general form (see visualization below). The input is a sensitive set D of data points. The algorithm then performs a sequence of k steps as follows:

  1. Select an ordering over data points, a slice size m, and a DP algorithm M. The selection may depend on the output of A in prior steps (and hence is adaptive).
  2. Slice out the (approximately) top m data points according to the order from the dataset D, apply M to the slice, and output the result.
A visualization of three Reorder-Slice-Compute (RSC) steps.

If we analyze the overall privacy loss of an RSC algorithm using DP composition theorems, the privacy guarantee suffers from the expected composition cost, i.e., it deteriorates with the square root of the number of steps k. To eliminate this composition cost, we provide a novel analysis that removes the dependence on k altogether: the overall privacy guarantee is close to that of a single step! The idea behind our tighter analysis is a novel technique that limits the potential cascade of affected steps when a single data point is modified (details in the paper).

Tighter privacy analysis means better utility. The effectiveness of DP algorithms is often stated in terms of the smallest input size (number of data points) that suffices in order to release a correct result that meets the privacy requirements. We describe several problems with algorithms that can be expressed in the RSC paradigm and for which our tighter analysis improved utility.

Private interval point

We start with the following basic aggregation task. The input is a dataset D of n points from an ordered domain X (think of the domain as the natural numbers between 1 and |X|). The goal is to return a point y in X that is in the interval of D, that is between the minimum and the maximum points in D.

The solution to the interval point problem is trivial without the privacy requirement: simply return any point in the dataset D. But this solution is not privacy-preserving as it discloses the presence of a particular datapoint in the input. We can also see that if there is only one point in the dataset, a privacy-preserving solution is not possible, as it must return that point. We can therefore ask the following fundamental question: What is the smallest input size N for which we can solve the private interval point problem?

It is known that N must increase with the domain size |X| and that this dependence is at least the iterated log function log* |X| [1, 2]. On the other hand, the best prior DP algorithm required the input size to be at least (log* |X|)1.5. To close this gap, we designed an RSC algorithm that requires only an order of log* |X| points.

The iterated log function is extremely slow growing: It is the number of times we need to take a logarithm of a value before we reach a value that is equal to or smaller than 1. How did this function naturally come out in the analysis? Each step of the RSC algorithm remapped the domain to a logarithm of its prior size. Therefore there were log* |X| steps in total. The tighter RSC analysis eliminated a square root of the number of steps from the required input size.

Even though the interval point task seems very basic, it captures the essence of the difficulty of private solutions for common aggregation tasks. We next describe two of these tasks and express the required input size to these tasks in terms of N.

Private approximate median

One of these common aggregation tasks is approximate median: The input is a dataset D of n points from an ordered domain X. The goal is to return a point y that is between the ⅓ and ⅔ quantiles of D. That is, at least a third of the points in D are smaller or equal to y and at least a third of the points are larger or equal to y. Note that returning an exact median is not possible with differential privacy, since it discloses the presence of a datapoint. Hence we consider the relaxed requirement of an approximate median (shown below).

We can compute an approximate median by finding an interval point: We slice out the N smallest points and the N largest points and then compute an interval point of the remaining points. The latter must be an approximate median. This works when the dataset size is at least 3N.

An example of a data D over domain X, the set of interval points, and the set of approximate medians.

Private learning of axis-aligned rectangles

For the next task, the input is a set of n labeled data points, where each point x = (x1,….,xd) is a d-dimensional vector over a domain X. Displayed below, the goal is to learn values ai , bi for the axes i=1,…,d that define a d-dimensional rectangle, so that for each example x

  • If x is positively labeled (shown as red plus signs below) then it lies within the rectangle, that is, for all axes i, xi is in the interval [ai ,bi], and
  • If x is negatively labeled (shown as blue minus signs below) then it lies outside the rectangle, that is, for at least one axis i, xi is outside the interval [ai ,bi].
A set of 2-dimensional labeled points and a respective rectangle.

Any DP solution for this problem must be approximate in that the learned rectangle must be allowed to mislabel some data points, with some positively labeled points outside the rectangle or negatively labeled points inside it. This is because an exact solution could be very sensitive to the presence of a particular data point and would not be private. The goal is a DP solution that keeps this necessary number of mislabeled points small.

We first consider the one-dimensional case (d = 1). We are looking for an interval [a,b] that covers all positive points and none of the negative points. We show that we can do this with at most 2N mislabeled points. We focus on the positively labeled points. In the first RSC step we slice out the N smallest points and compute a private interval point as a. We then slice out the N largest points and compute a private interval point as b. The solution [a,b] correctly labels all negatively labeled points and mislabels at most 2N of the positively labeled points. Thus, at most ~2N points are mislabeled in total.

Illustration for d = 1, we slice out N left positive points and compute an interval point a, slice out N right positive points and compute an interval point b.

With d > 1, we iterate over the axes i = 1,….,d and apply the above for the ith coordinates of input points to obtain the values ai , bi. In each iteration, we perform two RSC steps and slice out 2N positively labeled points. In total, we slice out 2dN points and all remaining points were correctly labeled. That is, all negatively-labeled points are outside the final d-dimensional rectangle and all positively-labeled points, except perhaps ~2dN, lie inside the rectangle. Note that this algorithm uses the full flexibility of RSC in that the points are ordered differently by each axis. Since we perform d steps, the RSC analysis shaves off a factor of square root of d from the number of mislabeled points.

Training ML models with adaptive selection of training examples

The training efficiency or performance of ML models can sometimes be improved by selecting training examples in a way that depends on the current state of the model, e.g., self-paced curriculum learning or active learning.

The most common method for private training of ML models is DP-SGD, where noise is added to the gradient update from each minibatch of training examples. Privacy analysis with DP-SGD typically assumes that training examples are randomly partitioned into minibatches. But if we impose a data-dependent selection order on training examples, and further modify the selection criteria k times during training, then analysis through DP composition results in deterioration of the privacy guarantees of a magnitude equal to the square root of k.

Fortunately, example selection with DP-SGD can be naturally expressed in the RSC paradigm: each selection criteria reorders the training examples and each minibatch is a slice (for which we compute a noisy gradient). With RSC analysis, there is no privacy deterioration with k, which brings DP-SGD training with example selection into the practical domain.

Conclusion

The RSC paradigm was introduced in order to tackle an open problem that is primarily of theoretical significance, but turns out to be a versatile tool with the potential to enhance data efficiency in production environments.

Acknowledgments

The work described here was done jointly with Xin Lyu, Jelani Nelson, and Tamas Sarlos.

Read More

NVIDIA Partners with India Giants to Advance AI in World’s Most Populous Nation

NVIDIA Partners with India Giants to Advance AI in World’s Most Populous Nation

The world’s largest democracy is poised to transform itself and the world, embracing AI on an enormous scale.

Speaking with the press Friday in Bengaluru, in the context of announcements from two of India’s largest conglomerates, Reliance Industries Limited and Tata Group, NVIDIA founder and CEO Jensen Huang detailed plans to bring AI technology and skills to address the world’s most populous nation’s greatest challenges.

“I think this is going to be one of the largest AI markets in the world,” said Huang, who was wrapping up a week of high-level meetings across the nation, including with Prime Minister Narendra Modi, leading AI researchers, top business leaders, the press and the country’s 4,000-some NVIDIA employees.

The companies will work together to create an AI computing infrastructure and platforms for developing AI solutions. It will be based on NVIDIA technology like the NVIDIA GH200 Grace Hopper Superchip and NVIDIA DGX Cloud.

GH200 marks a fundamental shift in computing architecture that provides exceptional performance and massive memory bandwidth, while DGX Cloud, an AI supercomputing service in the cloud, makes it easier for enterprises to train their employees in AI technology, access the technology internally and provide generative AI services to customers.

In his exchange with more than a dozen of India’s top tech journalists following the announcement, Huang said computer science expertise is a core competency for India, and that with access to technology and capital India is poised to build AI to be able to solve challenges at home and abroad.

“You have the data, you have the talent,” Huang said. “We are open for business and bring great expertise on building supercomputers.

During the freewheeling back and forth with the media, Hiuang emphasized India’s strength in information technology and the potential for AI to accelerate the development of India’s IT industry.

“IT is one of your natural resources. You produce it at an incredible scale. You’re incredibly good at it. You export it all over the world,” Huang said.

India’s “AI Moment”

Earlier, after meeting with many of the region’s top technology leaders — including startup pioneers, AI proponents, and key players in India’s digital public infrastructure — Huang hailed “India’s moment,” saying the nationis on the cusp of becoming a global AI powerhouse.

NVIDIA CEO Jensen Huang with Nandan Nilekani, founder of Infosys and founding chairman of UIDAI during a meeting with key Indian tech leaders.

While India has well-known technical capabilities — distinguished technical universities, 2,500 engineering colleges and an estimated 1.5 million engineers — many of its 1.4 billion people, located across sprawling metropolitan areas and some 650,000 villages, collectively speaking dozens of languages, have yet to fully benefit from this progress.

Applied in the Indian context, AI can help rural farmers interact via cell phones in their local language to get weather information and crop prices. It can help provide, at a massive scale, expert diagnosis of medical symptoms and imaging scans where doctors may not be immediately available. It can better predict cyclonic storms using decades of atmospheric data, enabling those at risk to more quickly evacuate and find shelter.

Reliance Industries and Tata Communications will build and operate state-of-the-art AI supercomputing data centers based on such technology, utilizing it for internal AI development and infrastructure-as-a-service for India’s AI researchers, companies and burgeoning AI startup ecosystem.

That effort, Huang said, during his conversation with the Indian technology press, promises to be part of a process that will turn India into a beacon for AI technology.

“AI could be built in India, used in India, and exported from India,” Huang said.

Read More

Implement smart document search index with Amazon Textract and Amazon OpenSearch

Implement smart document search index with Amazon Textract and Amazon OpenSearch

For modern companies that deal with enormous volumes of documents such as contracts, invoices, resumes, and reports, efficiently processing and retrieving pertinent data is critical to maintaining a competitive edge. However, traditional methods of storing and searching for documents can be time-consuming and often result in a large effort to find a specific document, especially when they include handwriting. What if there was a way to process documents intelligently and make them searchable in with high accuracy?

This is made possible with Amazon Textract, AWS’s Intelligent Document Processing service, coupled with the fast search capabilities of OpenSearch. In this post, we’ll take you on a journey to rapidly build and deploy a document search indexing solution that helps your organization to better harness and extract insights from documents.

Whether you’re in Human Resources looking for specific clauses in employee contracts, or a financial analyst sifting through a mountain of invoices to extract payment data, this solution is tailored to empower you to access the information you need with unprecedented speed and accuracy.

With the proposed solution, your documents are automatically ingested, their content parsed and subsequently indexed into a highly responsive and scalable OpenSearch index.

We’ll cover how technologies such as Amazon Textract, AWS Lambda, Amazon Simple Storage Service (Amazon S3), and Amazon OpenSearch Service can be integrated into a workflow that seamlessly processes documents. Then we dive into indexing this data into OpenSearch and demonstrate the search capabilities that become available at your fingertips.

Whether your organization is taking the first steps into the digital transformation era or is an established giant seeking to turbocharge information retrieval, this guide is your compass to navigating the opportunities that AWS Intelligent Document Processing and OpenSearch offer.

The implementation used in this post utilizes the Amazon Textract IDP CDK constructs – AWS Cloud Development Kit (CDK) components to define infrastructure for Intelligent Document Processing (IDP) workflows – which allow you to build use case specific customizable IDP workflows. The IDP CDK constructs and samples are a collection of components to enable definition of IDP processes on AWS and published to GitHub. The main concepts used are the AWS Cloud Development Kit (CDK) constructs, the actual CDK stacks and AWS Step Functions. The workshop Use machine learning to automate and process documents at scale is a good starting point to learn more about customizing workflows and using the other sample workflows as a base for your own.

Solution overview

In this solution, we focus on indexing documents into an OpenSearch index for quick search-and-retrieval of information and documents. Documents in PDF, TIFF, JPEG or PNG format are put in an Amazon Simple Storage Service (Amazon S3) bucket and subsequently indexed into OpenSearch using this Step Functions workflow.

Step Function workflow

Figure 1: The Step Functions OpenSearch workflow

The OpenSearchWorkflow-Decider looks at the document and verifies that the document is one of the supported mime types (PDF, TIFF, PNG or JPEG). It consists of one AWS Lambda function.

The DocumentSplitter generates maximum of 2500-pages chunk from documents. This means even though Amazon Textract supports documents of up to 3000 pages, you can pass in documents with many more pages and the process still works fine and puts the pages into OpenSearch and creates correct page numbers. The DocumentSplitter is implemented as an AWS Lambda function.

The Map State processes each chunk in parallel.

The TextractAsync task calls Amazon Textract using the asynchronous Application Programming Interface (API) following best practices with Amazon Simple Notification Service (Amazon SNS) notifications and OutputConfig to store the Amazon Textract JSON output to a customer Amazon S3 bucket. It consists of two Amazon Lambda functions: one to submit the document for processing and one getting triggered on the Amazon SNS notification.

Because the TextractAsync task can produce multiple paginated output files, the TextractAsyncToJSON2 process combines them into one JSON file.

The Step Functions context is enriched with information that should also be searchable in the OpenSearch index in the SetMetaData step. The sample implementation adds ORIGIN_FILE_NAME, START_PAGE_NUMBER, and ORIGIN_FILE_URI. You can add any information to enrich the search experience, like information from other backend systems, specific IDs or classification information.

The GenerateOpenSearchBatch takes the generated Amazon Textract output JSON, combines it with the information from the context set by SetMetaData and prepares a file that is optimized for batch import into OpenSearch.

In the OpenSearchPushInvoke, this batch import file is sent into the OpenSearch index and available for search. This AWS Lambda function is connected with the aws-lambda-opensearch construct from the AWS Solutions library using the m6g.large.search instances, OpenSearch version 2.7, and configured the Amazon Elastic Block Service (Amazon EBS) volume size to General Purpose 2 (GP2) with 200 GB. You can change the OpenSearch configuration according to your requirements.

The final TaskOpenSearchMapping step clears the context, which otherwise could exceed the Step Functions Quota of Maximum input or output size for a task, state, or execution.

Prerequisites

To deploy the samples, you need an AWS account , the AWS Cloud Development Kit (AWS CDK), a current Python version and Docker are required. You need permissions to deploy AWS CloudFormation templates, push to the Amazon Elastic Container Registry (Amazon ECR), create Amazon Identity and Access Management (AWS IAM) roles, Amazon Lambda functions, Amazon S3 buckets, Amazon Step Functions, Amazon OpenSearch cluster, and an Amazon Cognito user pool. Make sure your AWS CLI environment is setup with the according permissions.

You can also spin up a AWS Cloud9 instance with AWS CDK, Python and Docker pre-installed to initiate the deployment.

Walkthrough

Deployment

  1. After you set up the prerequisites, you need to first clone the repository:
git clone https://github.com/aws-solutions-library-samples/guidance-for-low-code-intelligent-document-processing-on-aws.git
  1. Then cd into the repository folder and install the dependencies:
cd guidance-for-low-code-intelligent-document-processing-on-aws/

pip install -r requirements.txt
  1. Deploy the OpenSearchWorkflow stack:
cdk deploy OpenSearchWorkflow

The deployment takes around 25 minutes with the default configuration settings from the GitHub samples, and creates a Step Functions workflow, which is invoked when a document is put at an Amazon S3 bucket/prefix and subsequently is processed till the content of the document is indexed in an OpenSearch cluster.

The following is a sample output including useful links and information generated fromcdk deploy OpenSearchWorkflowcommand:

OpenSearchWorkflow.CognitoUserPoolLink = https://us-east-1.console.aws.amazon.com/cognito/v2/idp/user-pools/us-east-1_1234abcdef/users?region=us-east-1
OpenSearchWorkflow.DocumentQueueLink = https://us-east-1.console.aws.amazon.com/sqs/v2/home?region=us-east-1#/queues/https%3A%2F%2Fsqs.us-east-1.amazonaws.com%2F123412341234%2FOpenSearchWorkflow-ExecutionThrottleDocumentQueueABC1234-ABCDEFG1234.fifo
OpenSearchWorkflow.DocumentUploadLocation = s3://opensearchworkflow-opensearchworkflowbucketabcdef1234/uploads/
OpenSearchWorkflow.OpenSearchDashboard = https://search-idp-cdk-opensearch-abcdef1234.us-east-1.es.amazonaws.com/states/_dashboards
OpenSearchWorkflow.OpenSearchLink = https://us-east-1.console.aws.amazon.com/aos/home?region=us-east-1#/opensearch/domains/idp-cdk-opensearch
OpenSearchWorkflow.StepFunctionFlowLink = https://us-east-1.console.aws.amazon.com/states/home?region=us-east-1#/statemachines/view/arn:aws:states:us-east-1:123412341234:stateMachine:OpenSearchWorkflow12341234

This information is also available in the AWS CloudFormation Console.

When a new document is placed under the OpenSearchWorkflow.DocumentUploadLocation, a new Step Functions workflow is started for this document.

To check the status of this document, the OpenSearchWorkflow.StepFunctionFlowLink provides a link to the list of StepFunction executions in the AWS Management Console, displaying the status of the document processing for each document uploaded to Amazon S3. The tutorial Viewing and debugging executions on the Step Functions console provides an overview of the components and views in the AWS Console.

Testing

  1. First test using a sample file.
aws s3 cp s3://amazon-textract-public-content/idp-cdk-samples/moby-dick-hidden-paystub-and-w2.pdf $(aws cloudformation list-exports --query 'Exports[?Name==`OpenSearchWorkflow-DocumentUploadLocation`].Value' --output text)
  1. After selecting the link to the StepFunction workflow or open the AWS Management Console and going to the Step Functions service page, you can look at the different workflow invocations.
Step Function executions list

Figure 2: The Step Functions executions list

  1. Take a look at the currently running sample document execution, where you can follow the execution of the individual workflow tasks.
One document Step Functions workflow execution

Figure 3: One document Step Functions workflow execution

Search

Once the process finished, we can validate that the document is indexed in the OpenSearch index.

  1. To do so, first we create an Amazon Cognito user. Amazon Cognito is used for Authentication of users against the OpenSearch index. Select the link in the output from the cdk deploy (or look at the AWS CloudFormation output in the AWS Management Console) named OpenSearchWorkflow.CognitoUserPoolLink.
Figure 4: The Cognito user pool

Figure 4: The Cognito user pool

  1. Next, select the Create user button, which directs you to a page to enter a username and a password for accessing the OpenSearch Dashboard.
Figure 5: The Cognito create user dialog

Figure 5: The Cognito create user dialog

  1. After choosing Create user, you can continue to the OpenSearch Dashboard by clicking on the OpenSearchWorkflow.OpenSearchDashboard from the CDK deployment output. Login using the previously created username and password. The first time you login, you have to change the password.
  2. Once being logged in to the OpenSearch Dashboard, select the Stack Management section, followed by Index Patterns to create a search index.
Figure 6: OpenSearch Dashboards Stack Management

Figure 6: OpenSearch Dashboards Stack Management

Figure 7: OpenSearch Index Patterns overview

Figure 7: OpenSearch Index Patterns overview

  1. The default name for the index is papers-index and an index pattern name of papers-index* will match that.
Figure 8: Define the OpenSearch index pattern

Figure 8: Define the OpenSearch index pattern

  1. After clicking Next step, select timestamp as the Time field and Create index pattern.
Figure 9: OpenSearch index pattern time field

Figure 9: OpenSearch index pattern time field

  1. Now, from the menu, select Discover.
Figure 10: OpenSearch Discover

Figure 10: OpenSearch Discover

In most cases ,you need to change the time-span according to your last ingest. The default is 15 minutes and often there was no activity in the last 15 minutes. In this example, it changed to 15 days to visualize the ingest.

Figure 11: OpenSearch timespan change

Figure 11: OpenSearch timespan change

  1. Now you can start to search. A novel was indexed, you can search for any terms like call me Ishmael and see the results.
Figure 12: OpenSearch search term

Figure 12: OpenSearch search term

In this case, the term call me Ishmael appears on page 6 of the document at the given Uniform Resource Identifier (URI), which points to the Amazon S3 location of the file. This makes it faster to identify documents and find information across a large corpus of PDF, TIFF or image documents, compared to manually skipping through them.

Running at scale

In order to estimate scale and duration of an indexing process, the implementation was tested with 93,997 documents and a total sum of 1,583,197 pages (average 16.84 pages/document and the largest file having 3755 pages), which all got indexed into OpenSearch. Processing all files and indexing them into OpenSearch took 5.5 hours in the US East (N. Virginia – us-east-1) region using default Amazon Textract Service Quotas. The graph below shows an initial test at 18:00 followed by the main ingest at 21:00 and all done by 2:30.

Figure 13: OpenSearch indexing overview

Figure 13: OpenSearch indexing overview

For the processing, the tcdk.SFExecutionsStartThrottle was set to an executions_concurrency_threshold=550, which means that concurrent document processing workflows are capped at 550 and excess requests are queued to an Amazon SQS Fist-In-First-Out (FIFO) queue, which is subsequently drained when current workflows finish. The threshold of 550 is based on the Textract Service quota of 600 in the us-east-1 region. Therefore, the queue depth and age of oldest message are metrics worth monitoring.

Figure 14: Amazon SQS monitoring

Figure 14: Amazon SQS monitoring

In this test, all documents were uploaded to Amazon S3 at once, therefore the Approximate Number of Messages Visible has a steep increase and then a slow decline as no new documents are ingested. The Approximate Age Of Oldest Message increases until all messages are processed. The Amazon SQS MessageRetentionPeriod is set to 14 days. For very long running backlog processing that could exceed 14 days processing, start with processing a smaller subset of representative documents and monitor the duration of execution to estimate how many documents you can pass in before exceeding 14 days. The Amazon SQS CloudWatch metrics look similar for a use case of processing a large backlog of documents, which is ingested at once then processed fully. If your use case is a steady flow of documents, both metrics, the Approximate Number of Messages Visible and the Approximate Age Of Oldest Message will be more linear. You can also use the threshold parameter to mix a steady load with backlog processing and allocate capacity according to your processing needs.

Another metrics to monitor is the health of the OpenSearch cluster, which you should setup according to the Opernational best practices for Amazon OpenSearch Service. The default deployment uses m6g.large.search instances.

Figure 15: OpenSearch monitoring

Figure 15: OpenSearch monitoring

Here is a snapshot of the Key Performance Indicators (KPI) for the OpenSearch cluster. No errors, constant indexing data rate and latency.

The Step Functions workflow executions show the state of processing for each individual document. If you see executions in Failed state, then select the details. A good metric to monitor is the AWS CloudWatch Automatic Dashboard for Step Functions, which exposes some of the Step Functions CloudWatch metrics.

Figure 16: Step Functions monitoring executions succeeded

Figure 16: Step Functions monitoring executions succeeded

In this AWS CloudWatch Dashboard graph, you see the successful Step Functions executions over time.

Figure 17: OpenSearch monitoring executions failed

Figure 17: OpenSearch monitoring executions failed

And this one shows the failed executions. These are worth investigating through the AWS Console Step Functions overview.

The following screenshot shows one example of a failed execution due to the origin file being of 0 size, which makes sense because the file has no content and could not be processed. It is important to filter failed processes and visualizes failures, in order for you to go back to the source document and validate the root cause.

Figure 18: Step Functions failed workflow

Figure 18: Step Functions failed workflow

Other failures might include documents that are not of mime type: application/pdf, image/png, image/jpeg, or image/tiff because other document types are not supported by Amazon Textract.

Cost

The total cost of ingesting 1,583,278 pages was split across AWS services used for the implementation. The following list serves as approximate numbers, because your actual cost and processing duration vary depending on the size of documents, the number of pages per document, the density of information in the documents, and the AWS Region. Amazon DynamoDB was consuming $0.55, Amazon S3 $3.33, OpenSearch Service $14.71, Step Functions $17.92, AWS Lambda $28.95, and Amazon Textract $1,849.97. Also, keep in mind that the deployed Amazon OpenSearch Service cluster is billed by the hour and will accumulate higher cost when run over a period of time.

Modifications

Most likely, you want to modify the implementation and customize for your use case and documents. The workshop Use machine learning to automate and process documents at scale presents a good overview on how to manipulate the actual workflows, changing the flow, and adding new components. To add custom fields to the OpenSearch index, look at the SetMetaData task in the workflow using the set-manifest-meta-data-opensearch AWS Lambda function to add meta-data to the context, which will be added as a field to the OpenSearch index. Any meta-data information will become part of the index.

Cleaning up

Delete the example resources if you no longer need them, to avoid incurring future costs using the followind command:

cdk destroy OpenSearchWorkflow

in the same environment as the cdk deploy command. Beware that this removes everything, including the OpenSearch cluster and all documents and the Amazon S3 bucket. If you want to maintain that information, backup your Amazon S3 bucket and create an index snapshot from your OpenSearch cluster. If you processed many files, then you may have to empty the Amazon S3 bucket first using the AWS Management Console (i.e., after you took a backup or synced them to a different bucket if you want to retain the information), because the cleanup function can time out and then destroy the AWS CloudFormation stack.

Conclusion

In this post, we showed you how to deploy a full stack solution to ingest a large number of documents into an OpenSearch index, which are ready to be used for search use cases. The individual components of the implementation were discussed as well as scaling considerations, cost, and modification options. All code is accessible as OpenSource on GitHub as IDP CDK samples and as IDP CDK constructs to build your own solutions from scratch. As a next step you can start to modify the workflow, add information to the documents in the search index and explore the IDP workshop. Please comment below on your experience and ideas to expand the current solution.


About the Author

Martin Schade is a Senior ML Product SA with the Amazon Textract team. He has over 20 years of experience with internet-related technologies, engineering, and architecting solutions. He joined AWS in 2014, first guiding some of the largest AWS customers on the most efficient and scalable use of AWS services, and later focused on AI/ML with a focus on computer vision. Currently, he’s obsessed with extracting information from documents.

Read More

Semantic image search for articles using Amazon Rekognition, Amazon SageMaker foundation models, and Amazon OpenSearch Service

Semantic image search for articles using Amazon Rekognition, Amazon SageMaker foundation models, and Amazon OpenSearch Service

Digital publishers are continuously looking for ways to streamline and automate their media workflows in order to generate and publish new content as rapidly as they can.

Publishers can have repositories containing millions of images and in order to save money, they need to be able to reuse these images across articles. Finding the image that best matches an article in repositories of this scale can be a time-consuming, repetitive, manual task that can be automated. It also relies on the images in the repository being tagged correctly, which can also be automated (for a customer success story, refer to Aller Media Finds Success with KeyCore and AWS).

In this post, we demonstrate how to use Amazon Rekognition, Amazon SageMaker JumpStart, and Amazon OpenSearch Service to solve this business problem. Amazon Rekognition makes it easy to add image analysis capability to your applications without any machine learning (ML) expertise and comes with various APIs to fulfil use cases such as object detection, content moderation, face detection and analysis, and text and celebrity recognition, which we use in this example. SageMaker JumpStart is a low-code service that comes with pre-built solutions, example notebooks, and many state-of-the-art, pre-trained models from publicly available sources that are straightforward to deploy with a single click into your AWS account. These models have been packaged to be securely and easily deployable via Amazon SageMaker APIs. The new SageMaker JumpStart Foundation Hub allows you to easily deploy large language models (LLM) and integrate them with your applications. OpenSearch Service is a fully managed service that makes it simple to deploy, scale, and operate OpenSearch. OpenSearch Service allows you to store vectors and other data types in an index, and offers rich functionality that allows you to search for documents using vectors and measuring the semantical relatedness, which we use in this post.

The end goal of this post is to show how we can surface a set of images that are semantically similar to some text, be that an article or tv synopsis.

The following screenshot shows an example of taking a mini article as your search input, rather than using keywords, and being able to surface semantically similar images.

Overview of solution

The solution is divided into two main sections. First, you extract label and celebrity metadata from the images, using Amazon Rekognition. You then generate an embedding of the metadata using a LLM. You store the celebrity names, and the embedding of the metadata in OpenSearch Service. In the second main section, you have an API to query your OpenSearch Service index for images using OpenSearch’s intelligent search capabilities to find images that are semantically similar to your text.

This solution uses our event-driven services Amazon EventBridge, AWS Step Functions, and AWS Lambda to orchestrate the process of extracting metadata from the images using Amazon Rekognition. Amazon Rekognition will perform two API calls to extract labels and known celebrities from the image.

Amazon Rekognition celebrity detection API, returns a number of elements in the response. For this post, you use the following:

  • Name, Id, and Urls – The celebrity name, a unique Amazon Rekognition ID, and list of URLs such as the celebrity’s IMDb or Wikipedia link for further information.
  • MatchConfidence – A match confidence score that can be used to control API behavior. We recommend applying a suitable threshold to this score in your application to choose your preferred operating point. For example, by setting a threshold of 99%, you can eliminate more false positives but may miss some potential matches.

In your second API call, Amazon Rekognition label detection API, returns a number of elements in the response. You use the following:

  • Name – The name of the detected label
  • Confidence – The level of confidence in the label assigned to a detected object

A key concept in semantic search is embeddings. A word embedding is a numerical representation of a word or group of words, in the form of a vector. When you have many vectors, you can measure the distance between them, and vectors which are close in distance are semantically similar. Therefore, if you generate an embedding of all of your images’ metadata, and then generate an embedding of your text, be that an article or tv synopsis for example, using the same model, you can then find images which are semantically similar to your given text.

There are many models available within SageMaker JumpStart to generate embeddings. For this solution, you use GPT-J 6B Embedding from Hugging Face. It produces high-quality embeddings and has one of the top performance metrics according to Hugging Face’s evaluation results. Amazon Bedrock is another option, still in preview, where you could choose Amazon Titan Text Embeddings model to generate the embeddings.

You use the GPT-J pre-trained model from SageMaker JumpStart to create an embedding of the image metadata and store this as a k-NN vector in your OpenSearch Service index, along with the celebrity name in another field.

The second part of the solution is to return the top 10 images to the user that are semantically similar to their text, be this an article or tv synopsis, including any celebrities if present. When choosing an image to accompany an article, you want the image to resonate with the pertinent points from the article. SageMaker JumpStart hosts many summarization models which can take a long body of text and reduce it to the main points from the original. For the summarization model, you use the AI21 Labs Summarize model. This model provides high-quality recaps of news articles and the source text can contain roughly 10,000 words, which allows the user to summarize the entire article in one go.

To detect if the text contains any names, potentially known celebrities, you use Amazon Comprehend which can extract key entities from a text string. You then filter by the Person entity, which you use as an input search parameter.

Then you take the summarized article and generate an embedding to use as another input search parameter. It’s important to note that you use the same model deployed on the same infrastructure to generate the embedding of the article as you did for the images. You then use Exact k-NN with scoring script so that you can search by two fields: celebrity names and the vector that captured the semantic information of the article. Refer to this post, Amazon OpenSearch Service’s vector database capabilities explained, on the scalability of Score script and how this approach on large indexes may lead to high latencies.

Walkthrough

The following diagram illustrates the solution architecture.

Following the numbered labels:

  1. You upload an image to an Amazon S3 bucket
  2. Amazon EventBridge listens to this event, and then triggers an AWS Step function execution
  3. The Step Function takes the image input, extracts the label and celebrity metadata
  4. The AWS Lambda function takes the image metadata and generates an embedding
  5. The Lambda function then inserts the celebrity name (if present) and the embedding as a k-NN vector into an OpenSearch Service index
  6. Amazon S3 hosts a simple static website, served by an Amazon CloudFront distribution. The front-end user interface (UI) allows you to authenticate with the application using Amazon Cognito to search for images
  7. You submit an article or some text via the UI
  8. Another Lambda function calls Amazon Comprehend to detect any names in the text
  9. The function then summarizes the text to get the pertinent points from the article
  10. The function generates an embedding of the summarized article
  11. The function then searches OpenSearch Service image index for any image matching the celebrity name and the k-nearest neighbors for the vector using cosine similarity
  12. Amazon CloudWatch and AWS X-Ray give you observability into the end to end workflow to alert you of any issues.

Extract and store key image metadata

The Amazon Rekognition DetectLabels and RecognizeCelebrities APIs give you the metadata from your images—text labels you can use to form a sentence to generate an embedding from. The article gives you a text input that you can use to generate an embedding.

Generate and store word embeddings

The following figure demonstrates plotting the vectors of our images in a 2-dimensional space, where for visual aid, we have classified the embeddings by their primary category.

You also generate an embedding of this newly written article, so that you can search OpenSearch Service for the nearest images to the article in this vector space. Using the k-nearest neighbors (k-NN) algorithm, you define how many images to return in your results.

Zoomed in to the preceding figure, the vectors are ranked based on their distance from the article and then return the K-nearest images, where K is 10 in this example.

OpenSearch Service offers the capability to store large vectors in an index, and also offers the functionality to run queries against the index using k-NN, such that you can query with a vector to return the k-nearest documents that have vectors in close distance using various measurements. For this example, we use cosine similarity.

Detect names in the article

You use Amazon Comprehend, an AI natural language processing (NLP) service, to extract key entities from the article. In this example, you use Amazon Comprehend to extract entities and filter by the entity Person, which returns any names that Amazon Comprehend can find in the journalist story, with just a few lines of code:

def get_celebrities(payload):
    response = comprehend_client.detect_entities(
        Text=' '.join(payload["text_inputs"]),
        LanguageCode="en",
    )
    celebrities = ""
    for entity in response["Entities"]:
        if entity["Type"] == "PERSON":
            celebrities += entity["Text"] + " "
    return celebrities

In this example, you upload an image to Amazon Simple Storage Service (Amazon S3), which triggers a workflow where you are extracting metadata from the image including labels and any celebrities. You then transform that extracted metadata into an embedding and store all of this data in OpenSearch Service.

Summarize the article and generate an embedding

Summarizing the article is an important step to make sure that the word embedding is capturing the pertinent points of the article, and therefore returning images that resonate with the theme of the article.

AI21 Labs Summarize model is very simple to use without any prompt and just a few lines of code:

def summarise_article(payload):
    sagemaker_endpoint_summarise = os.environ["SAGEMAKER_ENDPOINT_SUMMARIZE"]
    response = ai21.Summarize.execute(
        source=payload,
        sourceType="TEXT",
        destination=ai21.SageMakerDestination(sagemaker_endpoint_summarise)
    )
    response_summary = response.summary 
    return response_summary

You then use the GPT-J model to generate the embedding

def get_vector(payload_summary):
    sagemaker_endpoint = os.environ["SAGEMAKER_ENDPOINT_VECTOR"]
    response = sm_runtime_client.invoke_endpoint(
        EndpointName=sagemaker_endpoint,
        ContentType="application/json",
        Body=json.dumps(payload_summary).encode("utf-8"),
    )
    response_body = json.loads((response["Body"].read()))
    return response_body["embedding"][0]

You then search OpenSearch Service for your images

The following is an example snippet of that query:

def search_document_celeb_context(person_names, vector):
    results = wr.opensearch.search(
        client=os_client,
        index="images",
        search_body={
            "size": 10,
            "query": {
                "script_score": {
                    "query": {
                        "match": {"celebrities": person_names }
                    },
                    "script": {
                        "lang": "knn",
                        "source": "knn_score",
                        "params": {
                            "field": "image_vector",
                            "query_value": vector,
                            "space_type": "cosinesimil"
                        }
                    }
                }
            }
        },
    )
    return results.drop(columns=["image_vector"]).to_dict()

The architecture contains a simple web app to represent a content management system (CMS).

For an example article, we used the following input:

“Werner Vogels loved travelling around the globe in his Toyota. We see his Toyota come up in many scenes as he drives to go and meet various customers in their home towns.”

None of the images have any metadata with the word “Toyota,” but the semantics of the word “Toyota” are synonymous with cars and driving. Therefore, with this example, we can demonstrate how we can go beyond keyword search and return images that are semantically similar. In the above screenshot of the UI, the caption under the image shows the metadata Amazon Rekognition extracted.

You could include this solution in a larger workflow where you use the metadata you already extracted from your images to start using vector search along with other key terms, such as celebrity names, to return the best resonating images and documents for your search query.

Conclusion

In this post, we showed how you can use Amazon Rekognition, Amazon Comprehend, SageMaker, and OpenSearch Service to extract metadata from your images and then use ML techniques to discover them automatically using celebrity and semantic search. This is particularly important within the publishing industry, where speed matters in getting fresh content out quickly and to multiple platforms.

For more information about working with media assets, refer to Media intelligence just got smarter with Media2Cloud 3.0.


About the Author

Mark Watkins is a Solutions Architect within the Media and Entertainment team, supporting his customers solve many data and ML problems. Away from professional life, he loves spending time with his family and watching his two little ones growing up.

Read More

Improving asset health and grid resilience using machine learning

Improving asset health and grid resilience using machine learning

This post is co-written with Travis Bronson, and Brian L Wilkerson from Duke Energy

Machine learning (ML) is transforming every industry, process, and business, but the path to success is not always straightforward. In this blog post, we demonstrate how Duke Energy, a Fortune 150 company headquartered in Charlotte, NC., collaborated with the AWS Machine Learning Solutions Lab (MLSL) to use computer vision to automate the inspection of wooden utility poles and help prevent power outages, property damage and even injuries.

The electric grid is made up of poles, lines and power plants to generate and deliver electricity to millions of homes and businesses. These utility poles are critical infrastructure components and subject to various environmental factors such as wind, rain and snow, which can cause wear and tear on assets. It’s critical that utility poles are regularly inspected and maintained to prevent failures that can lead to power outages, property damage and even injuries. Most power utility companies, including Duke Energy, use manual visual inspection of utility poles to identifyanomalies related to their transmission and distribution network. But this method can be costlyand time-consuming, and it requires that power transmission lineworkers follow rigorous safety protocols.

Duke Energy has used artificial intelligence in the past to create efficiencies in day-to-day operations to great success. The company has used AI to inspect generation assets and critical infrastructure and has been exploring opportunities to apply AI to the inspection of utility poles as well. Over the course of the AWS Machine Learning Solutions Lab engagement with Duke Energy, the utility progressed its work to automate the detection of anomalies in wood poles using advanced computer vision techniques.

Goals and use case

The goal of this engagement between Duke Energy and the Machine Learning Solutions Lab is to leverage machine learning to inspect hundreds of thousands of high-resolution aerial images to automate the identification and review process of all wood pole-related issues across 33,000 miles of transmission lines. This goal will further help Duke Energy to improve grid resiliency and comply with government regulations by identifying the defects in a timely manner. It will also reduce fuel and labor costs, as well as reduce carbon emissions by minimizing unnecessary truck rolls. Finally, it will also improve safety by minimizing miles driven, poles climbed and physical inspection risks associated with compromising terrain and weather conditions.

In the following sections, we present the key challenges associated with developing robust and efficient models for anomaly detection related to wood utility poles. We also describe the key challenges and suppositions associated with various data preprocessing techniques employed to achieve the desired model performance. Next, we present the key metrics used for evaluating the model performance along with the evaluation of our final models. And finally, we compare various state-of-the-art supervised and unsupervised modeling techniques.

Challenges

One of the key challenges associated with training a model for detecting anomalies using aerial images is the non-uniform image sizes. The following figure shows the distribution of image height and width of a sample data set from Duke Energy. It can be observed that the images have a large amount of variation in terms of size. Similarly, the size of images also pose significant challenges. The size of input images are thousands of pixels wide and thousands of pixels long. This is also not ideal for training a model for identification of the small anomalous regions in the image.

Distribution of image height and width for a sample data set

Distribution of image height and width for a sample data set

Also, the input images contain a large amount of irrelevant background information such as vegetation, cars, farm animals, etc. The background information could result in suboptimal model performance. Based on our assessment, only 5% of the image contains the wood poles and the anomalies are even smaller. This a major challenge for identifying and localizing anomalies in the high-resolution images. The number of anomalies is significantly smaller, compared to the entire data set. There are only 0.12% of anomalous images in the entire data set (i.e., 1.2 anomalies out of 1000 images). Finally, there is no labeled data available for training a supervised machine learning model. Next, we describe how we address these challenges and explain our proposed method.

Solution overview

Modeling techniques

The following figure demonstrates our image processing and anomaly detection pipeline. We first imported the data into Amazon Simple Storage Service (Amazon S3) using Amazon SageMaker Studio. We further employed various data processing techniques to address some of the challenges highlighted above to improve the model performance. After data preprocessing, we employed Amazon Rekognition Custom Labels for data labeling. The labeled data is further used to train supervised ML models such as Vision Transformer, Amazon Lookout for Vision, and AutoGloun for anomaly detection.

Image processing and anomaly detection pipeline

Image processing and anomaly detection pipeline

The following figure demonstrates the detailed overview of our proposed approach that includes the data processing pipeline and various ML algorithms employed for anomaly detection. First, we will describe the steps involved in the data processing pipeline. Next, we will explain the details and intuition related to various modeling techniques employed during this engagement to achieve the desired performance goals.

Data preprocessing

The proposed data preprocessing pipeline includes data standardization, identification of region of interest (ROI), data augmentation, data segmentation, and finally data labeling. The purpose of each step is described below:

Data standardization

The first step in our data processing pipeline includes data standardization. In this step, each image is cropped and divided into non overlapping patches of size 224 X 224 pixels. The goal of this step is to generate patches of uniform sizes that could be further utilized for training a ML model and localizing the anomalies in high resolution images.

Identification of region of interest (ROI)

The input data consists of high-resolution images containing large amount of irrelevant background information (i.e., vegetation, houses, cars, horses, cows, etc.). Our goal is to identify anomalies related to wood poles. In order to identify the ROI (i.e., patches containing the wood pole), we employed Amazon Rekognition custom labeling. We trained an Amazon Rekognition custom label model using 3k labeled images containing both ROI and background images. The goal of the model is to do a binary classification between the ROI and background images. The patches identified as background information are discarded while the crops predicted as ROI are used in the next step. The following figure demonstrates the pipeline that identifies the ROI. We generated a sample of non-overlapping crops of 1,110 wooden images that generated 244,673 crops. We further used these images as input to an Amazon Rekognition custom model that identified 11,356 crops as ROI. Finally, we manually verified each of these 11,356 patches. During the manual inspection, we identified the model was able to correctly predict 10,969 wood patches out of 11,356 as ROI. In other words, the model achieved 96% precision.

Identification of region of interest

Identification of region of interest

Data labeling

During the manual inspection of the images, we also labeled each image with their associated labels. The associated labels of images include wood patch, non-wood patch, non-structure, non-wood patch and finally wood patches with anomalies. The following figure demonstrates the nomenclature of the images using Amazon Rekognition custom labeling.

Data augmentation

Given the limited amount of labeled data that was available for training, we augmented the training data set by making horizontal flips of all of the patches. This had the effective impact of doubling the size of our data set.

Segmentation

We labeled the objects in 600 images (poles, wires, and metal railing) using the bounding box object detection labeling tool in Amazon Rekognition Custom Labels and trained a model to detect the three main objects of interest. We used the trained model to remove the background from all the images, by identifying and extracting the poles in each image, while removing the all other objects as well as the background. The resulting dataset had fewer images than the original data set, as a result of removing all images that don’t contain wood poles. In addition, there was also a false positive image that were removed from the dataset.

Anomaly detection

Next, we use the preprocessed data for training the machine learning model for anomaly detection. We employed three different methods for anomaly detection which includes AWS Managed Machine Learning Services (Amazon Lookout for Vision [L4V], Amazon Rekognition), AutoGluon, and Vision Transformer based self-distillation method.

AWS Services

Amazon Lookout for Vision (L4V)

Amazon Lookout for Vision is a managed AWS service that enables swift training and deployment of ML models and provides anomaly detection capabilities. It requires fully labelled data, which we provided by pointing to the image paths in Amazon S3. Training the model is as a simple as a single API (Application programming interface) call or console button click and L4V takes care of model selection and hyperparameter tuning under the hood.

Amazon Rekognition

Amazon Rekognition is a managed AI/ML service similar to L4V, which hides modelling details and provides many capabilities such as image classification, object detection, custom labelling, and more. It provides the ability to use the built-in models to apply to previously known entities in images (e.g., from ImageNet or other large open datasets). However, we used Amazon Rekognition’s Custom Labels functionality to train the ROI detector, as well as an anomaly detector on the specific images that Duke Energy has. We also used the Amazon Rekognition’s Custom Labels to train a model to put bounding boxes around wood poles in each image.

AutoGloun

AutoGluon is an open-source machine learning technique developed by Amazon. AutoGluon includes a multi-modal component which allows easy training on image data. We used AutoGluon Multi-modal to train models on the labelled image patches to establish a baseline for identifying anomalies.

Vision Transformer

Many of the most exciting new AI breakthroughs have come from two recent innovations: self-supervised learning, which allows machines to learn from random, unlabeled examples; and Transformers, which enable AI models to selectively focus on certain parts of their input and thus reason more effectively. Both methods have been a sustained focus for the Machine learning community, and we’re pleased to share that we used them in this engagement.

In particular, working in collaboration with researchers at Duke Energy, we used pre-trained self-distillation ViT (Vision Transformer) models as feature extractors for the downstream anomaly detection application using Amazon Sagemaker. The pre-trained self-distillation vision transformer models are trained on large amount of training data stored on Amazon S3 in a self-supervised manner using Amazon SageMaker. We leverage the transfer learning capabilities of ViT models pre-trained on large scale datasets (e.g., ImageNet). This helped us achieve a recall of 83% on an evaluation set using only a few thousands of labeled images for training.

Evaluation metrics

The following figure shows the key metrics used to evaluate model performance and its impacts. The key goal of the model is to maximize anomaly detection (i.e. true positives) and minimize the number of false negatives, or times when the anomalies that could lead to outages are beingmisclassified.

Once the anomalies are identified, technicians can address them, preventing future outages and ensuring compliance with government regulations. There’s another benefit to minimizing false positives: you avoid the unnecessary effort of going through images again.

Keeping these metrics in mind, we track the model performance in terms of following metrics, which encapsulates all four metrics defined above.

Precision

The percent of anomalies detected that are actual anomalies for objects of interest. Precision measures how well our algorithm identifies only anomalies. For this use case, high precision means low false alarms (i.e., the algorithm falsely identifies a woodpecker hole while there isn’t any in the image).

Recall

The percent of all anomalies that are recovered for each object of interest. Recall measures how well we identify all anomalies. This set captures some percentage of the full set of anomalies, and that percentage is the recall. For this use case, high recall means that we’re good at catching woodpecker holes when they occur. Recall is therefore the right metric to focus on in this POC because false alarms are at best annoying while missed anomalies could lead to serious consequence if left unattended.

Lower recall can lead to outages and government regulation violations. While lower precision leads to wasted human effort. The primary goal of this engagement is to identify all the anomalies to comply with government regulation and avoid any outage, hence we prioritize improving recall over precision.

Evaluation and model comparison

In the following section, we demonstrate the comparison of various modeling techniques employed during this engagement. We evaluated the performance of two AWS services Amazon Rekognition and Amazon Lookout for Vision. We also evaluated various modeling techniques using AutoGluon. Finally, we compare the performance with state-of-the-art ViT based self-distillation method.

The following figure shows the model improvement for the AutoGluon using different data processing techniques over the period of this engagement. The key observation is as we improve the data quality and quantity the performance of the model in terms of recall improved from below 30% to 78%.

Next, we compare the performance of AutoGluon with AWS services. We also employed various data processing techniques that helped improve the performance. However, the major improvement came from increasing the data quantity and quality. We increase the dataset size from 11 K images in total to 60 K images.

Next, we compare the performance of AutoGluon and AWS services with ViT based method. The following figure demonstrates that ViT-based method, AutoGluon and AWS services performed on par in terms of recall. One key observation is, beyond a certain point, increase in data quality and quantity does not help increase the performance in terms of recall. However, we observe improvements in terms of precision.

Precision versus recall comparison

Amazon AutoGluon Predicted anomalies Predicted normal
Anomalies 15600 4400
Normal 3659 38341

Next, we present the confusion matrix for AutoGluon and Amazon Rekognition and ViT based method using our dataset that contains 62 K samples. Out of 62K samples, 20 K samples are anomalous while remaining 42 K images are normal. It can be observed that ViT based methods captures largest number of anomalies (16,600) followed by Amazon Rekognition (16,000) and Amazon AutoGluon (15600). Similarly, Amazon AutoGluon has least number of false positives (3659 images) followed by Amazon Rekognition (5918) and ViT (15323). These results demonstrates that Amazon Rekognition achieves the highest AUC (area under the curve).

Amazon Rekognition Predicted anomalies Predicted normal
Anomalies 16,000 4000
Normal 5918 36082
ViT                                Predicted anomalies Predicted normal
Anomalies 16,600 3400
Normal 15,323 26,677

Conclusion

In this post, we showed you how the MLSL and Duke Energy teams worked together to develop a computer vision-based solution to automate anomaly detection in wood poles using high resolution images collected via helicopter flights. The proposed solution employed a data processing pipeline to crop the high-resolution image for size standardization. The cropped images are further processed using Amazon Rekognition Custom Labels to identify the region of interest (i.e., crops containing the patches with poles). Amazon Rekognition achieved 96% precision in terms of correctly identifying the patches with poles. The ROI crops are further used for anomaly detection using ViT based self-distillation mdoel AutoGluon and AWS services for anomaly detection. We used a standard data set to evaluate the performance of all three methods. The ViT based model achieved 83% recall and 52% precision. AutoGluon achieved 78% recall and 81% precision. Finally, Amazon Rekognition achieves 80% recall and 73% precision. The goal of using three different methods is to compare the performance of each method with different number of training samples, training time, and deployment time. All these methods take less than 2 hours to train a and deploy using a single A100 GPU instance or managed services on Amazon AWS. Next, steps for further improvement in model performance include adding more training data for improving model precision.

Overall, the end-to-end pipeline proposed in this post help achieve significant improvements in anomaly detection while minimizing operations cost, safety incident, regulatory risks, carbon emissions, and potential power outages.

The solution developed can be employed for other anomaly detection and asset health-related use cases across transmission and distribution networks, including defects in insulators and other equipment. For further assistance in developing and customizing this solution, please feel free to get in touch with the MLSL team.


About the Authors

Travis Bronson is a Lead Artificial Intelligence Specialist with 15 years of experience in technology and 8 years specifically dedicated to artificial intelligence. Over his 5-year tenure at Duke Energy, Travis has advanced the application of AI for digital transformation by bringing unique insights and creative thought leadership to his company’s leading edge. Travis currently leads the AI Core Team, a community of AI practitioners, enthusiasts, and business partners focused on advancing AI outcomes and governance. Travis gained and refined his skills in multiple technological fields, starting in the US Navy and US Government, then transitioning to the private sector after more than a decade of service.

 Brian Wilkerson is an accomplished professional with two decades of experience at Duke Energy. With a degree in computer science, he has spent the past 7 years excelling in the field of Artificial Intelligence. Brian is a co-founder of Duke Energy’s MADlab (Machine Learning, AI and Deep learning team). Hecurrently holds the position of Director of Artificial Intelligence & Transformation at Duke Energy, where he is passionate about delivering business value through the implementation of AI.

Ahsan Ali is an Applied Scientist at the Amazon Generative AI Innovation Center, where he works with customers from different domains to solve their urgent and expensive problems using Generative AI.

Tahin Syed is an Applied Scientist with the Amazon Generative AI Innovation Center, where he works with customers to help realize business outcomes with generative AI solutions. Outside of work, he enjoys trying new food, traveling, and teaching taekwondo.

Dr. Nkechinyere N. Agu is an Applied Scientist in the Generative AI Innovation Center at AWS. Her expertise is in Computer Vision AI/ML methods, Applications of AI/ML to healthcare, as well as the integration of semantic technologies (Knowledge Graphs) in ML solutions. She has a Masters and a Doctorate in Computer Science.

Aldo Arizmendi is a Generative AI Strategist in the AWS Generative AI Innovation Center based out of Austin, Texas. Having received his B.S. in Computer Engineering from the University of Nebraska-Lincoln, over the last 12 years, Mr. Arizmendi has helped hundreds of Fortune 500 companies and start-ups transform their business using advanced analytics, machine learning, and generative AI.

Stacey Jenks is a Principal Analytics Sales Specialist at AWS, with more than two decades of experience in Analytics and AI/ML. Stacey is passionate about diving deep on customer initiatives and driving transformational, measurable business outcomes with data. She is especially enthusiastic about the mark that utilities will make on society, via their path to a greener planet with affordable, reliable, clean energy.

Mehdi Noor is an Applied Science Manager at Generative Ai Innovation Center. With a passion for bridging technology and innovation, he assists AWS customers in unlocking the potential of Generative AI, turning potential challenges into opportunities for rapid experimentation and innovation by focusing on scalable, measurable, and impactful uses of advanced AI technologies, and streamlining the path to production.

Read More

Understanding social biases through the text-to-image generation lens

Understanding social biases through the text-to-image generation lens

This research paper was presented at the Sixth AAAI/ACM Conference on Artificial Intelligence, Ethics, and Society (AIES) (opens in new tab), a premier forum for discussion on the societal and ethical aspects of artificial intelligence.

The rise of text-to-image (T2I) generation has ushered in a new era of innovation, offering a broad spectrum of possibilities for creators, designers, and the everyday users of productivity software. This technology can transform descriptive text into remarkably realistic visual content, empowering users to enrich their work with vivid illustrative elements. However, beneath this innovation lies a notable concern—the potential inclusion of harmful societal biases.

These T2I models create images from the extensive web data on which they had been trained, and this data often lacks representation of different demographic groups and cultures and can even harbor harmful content. When these societal biases seep into AI-generated content, they perpetuate and amplify pre-existing societal problems, reinforcing them and creating a disconcerting cycle that undermines previous and current mitigation efforts.

Representation of gender, race, and age across occupations and personality traits

To tackle this problem, it is essential to rigorously evaluate these models across a variety of demographic factors and scenarios. In our paper, “Social Biases through the Text-to-Image Generation Lens (opens in new tab),” presented at AIES 2023 (opens in new tab), we conduct a thorough analysis for studying and quantifying common societal biases reflected in generated images. We focus on the portrayal of occupations, personality traits, and everyday situations across representations of gender, age, race, and geographical location. 

Microsoft Research Podcast

Collaborators: Holoportation™ communication technology with Spencer Fowers and Kwame Darko

Spencer Fowers and Kwame Darko break down how the technology behind Holoportation and the telecommunication device being built around it brings patients and doctors together when being in the same room isn’t an easy option and discuss the potential impact of the work.


For example, consider images that reinforce societal biases for the roles of CEO and housekeeper. These professions have been extensively studied as examples of stereotypical gender biases—where predominantly men are CEOs and women are housekeepers. For all such cases, we observed three different perspectives: 

  1. Real-world distribution: Relies on labor statistics, presenting distribution across various dimensions, such as gender, race, and age.
  1. Search engine results: Captures the distribution evident in search engine outcomes, reflecting contemporary portrayals. 
  1. Image generation results: Emphasizes the distribution observed in image generation outputs. 

We tested two T2I generators, DALLE-v2 and Stable Diffusion and compared them with 2022 data from the U.S. Bureau of Labor Statistics and results for a Google image search conducted in 2020, examining how women are represented across five different occupations. Notably, the analysis of generation models revealed a significant setback in representational fairness compared with data sourced from the U.S. Bureau of Labor Statistics (BLS) and a web image search (GIS) based on geographically referenced information. Notably, images generated by DALLE-v2 provide minimal representation of women in the professions of CEO and computer programmer. Conversely, in images generated by Stable Diffusion, women are consistently represented in the roles of nurses and housekeepers 100% of the time. Figure 1 illustrates our findings, and Figure 2 shows examples of images generated to show different occupations. 

A chart showing gender representation in percentage for DALLE-v2, Stable Diffusion, Google Image Search 2020, and BLS data. 
Figure 1. Gender representation for DALLE-v2, Stable Diffusion, Google Image Search 2020, and BLS data. 
Examples of generations for the professions of “computer programmer” and “housekeeper” using the DALL-E v2 and Stable Diffusion models.
Figure 2. A sample of the first four images generated for the professions of “computer programmer” and “housekeeper” using the DALL-E v2 and Stable Diffusion models. Notably, one gender is conspicuously absent across a distribution of 500 generated images. 

Even when using basic prompts like “person” without including an occupation, we observed that models can underrepresent certain demographic groups across age, race, and gender. When we analyzed DALLE-v2 and Stable Diffusion, both offered a limited representation of races other than white across a set of 500 generated images. Furthermore, the DALLE-v2 outputs revealed a remarkable lack of age diversity, with over 80% of the images depicting either adults who appeared to be between the ages 18 and 40 or children. This is illustrated in Figure 3.

Three charts showing gender, race, and age distribution as interpreted by human annotators for DALL-E v2 and Stable Diffusion models.
Figure 3. Gender, race, and age distribution as interpreted by human annotators and automated face processing within the context of image generation for the prompt “person.” 

Our study also examines biases of similar representations across positive and negative personality traits, revealing the subtleties of how these traits are depicted. While individuals of nonwhite races appear linked with positive attributes such as vigor, ambition, striving, and independence, they are also associated with negative traits like detachment, hardheartedness, and conceitedness. 

Representation of geographical locations in everyday scenarios 

Another aspect of bias that we studied pertains to the representation of diverse geographical locations in how models interpret everyday scenarios. We did this using such prompts as “a photo of a birthday party” or “a photo of a library.” Although it is difficult to discern the precise location of a generated photo, distinctions in these representations can still be measured between using a general prompt and a prompt that specifies a location, for example, “a photo of a birthday party in Colombia.” In the paper, we describe this experiment for the two most populous countries in each inhabited continent, considering everyday scenarios centering around events, places, food, institutions, community, and clothing. When models were given a general prompt, overall results indicated that images generated for countries like Nigeria, Papua New Guinea, and Ethiopia had the greatest difference between the prompt and the image, while images generated for Germany, the US, and Russia were the closest aligned to the general prompt. 

Subtle effects of using expanded prompts 

Many bias mitigation techniques rely on expanding the prompt to enrich and diversify the images that models generate. To tackle bias in AI-generated images, we applied prompt engineering (opens in new tab) to increase the likelihood that the image will reflect what’s specified in the prompt. We used prompt expansion, a type of prompt engineering, to add further descriptors to the initial general prompts and guide the model towards unbiased content. An example of prompt expansion would be “a portrait of a female doctor” instead of “a portrait of a doctor.” Our experiments proved that prompt expansion is predominantly effective in creating more specified content in AI-generated images. However, there are also unintended outcomes, particularly in terms of decreased diversity and image quality, as shown in Figure 4. 

Examples of generation output from DALL-E v2 for two prompts: “a portrait of an announcer” and “a portrait of a female announcer.”
Figure 4. Expanded prompts using descriptors like “female” can indeed yield more diverse depictions, but often at the cost of image variety and quality. 

Safeguarding against bias in T2I models

As T2I generation models become increasingly integrated into our digital ecosystems, it is paramount that we remain vigilant to the biases they may inadvertently perpetuate. This research underscores the profound importance of continually evaluating and refining these models. We hope that the outcomes and methodology presented in this study provide valuable insights for evaluating and building new generative models. We would like to emphasize the importance of fostering responsible development and ensuring representational fairness in this process. 

The post Understanding social biases through the text-to-image generation lens appeared first on Microsoft Research.

Read More