Electronics Giants Tap Into Industrial Automation With NVIDIA Metropolis for Factories

Electronics Giants Tap Into Industrial Automation With NVIDIA Metropolis for Factories

The $46 trillion global electronics manufacturing industry spans more than 10 million factories worldwide, where much is at stake in producing defect-free products. To drive product excellence, leading electronics manufacturers are adopting NVIDIA Metropolis for Factories.

More than 50 manufacturing giants and industrial automation providers — including Foxconn Industrial Internet, Pegatron, Quanta, Siemens and Wistron — are implementing Metropolis for Factories, NVIDIA founder and CEO Jensen Huang announced during his keynote address at the COMPUTEX technology conference in Taipei.

NVIDIA Metropolis for Factories is a collection of factory automation workflows that enables industrial technology companies and manufacturers to develop, deploy and manage customized quality-control systems that offer a competitive advantage.

Manufacturers globally spend more than $6 trillion a year in pursuit of quality control, and they apply defect detection on nearly every product line. But manual inspections can’t keep up with the demands.

Many manufacturers have automated optical inspection (AOI) systems that can help, but often these have high false detection rates, requiring labor-intensive and costly secondary manual inspections in an already challenging labor market, reducing their value.

NVIDIA Metropolis for Factories now offers a state-of-the-art AI platform and workflows for the development of incredibly accurate inspection applications such as AOI.

Pegatron Drives AOI With Metropolis for Factories 

Leading manufacturer Pegatron, based in Taipei’s Beitou district, is using NVIDIA Metropolis for Factories on its production lines.

Pegatron manufactures everything from motherboards to smartphones, laptops and game consoles. With a dozen manufacturing facilities handling more than 300 products and more than 5,000 parts per day, Pegatron has a lot of quality control to manage across its product portfolio. Further, frequent product updates require ongoing revisions to its AOI systems.

Pegatron is using the entire Metropolis for Factories workflow to support its printed circuit board (PCB) factories with simulation, robotics and automated production inspection. Metropolis for Factories enables the electronics manufacturing giant to quickly update its defect detection models and achieve 99.8% accuracy on its AOI systems, starting with small datasets.

 

Pegatron uses NVIDIA Isaac Sim, a robotic simulator, to program robotic arms in simulation and to model the performance of its fleets of mobile robots.

Tapping into NVIDIA Omniverse Replicator provides synthetic data generation to simulate defects, helping build massive training datasets with domain randomization and other techniques.

In Metropolis, NVIDIA TAO Toolkit allows Pegatron to access pretrained models and transfer learning to build its highly accurate defect detection models from its enhanced datasets.

The NVIDIA DeepStream software development kit can be used to develop optimized intelligent video applications that handle multiple video, image and audio streams. Using DeepStream, Pegatron was able to achieve a 10x improvement in throughput.

Moreover, Omniverse enables Pegatron to run digital twins of its inspection equipment, so it can simulate future inspection processes, promising increased efficiencies to its production workflow.

It’s also used by Quanta subsidiary Techman Robot, which taps Isaac Sim to optimize the inspection of robots by robots on their manufacturing line.

Metropolis for Factories is helping manufacturers like Pegatron to increase production line throughput, reduce costs and improve production quality.

Growing Partner Ecosystem Supports Metropolis 

Metropolis for Factories can be deployed from the enterprise industrial edge to the cloud, and a large and growing ecosystem of partners is helping bring it to market.

A host of specialists are joining forces on this effort including sensor makers, application partners, inspection equipment makers and integration partners.

Basler, a leading maker of imaging sensors and systems, has partnered with NVIDIA to help developers build AI-enabled inspection systems faster through tighter integration with the NVIDIA DeepStream SDK.

Quantiphi, a Metropolis partner, is working with one of the world’s largest beverage producers to automate inspections of fully packed pallets with GPU-powered vision AI.

Overview and Advantech — both NVIDIA Metropolis partners — are collaborating to build a real-time AI-based inspection system to support industrial inspection, product counting and assembly verification.

Metropolis partners Siemens and Data Monsters are working together to build industrial inspection systems, bringing together Omniverse Replicator synthetic data generation, NVIDIA TAO training, DeepStream runtime and Siemens’ NVIDIA Jetson-powered industrial personal computers.

Learn more about NVIDIA Metropolis for Factories.

Read More

NVIDIA Brings New Generative AI Capabilities, Groundbreaking Performance to 100 Million Windows RTX PCs and Workstations

NVIDIA Brings New Generative AI Capabilities, Groundbreaking Performance to 100 Million Windows RTX PCs and Workstations

Generative AI is rapidly ushering in a new era of computing for productivity, content creation, gaming and more. Generative AI models and applications — like NVIDIA NeMo and DLSS 3 Frame Generation, Meta LLaMa, ChatGPT, Adobe Firefly and Stable Diffusion — use neural networks to identify patterns and structures within existing data to generate new and original content.

When optimized for GeForce RTX and NVIDIA RTX GPUs, which offer up to 1,400 Tensor TFLOPS for AI inferencing, generative AI models can run up to 5x faster than on competing devices. This is thanks to Tensor Cores — dedicated hardware in RTX GPUs built to accelerate AI calculations — and regular software improvements. Enhancements introduced last week at the Microsoft Build conference doubled performance for generative AI models, such as Stable Diffusion, that take advantage of new DirectML optimizations.

As more AI inferencing happens on local devices, PCs will need powerful yet efficient hardware to support these complex tasks. To meet this need, RTX GPUs will add Max-Q low-power inferencing for AI workloads. The GPU will operate at a fraction of the power for lighter inferencing tasks, while scaling up to unmatched levels of performance for heavy generative AI workloads.

To create new AI applications, developers can now access a complete RTX-accelerated AI development stack running on Windows 11, making it easier to develop, train and deploy advanced AI models. This starts with development and fine-tuning of models with optimized deep learning frameworks available via Windows Subsystem for Linux.

Developers can then move seamlessly to the cloud to train on the same NVIDIA AI stack, which is available from every major cloud service provider. Next, developers can optimize the trained models for fast inferencing with tools like the new Microsoft Olive. And finally, they can deploy their AI-enabled applications and features to an install base of over 100 million RTX PCs and workstations  that have been optimized for AI.

“AI will be the single largest driver of innovation for Windows customers in the coming years,” said Pavan Davuluri, corporate vice president of Windows silicon and system integration at Microsoft. “By working in concert with NVIDIA on hardware and software optimizations, we’re equipping developers with a transformative, high-performance, easy-to-deploy experience.”

To date, over 400 RTX AI-accelerated apps and games have been released, with more on the way.

During his keynote address kicking off COMPUTEX 2023, NVIDIA founder and CEO Jensen Huang introduced a new generative AI to support game development, NVIDIA Avatar Cloud Engine (ACE) for Games.

This custom AI model foundry service transforms games by bringing intelligence to non-playable characters through AI-powered natural language interactions. Developers of middleware, tools and games can use ACE for Games to build and deploy customized speech, conversation and animation AI models in their software and games.

Generative AI on RTX, Anywhere

From servers to the cloud to devices, generative AI running on RTX GPUs is everywhere. NVIDIA’s accelerated AI computing is a low-latency, full-stack endeavor. We’ve been optimizing every part of our hardware and software architecture for many years for AI, including fourth-generation Tensor Cores — dedicated AI hardware on RTX GPUs.

Regular driver optimizations ensure peak performance. The most recent NVIDIA driver, combined with Olive-optimized models and updates to DirectML, delivers significant speedups for developers on Windows 11. For example, Stable Diffusion performance is improved by 2x compared to the previous interference times for developers taking advantage of DirectML optimized paths.

And with the latest generation of RTX laptops and mobile workstations built on the NVIDIA Ada Lovelace architecture, users can take generative AI anywhere. Our next-gen mobile platform brings new levels of performance and portability — in form factors as small as 14 inches and as lightweight as about three pounds. Makers like Dell, HP, Lenovo and ASUS are pushing the generative AI era forward, backed by RTX GPUs and Tensor Cores.

“As AI continues to get deployed across industries at an expected annual growth rate of over 37% now through 2030, businesses and consumers will increasingly need the right technology to develop and implement AI, including generative AI. Lenovo is uniquely positioned to empower generative AI spanning from devices to servers to the cloud, having developed products and solutions for AI workloads for years. Our NVIDIA RTX GPU-powered PCs, such as select Lenovo ThinkPad, ThinkStation, ThinkBook, Yoga, Legion and LOQ devices, are enabling the transformative wave of generative AI for better everyday user experiences in saving time, creating content, getting work done, gaming and more.” — Daryl Cromer, vice president and chief technology officer of PCs and Smart Devices at Lenovo

“Generative AI is transformative and a catalyst for future innovation across industries. Together, HP and NVIDIA equip developers with incredible performance, mobility and the reliability needed to run accelerated AI models today, while powering a new era of generative AI.” —  Jim Nottingham, senior vice president and general manager of Z by HP

“Our recent work with NVIDIA on Project Helix centers on making it easier for enterprises to build and deploy trustworthy generative AI on premises. Another step in this historic moment is bringing generative AI to PCs. Think of app developers looking to perfect neural network algorithms while keeping training data and IP under local control. This is what our powerful and scalable Precision workstations with NVIDIA RTX GPUs are designed to do. And as the global leader in workstations, Dell is uniquely positioned to help users securely accelerate AI applications from the edge to the datacenter.” — Ed Ward, president of the client product group at Dell Technologies

“The generative AI era is upon us, requiring immense processing and fully optimized hardware and software. With the NVIDIA AI platform, including NVIDIA Omniverse, which is now preinstalled on many of our products, we are excited to see the AI revolution continue to take shape on ASUS and ROG laptops.” — Galip Fu, director of global consumer marketing at ASUS

Soon, laptops and mobile workstations with RTX GPUs will get the best of both worlds. AI inference-only workloads will be optimized for Tensor Core performance while keeping power consumption of the GPU as low as possible, extending battery life and maintaining a cool, quiet system. The GPU can then dynamically scale up for maximum AI performance when the workload demands it.

Developers can also learn how to optimize their applications end-to-end to take full advantage of GPU-acceleration via the NVIDIA AI for accelerating applications developer site.

Read More

NVIDIA CEO Tells NTU Grads to Run, Not Walk — But Be Prepared to Stumble

NVIDIA CEO Tells NTU Grads to Run, Not Walk — But Be Prepared to Stumble

“You are running for food, or you are running from becoming food. And often times, you can’t tell which. Either way, run.”

NVIDIA founder and CEO Jensen Huang today urged graduates of National Taiwan University to run hard to seize the unprecedented opportunities that AI will present, but embrace the inevitable failures along the way.

Whatever you pursue, he told the 10,000 graduates of the island’s premier university, do it with passion and conviction — and stay humble enough to learn the hard lessons that await.

“Whatever it is, run after it like we did. Run. Don’t walk,” Huang said, having swapped his signature black leather jacket for a black graduation robe, with the school’s plum-blossom emblem highlighting a royal blue, white and aqua collar.

“Remember, either you are running for food; or you are running from becoming food. And often times, you can’t tell which. Either way, run.”

Huang, who moved from Taiwan when he was young, recognized his parents in the audience, and shared three stories of initial failures and retreat. He called them instrumental in helping forge NVIDIA’s character during its three-decade journey from a three-person gaming-graphics startup to a global AI leader worth nearly a trillion dollars.

“I was … successful — until I started NVIDIA,” he said. “At NVIDIA, I experienced failures — great big ones. All humiliating and embarrassing. Many nearly doomed us.”

The first involved a key early contract the company won to help Sega build a gaming console. Rapid changes in the industry forced NVIDIA to give up the contract in a near-death brush with bankruptcy, which Sega’s leadership helped avert.

“Confronting our mistake and, with humility, asking for help saved NVIDIA,” he said.

The second was the decision in 2007 to put CUDA into all the company’s GPUs, enabling them to crunch data in addition to handling 3D graphics. It was an expensive, long-term investment that drew much criticism didn’t pay off for years until the chips started being used for machine learning.

“Our market cap hovered just above a billion dollars,” he recalled. “We suffered many years of poor performance. Our shareholders were skeptical of CUDA and preferred we improve profitability.”

The third was the decision in 2010 to charge into the promising mobile-phone market as graphics-rich capabilities were coming into reach. The market quickly commoditized, though, and NVIDIA retreated just as quickly, taking initial heat but opening the door to investing in promising new markets — robotics and self-driving cars.

“Our strategic retreat paid off,” he said. “By leaving the phone market, we opened our minds to invent a new one.”

Huang told grads that of the parallels in terms of boundless promise between the world he entered upon graduating four decades ago, on the cusp of the PC revolution, and the brave new age of AI they are entering today.

“For your journey, take along some of my learnings,” he said. Admit mistakes and ask for help; endure pain and suffering to realize your dreams; and make sacrifices to dedicate yourself to a life of purpose.

Read More

Foundation models for reasoning on charts

Foundation models for reasoning on charts

Visual language is the form of communication that relies on pictorial symbols outside of text to convey information. It is ubiquitous in our digital life in the form of iconography, infographics, tables, plots, and charts, extending to the real world in street signs, comic books, food labels, etc. For that reason, having computers better understand this type of media can help with scientific communication and discovery, accessibility, and data transparency.

While computer vision models have made tremendous progress using learning-based solutions since the advent of ImageNet, the focus has been on natural images, where all sorts of tasks, such as classification, visual question answering (VQA), captioning, detection and segmentation, have been defined, studied and in some cases advanced to reach human performance. However, visual language has not garnered a similar level of attention, possibly because of the lack of large-scale training sets in this space. But over the last few years, new academic datasets have been created with the goal of evaluating question answering systems on visual language images, like PlotQA, InfographicsVQA, and ChartQA.

Example from ChartQA. Answering the question requires reading the information and computing the sum and the difference.

Existing models built for these tasks relied on integrating optical character recognition (OCR) information and their coordinates into larger pipelines but the process is error prone, slow, and generalizes poorly. The prevalence of these methods was because existing end-to-end computer vision models based on convolutional neural networks (CNNs) or transformers pre-trained on natural images could not be easily adapted to visual language. But existing models are ill-prepared for the challenges in answering questions on charts, including reading the relative height of bars or the angle of slices in pie charts, understanding axis scales, correctly mapping pictograms with their legend values with colors, sizes and textures, and finally performing numerical operations with the extracted numbers.

In light of these challenges, we propose “MatCha: Enhancing Visual Language Pretraining with Math Reasoning and Chart Derendering”. MatCha, which stands for math and charts, is a pixels-to-text foundation model (a pre-trained model with built-in inductive biases that can be fine-tuned for multiple applications) trained on two complementary tasks: (a) chart de-rendering and (b) math reasoning. In chart de-rendering, given a plot or chart, the image-to-text model is required to generate its underlying data table or the code used to render it. For math reasoning pre-training, we pick textual numerical reasoning datasets and render the input into images, which the image-to-text model needs to decode for answers. We also propose “DePlot: One-shot visual language reasoning by plot-to-table translation”, a model built on top of MatCha for one-shot reasoning on charts via translation to tables. With these methods we surpass the previous state of the art in ChartQA by more than 20% and match the best summarization systems that have 1000 times more parameters. Both papers will be presented at ACL2023.

Chart de-rendering

Plots and charts are usually generated by an underlying data table and a piece of code. The code defines the overall layout of the figure (e.g., type, direction, color/shape scheme) and the underlying data table establishes the actual numbers and their groupings. Both the data and code are sent to a compiler/rendering engine to create the final image. To understand a chart, one needs to discover the visual patterns in the image and effectively parse and group them to extract the key information. Reversing the plot rendering process demands all such capabilities and can thus serve as an ideal pre-training task.

A chart created from a table in the Airbus A380 Wikipedia page using random plotting options. The pre-training task for MatCha consists of recovering the source table or the source code from the image.

In practice, it is challenging to simultaneously obtain charts, their underlying data tables, and their rendering code. To collect sufficient pre-training data, we independently accumulate [chart, code] and [chart, table] pairs. For [chart, code], we crawl all GitHub IPython notebooks with appropriate licenses and extract blocks with figures. A figure and the code block right before it are saved as a [chart, code] pair. For [chart, table] pairs, we explored two sources. For the first source, synthetic data, we manually write code to convert web-crawled Wikipedia tables from the TaPas codebase to charts. We sampled from and combined several plotting options depending on the column types. In addition, we also add [chart, table] pairs generated in PlotQA to diversify the pre-training corpus. The second source is web-crawled [chart, table] pairs. We directly use the [chart, table] pairs crawled in the ChartQA training set, containing around 20k pairs in total from four websites: Statista, Pew, Our World in Data, and OECD.

Math reasoning

We incorporate numerical reasoning knowledge into MatCha by learning math reasoning skills from textual math datasets. We use two existing textual math reasoning datasets, MATH and DROP for pre-training. MATH is synthetically created, containing two million training examples per module (type) of questions. DROP is a reading-comprehension–style QA dataset where the input is a paragraph context and a question.

To solve questions in DROP, the model needs to read the paragraph, extract relevant numbers and perform numerical computation. We found both datasets to be complementary. MATH contains a large number of questions across different categories, which helps us identify math operations needed to explicitly inject into the model. DROP’s reading-comprehension format resembles the typical QA format wherein models simultaneously perform information extraction and reasoning. In practice, we render inputs of both datasets into images. The model is trained to decode the answer.

To improve the math reasoning skills of MatCha we incorporate examples from MATH and DROP into the pre-training objective, by rendering the input text as images.

End-to-end results

We use a Pix2Struct model backbone, which is an image-to-text transformer tailored for website understanding, and pre-train it with the two tasks described above. We demonstrate the strengths of MatCha by fine-tuning it on several visual language tasks — tasks involving charts and plots for question answering and summarization where no access to the underlying table is possible. MatCha surpasses previous models’ performance by a large margin and also outperforms the previous state of the art, which assumes access to underlying tables.

In the figure below, we first evaluate two baseline models that incorporate information from an OCR pipeline, which until recently was the standard approach for working with charts. The first is based on T5, the second on VisionTaPas. We also compare against PaLI-17B, which is a large (~1000 times larger than the other models) image plus text-to-text transformer trained on a diverse set of tasks but with limited capabilities for reading text and other forms of visual language. Finally, we report the Pix2Struct and MatCha model results.

Experimental results on two chart QA benchmarks ChartQA & PlotQA (using relaxed accuracy) and a chart summarization benchmark chart-to-text (using BLEU4). Matcha surpasses the state of the art by a large margin on QA, compared to larger models, and matches these larger models on summarization.

For QA datasets, we use the official relaxed accuracy metric that allows for small relative errors in numerical outputs. For chart-to-text summarization, we report BLEU scores. MatCha achieves noticeably improved results compared to baselines for question answering, and comparable results to PaLI in summarization, where large size and extensive long text/captioning generation pre-training are advantageous for this kind of long-form text generation.

Derendering plus large language model chains

While extremely performant for their number of parameters, particularly on extractive tasks, we observed that fine-tuned MatCha models could still struggle with end-to-end complex reasoning (e.g., mathematical operations involving large numbers or multiple steps). Thus, we also propose a two-step method to tackle this: 1) a model reads a chart, then outputs the underlying table, 2) a large language model (LLM) reads this output and then tries to answer the question solely based on the textual input.

For the first model, we fine-tuned MatCha solely on the chart-to-table task, increasing the output sequence length to guarantee it could recover all or most of the information in the chart. DePlot is the resulting model. In the second stage, any LLM (such as FlanPaLM or Codex) can be used for the task, and we can rely on the standard methods to increase performance on LLMs, for example chain-of-thought and self-consistency. We also experimented with program-of-thoughts where the model produces executable Python code to offload complex computations.

An illustration of the DePlot+LLM method. This is a real example using FlanPaLM and Codex. The blue boxes are input to the LLM and the red boxes contain the answer generated by the LLMs. We highlight some of the key reasoning steps in each answer.

As shown in the example above, the DePlot model in combination with LLMs outperforms fine-tuned models by a significant margin, especially so in the human-sourced portion of ChartQA, where the questions are more natural but demand more difficult reasoning. Furthermore, DePlot+LLM can do so without access to any training data.

We have released the new models and code at our GitHub repo, where you can try it out yourself in colab. Checkout the papers for MatCha and DePlot for more details on the experimental results. We hope that our results can benefit the research community and make the information in charts and plots more accessible to everyone.

Acknowledgements

This work was carried out by Fangyu Liu, Julian Martin Eisenschlos, Francesco Piccinno, Syrine Krichene, Chenxi Pang, Kenton Lee, Mandar Joshi, Wenhu Chen and Yasemin Altun from our Language Team as part of Fangyu’s internship project. Nigel Collier from Cambridge also was a collaborator. We would like to thank Joshua Howland, Alex Polozov, Shrestha Basu Mallick, Massimo Nicosia and William Cohen for their valuable comments and suggestions.

Read More

Create high-quality images with Stable Diffusion models and deploy them cost-efficiently with Amazon SageMaker

Create high-quality images with Stable Diffusion models and deploy them cost-efficiently with Amazon SageMaker

Text-to-image generation is a task in which a machine learning (ML) model generates an image from a textual description. The goal is to generate an image that closely matches the description, capturing the details and nuances of the text. This task is challenging because it requires the model to understand the semantics and syntax of the text and to generate photorealistic images. There are many practical applications of text-to-image generation in AI photography, concept art, building architecture, fashion, video games, graphic design, and much more.

Stable Diffusion is a text-to-image model that empowers you to create high-quality images within seconds. When real-time interaction with this type of model is the goal, ensuring a smooth user experience depends on the use of accelerated hardware for inference, such as GPUs or AWS Inferentia2, Amazon’s own ML inference accelerator. The steep costs involved in using GPUs typically requires optimizing the utilization of the underlying compute, even more so when you need to deploy different architectures or personalized (fine-tuned) models. Amazon SageMaker multi-model endpoints (MMEs) help you address this problem by helping you scale thousands of models into one endpoint. By using a shared serving container, you can host multiple models in a cost-effective, scalable manner within the same endpoint, and even the same GPU.

In this post, you will learn about Stable Diffusion model architectures, different types of Stable Diffusion models, and techniques to enhance image quality. We also show you how to deploy Stable Diffusion models cost-effectively using SageMaker MMEs and NVIDIA Triton Inference Server.

Prompt: portrait of a cute bernese dog, art by elke Vogelsang, 8k ultra realistic, trending on artstation, 4 k Prompt: architecture design of living room, 8 k ultra-realistic, 4 k, hyperrealistic, focused, extreme details Prompt: New York skyline at night, 8k, long shot photography, unreal engine 5, cinematic, masterpiece

Stable Diffusion architecture

Stable Diffusion is a text-to-image open-source model that you can use to create images of different styles and content simply by providing a text prompt. In the context of text-to-image generation, a diffusion model is a generative model that you can use to generate high-quality images from textual descriptions. Diffusion models are a type of generative model that can capture the complex dependencies between the input and output modalities text and images.

The following diagram shows a high-level architecture of a Stable Diffusion model.

It consists of the following key elements:

  • Text encoder – CLIP is a transformers-based text encoder model that takes input prompt text and converts it into token embeddings that represent each word in the text. CLIP is trained on a dataset of images and their captions, a combination of image encoder and text encoder.
  • U-Net – A U-Net model takes token embeddings from CLIP along with an array of noisy inputs and produces a denoised output. This happens though a series of iterative steps, where each step processes an input latent tensor and produces a new latent space tensor that better represents the input text.
  • Auto encoder-decoder – This model creates the final images. It takes the final denoised latent output from the U-Net model and converts it into images that represents the text input.

Types of Stable Diffusion models

In this post, we explore the following pre-trained Stable Diffusion models by Stability AI from the Hugging Face model hub.

stable-diffusion-2-1-base

Use this model to generate images based on a text prompt. This is a base version of the model that was trained on LAION-5B. The model was trained on a subset of the large-scale dataset LAION-5B, and mainly with English captions. We use StableDiffusionPipeline from the diffusers library to generate images from text prompts. This model can create images of dimension 512 x 512. It uses the following parameters:

  • prompt – A prompt can be a text word, phrase, sentences, or paragraphs.
  • negative_prompt – You can also pass a negative prompt to exclude specified elements from the image generation process and to enhance the quality of the generated images.
  • guidance_scale – A higher guidance scale results in an image more closely related to the prompt, at the expense of image quality. If specified, it must be a float.

stable-diffusion-2-depth

This model is used to generate new images from existing ones while preserving the shape and depth of the objects in the original image. This stable-diffusion-2-depth model is fine-tuned from stable-diffusion-2-base, an extra input channel to process the (relative) depth prediction. We use StableDiffusionDepth2ImgPipeline from the diffusers library to load the pipeline and generate depth images. The following are the additional parameters specific to the depth model:

  • image – The initial image to condition the generation of new images.
  • num_inference_steps (optional) – The number of denoising steps. More denoising steps usually leads to a higher-quality image at the expense of slower inference. This parameter is modulated by strength.
  • strength (optional) – Conceptually, this indicates how much to transform the reference image. The value must be between 0–1. image is used as a starting point, adding more noise to it the larger the strength. The number of denoising steps depends on the amount of noise initially added. When strength is 1, the added noise will be maximum and the denoising process will run for the full number of iterations specified in num_inference_steps. A value of 1, therefore, essentially ignores image. For more details, refer to the following code.

stable-diffusion-2-inpainting

You can use this model for AI image restoration use cases. You can also use it to create novel designs and images from the prompts and additional arguments. This model is also derived from the base model and has a mask generation strategy. It specifies the mask of the original image to represent segments to be changed and segments to leave unchanged. We use StableDiffusionUpscalePipeline from the diffusers library to apply inpaint changes on original image. The following additional parameter is specific to the depth model:

  • mask_input – An image where the blacked-out portion remains unchanged during image generation and the white portion is replaced

stable-diffusion-x4-upscaler

This model is also derived from the base model, additionally trained on the 10M subset of LAION containing 2048 x 2048 images. As the name implies, it can be used to upscale lower-resolution images to higher resolutions

Use case overview

For this post, we deploy an AI image service with multiple capabilities, including generating novel images from text, changing the styles of existing images, removing unwanted objects from images, and upscaling low-resolution images to higher resolutions. Using several variations of Stable Diffusion models, you can address all of these use cases within a single SageMaker endpoint. This means that you’ll need to host large number of models in a performant, scalable, and cost-efficient way. In this post, we show how to deploy multiple Stable Diffusion models cost-effectively using SageMaker MMEs and NVIDIA Triton Inference Server. You will learn about the implementation details, optimization techniques, and best practices to work with text-to-image models.

The following table summarizes the Stable Diffusion models that we deploy to a SageMaker MME.

Model Name Model Size in GB
stabilityai/stable-diffusion-2-1-base 2.5
stabilityai/stable-diffusion-2-depth 2.7
stabilityai/stable-diffusion-2-inpainting 2.5
stabilityai/stable-diffusion-x4-upscaler 7

Solution overview

The following steps are involved in deploying Stable Diffusion models to SageMaker MMEs:

  1. Use the Hugging Face hub to download the Stable Diffusion models to a local directory. This will download scheduler, text_encoder, tokenizer, unet, and vae for each Stable Diffusion model into its corresponding local directory. We use the revision="fp16" version of the model.
  2. Set up the NVIDIA Triton model repository, model configurations, and model serving logic model.py. Triton uses these artifacts to serve predictions.
  3. Package the conda environment with additional dependencies and the package model repository to be deployed to the SageMaker MME.
  4. Package the model artifacts in an NVIDIA Triton-specific format and upload model.tar.gz to Amazon Simple Storage Service (Amazon S3). The model will be used for generating images.
  5. Configure a SageMaker model, endpoint configuration, and deploy the SageMaker MME.
  6. Run inference and send prompts to the SageMaker endpoint to generate images using the Stable Diffusion model. We specify the TargetModel variable and invoke different Stable Diffusion models to compare the results visually.

We have published the code to implement this solution architecture in the GitHub repo. Follow the README instructions to get started.

Serve models with an NVIDIA Triton Inference Server Python backend

We use a Triton Python backend to deploy the Stable Diffusion pipeline model to a SageMaker MME. The Python backend lets you serve models written in Python by Triton Inference Server. To use the Python backend, you need to create a Python file model.py that has the following structure: Every Python backend can implement four main functions in the TritonPythonModel class:

import triton_python_backend_utils as pb_utils
class TritonPythonModel:
"""Your Python model must use the same class name. Every Python model
that is created must have "TritonPythonModel" as the class name.
"""
def auto_complete_config(auto_complete_model_config):
def initialize(self, args):
def execute(self, requests):
def finalize(self):

Every Python backend can implement four main functions in the TritonPythonModel class: auto_complete_config, initialize, execute, and finalize.

initialize is called when the model is being loaded. Implementing initialize is optional. initialize allows you to do any necessary initializations before running inference. In the initialize function, we create a pipeline and load the pipelines using from_pretrained checkpoints. We configure schedulers from the pipeline scheduler config pipe.scheduler.config. Finally, we specify xformers optimizations to enable the xformer memory efficient parameter enable_xformers_memory_efficient_attention. We provide more details on xformers later in this post. You can refer to model.py of each model to understand the different pipeline details. This file can be found in the model repository.

The execute function is called whenever an inference request is made. Every Python model must implement the execute function. In the execute function, you are given a list of InferenceRequest objects. We pass the input text prompt to the pipeline to get an image from the model. Images are decoded and the generated image is returned from this function call.

We get the input tensor from the name defined in the model configuration config.pbtxt file. From the inference request, we get prompt, negative_prompt, and gen_args, and decode them. We pass all the arguments to the model pipeline object. Encode the image to return the generated image predictions. You can refer to the config.pbtxt file of each model to understand the different pipeline details. This file can be found in the model repository. Finally, we wrap the generated image in InferenceResponse and return the response.

Implementing finalize is optional. This function allows you to do any cleanups necessary before the model is unloaded from Triton Inference Server.

When working with the Python backend, it’s the user’s responsibility to ensure that the inputs are processed in a batched manner and that responses are sent back accordingly. To achieve this, we recommend following these steps:

  1. Loop through all requests in the requests object to form a batched_input.
  2. Run inference on the batched_input.
  3. Split the results into multiple InferenceResponse objects and concatenate them as the responses.

Refer to the Triton Python backend documentation or Host ML models on Amazon SageMaker using Triton: Python backend for more details.

NVIDIA Triton model repository and configuration

The model repository contains the model serving script, model artifacts and tokenizer artifacts, a packaged conda environment (with dependencies needed for inference), the Triton config file, and the Python script used for inference. The latter is mandatory when you use the Python backend, and you should use the Python file model.py. Let’s explore the configuration file of the inpaint Stable Diffusion model and understand the different options specified:

name: "sd_inpaint"
backend: "python"
max_batch_size: 8
input [
  {
    name: "prompt"
    data_type: TYPE_STRING
    dims: [
      -1
    ]
  },
  {
    name: "negative_prompt"
    data_type: TYPE_STRING
    dims: [
      -1
    ]
    optional: true
  },
  {
    name: "image"
    data_type: TYPE_STRING
    dims: [
      -1
    ]
  },
  {
    name: "mask_image"
    data_type: TYPE_STRING
    dims: [
      -1
    ]
  },
  {
    name: "gen_args"
    data_type: TYPE_STRING
    dims: [
      -1
    ]
    optional: true
  }
]
output [
  {
    name: "generated_image"
    data_type: TYPE_STRING    
    dims: [
      -1
    ]
  }
]
instance_group [
  {
    kind: KIND_GPU
  }
]
parameters: {
  key: "EXECUTION_ENV_PATH",
  value: {string_value: "/tmp/conda/sd_env.tar.gz"
  }
}

The following table explains the various parameters and values:

Key Details
name It’s not required to include the model configuration name property. In the event that the configuration doesn’t specify the model’s name, it’s presumed to be identical to the name of the model repository directory where the model is stored. However, if a name is provided, it must match the name of the model repository directory where the model is stored. sd_inpaint is the config property name.
backend This specifies the Triton framework to serve model predictions. This is a mandatory parameter. We specify python, because we’ll be using the Triton Python backend to host the Stable Diffusion models.
max_batch_size This indicates the maximum batch size that the model supports for the types of batching that can be exploited by Triton.
input→ prompt Text prompt of type string. Specify -1 to accept dynamic tensor shape.
input→ negative_prompt Negative text prompt of type string. Specify -1 to accept dynamic tensor shape.
input→ mask_image Base64 encoded mask image of type string. Specify -1 to accept dynamic tensor shape.
input→ image Base64 encoded image of type string. Specify -1 to accept dynamic tensor shape.
input→ gen_args JSON encoded additional arguments of type string. Specify -1 to accept dynamic tensor shape.
output→ generated_image Generated image of type string. Specify -1 to accept dynamic tensor shape.
instance_group You can use this this setting to place multiple run instances of a model on every GPU or on only certain GPUs. We specify KIND_GPU to make copies of the model on available GPUs.
parameters We set the conda environment path to EXECUTION_ENV_PATH.

For details about the model repository and configurations of other Stable Diffusion models, refer to the code in the GitHub repo. Each directory contains artifacts for the specific Stable Diffusion models.

Package a conda environment and extend the SageMaker Triton container

SageMaker NVIDIA Triton container images don’t contain libraries like transformer, accelerate, and diffusers to deploy and serve Stable Diffusion models. However, Triton allows you to bring additional dependencies using conda-pack. Let’s start by creating the conda environment with the necessary dependencies outlined in the environment.yml file and create a tar model artifact sd_env.tar.gz file containing the conda environment with dependencies installed in it. Run the following YML file to create a conda-pack artifact and copy the artifact to the local directory from where it will be uploaded to Amazon S3. Note that we will be uploading the conda artifacts as one of the models in the MME and invoking this model to set up the conda environment in the SageMaker hosting ML instance.

%%writefile environment.yml
name: mme_env
dependencies:
  - python=3.8
  - pip
  - pip:
      - numpy
      - torch --extra-index-url https://download.pytorch.org/whl/cu118
      - accelerate
      - transformers
      - diffusers
      - xformers
      - conda-pack

!conda env create -f environment.yml –force

Upload model artifacts to Amazon S3

SageMaker expects the .tar.gz file containing each Triton model repository to be hosted on the multi-model endpoint. Therefore, we create a tar artifact with content from the Triton model repository. We can use this S3 bucket to host thousands of model artifacts, and the SageMaker MME will use models from this location to dynamically load and serve a large number of models. We store all the Stable Diffusion models in this Amazon S3 location.

Deploy the SageMaker MME

In this section, we walk through the steps to deploy the SageMaker MME by defining container specification, SageMaker model and endpoint configurations.

Define the serving container

In the container definition, define the ModelDataUrl to specify the S3 directory that contains all the models that the SageMaker MME will use to load and serve predictions. Set Mode to MultiModel to indicate that SageMaker will create the endpoint with the MME container specifications. We set the container with an image that supports deploying MMEs with GPU. See Supported algorithms, frameworks, and instances for more details.

We see all three model artifacts in the following Amazon S3 ModelDataUrl location:

container = {"Image": mme_triton_image_uri, 
             "ModelDataUrl": model_data_url, 
             "Mode": "MultiModel"}

Create an MME object

We use the SageMaker Boto3 client to create the model using the create_model API. We pass the container definition to the create model API along with ModelName and ExecutionRoleArn:

create_model_response = sm_client.create_model(
    ModelName=sm_model_name, 
    ExecutionRoleArn=role, 
    PrimaryContainer=container
)

Define configurations for the MME

Create an MME configuration using the create_endpoint_config Boto3 API. Specify an accelerated GPU computing instance in InstanceType (we use the same instance type that we are using to host our SageMaker notebook). We recommend configuring your endpoints with at least two instances with real-life use cases. This allows SageMaker to provide a highly available set of predictions across multiple Availability Zones for the models.

create_endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            "InstanceType": instance_type,
            "InitialVariantWeight": 1,
            "InitialInstanceCount": 1,
            "ModelName": sm_model_name,
            "VariantName": "AllTraffic",
        }
    ],
)

Create an MME

Use the preceding endpoint configuration to create a new SageMaker endpoint and wait for the deployment to finish:

create_endpoint_response = sm_client.create_endpoint(
                EndpointName=endpoint_name, 
                EndpointConfigName=endpoint_config_name
)

The status will change to InService when the deployment is successful.

Generate images using different versions of Stable Diffusion models

Let’s start by invoking the base model with a prompt and getting the generated image. We pass the inputs to the base model with prompt, negative_prompt, and gen_args as a dictionary. We set the data type and shape of each input item in the dictionary and pass it as input to the model.

inputs = dict(prompt = "Infinity pool on top of a high rise overlooking Central Park",
             negative_prompt = "blur,low detail, low quality",
             gen_args = json.dumps(dict(num_inference_steps=50, guidance_scale=8))
)
payload = {
    "inputs":
        [{"name": name, "shape": [1,1], "datatype": "BYTES", "data": [data]} for name, data in inputs.items()]
}
response = runtime_sm_client.invoke_endpoint(
        EndpointName=endpoint_name,
        ContentType="application/octet-stream",
        Body=json.dumps(payload),
        TargetModel="sd_base.tar.gz", 
    )
output = json.loads(response["Body"].read().decode("utf8"))["outputs"]
decode_image(output[0]["data"][0])

Prompt: Infinity pool on top of a high rise overlooking Central Park

Working with this image, we can modify it with the versatile Stable Diffusion depth model. For example, we can change the style of the image to an oil painting, or change the setting from Central Park to Yellowstone National Park simply by passing the original image along with a prompt describing the changes we would like to see.

We invoke the depth model by specifying sd_depth.tar.gz in the TargetModel of the invoke_endpoint function call. In the outputs, notice how the orientation of the original image is preserved, but for one example, the NYC buildings have been transformed into rock formations of the same shape.

inputs = dict(prompt = "highly detailed oil painting of an inifinity pool overlooking central park",
              image=image,
              gen_args = json.dumps(dict(num_inference_steps=50, strength=0.9))
              )
payload = {
    "inputs":
        [{"name": name, "shape": [1,1], "datatype": "BYTES", "data": [data]} for name, data in inputs.items()]
}
response = runtime_sm_client.invoke_endpoint(
        EndpointName=endpoint_name,
        ContentType="application/octet-stream",
        Body=json.dumps(payload),
        TargetModel="sd_depth.tar.gz", 
    )
output = json.loads(response["Body"].read().decode("utf8"))["outputs"]
print("original image")
display(original_image)
print("generated image")
display(decode_image(output[0]["data"][0]))
Original image Oil painting Yellowstone Park

Another useful model is Stable Diffusion inpainting, which we can use to remove certain parts of the image. Let’s say you want to remove the tree in the following example image. We can do so by invoking the inpaint model sd_inpaint.tar.gz. To remove the tree, we need to pass a mask_image, which indicates which regions of the image should be retained and which should be filled in. The black pixel portion of the mask image indicates the regions that should remain unchanged, and the white pixels indicate what should be replaced.

image = encode_image(original_image).decode("utf8")
mask_image = encode_image(Image.open("sample_images/bertrand-gabioud-mask.png")).decode("utf8")
inputs = dict(prompt = "building, facade, paint, windows",
              image=image,
              mask_image=mask_image,
              negative_prompt = "tree, obstruction, sky, clouds",
              gen_args = json.dumps(dict(num_inference_steps=50, guidance_scale=10))
              )
payload = {
    "inputs":
        [{"name": name, "shape": [1,1], "datatype": "BYTES", "data": [data]} for name, data in inputs.items()]
}
response = runtime_sm_client.invoke_endpoint(
        EndpointName=endpoint_name,
        ContentType="application/octet-stream",
        Body=json.dumps(payload),
        TargetModel="sd_inpaint.tar.gz", 
    )
output = json.loads(response["Body"].read().decode("utf8"))["outputs"]
decode_image(output[0]["data"][0])
Original image Mask image Inpaint image

In our final example, we downsize the original image that was generated earlier from its 512 x 512 resolution to 128 x 128. We then invoke the Stable Diffusion upscaler model to upscale the image back to 512 x 512. We use the same prompt to upscale the image as what we used to generate the initial image. While not necessary, providing a prompt that describes the image helps guide the upscaling process and should lead to better results.

low_res_image = output_image.resize((128, 128))
inputs = dict(prompt = "Infinity pool on top of a high rise overlooking Central Park",
             image=encode_image(low_res_image).decode("utf8")
)

payload = {
    "inputs":
        [{"name": name, "shape": [1,1], "datatype": "BYTES", "data": [data]} for name, data in inputs.items()]
}

response = runtime_sm_client.invoke_endpoint(
        EndpointName=endpoint_name,
        ContentType="application/octet-stream",
        Body=json.dumps(payload),
        TargetModel="sd_upscale.tar.gz", 
    )
output = json.loads(response["Body"].read().decode("utf8"))["outputs"]
upscaled_image = decode_image(output[0]["data"][0])
Low-resolution image Upscaled image

Although the upscaled image is not as detailed as the original, it’s a marked improvement over the low-resolution one.

Optimize for memory and speed

The xformers library is a way to speed up image generation. This optimization is only available for NVIDIA GPUs. It speeds up image generation and lowers VRAM usage. We have used the xformers library for memory-efficient attention and speed. When the enable_xformers_memory_efficient_attention option is enabled, you should observe lower GPU memory usage and a potential speedup at inference time.

Clean Up

Follow the instruction in the clean up section of the notebook to delete the resource provisioned part of this blog to avoid unnecessary charges. Refer Amazon SageMaker Pricing for details the cost of the inference instances.

Conclusion

In this post, we discussed Stable Diffusion models and how you can deploy different versions of Stable Diffusion models cost-effectively using SageMaker multi-model endpoints. You can use this approach to build a creator image generation and editing tool. Check out the code samples in the GitHub repo to get started and let us know about the cool generative AI tool that you build.


About the Authors

Simon Zamarin is an AI/ML Solutions Architect whose main focus is helping customers extract value from their data assets. In his spare time, Simon enjoys spending time with family, reading sci-fi, and working on various DIY house projects.

Vikram Elango is a Sr. AI/ML Specialist Solutions Architect at AWS, based in Virginia, US. He is currently focused on generative AI, LLMs, prompt engineering, large model inference optimization, and scaling ML across enterprises. Vikram helps financial and insurance industry customers with design and architecture to build and deploy ML applications at scale. In his spare time, he enjoys traveling, hiking, cooking, and camping with his family.

João Moura is an AI/ML Specialist Solutions Architect at AWS, based in Spain. He helps customers with deep learning model training and inference optimization, and more broadly building large-scale ML platforms on AWS. He is also an active proponent of ML-specialized hardware and low-code ML solutions.

Saurabh Trikande is a Senior Product Manager for Amazon SageMaker Inference. He is passionate about working with customers and is motivated by the goal of democratizing machine learning. He focuses on core challenges related to deploying complex ML applications, multi-tenant ML models, cost optimizations, and making deployment of deep learning models more accessible. In his spare time, Saurabh enjoys hiking, learning about innovative technologies, following TechCrunch, and spending time with his family.

Read More

Barkour: Benchmarking animal-level agility with quadruped robots

Barkour: Benchmarking animal-level agility with quadruped robots

Creating robots that exhibit robust and dynamic locomotion capabilities, similar to animals or humans, has been a long-standing goal in the robotics community. In addition to completing tasks quickly and efficiently, agility allows legged robots to move through complex environments that are otherwise difficult to traverse. Researchers at Google have been pursuing agility for multiple years and across various form factors. Yet, while researchers have enabled robots to hike or jump over some obstacles, there is still no generally accepted benchmark that comprehensively measures robot agility or mobility. In contrast, benchmarks are driving forces behind the development of machine learning, such as ImageNet for computer vision, and OpenAI Gym for reinforcement learning (RL).

In “Barkour: Benchmarking Animal-level Agility with Quadruped Robots”, we introduce the Barkour agility benchmark for quadruped robots, along with a Transformer-based generalist locomotion policy. Inspired by dog agility competitions, a legged robot must sequentially display a variety of skills, including moving in different directions, traversing uneven terrains, and jumping over obstacles within a limited timeframe to successfully complete the benchmark. By providing a diverse and challenging obstacle course, the Barkour benchmark encourages researchers to develop locomotion controllers that move fast in a controllable and versatile way. Furthermore, by tying the performance metric to real dog performance, we provide an intuitive metric to understand the robot performance with respect to their animal counterparts.

We invited a handful of dooglers to try the obstacle course to ensure that our agility objectives were realistic and challenging. Small dogs complete the obstacle course in approximately 10s, whereas our robot’s typical performance hovers around 20s.

Barkour benchmark

The Barkour scoring system uses a per obstacle and an overall course target time based on the target speed of small dogs in the novice agility competitions (about 1.7m/s). Barkour scores range from 0 to 1, with 1 corresponding to the robot successfully traversing all the obstacles along the course within the allotted time of approximately 10 seconds, the average time needed for a similar-sized dog to traverse the course. The robot receives penalties for skipping, failing obstacles, or moving too slowly.

Our standard course consists of four unique obstacles in a 5m x 5m area. This is a denser and smaller setup than a typical dog competition to allow for easy deployment in a robotics lab. Beginning at the start table, the robot needs to weave through a set of poles, climb an A-frame, clear a 0.5m broad jump and then step onto the end table. We chose this subset of obstacles because they test a diverse set of skills while keeping the setup within a small footprint. As is the case for real dog agility competitions, the Barkour benchmark can be easily adapted to a larger course area and may incorporate a variable number of obstacles and course configurations.

Overview of the Barkour benchmark’s obstacle course setup, which consists of weave poles, an A-frame, a broad jump, and pause tables. The intuitive scoring mechanism, inspired by dog agility competitions, balances speed, agility and performance and can be easily modified to incorporate other types of obstacles or course configurations.

Learning agile locomotion skills

The Barkour benchmark features a diverse set of obstacles and a delayed reward system, which pose a significant challenge when training a single policy that can complete the entire obstacle course. So in order to set a strong performance baseline and demonstrate the effectiveness of the benchmark for robotic agility research, we adopt a student-teacher framework combined with a zero-shot sim-to-real approach. First, we train individual specialist locomotion skills (teacher) for different obstacles using on-policy RL methods. In particular, we leverage recent advances in large-scale parallel simulation to equip the robot with individual skills, including walking, slope climbing, and jumping policies.

Next, we train a single policy (student) that performs all the skills and transitions in between by using a student-teacher framework, based on the specialist skills we previously trained. We use simulation rollouts to create datasets of state-action pairs for each one of the specialist skills. This dataset is then distilled into a single Transformer-based generalist locomotion policy, which can handle various terrains and adjust the robot’s gait based on the perceived environment and the robot’s state.

During deployment, we pair the locomotion transformer policy that is capable of performing multiple skills with a navigation controller that provides velocity commands based on the robot’s position. Our trained policy controls the robot based on the robot’s surroundings represented as an elevation map, velocity commands, and on-board sensory information provided by the robot.

Deployment pipeline for the locomotion transformer architecture. At deployment time, a high-level navigation controller guides the real robot through the obstacle course by sending commands to the locomotion transformer policy.

Robustness and repeatability are difficult to achieve when we aim for peak performance and maximum speed. Sometimes, the robot might fail when overcoming an obstacle in an agile way. To handle failures we train a recovery policy that quickly gets the robot back on its feet, allowing it to continue the episode.

Evaluation

We evaluate the Transformer-based generalist locomotion policy using custom-built quadruped robots and show that by optimizing for the proposed benchmark, we obtain agile, robust, and versatile skills for our robot in the real world. We further provide analysis for various design choices in our system and their impact on the system performance.

Model of the custom-built robots used for evaluation.

We deploy both the specialist and generalist policies to hardware (zero-shot sim-to-real). The robot’s target trajectory is provided by a set of waypoints along the various obstacles. In the case of the specialist policies, we switch between specialist policies by using a hand-tuned policy switching mechanism that selects the most suitable policy given the robot’s position.

Typical performance of our agile locomotion policies on the Barkour benchmark. Our custom-built quadruped robot robustly navigates the terrain’s obstacles by leveraging various skills learned using RL in simulation.

We find that very often our policies can handle unexpected events or even hardware degradation resulting in good average performance, but failures are still possible. As illustrated in the image below, in case of failures, our recovery policy quickly gets the robot back on its feet, allowing it to continue the episode. By combining the recovery policy with a simple walk-back-to-start policy, we are able to run repeated experiments with minimal human intervention to measure the robustness.

Qualitative example of robustness and recovery behaviors. The robot trips and rolls over after heading down the A-frame. This triggers the recovery policy, which enables the robot to get back up and continue the course.

We find that across a large number of evaluations, the single generalist locomotion transformer policy and the specialist policies with the policy switching mechanism achieve similar performance. The locomotion transformer policy has a slightly lower average Barkour score, but exhibits smoother transitions between behaviors and gaits.

Measuring robustness of the different policies across a large number of runs on the Barkour benchmark.

Histogram of the agility scores for the locomotion transformer policy. The highest scores shown in blue (0.75 – 0.9) represent the runs where the robot successfully completes all obstacles.

Conclusion

We believe that developing a benchmark for legged robotics is an important first step in quantifying progress toward animal-level agility. To establish a strong baseline, we investigated a zero-shot sim-to-real approach, taking advantage of large-scale parallel simulation and recent advancements in training Transformer-based architectures. Our findings demonstrate that Barkour is a challenging benchmark that can be easily customized, and that our learning-based method for solving the benchmark provides a quadruped robot with a single low-level policy that can perform a variety of agile low-level skills.

Acknowledgments

The authors of this post are now part of Google DeepMind. We would like to thank our co-authors at Google DeepMind and our collaborators at Google Research: Wenhao Yu, J. Chase Kew, Tingnan Zhang, Daniel Freeman, Kuang-Hei Lee, Lisa Lee, Stefano Saliceti, Vincent Zhuang, Nathan Batchelor, Steven Bohez, Federico Casarini, Jose Enrique Chen, Omar Cortes, Erwin Coumans, Adil Dostmohamed, Gabriel Dulac-Arnold, Alejandro Escontrela, Erik Frey, Roland Hafner, Deepali Jain, Yuheng Kuang, Edward Lee, Linda Luu, Ofir Nachum, Ken Oslund, Jason Powell, Diego Reyes, Francesco Romano, Feresteh Sadeghi, Ron Sloat, Baruch Tabanpour, Daniel Zheng, Michael Neunert, Raia Hadsell, Nicolas Heess, Francesco Nori, Jeff Seto, Carolina Parada, Vikas Sindhwani, Vincent Vanhoucke, and Jie Tan. We would also like to thank Marissa Giustina, Ben Jyenis, Gus Kouretas, Nubby Lee, James Lubin, Sherry Moore, Thinh Nguyen, Krista Reymann, Satoshi Kataoka, Trish Blazina, and the members of the robotics team at Google DeepMind for their contributions to the project.

Read More

Attend our first Developer Summit on Recommendation Systems

Attend our first Developer Summit on Recommendation Systems

Posted by Wei Wei, Developer Advocate

Register for the Summit here!

Recommendation systems are everywhere. They power our favorite websites, apps, and services, helping us find the things we enjoy. But how do modern recommenders work? What are the key components and how do they fit together? How can we make them even better?

Since we launched our recommendation system landing page last year, we have heard many positive feedback from our developer community. While many developers find the new consolidated page very useful to get started with our suite of products, they are also eager to learn more about how to best leverage them to build powerful in-house recommenders for their own business needs.

This is why we are very excited to announce our first-ever Developer Summit on Recommendation Systems (registration is open now). This event will be held online on Jun 9, 2023 10AM – 12PM US Pacific Time and it will bring together many Google engineers who authored our suite of products to share their insights and expertise in recommendation systems. At this summit, we will not only cover specific products (such as TensorFlow Recommenders, TensorFlow Ranking, and TensorFlow Agents), share ideas on augmenting recommenders with Large Language Models (LLMs), but also discuss Google’s cutting edge recommendation system research (e.g., generative retrieval using generative AI techniques).

This Developer Summit is the perfect event for anyone who wants to learn more about recommendation systems. Whether you’re just getting started or a seasoned practitioner in this exciting domain, you’re sure to find something valuable at this event.

We look forward to (virtually) meeting you there!

Read More

Differentially private clustering for large-scale datasets

Differentially private clustering for large-scale datasets

Clustering is a central problem in unsupervised machine learning (ML) with many applications across domains in both industry and academic research more broadly. At its core, clustering consists of the following problem: given a set of data elements, the goal is to partition the data elements into groups such that similar objects are in the same group, while dissimilar objects are in different groups. This problem has been studied in math, computer science, operations research and statistics for more than 60 years in its myriad variants. Two common forms of clustering are metric clustering, in which the elements are points in a metric space, like in the k-means problem, and graph clustering, where the elements are nodes of a graph whose edges represent similarity among them.

In the k-means clustering problem, we are given a set of points in a metric space with the objective to identify k representative points, called centers (here depicted as triangles), so as to minimize the sum of the squared distances from each point to its closest center. Source, rights: CC-BY-SA-4.0

Despite the extensive literature on algorithm design for clustering, few practical works have focused on rigorously protecting the user’s privacy during clustering. When clustering is applied to personal data (e.g., the queries a user has made), it is necessary to consider the privacy implications of using a clustering solution in a real system and how much information the output solution reveals about the input data.

To ensure privacy in a rigorous sense, one solution is to develop differentially private (DP) clustering algorithms. These algorithms ensure that the output of the clustering does not reveal private information about a specific data element (e.g., whether a user has made a given query) or sensitive data about the input graph (e.g., a relationship in a social network). Given the importance of privacy protections in unsupervised machine learning, in recent years Google has invested in research on theory and practice of differentially private metric or graph clustering, and differential privacy in a variety of contexts, e.g., heatmaps or tools to design DP algorithms.

Today we are excited to announce two important updates: 1) a new differentially-private algorithm for hierarchical graph clustering, which we’ll be presenting at ICML 2023, and 2) the open-source release of the code of a scalable differentially-private k-means algorithm. This code brings differentially private k-means clustering to large scale datasets using distributed computing. Here, we will also discuss our work on clustering technology for a recent launch in the health domain for informing public health authorities.

Differentially private hierarchical clustering

Hierarchical clustering is a popular clustering approach that consists of recursively partitioning a dataset into clusters at an increasingly finer granularity. A well known example of hierarchical clustering is the phylogenetic tree in biology in which all life on Earth is partitioned into finer and finer groups (e.g., kingdom, phylum, class, order, etc.). A hierarchical clustering algorithm receives as input a graph representing the similarity of entities and learns such recursive partitions in an unsupervised way. Yet at the time of our research no algorithm was known to compute hierarchical clustering of a graph with edge privacy, i.e., preserving the privacy of the vertex interactions.

In “Differentially-Private Hierarchical Clustering with Provable Approximation Guarantees”, we consider how well the problem can be approximated in a DP context and establish firm upper and lower bounds on the privacy guarantee. We design an approximation algorithm (the first of its kind) with a polynomial running time that achieves both an additive error that scales with the number of nodes n (of order n2.5) and a multiplicative approximation of O(log½ n), with the multiplicative error identical to the non-private setting. We further provide a new lower bound on the additive error (of order n2) for any private algorithm (irrespective of its running time) and provide an exponential-time algorithm that matches this lower bound. Moreover, our paper includes a beyond-worst-case analysis focusing on the hierarchical stochastic block model, a standard random graph model that exhibits a natural hierarchical clustering structure, and introduces a private algorithm that returns a solution with an additive cost over the optimum that is negligible for larger and larger graphs, again matching the non-private state-of-the-art approaches. We believe this work expands the understanding of privacy preserving algorithms on graph data and will enable new applications in such settings.

Large-scale differentially private clustering

We now switch gears and discuss our work for metric space clustering. Most prior work in DP metric clustering has focused on improving the approximation guarantees of the algorithms on the k-means objective, leaving scalability questions out of the picture. Indeed, it is not clear how efficient non-private algorithms such as k-means++ or k-means// can be made differentially private without sacrificing drastically either on the approximation guarantees or the scalability. On the other hand, both scalability and privacy are of primary importance at Google. For this reason, we recently published multiple papers that address the problem of designing efficient differentially private algorithms for clustering that can scale to massive datasets. Our goal is, moreover, to offer scalability to large scale input datasets, even when the target number of centers, k, is large.

We work in the massively parallel computation (MPC) model, which is a computation model representative of modern distributed computation architectures. The model consists of several machines, each holding only part of the input data, that work together with the goal of solving a global problem while minimizing the amount of communication between machines. We present a differentially private constant factor approximation algorithm for k-means that only requires a constant number of rounds of synchronization. Our algorithm builds upon our previous work on the problem (with code available here), which was the first differentially-private clustering algorithm with provable approximation guarantees that can work in the MPC model.

The DP constant factor approximation algorithm drastically improves on the previous work using a two phase approach. In an initial phase it computes a crude approximation to “seed” the second phase, which consists of a more sophisticated distributed algorithm. Equipped with the first-step approximation, the second phase relies on results from the Coreset literature to subsample a relevant set of input points and find a good differentially private clustering solution for the input points. We then prove that this solution generalizes with approximately the same guarantee to the entire input.

Vaccination search insights via DP clustering

We then apply these advances in differentially private clustering to real-world applications. One example is our application of our differentially-private clustering solution for publishing COVID vaccine-related queries, while providing strong privacy protections for the users.

The goal of Vaccination Search Insights (VSI) is to help public health decision makers (health authorities, government agencies and nonprofits) identify and respond to communities’ information needs regarding COVID vaccines. In order to achieve this, the tool allows users to explore at different geolocation granularities (zip-code, county and state level in the U.S.) the top themes searched by users regarding COVID queries. In particular, the tool visualizes statistics on trending queries rising in interest in a given locale and time.

Screenshot of the output of the tool. Displayed on the left, the top searches related to Covid vaccines during the period Oct 10-16 2022. On the right, the queries that have had rising importance during the same period and compared to the previous week.

To better help identifying the themes of the trending searches, the tool clusters the search queries based on their semantic similarity. This is done by applying a custom-designed k-means–based algorithm run over search data that has been anonymized using the DP Gaussian mechanism to add noise and remove low-count queries (thus resulting in a differentially clustering). The method ensures strong differential privacy guarantees for the protection of the user data.

This tool provided fine-grained data on COVID vaccine perception in the population at unprecedented scales of granularity, something that is especially relevant to understand the needs of the marginalized communities disproportionately affected by COVID. This project highlights the impact of our investment in research in differential privacy, and unsupervised ML methods. We are looking to other important areas where we can apply these clustering techniques to help guide decision making around global health challenges, like search queries on climate change–related challenges such as air quality or extreme heat.

Acknowledgements

We thank our co-authors Silvio Lattanzi, Vahab Mirrokni, Andres Munoz Medina, Shyam Narayanan, David Saulpic, Chris Schwiegelshohn, Sergei Vassilvitskii, Peilin Zhong and our colleagues from the Health AI team that made the VSI launch possible Shailesh Bavadekar, Adam Boulanger, Tague Griffith, Mansi Kansal, Chaitanya Kamath, Akim Kumok, Yael Mayer, Tomer Shekel, Megan Shum, Charlotte Stanton, Mimi Sun, Swapnil Vispute, and Mark Young.

For more information on the Graph Mining team (part of Algorithm and Optimization) visit our pages.

Read More

Google Research at I/O 2023

Google Research at I/O 2023

Wednesday, May 10th was an exciting day for the Google Research community as we watched the results of months and years of our foundational and applied work get announced on the Google I/O stage. With the quick pace of announcements on stage, it can be difficult to convey the substantial effort and unique innovations that underlie the technologies we presented. So today, we’re excited to reveal more about the research efforts behind some of the many exciting announcements at this year’s I/O.


PaLM 2

Our next-generation large language model (LLM), PaLM 2, is built on advances in compute-optimal scaling, scaled instruction-fine tuning and improved dataset mixture. By fine-tuning and instruction-tuning the model for different purposes, we have been able to integrate state-of-the-art capabilities into over 25 Google products and features, where it is already helping to inform, assist and delight users. For example:

  • Bard is an early experiment that lets you collaborate with generative AI and helps to boost productivity, accelerate ideas and fuel curiosity. It builds on advances in deep learning efficiency and leverages reinforcement learning from human feedback to provide more relevant responses and increase the model’s ability to follow instructions. Bard is now available in 180 countries, where users can interact with it in English, Japanese and Korean, and thanks to the multilingual capabilities afforded by PaLM 2, support for 40 languages is coming soon.
  • With Search Generative Experience we’re taking more of the work out of searching, so you’ll be able to understand a topic faster, uncover new viewpoints and insights, and get things done more easily. As part of this experiment, you’ll see an AI-powered snapshot of key information to consider, with links to dig deeper.
  • MakerSuite is an easy-to-use prototyping environment for the PaLM API, powered by PaLM 2. In fact, internal user engagement with early prototypes of MakerSuite accelerated the development of our PaLM 2 model itself. MakerSuite grew out of research focused on prompting tools, or tools explicitly designed for customizing and controlling LLMs. This line of research includes PromptMaker (precursor to MakerSuite), and AI Chains and PromptChainer (one of the first research efforts demonstrating the utility of LLM chaining).
  • Project Tailwind also made use of early research prototypes of MakerSuite to develop features to help writers and researchers explore ideas and improve their prose; its AI-first notebook prototype used PaLM 2 to allow users to ask questions of the model grounded in documents they define.
  • Med-PaLM 2 is our state-of-the-art medical LLM, built on PaLM 2. Med-PaLM 2 achieved 86.5% performance on U.S. Medical Licensing Exam–style questions, illustrating its exciting potential for health. We’re now exploring multimodal capabilities to synthesize inputs like X-rays.
  • Codey is a version of PaLM 2 fine-tuned on source code to function as a developer assistant. It supports a broad range of Code AI features, including code completions, code explanation, bug fixing, source code migration, error explanations, and more. Codey is available through our trusted tester program via IDEs (Colab, Android Studio, Duet AI for Cloud, Firebase) and via a 3P-facing API.

Perhaps even more exciting for developers, we have opened up the PaLM APIs & MakerSuite to provide the community opportunities to innovate using this groundbreaking technology.

PaLM 2 has advanced coding capabilities that enable it to find code errors and make suggestions in a number of different languages.

Imagen

Our Imagen family of image generation and editing models builds on advances in large Transformer-based language models and diffusion models. This family of models is being incorporated into multiple Google products, including:

  • Image generation in Google Slides and Android’s Generative AI wallpaper are powered by our text-to-image generation features.
  • Google Cloud’s Vertex AI enables image generation, image editing, image upscaling and fine-tuning to help enterprise customers meet their business needs.
  • I/O Flip, a digital take on a classic card game, features Google developer mascots on cards that were entirely AI generated. This game showcased a fine-tuning technique called DreamBooth for adapting pre-trained image generation models. Using just a handful of images as inputs for fine-tuning, it allows users to generate personalized images in minutes. With DreamBooth, users can synthesize a subject in diverse scenes, poses, views, and lighting conditions that don’t appear in the reference images.
    I/O Flip presents custom card decks designed using DreamBooth.

Phenaki

Phenaki, Google’s Transformer-based text-to-video generation model was featured in the I/O pre-show. Phenaki is a model that can synthesize realistic videos from textual prompt sequences by leveraging two main components: an encoder-decoder model that compresses videos to discrete embeddings and a transformer model that translates text embeddings to video tokens.

ARCore and the Scene Semantic API

Among the new features of ARCore announced by the AR team at I/O, the Scene Semantic API can recognize pixel-wise semantics in an outdoor scene. This helps users create custom AR experiences based on the features in the surrounding area. This API is empowered by the outdoor semantic segmentation model, leveraging our recent works around the DeepLab architecture and an egocentric outdoor scene understanding dataset. The latest ARCore release also includes an improved monocular depth model that provides higher accuracy in outdoor scenes.

Scene Semantics API uses DeepLab-based semantic segmentation model to provide accurate pixel-wise labels in a scene outdoors.

Chirp

Chirp is Google’s family of state-of-the-art Universal Speech Models trained on 12 million hours of speech to enable automatic speech recognition (ASR) for 100+ languages. The models can perform ASR on under-resourced languages, such as Amharic, Cebuano, and Assamese, in addition to widely spoken languages like English and Mandarin. Chirp is able to cover such a wide variety of languages by leveraging self-supervised learning on unlabeled multilingual dataset with fine-tuning on a smaller set of labeled data. Chirp is now available in the Google Cloud Speech-to-Text API, allowing users to perform inference on the model through a simple interface. You can get started with Chirp here.

MusicLM

At I/O, we launched MusicLM, a text-to-music model that generates 20 seconds of music from a text prompt. You can try it yourself on AI Test Kitchen, or see it featured during the I/O preshow, where electronic musician and composer Dan Deacon used MusicLM in his performance.

MusicLM, which consists of models powered by AudioLM and MuLAN, can make music (from text, humming, images or video) and musical accompaniments to singing. AudioLM generates high quality audio with long-term consistency. It maps audio to a sequence of discrete tokens and casts audio generation as a language modeling task. To synthesize longer outputs efficiently, it used a novel approach we’ve developed called SoundStorm.

Universal Translator dubbing

Our dubbing efforts leverage dozens of ML technologies to translate the full expressive range of video content, making videos accessible to audiences across the world. These technologies have been used to dub videos across a variety of products and content types, including educational content, advertising campaigns, and creator content, with more to come. We use deep learning technology to achieve voice preservation and lip matching and enable high-quality video translation. We’ve built this product to include human review for quality, safety checks to help prevent misuse, and we make it accessible only to authorized partners.

AI for global societal good

We are applying our AI technologies to solve some of the biggest global challenges, like mitigating climate change, adapting to a warming planet and improving human health and wellbeing. For example:

  • Traffic engineers use our Green Light recommendations to reduce stop-and-go traffic at intersections and improve the flow of traffic in cities from Bangalore to Rio de Janeiro and Hamburg. Green Light models each intersection, analyzing traffic patterns to develop recommendations that make traffic lights more efficient — for example, by better synchronizing timing between adjacent lights, or adjusting the “green time” for a given street and direction.
  • We’ve also expanded global coverage on the Flood Hub to 80 countries, as part of our efforts to predict riverine floods and alert people who are about to be impacted before disaster strikes. Our flood forecasting efforts rely on hydrological models informed by satellite observations, weather forecasts and in-situ measurements.

Technologies for inclusive and fair ML applications

With our continued investment in AI technologies, we are emphasizing responsible AI development with the goal of making our models and tools useful and impactful while also ensuring fairness, safety and alignment with our AI Principles. Some of these efforts were highlighted at I/O, including:

  • The release of the Monk Skin Tone Examples (MST-E) Dataset to help practitioners gain a deeper understanding of the MST scale and train human annotators for more consistent, inclusive, and meaningful skin tone annotations. You can read more about this and other developments on our website. This is an advancement on the open source release of the Monk Skin Tone (MST) Scale we launched last year to enable developers to build products that are more inclusive and that better represent their diverse users.
  • A new Kaggle competition (open until August 10th) in which the ML community is tasked with creating a model that can quickly and accurately identify American Sign Language (ASL) fingerspelling — where each letter of a word is spelled out in ASL rapidly using a single hand, rather than using the specific signs for entire words — and translate it into written text. Learn more about the fingerspelling Kaggle competition, which features a song from Sean Forbes, a deaf musician and rapper. We also showcased at I/O the winning algorithm from the prior year’s competition powers PopSign, an ASL learning app for parents of deaf or hard of hearing children created by Georgia Tech and Rochester Institute of Technology (RIT).

Building the future of AI together

It’s inspiring to be part of a community of so many talented individuals who are leading the way in developing state-of-the-art technologies, responsible AI approaches and exciting user experiences. We are in the midst of a period of incredible and transformative change for AI. Stay tuned for more updates about the ways in which the Google Research community is boldly exploring the frontiers of these technologies and using them responsibly to benefit people’s lives around the world. We hope you’re as excited as we are about the future of AI technologies and we invite you to engage with our teams through the references, sites and tools that we’ve highlighted here.

Read More

Build a powerful question answering bot with Amazon SageMaker, Amazon OpenSearch Service, Streamlit, and LangChain

Build a powerful question answering bot with Amazon SageMaker, Amazon OpenSearch Service, Streamlit, and LangChain

One of the most common applications of generative AI and large language models (LLMs) in an enterprise environment is answering questions based on the enterprise’s knowledge corpus. Amazon Lex provides the framework for building AI based chatbots. Pre-trained foundation models (FMs) perform well at natural language understanding (NLU) tasks such summarization, text generation and question answering on a broad variety of topics but either struggle to provide accurate (without hallucinations) answers or completely fail at answering questions about content that they haven’t seen as part of their training data. Furthermore, FMs are trained with a point in time snapshot of data and have no inherent ability to access fresh data at inference time; without this ability they might provide responses that are potentially incorrect or inadequate.

A commonly used approach to address this problem is to use a technique called Retrieval Augmented Generation (RAG). In the RAG-based approach we convert the user question into vector embeddings using an LLM and then do a similarity search for these embeddings in a pre-populated vector database holding the embeddings for the enterprise knowledge corpus. A small number of similar documents (typically three) is added as context along with the user question to the “prompt” provided to another LLM and then that LLM generates an answer to the user question using information provided as context in the prompt. RAG models were introduced by Lewis et al. in 2020 as a model where parametric memory is a pre-trained seq2seq model and the non-parametric memory is a dense vector index of Wikipedia, accessed with a pre-trained neural retriever. To understand the overall structure of a RAG-based approach, refer to Question answering using Retrieval Augmented Generation with foundation models in Amazon SageMaker JumpStart.

In this post we provide a step-by-step guide with all the building blocks for creating an enterprise ready RAG application such as a question answering bot. We use a combination of different AWS services, open-source foundation models (FLAN-T5 XXL for text generation and GPT-j-6B for embeddings) and packages such as LangChain for interfacing with all the components and Streamlit for building the bot frontend.

We provide an AWS Cloud Formation template to stand up all the resources required for building this solution. We then demonstrate how to use LangChain for tying everything together:

  • Interfacing with LLMs hosted on Amazon SageMaker.
  • Chunking of knowledge base documents.
  • Ingesting document embeddings into Amazon OpenSearch Service.
  • Implementing the question answering task.

We can use the same architecture to swap the open-source models with the Amazon Titan models. After Amazon Bedrock launches, we will publish a follow-up post showing how to implement similar generative AI applications using Amazon Bedrock, so stay tuned.

Solution overview

We use the SageMaker docs as the knowledge corpus for this post. We convert the HTML pages on this site into smaller overlapping chunks (to retain some context continuity between chunks) of information and then convert these chunks into embeddings using the gpt-j-6b model and store the embeddings in OpenSearch Service. We implement the RAG functionality inside an AWS Lambda function with Amazon API Gateway to handle routing all requests to the Lambda. We implement a chatbot application in Streamlit which invokes the function via the API Gateway and the function does a similarity search in the OpenSearch Service index for the embeddings of user question. The matching documents (chunks) are added to the prompt as context by the Lambda function and then the function uses the flan-t5-xxl model deployed as a SageMaker endpoint to generate an answer to the user question. All the code for this post is available in the GitHub repo.

The following figure represents the high-level architecture of the proposed solution.

Architecture

Figure 1: Architecture

Step-by-step explanation:

  1. The User provides a question via the Streamlit web application.
  2. The Streamlit application invokes the API Gateway endpoint REST API.
  3. The API Gateway invokes the Lambda function.
  4. The function invokes the SageMaker endpoint to convert user question into embeddings.
  5. The function invokes invokes an OpenSearch Service API to find similar documents to the user question.
  6. The function creates a “prompt” with the user query and the “similar documents” as context and asks the SageMaker endpoint to generate a response.
  7. The response is provided from the function to the API Gateway.
  8. The API Gateway provides the response to the Streamlit application.
  9. The User is able to view the response on the Streamlit application,

As illustrated in the architecture diagram, we use the following AWS services:

In terms of open-source packages used in this solution, we use LangChain for interfacing with OpenSearch Service and SageMaker, and FastAPI for implementing the REST API interface in the Lambda.

The workflow for instantiating the solution presented in this post in your own AWS account is as follows:

  1. Run the CloudFormation template provided with this post in your account. This will create all the necessary infrastructure resources needed for this solution:
    • SageMaker endpoints for the LLMs
    • OpenSearch Service cluster
    • API Gateway
    • Lambda function
    • SageMaker Notebook
    • IAM roles
  2. Run the data_ingestion_to_vectordb.ipynb notebook in the SageMaker notebook to ingest data from SageMaker docs into an OpenSearch Service index.
  3. Run the Streamlit application on a terminal in Studio and open the URL for the application in a new browser tab.
  4. Ask your questions about SageMaker via the chat interface provided by the Streamlit app and view the responses generated by the LLM.

These steps are discussed in detail in the following sections.

Prerequisites

To implement the solution provided in this post, you should have an AWS account and familiarity with LLMs, OpenSearch Service and SageMaker.

We need access to accelerated instances (GPUs) for hosting the LLMs. This solution uses one instance each of ml.g5.12xlarge and ml.g5.24xlarge; you can check the availability of these instances in your AWS account and request these instances as needed via a Sevice Quota increase request as shown in the following screenshot.

Service quota increase

Figure 2: Service Quota Increase Request

Use AWS Cloud Formation to create the solution stack

We use AWS CloudFormation to create a SageMaker notebook called aws-llm-apps-blog and an IAM role called LLMAppsBlogIAMRole. Choose Launch Stack for the Region you want to deploy resources to. All parameters needed by the CloudFormation template have default values already filled in, except for the OpenSearch Service password which you’d have to provide. Make a note of the OpenSearch Service username and password, we use those in subsequent steps. This template takes about 15 minutes to complete.

AWS Region Link
us-east-1
us-west-2
eu-west-1
ap-northeast-1

After the stack is created successfully, navigate to the stack’s Outputs tab on the AWS CloudFormation console and note the values for OpenSearchDomainEndpoint and LLMAppAPIEndpoint. We use those in the subsequent steps.

CloudFormation stack outputs

Figure 3: Cloud Formation Stack Outputs

Ingest the data into OpenSearch Service

To ingest the data, complete the following steps:

  1. On the SageMaker console, choose Notebooks in the navigation pane.
  2. Select the notebook aws-llm-apps-blog and choose Open JupyterLab.

    Open JupyterLab

    Figure 4: Open JupyterLab

  3. Choose data_ingestion_to_vectordb.ipynb to open it in JupyterLab. This notebook will ingest the SageMaker docs to an OpenSearch Service index called llm_apps_workshop_embeddings.

    Notebook path

    Figure 5: Open Data Ingestion Notebook

  4. When the notebook is open, on the Run menu, choose Run All Cells to run the code in this notebook. This will download the dataset locally into the notebook and then ingest it into the OpenSearch Service index. This notebook takes about 20 minutes to run. The notebook also ingests the data into another vector database called FAISS. The FAISS index files are saved locally and the uploaded to Amazon Simple Storage Service (S3) so that they can optionally be used by the Lambda function as an illustration of using an alternate vector database.

    Run all cells

    Figure 6: Notebook Run All Cells

Now we’re ready to split the documents into chunks, which can then be converted into embeddings to be ingested into OpenSearch. We use the LangChain RecursiveCharacterTextSplitter class to chunk the documents and then use the LangChain SagemakerEndpointEmbeddingsJumpStart class to convert these chunks into embeddings using the gpt-j-6b LLM. We store the embeddings in OpenSearch Service via the LangChain OpenSearchVectorSearch class. We package this code into Python scripts that are provided to the SageMaker Processing Job via a custom container. See the data_ingestion_to_vectordb.ipynb notebook for the full code.

  1. Create a custom container, then install in it the LangChain and opensearch-py Python packages.
  2. Upload this container image to Amazon Elastic Container Registry (ECR).
  3. We use the SageMaker ScriptProcessor class to create a SageMaker Processing job that will run on multiple nodes.
    • The data files available in Amazon S3 are automatically distributed across in the SageMaker Processing job instances by setting s3_data_distribution_type='ShardedByS3Key' as part of the ProcessingInput provided to the processing job.
    • Each node processes a subset of the files and this brings down the overall time required to ingest the data into OpenSearch Service.
    • Each node also uses Python multiprocessing to internally also parallelize the file processing. Therefore, there are two levels of parallelization happening, one at the cluster level where individual nodes are distributing the work (files) amongst themselves and another at the node level where the files in a node are also split between multiple processes running on the node.
       # setup the ScriptProcessor with the above parameters
      processor = ScriptProcessor(base_job_name=base_job_name,
                                  image_uri=image_uri,
                                  role=aws_role,
                                  instance_type=instance_type,
                                  instance_count=instance_count,
                                  command=["python3"],
                                  tags=tags)
      
      # setup input from S3, note the ShardedByS3Key, this ensures that 
      # each instance gets a random and equal subset of the files in S3.
      inputs = [ProcessingInput(source=f"s3://{bucket}/{app_name}/{DOMAIN}",
                                destination='/opt/ml/processing/input_data',
                                s3_data_distribution_type='ShardedByS3Key',
                                s3_data_type='S3Prefix')]
      
      
      logger.info(f"creating an opensearch index with name={opensearch_index}")
      # ready to run the processing job
      st = time.time()
      processor.run(code="container/load_data_into_opensearch.py",
                    inputs=inputs,
                    outputs=[],
                    arguments=["--opensearch-cluster-domain", opensearch_domain_endpoint,
                              "--opensearch-secretid", os_creds_secretid_in_secrets_manager,
                              "--opensearch-index-name", opensearch_index,
                              "--aws-region", aws_region,
                              "--embeddings-model-endpoint-name", embeddings_model_endpoint_name,
                              "--chunk-size-for-doc-split", str(CHUNK_SIZE_FOR_DOC_SPLIT),
                              "--chunk-overlap-for-doc-split", str(CHUNK_OVERLAP_FOR_DOC_SPLIT),
                              "--input-data-dir", "/opt/ml/processing/input_data",
                              "--create-index-hint-file", CREATE_OS_INDEX_HINT_FILE,
                              "--process-count", "2"])

  4. Close the notebook after all cells run without any error. Your data is now available in OpenSearch Service. Enter the following URL in your browser’s address bar to get a count of documents in the llm_apps_workshop_embeddings index. Use the OpenSearch Service domain endpoint from the CloudFormation stack outputs in the URL below. You’d be prompted for the OpenSearch Service username and password, these are available from the CloudFormations stack.
    https://your-opensearch-domain-endpoint/llm_apps_workshop_embeddings/_count

The browser window should show an output similar to the following. This output shows that 5,667 documents were ingested into the llm_apps_workshop_embeddings index. {"count":5667,"_shards":{"total":5,"successful":5,"skipped":0,"failed":0}}

Run the Streamlit application in Studio

Now we’re ready to run the Streamlit web application for our question answering bot. This application allows the user to ask a question and then fetches the answer via the /llm/rag REST API endpoint provided by the Lambda function.

Studio provides a convenient platform to host the Streamlit web application. The following steps describes how to run the Streamlit app on Studio. Alternatively, you could also follow the same procedure to run the app on your laptop.

  1. Open Studio and then open a new terminal.
  2. Run the following commands on the terminal to clone the code repository for this post and install the Python packages needed by the application:
    git clone https://github.com/aws-samples/llm-apps-workshop
    cd llm-apps-workshop/blogs/rag/app
    pip install -r requirements.txt

  3. The API Gateway endpoint URL that is available from the CloudFormation stack output needs to be set in the webapp.py file. This is done by running the following sed command. Replace the replace-with-LLMAppAPIEndpoint-value-from-cloudformation-stack-outputs in the shell commands with the value of the LLMAppAPIEndpoint field from the CloudFormation stack output and then run the following commands to start a Streamlit app on Studio.
    
    EP=replace-with-LLMAppAPIEndpoint-value-from-cloudformation-stack-outputs
    # replace __API_GW_ENDPOINT__ with  output from the cloud formation stack
    sed -i "s|__API_GW_ENDPOINT__|$EP|g" webapp.py
    streamlit run webapp.py

  4. When the application runs successfully, you’ll see an output similar to the following (the IP addresses you will see will be different from the ones shown in this example). Note the port number (typically 8501) from the output to use as part of the URL for app in the next step.
    sagemaker-user@studio$ streamlit run webapp.py 
    
    Collecting usage statistics. To deactivate, set browser.gatherUsageStats to False.
    
    You can now view your Streamlit app in your browser.
    
    Network URL: http://169.255.255.2:8501
    External URL: http://52.4.240.77:8501

  5. You can access the app in a new browser tab using a URL that is similar to your Studio domain URL. For example, if your Studio URL is https://d-randomidentifier.studio.us-east-1.sagemaker.aws/jupyter/default/lab? then the URL for your Streamlit app will be https://d-randomidentifier.studio.us-east-1.sagemaker.aws/jupyter/default/proxy/8501/webapp (notice that lab is replaced with proxy/8501/webapp). If the port number noted in the previous step is different from 8501 then use that instead of 8501 in the URL for the Streamlit app.

The following screenshot shows the app with a couple of user questions.

Streamlit app

A closer look at the RAG implementation in the Lambda function

Now that we have the application working end to end, lets take a closer look at the Lambda function. The Lambda function uses FastAPI to implement the REST API for RAG and the Mangum package to wrap the API with a handler that we package and deploy in the function. We use the API Gateway to route all incoming requests to invoke the function and handle the routing internally within our application.

The following code snippet shows how we find documents in the OpenSearch index that are similar to the user question and then create a prompt by combining the question and the similar documents. This prompt is then provided to the LLM for generating an answer to the user question.

@router.post("/rag")
async def rag_handler(req: Request) -> Dict[str, Any]:
    # dump the received request for debugging purposes
    logger.info(f"req={req}")

    # initialize vector db and SageMaker Endpoint
    _init(req)

    # Use the vector db to find similar documents to the query
    # the vector db call would automatically convert the query text
    # into embeddings
    docs = _vector_db.similarity_search(req.q, k=req.max_matching_docs)
    logger.info(f"here are the {req.max_matching_docs} closest matching docs to the query="{req.q}"")
    for d in docs:
        logger.info(f"---------")
        logger.info(d)
        logger.info(f"---------")

    # now that we have the matching docs, lets pack them as a context
    # into the prompt and ask the LLM to generate a response
    prompt_template = """Answer based on context:nn{context}nn{question}"""

    prompt = PromptTemplate(
        template=prompt_template, input_variables=["context", "question"]
    )
    logger.info(f"prompt sent to llm = "{prompt}"")
    chain = load_qa_chain(llm=_sm_llm, prompt=prompt)
    answer = chain({"input_documents": docs, "question": req.q}, return_only_outputs=True)['output_text']
    logger.info(f"answer received from llm,nquestion: "{req.q}"nanswer: "{answer}"")
    resp = {'question': req.q, 'answer': answer}
    if req.verbose is True:
        resp['docs'] = docs

    return resp

Clean up

To avoid incurring future charges, delete the resources. You can do this by deleting the CloudFormation stack as shown in the following screenshot.

Delete CloudFormation stack

Figure 7: Cleaning Up

Conclusion

In this post, we showed how to create an enterprise ready RAG solution using a combination of AWS service, open-source LLMs and open-source Python packages.

We encourage you to learn more by exploring JumpStart, Amazon Titan models, Amazon Bedrock, and OpenSearch Service and building a solution using the sample implementation provided in this post and a dataset relevant to your business. If you have questions or suggestions, leave a comment.


About the Authors

Amit Arora is an AI and ML Specialist Architect at Amazon Web Services, helping enterprise customers use cloud-based machine learning services to rapidly scale their innovations. He is also an adjunct lecturer in the MS data science and analytics program at Georgetown University in Washington D.C.

Xin HuangDr. Xin Huang is a Senior Applied Scientist for Amazon SageMaker JumpStart and Amazon SageMaker built-in algorithms. He focuses on developing scalable machine learning algorithms. His research interests are in the area of natural language processing, explainable deep learning on tabular data, and robust analysis of non-parametric space-time clustering. He has published many papers in ACL, ICDM, KDD conferences, and Royal Statistical Society: Series A.

Navneet Tuteja is a Data Specialist at Amazon Web Services. Before joining AWS, Navneet worked as a facilitator for organizations seeking to modernize their data architectures and implement comprehensive AI/ML solutions. She holds an engineering degree from Thapar University, as well as a master’s degree in statistics from Texas A&M University.

Read More