Accounting for past imaging studies: Enhancing radiology AI and reporting

Accounting for past imaging studies: Enhancing radiology AI and reporting

The use of self-supervision from image-text pairs has been a key enabler in the development of scalable and flexible vision-language AI models in not only general domains but also in biomedical domains such as radiology. The goal in the radiology setting is to produce rich training signals without requiring manual labels so the models can learn to accurately recognize and locate findings in the images and relate them to content in radiology reports.

Radiologists use radiology reports to describe imaging findings and offer a clinical diagnosis or a range of possible diagnoses, all of which can be influenced by considering the findings on previous imaging studies. In fact, comparisons with previous images are crucial for radiologists to make informed decisions. These comparisons can provide valuable context for determining whether a condition is a new concern or improving, deteriorating, or stable if an existing condition and can inform more appropriate treatment recommendations. Despite the importance of comparisons, current AI solutions for radiology often fall short in aligning images with report data because of the lack of access to prior scans. Current AI solutions also typically fail to account for the chronological progression of disease or imaging findings often present in biomedical datasets. This can lead to ambiguity in the model training process and can be risky in downstream applications such as automated report generation, where models may make up temporal content without access to past medical scans. In short, this limits the real-world applicability of such AI models to empower caregivers and augment existing workflows.

In our previous work, we demonstrated that multimodal self-supervised learning of radiology images and reports can yield significant performance improvement in downstream applications of machine learning models, such as detecting the presence of medical conditions and localizing these findings within the images. In our latest study, which is being presented at the 2023 IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR), we propose BioViL-T, a self-supervised training framework that further increases the data efficiency of this learning paradigm by leveraging the temporal structure present in biomedical datasets. This approach enables the incorporation of temporal information and has the potential to perform complementary self-supervision without the need for additional data, resulting in improved predictive performance.

Our proposed approach can handle missing or spatially misaligned images and can potentially scale to process a large number of prior images. By leveraging the existing temporal structure available in datasets, BioViL-T achieves state-of-the-art results on several downstream benchmarks. We’ve made both our models and source code open source, allowing for a comprehensive exploration and validation of the results discussed in our study. We’ve also released a new multimodal temporal benchmark dataset, MS-CXR-T, to support further research into longitudinal modeling of medical images and text data.

SPOTLIGHT: AI focus area

AI and Microsoft Research

Learn more about the breadth of AI research at Microsoft

Connecting the data points

Solving for the static case in vision-language processing—that is, learning with pairs of single images and captions—is a natural first step in advancing the field. So it’s not surprising that current biomedical vision-language processing work has largely focused on tasks that are dependent on features or abnormalities present at a single point in time—what is a patient’s current condition, and what is a likely diagnosis?—treating image-text pairs such as x-rays and corresponding reports in today’s datasets as independent data points. When prior imaging findings are referenced in reports, that information is often ignored or removed in the training process. Further, a lack of publicly available datasets containing longitudinal series of imaging examinations and reports has further challenged the incorporation of temporal information into medical imaging benchmarks.

Thanks to our early and close collaboration with practicing radiologists and our long-standing work with Nuance, a leading provider of AI solutions in the radiology space that was acquired by Microsoft in 2022, we’ve been able to better understand clinician workflow in the radiological imaging setting. That includes how radiology data is created, what its different components are, and how routinely radiologists refer to prior studies in the context of interpreting medical images. With these insights, we were able to identify temporal alignment of text across multiple images as a clinically significant research problem. To ground, or associate, report information such as “pleural effusion has improved compared to previous study” with the imaging modality requires access to the prior imaging study. We were able to tackle this challenge without gathering additional data or annotations.

As an innovative solution, we leveraged the metadata from de-identified public datasets like MIMIC-CXR. This metadata preserves the original order and intervals of studies, allowing us to connect various images over time and observe disease progression. Developing more data efficient and smart solutions in the healthcare space, where data sources are scarce, is important if we want to develop meaningful AI solutions.

An animated flowchart of BioViL-T. Arrows direct from a prior chest x-ray and current chest x-ray  through boxes labeled “CNN” to image embeddings, illustrated by a purple cube and a brown cube, respectively, representing relevant spatial and temporal features. An arrow points from these features through a box labeled “Vision Transformer Blocks” to a “difference embedding,” represented by a blue cube. A curly bracket pointing to a brown and blue cube labeled “image features” indicates the aggregation of the current image embedding and the difference embedding. Arrows from the “image features” cube and from an extract from a radiology report point to a text model, represented by box labeled “CXR-BERT.”
Figure 1: The proposed self-supervised training framework BioViL-T leverages pairs of radiology reports and sequences of medical images. The training scheme does not require manual expert labels and can scale to a large amount of radiology data to pretrain image and text models required for downstream clinical applications.

Addressing the challenges of longitudinal analysis

With current and prior images now available for comparison, the question became, how can a model reason about images coming from different time points? Radiological imaging, especially with planar techniques like radiographs, may show noticeable variation. This can be influenced by factors such as the patient’s posture during capture and the positioning of the device. Notably, these variations become more pronounced when images are taken with longer time gaps in between. To manage variations, current approaches to longitudinal analysis, largely used for fully supervised learning of image models only, require extensive preprocessing, such as image registration, a technique that attempts to align multiple images taken at different times from different viewpoints. In addition to better managing image variation, we wanted a framework that could be applied to cases in which prior images weren’t relevant or available and the task involved only one image.

We designed BioViL-T with these challenges in mind. Its main components are a multi-image encoder, consisting of both a vision transformer and a convolutional neural network (CNN), and a text encoder. As illustrated in Figure 1, in the multi-image encoder, each input image is first encoded with the CNN model to independently extract findings, such as opacities, present in each medical scan. Here, the CNN counteracts the large data demands of transformer-based architectures through its efficiency in extracting lower-level semantic features.

At the next stage, the features across time points are matched and compared in the vision transformer block, then aggregated into a single joint representation incorporating both current and historical radiological information. It’s important to note that the transformer architecture can adapt to either single- or multi-image scenarios, thereby better handling situations in which past images are unavailable, such as when there’s no relevant image history. Additionally, a cross-attention mechanism across image regions reduces the need for extensive preprocessing, addressing potential variations across images.

In the final stage, the multi-image encoder is jointly trained with the text encoder to match the image representations with their text counterparts using masked modeling and contrastive supervision techniques. To improve text representations and model supervision, we utilize the domain-specific text encoder CXR-BERT-general, which is pretrained on clinical text corpora and built on a clinical vocabulary.

Two chest x-rays side-by-side animated with bounding boxes and attention maps on the affected area of the lung.
Figure 2: Example of current (left) and prior (right) chest x-ray scans. The attention maps computed within the vision transformer show (in purple) how the model interprets disease progression by focusing on these image regions. In this example, the airspace disease seen in the left lung lobe has improved since the prior acquisition.

Grounded model prediction

In our work, we found that linking multiple images during pretraining makes for both better language and vision representations, enabling the AI model to better associate information present in both the text and the images. This means that when given a radiology report of a chest x-ray, for example, with the description “increased opacities in the left lower lung compared with prior examination,” a model can more accurately identify, locate, and compare findings, such as opacities. This improved alignment between data modalities is crucial because it allows the model to provide more accurate and relevant insights, such as identifying abnormalities in medical images, generating more accurate diagnostic reports, or tracking the progression of a disease over time.

Two findings were particularly insightful for us during our experimentation with BioViL-T:

  • Today’s language-generating AI models are often trained by masking portions of text and then prompting them to fill in the blanks as a means of encouraging the models to account for context in outputting a prediction. We extended the traditional masked language modeling (MLM) approach to be guided by multi-image context, essentially making the approach multimodal. This, in return, helped us better analyze whether BioViL-T was learning a progression based on provided images or making a random prediction of the masked words based solely on the text context. We gave the model radiology images and reports with progression-related language, such as “improving,” masked. An example input would be “pleural effusion has been [MASKED] since yesterday.” We then tasked the model with predicting the missing word(s) based on single and multi-image inputs. When provided with a single image, the model was unsuccessful in completing the task; however, when provided with a current and prior image, performance improved, demonstrating that the model is basing its prediction on the prior image.
  • Additionally, we found that training on prior images decreases instances of the generative AI model producing ungrounded outputs that seem plausible but are factually incorrect, in this case, when there’s a lack of information. Prior work into radiology report generation utilizes single input images, resulting in the model potentially outputting text that describes progression without having access to past scans. This severely limits the potential adoption of AI solutions in a high-stakes domain such as healthcare. A decrease in ungrounded outputs, however, could enable automated report generation or assistive writing in the future, which could potentially help reduce administrative duties and ease burnout in the healthcare community. Note that these models aren’t intended for any clinical use at the moment, but they’re important proof points to assess the capabilities of healthcare AI.

Moving longitudinal analysis forward

Through our relationships with practicing radiologists and Nuance, we were able to identify and concentrate on a clinically important research problem, finding that accounting for patient history matters if we want to develop AI solutions with value. To help the research community advance longitudinal analysis, we’ve released a new benchmark dataset. MS-CXR-T, which was curated by a board-certified radiologist, consists of current-prior image pairs of chest x-rays labeled with a state of progression for the temporal image classification task and pairs of sentences about disease progression that are either contradictory or capture the same assessment but are phrased differently for the sentence similarity task.

We focused on chest x-rays and lung diseases, but we see our work as having the potential to be extended into other medical imaging settings where analyzing images over time plays an important part in clinician decision-making, such as scenarios involving MRI or CT scans. However far the reach, it’s vital to ensure that models such as BioViL-T generalize well across different population groups and under the various conditions in which medical images are captured. This important part of the journey requires extensive benchmarking of models on unseen datasets. These datasets should widely vary in terms of acquisition settings, patient demographics, and disease prevalence. Another aspect of this work we look forward to exploring and monitoring is the potential role of general foundation models like GPT-4 in domain-specific foundation model training and the benefits of pairing larger foundation models with smaller specialized models such as BioViL-T.

To learn more and to access our text and image models and source code, visit the BioViL-T Hugging Face page and GitHub.

Acknowledgments

We’d like to thank our co-authors: Shruthi Bannur, Stephanie Hyland, Qianchu Liu, Fernando Pérez-García, Maximilian Ilse, Daniel C. Castro, Benedikt Boecking, Harshita Sharma, Kenza Bouzid, Anja Thieme, Anton Schwaighofer, Maria Wetscherek, and Aditya Nori. We’d also like to thank Hoifung Poon, Melanie Bernhardt, Melissa Bristow, and Naoto Usuyama for their valuable technical feedback and Hannah Richardson for assisting with compliance reviews.

MEDICAL DEVICE DISCLAIMER

BioViL-T was developed for research purposes and is not designed, intended, or made available as a medical device and should not be used to replace or as a substitute for professional medical advice, diagnosis, treatment, or judgment.

The post Accounting for past imaging studies: Enhancing radiology AI and reporting appeared first on Microsoft Research.

Read More

How Forethought saves over 66% in costs for generative AI models using Amazon SageMaker

How Forethought saves over 66% in costs for generative AI models using Amazon SageMaker

This post is co-written with Jad Chamoun, Director of Engineering at Forethought Technologies, Inc. and Salina Wu, Senior ML Engineer at Forethought Technologies, Inc.

Forethought is a leading generative AI suite for customer service. At the core of its suite is the innovative SupportGPT™ technology which uses machine learning to transform the customer support lifecycle—increasing deflection, improving CSAT, and boosting agent productivity. SupportGPT™ leverages state-of-the-art Information Retrieval (IR) systems and large language models (LLMs) to power over 30 million customer interactions annually.

SupportGPT’s primary use case is enhancing the quality and efficiency of customer support interactions and operations. By using state-of-the-art IR systems powered by embeddings and ranking models, SupportGPT can quickly search for relevant information, delivering accurate and concise answers to customer queries. Forethought uses per-customer fine-tuned models to detect customer intents in order to solve customer interactions. The integration of large language models helps humanize the interaction with automated agents, creating a more engaging and satisfying support experience.

SupportGPT also assists customer support agents by offering autocomplete suggestions and crafting appropriate responses to customer tickets that align with the company’s based on previous replies. By using advanced language models, agents can address customers’ concerns faster and more accurately, resulting in higher customer satisfaction.

Additionally, SupportGPT’s architecture enables detecting gaps in support knowledge bases, which helps agents provide more accurate information to customers. Once these gaps are identified, SupportGPT can automatically generate articles and other content to fill these knowledge voids, ensuring the support knowledge base remains customer-centric and up to date.

In this post, we share how Forethought uses Amazon SageMaker multi-model endpoints in generative AI use cases to save over 66% in cost.

Infrastructure challenges

To help bring these capabilities to market, Forethought efficiently scales its ML workloads and provides hyper-personalized solutions tailored to each customer’s specific use case. This hyper-personalization is achieved through fine-tuning embedding models and classifiers on customer data, ensuring accurate information retrieval results and domain knowledge that caters to each client’s unique needs. The customized autocomplete models are also fine-tuned on customer data to further enhance the accuracy and relevance of the responses generated.

One of the significant challenges in AI processing is the efficient utilization of hardware resources such as GPUs. To tackle this challenge, Forethought uses SageMaker multi-model endpoints (MMEs) to run multiple AI models on a single inference endpoint and scale. Because the hyper-personalization of models requires unique models to be trained and deployed, the number of models scales linearly with the number of clients, which can become costly.

To achieve the right balance of performance for real-time inference and cost, Forethought chose to use SageMaker MMEs, which support GPU acceleration. SageMaker MMEs enable Forethought to deliver high-performance, scalable, and cost-effective solutions with subsecond latency, addressing multiple customer support scenarios at scale.

SageMaker and Forethought

SageMaker is a fully managed service that provides developers and data scientists the ability to build, train, and deploy ML models quickly. SageMaker MMEs provide a scalable and cost-effective solution for deploying a large number of models for real-time inference. MMEs use a shared serving container and a fleet of resources that can use accelerated instances such as GPUs to host all of your models. This reduces hosting costs by maximizing endpoint utilization compared to using single-model endpoints. It also reduces deployment overhead because SageMaker manages loading and unloading models in memory and scaling them based on the endpoint’s traffic patterns. In addition, all SageMaker real-time endpoints benefit from built-in capabilities to manage and monitor models, such as including shadow variants, auto scaling, and native integration with Amazon CloudWatch (for more information, refer to CloudWatch Metrics for Multi-Model Endpoint Deployments).

As Forethought grew to host hundreds of models that also required GPU resources, we saw an opportunity to create a more cost-effective, reliable, and manageable architecture through SageMaker MMEs. Prior to migrating to SageMaker MMEs, our models were deployed on Kubernetes on Amazon Elastic Kubernetes Service (Amazon EKS). Although Amazon EKS provided management capabilities, it was immediately apparent that we were managing infrastructure that wasn’t specifically tailored for inference. Forethought had to manage model inference on Amazon EKS ourselves, which was a burden on engineering efficiency. For example, in order to share expensive GPU resources between multiple models, we were responsible for allocating rigid memory fractions to models that were specified during deployment. We wanted to address the following key problems with our existing infrastructure:

  • High cost – To ensure that each model had enough resources, we would be very conservative in how many models to fit per instance. This resulted in much higher costs for model hosting than necessary.
  • Low reliability – Despite being conservative in our memory allocation, not all models have the same requirements, and occasionally some models would throw out of memory (OOM) errors.
  • Inefficient management – We had to manage different deployment manifests for each type of model (such as classifiers, embeddings, and autocomplete), which was time-consuming and error-prone. We also had to maintain the logic to determine the memory allocation for different model types.

Ultimately, we needed an inference platform to take on the heavy lifting of managing our models at runtime to improve the cost, reliability, and the management of serving our models. SageMaker MMEs allowed us to address these needs.

Through its smart and dynamic model loading and unloading, and its scaling capabilities, SageMaker MMEs provided a significantly less expensive and more reliable solution for hosting our models. We are now able to fit many more models per instance and don’t have to worry about OOM errors because SageMaker MMEs handle loading and unloading models dynamically. In addition, deployments are now as simple as calling Boto3 SageMaker APIs and attaching the proper auto scaling policies.

The following diagram illustrates our legacy architecture.

To begin our migration to SageMaker MMEs, we identified the best use cases for MMEs and which of our models would benefit the most from this change. MMEs are best used for the following:

  • Models that are expected to have low latency but can withstand a cold start time (when it’s first loaded in)
  • Models that are called often and consistently
  • Models that need partial GPU resources
  • Models that share common requirements and inference logic

We identified our embeddings models and autocomplete language models as the best candidates for our migration. To organize these models under MMEs, we would create one MME per model type, or task, one for our embeddings models, and another for autocomplete language models.

We already had an API layer on top of our models for model management and inference. Our task at hand was to rework how this API was deploying and handling inference on models under the hood with SageMaker, with minimal changes to how clients and product teams interacted with the API. We also needed to package our models and custom inference logic to be compatible with NVIDIA Triton Inference Server using SageMaker MMEs.

The following diagram illustrates our new architecture.

Custom inference logic

Before migrating to SageMaker, Forethought’s custom inference code (preprocessing and postprocessing) ran in the API layer when a model was invoked. The objective was to transfer this functionality to the model itself to clarify the separation of responsibilities, modularize and simplify their code, and reduce the load on the API.

Embeddings

Forethought’s embedding models consist of two PyTorch model artifacts, and the inference request determines which model to call. Each model requires preprocessed text as input. The main challenges were integrating a preprocessing step and accommodating two model artifacts per model definition. To address the need for multiple steps in the inference logic, Forethought developed a Triton ensemble model with two steps: a Python backend preprocessing process and a PyTorch backend model call. Ensemble models allow for defining and ordering steps in the inference logic, with each step represented by a Triton model of any backend type. To ensure compatibility with the Triton PyTorch backend, the existing model artifacts were converted to TorchScript format. Separate Triton models were created for each model definition, and Forethought’s API layer was responsible for determining the appropriate TargetModel to invoke based on the incoming request.

Autocomplete

The autocomplete models (sequence to sequence) presented a distinct set of requirements. Specifically, we needed to enable the capability to loop through multiple model calls and cache substantial inputs for each call, all while maintaining low latency. Additionally, these models necessitated both preprocessing and postprocessing steps. To address these requirements and achieve the desired flexibility, Forethought developed autocomplete MME models utilizing the Triton Python backend, which offers the advantage of writing the model as Python code.

Benchmarking

After the Triton model shapes were determined, we deployed models to staging endpoints and conducted resource and performance benchmarking. Our main goal was to determine the latency for cold start vs in-memory models, and how latency was affected by request size and concurrency. We also wanted to know how many models could fit on each instance, how many models would cause the instances to scale up with our auto scaling policy, and how quickly the scale-up would happen. In keeping with the instance types we were already using, we did our benchmarking with ml.g4dn.xlarge and ml.g4dn.2xlarge instances.

Results

The following table summarizes our results.

Request Size Cold Start Latency Cached Inference Latency Concurrent Latency (5 requests)
Small (30 tokens) 12.7 seconds 0.03 seconds 0.12 seconds
Medium (250 tokens) 12.7 seconds 0.05 seconds 0.12 seconds
Large (550 tokens) 12.7 seconds 0.13 seconds 0.12 seconds

Noticeably, the latency for cold start requests is significantly higher than the latency for cached inference requests. This is because the model needs to be loaded from disk or Amazon Simple Storage Service (Amazon S3) when a cold start request is made. The latency for concurrent requests is also higher than the latency for single requests. This is because the model needs to be shared between concurrent requests, which can lead to contention.

The following table compares the latency of the legacy models and the SageMaker models.

Request Size Legacy Models SageMaker Models
Small (30 tokens) 0.74 seconds 0.24 seconds
Medium (250 tokens) 0.74 seconds 0.24 seconds
Large (550 tokens) 0.80 seconds 0.32 seconds

Overall, the SageMaker models are a better choice for hosting autocomplete models than the legacy models. They offer lower latency, scalability, reliability, and security.

Resource usage

In our quest to determine the optimal number of models that could fit on each instance, we conducted a series of tests. Our experiment involved loading models into our endpoints using an ml.g4dn.xlarge instance type, without any auto scaling policy.

These particular instances offer 15.5 GB of memory, and we aimed to achieve approximately 80% GPU memory usage per instance. Considering the size of each encoder model artifact, we managed to find the optimal number of Triton encoders to load on an instance to reach our targeted GPU memory usage. Furthermore, given that each of our embeddings models corresponds to two Triton encoder models, we were able to house a set number of embeddings models per instance. As a result, we calculated the total number of instances required to serve all our embeddings models. This experimentation has been crucial in optimizing our resource usage and enhancing the efficiency of our models.

We conducted similar benchmarking for our autocomplete models. These models were around 292.0 MB each. As we tested how many models would fit on a single ml.g4dn.xlarge instance, we noticed that we were only able to fit four models before our instance started unloading models, despite the models having a small size. Our main concerns were:

  • Cause for CPU memory utilization spiking
  • Cause for models getting unloaded when we tried to load in one more model instead of just the least recently used (LRU) model

We were able to pinpoint the root cause of the memory utilization spike coming from initializing our CUDA runtime environment in our Python model, which was necessary to move our models and data on and off the GPU device. CUDA loads many external dependencies into CPU memory when the runtime is initialized. Because the Triton PyTorch backend handles and abstracts away moving data on and off the GPU device, we didn’t run into this issue for our embedding models. To address this, we tried using ml.g4dn.2xlarge instances, which had the same amount of GPU memory but twice as much CPU memory. In addition, we added several minor optimizations in our Python backend code, including deleting tensors after use, emptying the cache, disabling gradients, and garbage collecting. With the larger instance type, we were able to fit 10 models per instance, and the CPU and GPU memory utilization became much more aligned.

The following diagram illustrates this architecture.

Auto scaling

We attached auto scaling policies to both our embeddings and autocomplete MMEs. Our policy for our embeddings endpoint targeted 80% average GPU memory utilization using custom metrics. Our autocomplete models saw a pattern of high traffic during business hours and minimal traffic overnight. Because of this, we created an auto scaling policy based on InvocationsPerInstance so that we could scale according to the traffic patterns, saving on cost without sacrificing reliability. Based on our resource usage benchmarking, we configured our scaling policies with a target of 225 InvocationsPerInstance.

Deploy logic and pipeline

Creating an MME on SageMaker is straightforward and similar to creating any other endpoint on SageMaker. After the endpoint is created, adding additional models to the endpoint is as simple as moving the model artifact to the S3 path that the endpoint targets; at this point, we can make inference requests to our new model.

We defined logic that would take in model metadata, format the endpoint deterministically based on the metadata, and check whether the endpoint existed. If it didn’t, we create the endpoint and add the Triton model artifact to the S3 patch for the endpoint (also deterministically formatted). For example, if the model metadata indicated that it is an autocomplete model, it would create an endpoint for auto-complete models and an associated S3 path for auto-complete model artifacts. If the endpoint existed, we would copy the model artifact to the S3 path.

Now that we had our model shapes for our MME models and the functionality for deploying our models to MME, we needed a way to automate the deployment. Our users must specify which model they want to deploy; we handle packaging and deployment of the model. The custom inference code packaged with the model is versioned and pushed to Amazon S3; in the packaging step, we pull the inference code according to the version specified (or the latest version) and use YAML files that indicate the file structures of the Triton models.

One requirement for us was that all of our MME models would be loaded into memory to avoid any cold start latency during production inference requests to load in models. To achieve this, we provision enough resources to fit all our models (according to the preceding benchmarking) and call every model in our MME at an hourly cadence.

The following diagram illustrates the model deployment pipeline.

The following diagram illustrates the model warm-up pipeline.

Model invocation

Our existing API layer provides an abstraction for callers to make inference on all of our ML models. This meant we only had to add functionality to the API layer to call the SageMaker MME with the correct target model depending on the inference request, without any changes to the calling code. The SageMaker inference code takes the inference request, formats the Triton inputs defined in our Triton models, and invokes the MMEs using Boto3.

Cost benefits

Forethought made significant strides in reducing model hosting costs and mitigating model OOM errors, thanks to the migration to SageMaker MMEs. Before this change, ml.g4dn.xlarge instances running in Amazon EKS. With the transition to MMEs, we discovered it could house 12 embeddings models per instance while achieving 80% GPU memory utilization. This led to a significant decline in our monthly expenses. To put it in perspective, we realized a cost saving of up to 80%. Moreover, to manage higher traffic, we considered scaling up the replicas. Assuming a scenario where we employ three replicas, we found that our cost savings would still be substantial even under these conditions, hovering around 43%.

The journey with SageMaker MMEs has proven financially beneficial, reducing our expenses while ensuring optimal model performance. Previously, our autocomplete language models were deployed in Amazon EKS, necessitating a varying number of ml.g4dn.xlarge instances based on the memory allocation per model. This resulted in a considerable monthly cost. However, with our recent migration to SageMaker MMEs, we’ve been able to reduce these costs substantially. We now host all our models on ml.g4dn.2xlarge instances, giving us the ability to pack models more efficiently. This has significantly trimmed our monthly expenses, and we’ve now realized cost savings in the 66–74% range. This move has demonstrated how efficient resource utilization can lead to significant financial savings using SageMaker MMEs.

Conclusion

In this post, we reviewed how Forethought uses SageMaker multi-model endpoints to decrease cost for real-time inference. SageMaker takes on the undifferentiated heavy lifting, so Forethought can increase engineering efficiency. It also allows Forethought to dramatically lower the cost for real-time inference while maintaining the performance needed for the business-critical operations. By doing so, Forethought is able to provide a differentiated offering for their customers using hyper-personalized models. Use SageMaker MME to host your models at scale and reduce hosting costs by improving endpoint utilization. It also reduces deployment overhead because Amazon SageMaker manages loading models in memory and scaling them based on the traffic patterns to your endpoint. You can find code samples on hosting multiple models using SageMaker MME on GitHub.


About the Authors

Jad Chamoun is a Director of Core Engineering at Forethought. His team focuses on platform engineering covering Data Engineering, Machine Learning Infrastructure, and Cloud Infrastructure.  You can find him on LinkedIn.

Salina Wu is a Sr. Machine Learning Infrastructure engineer at Forethought.ai. She works closely with the Machine Learning team to build and maintain their end-to-end training, serving, and data infrastructures. She is particularly motivated by introducing new ways to improve efficiency and reduce cost across the ML space. When not at work, Salina enjoys surfing, pottery, and being in nature.

James Park is a Solutions Architect at Amazon Web Services. He works with Amazon.com to design, build, and deploy technology solutions on AWS, and has a particular interest in AI and machine learning. In h is spare time he enjoys seeking out new cultures, new experiences,  and staying up to date with the latest technology trends.You can find him on LinkedIn.

Sunil Padmanabhan is a Startup Solutions Architect at AWS. As a former startup founder and CTO, he is passionate about machine learning and focuses on helping startups leverage AI/ML for their business outcomes and design and deploy ML/AI solutions at scale.

Dhawal Patel is a Principal Machine Learning Architect at AWS. He has worked with organizations ranging from large enterprises to mid-sized startups on problems related to distributed computing, and Artificial Intelligence. He focuses on Deep learning including NLP and Computer Vision domains. He helps customers achieve high performance model inference on SageMaker.

Read More

Reinventing the data experience: Use generative AI and modern data architecture to unlock insights

Reinventing the data experience: Use generative AI and modern data architecture to unlock insights

Implementing a modern data architecture provides a scalable method to integrate data from disparate sources. By organizing data by business domains instead of infrastructure, each domain can choose tools that suit their needs. Organizations can maximize the value of their modern data architecture with generative AI solutions while innovating continuously.

The natural language capabilities allow non-technical users to query data through conversational English rather than complex SQL. However, realizing the full benefits requires overcoming some challenges. The AI and language models must identify the appropriate data sources, generate effective SQL queries, and produce coherent responses with embedded results at scale. They also need a user interface for natural language questions.

Overall, implementing a modern data architecture and generative AI techniques with AWS is a promising approach for gleaning and disseminating key insights from diverse, expansive data at an enterprise scale. The latest offering for generative AI from AWS is Amazon Bedrock, which is a fully managed service and the easiest way to build and scale generative AI applications with foundation models. AWS also offers foundation models through Amazon SageMaker JumpStart as Amazon SageMaker endpoints. The combination of large language models (LLMs), including the ease of integration that Amazon Bedrock offers, and a scalable, domain-oriented data infrastructure positions this as an intelligent method of tapping into the abundant information held in various analytics databases and data lakes.

In the post, we showcase a scenario where a company has deployed a modern data architecture with data residing on multiple databases and APIs such as legal data on Amazon Simple Storage Service (Amazon S3), human resources on Amazon Relational Database Service (Amazon RDS), sales and marketing on Amazon Redshift, financial market data on a third-party data warehouse solution on Snowflake, and product data as an API. This implementation aims to enhance the productivity of the enterprise’s business analytics, product owners, and business domain experts. All this achieved through the use of generative AI in this domain mesh architecture, which enables the company to achieve its business objectives more efficiently. This solution has the option to include LLMs from JumpStart as a SageMaker endpoint as well as third-party models. We provide the enterprise users with a medium of asking fact-based questions without having an underlying knowledge of data channels, thereby abstracting the complexities of writing simple to complex SQL queries.

Solution overview

A modern data architecture on AWS applies artificial intelligence and natural language processing to query multiple analytics databases. By using services such as Amazon Redshift, Amazon RDS, Snowflake, Amazon Athena, and AWS Glue, it creates a scalable solution to integrate data from various sources. Using LangChain, a powerful library for working with LLMs, including foundation models from Amazon Bedrock and JumpStart in Amazon SageMaker Studio notebooks, a system is built where users can ask business questions in natural English and receive answers with data drawn from the relevant databases.

The following diagram illustrates the architecture.

The hybrid architecture uses multiple databases and LLMs, with foundation models from Amazon Bedrock and JumpStart for data source identification, SQL generation, and text generation with results.

The following diagram illustrates the specific workflow steps for our solution.

The steps are follows:

  1. A business user provides an English question prompt.
  2. An AWS Glue crawler is scheduled to run at frequent intervals to extract metadata from databases and create table definitions in the AWS Glue Data Catalog. The Data Catalog is input to Chain Sequence 1 (see the preceding diagram).
  3. LangChain, a tool to work with LLMs and prompts, is used in Studio notebooks. LangChain requires an LLM to be defined. As part of Chain Sequence 1, the prompt and Data Catalog metadata are passed to an LLM, hosted on a SageMaker endpoint, to identify the relevant database and table using LangChain.
  4. The prompt and identified database and table are passed to Chain Sequence 2.
  5. LangChain establishes a connection to the database and runs the SQL query to get the results.
  6. The results are passed to the LLM to generate an English answer with the data.
  7. The user receives an English answer to their prompt, querying data from different databases.

This following sections explain some of the key steps with associated code. To dive deeper into the solution and code for all steps shown here, refer to the GitHub repo. The following diagram shows the sequence of steps followed:

Prerequisites

You can use any databases that are compatible with SQLAlchemy to generate responses from LLMs and LangChain. However, these databases must have their metadata registered with the AWS Glue Data Catalog. Additionally, you will need to have access to LLMs through either JumpStart or API keys.

Connect to databases using SQLAlchemy

LangChain uses SQLAlchemy to connect to SQL databases. We initialize LangChain’s SQLDatabase function by creating an engine and establishing a connection for each data source. The following is a sample of how to connect to an Amazon Aurora MySQL-Compatible Edition serverless database and include only the employees table:

#connect to AWS Aurora MySQL
cluster_arn = <cluster_arn>
secret_arn = <secret_arn>
engine_rds=create_engine('mysql+auroradataapi://:@/employees',echo=True,
  connect_args=dict(aurora_cluster_arn=cluster_arn, secret_arn=secret_arn))
dbrds = SQLDatabase(engine_rds, include_tables=['employees'])

Next, we build prompts used by Chain Sequence 1 to identify the database and the table name based on the user question.

Generate dynamic prompt templates

We use the AWS Glue Data Catalog, which is designed to store and manage metadata information, to identify the source of data for a user query and build prompts for Chain Sequence 1, as detailed in the following steps:

  1. We build a Data Catalog by crawling through the metadata of multiple data sources using the JDBC connection used in the demonstration.
  2. With the Boto3 library, we build a consolidated view of the Data Catalog from multiple data sources. The following is a sample on how to get the metadata of the employees table from the Data Catalog for the Aurora MySQL database:
 #retrieve metadata from glue data catalog
  glue_tables_rds = glue_client.get_tables(DatabaseName=<database_name>, MaxResults=1000)
    for table in glue_tables_rds['TableList']:
        for column in table['StorageDescriptor']['Columns']:
             columns_str=columns_str+'n'+('rdsmysql|employees|'+table['Name']+"|"+column['Name'])

A consolidated Data Catalog has details on the data source, such as schema, table names, and column names. The following is a sample of the output of the consolidated Data Catalog:

database|schema|table|column_names
redshift|tickit|tickit_sales|listid
rdsmysql|employees|employees|emp_no
....
s3|none|claims|policy_id
  1. We pass the consolidated Data Catalog to the prompt template and define the prompts used by LangChain:
prompt_template = """
From the table below, find the database (in column database) which will contain the data (in corresponding column_names) to answer the question {query} n
"""+glue_catalog +""" Give your answer as database == n Also,give your answer as database.table =="""

Chain Sequence 1: Detect source metadata for the user query using LangChain and an LLM

We pass the prompt template generated in the previous step to the prompt, along with the user query to the LangChain model, to find the best data source to answer the question. LangChain uses the LLM model of our choice to detect source metadata.

Use the following code to use an LLM from JumpStart or third-party models:

#define your LLM model here
llm = <LLM>
#pass prompt template and user query to the prompt
PROMPT = PromptTemplate(template=prompt_template, input_variables=["query"])
# define llm chain
llm_chain = LLMChain(prompt=PROMPT, llm=llm)
#run the query and save to generated texts
generated_texts = llm_chain.run(query)

The generated text contains information such as the database and table names against which the user query is run. For example, for the user query “Name all employees with birth date this month,” generated_text has the information database == rdsmysql and database.table == rdsmysql.employees.

Next, we pass the details of the human resources domain, Aurora MySQL database, and employees table to Chain Sequence 2.

Chain Sequence 2: Retrieve responses from the data sources to answer the user query

Next, we run LangChain’s SQL database chain to convert text to SQL and implicitly run the generated SQL against the database to retrieve the database results in a simple readable language.

We start with defining a prompt template that instructs the LLM to generate SQL in a syntactically correct dialect and then run it against the database:

_DEFAULT_TEMPLATE = """Given an input question, first create a syntactically correct {dialect} query to run, then look at the results of the query and return the answer.
Only use the following tables:
{table_info}
If someone asks for the sales, they really mean the tickit.sales table.
Question: {input}"""
#define the prompt
PROMPT = PromptTemplate( input_variables=["input", "table_info", "dialect"], template=_DEFAULT_TEMPLATE)

Finally, we pass the LLM, database connection, and prompt to the SQL database chain and run the SQL query:

db_chain = SQLDatabaseChain.from_llm(llm, db, prompt=PROMPT)
response=db_chain.run(query)

For example, for the user query “Name all employees with birth date this month,” the answer is as follows:

Question: Name all employees with birth date this month

SELECT * FROM employees WHERE MONTH(birth_date) = MONTH(CURRENT_DATE());

User Response:
The employees with birthdays this month are:
Christian Koblick
Tzvetan Zielinski

Clean up

After you run the modern data architecture with generative AI, make sure to clean up any resources that won’t be utilized. Shut down and delete the databases used (Amazon Redshift, Amazon RDS, Snowflake). In addition, delete the data in Amazon S3 and stop any Studio notebook instances to not incur any further charges. If you used JumpStart to deploy an LLM as a SageMaker real-time endpoint, delete endpoint through either the SageMaker console or Studio.

Conclusion

In this post, we integrated a modern data architecture with generative AI and LLMs within SageMaker. This solution uses various text-to-text foundation models from JumpStart as well as third-party models. This hybrid approach identifies data sources, writes SQL queries, and generates responses with query results. It uses Amazon Redshift, Amazon RDS, Snowflake, and LLMs. To improve the solution, you could add more databases, a UI for English queries, prompt engineering, and data tools. This could become an intelligent, unified way to get insights from multiple data stores. To dive deeper into the solution and the code shown in this post, check out the GitHub repo . Also, refer to Amazon Bedrock for use cases on generative AI, foundation models, and large language models.


Appendix

Example prompts

Domain Database/API Prompt SQL (Generated by LLM) Output
Sales & Marketing Amazon RedShift How many ticket sales are there? SELECT COUNT(*) AS total_sales
FROM tickit.sales;
There are 172,456 ticket sales.
Sales & Marketing Amazon RedShift What was the total commission for the ticket sales in the year 2008? SELECT SUM(commission) AS total_commission
FROM tickit.sales
WHERE EXTRACT(YEAR FROM saletime) = 2008
The total commission for ticket sales in the year 2008 was $16,614,814.65.
Legal S3 How many frauds happened in the year 2023? SELECT count(*)
FROM claims
WHERE extract(year from write_time) = 2023 AND fraud = 1;
There were 164 fraud claims in 2023.
Legal S3 How many policies were claimed this year? SELECT count(*)
FROM claims;
There were 5000 claims made this year.
Human Resources Amazon Aurora MySQL Name all employees with birth date this month SELECT * FROM employees
WHERE MONTH(birth_date) = MONTH(CURRENT_DATE());
The employees with birthdays this month are:
Christian Koblick
Tzvetan Zielinski
Kazuhito Cappelletti
Yinghua Dredge
Human Resources Amazon Aurora MySQL How many employees were hired before 1990? SELECT COUNT(*) AS 'Number of employees hired before 1990'
FROM employees
WHERE hire_date < '1990-01-01'
29 employees were hired before 1990.
Finance and Investments Snowflake Which stock performed the best and the worst in May of 2013? SELECT name, MAX(close) AS max_close, MIN(close) AS min_close
FROM all_stocks_5yr
WHERE date BETWEEN '2013-05-01' AND '2013-05-31'
GROUP BY name
ORDER BY max_close DESC, min_close ASC
The stock that performed the best in May 2013 was AnySock1 (ASTOCK1) with a maximum closing price of $842.50. The stock that performed the worst was AnySock2 (ASTOCK2) with a minimum closing price of $3.22.
Finance and Investments Snowflake What is the average volume stocks traded in July of 2013? SELECT AVG(volume) AS average_volume
FROM all_stocks_5yr
WHERE date BETWEEN '2013-07-01' AND '2013-07-31'
The average volume of stocks traded in July 2013 was 4,374,177
Product – Weather API What is the weather like right now in New York City in degrees Fahrenheit?

About the Authors

Navneet Tuteja is a Data Specialist at Amazon Web Services. Before joining AWS, Navneet worked as a facilitator for organizations seeking to modernize their data architectures and implement comprehensive AI/ML solutions. She holds an engineering degree from Thapar University, as well as a master’s degree in statistics from Texas A&M University.

Sovik Kumar Nath is an AI/ML solution architect with AWS. He has extensive experience designing end-to-end machine learning and business analytics solutions in finance, operations, marketing, healthcare, supply chain management, and IoT. Sovik has published articles and holds a patent in ML model monitoring. He has double masters degrees from the University of South Florida, University of Fribourg, Switzerland, and a bachelors degree from the Indian Institute of Technology, Kharagpur. Outside of work, Sovik enjoys traveling, taking ferry rides, and watching movies.

Read More

Enabling delightful user experiences via predictive models of human attention

Enabling delightful user experiences via predictive models of human attention

People have the remarkable ability to take in a tremendous amount of information (estimated to be ~1010 bits/s entering the retina) and selectively attend to a few task-relevant and interesting regions for further processing (e.g., memory, comprehension, action). Modeling human attention (the result of which is often called a saliency model) has therefore been of interest across the fields of neuroscience, psychology, human-computer interaction (HCI) and computer vision. The ability to predict which regions are likely to attract attention has numerous important applications in areas like graphics, photography, image compression and processing, and the measurement of visual quality.

We’ve previously discussed the possibility of accelerating eye movement research using machine learning and smartphone-based gaze estimation, which earlier required specialized hardware costing up to $30,000 per unit. Related research includes “Look to Speak”, which helps users with accessibility needs (e.g., people with ALS) to communicate with their eyes, and the recently published “Differentially private heatmaps” technique to compute heatmaps, like those for attention, while protecting users’ privacy.

In this blog, we present two papers (one from CVPR 2022, and one just accepted to CVPR 2023) that highlight our recent research in the area of human attention modeling: “Deep Saliency Prior for Reducing Visual Distraction” and “Learning from Unique Perspectives: User-aware Saliency Modeling”, together with recent research on saliency driven progressive loading for image compression (1, 2). We showcase how predictive models of human attention can enable delightful user experiences such as image editing to minimize visual clutter, distraction or artifacts, image compression for faster loading of webpages or apps, and guiding ML models towards more intuitive human-like interpretation and model performance. We focus on image editing and image compression, and discuss recent advances in modeling in the context of these applications.

Attention-guided image editing

Human attention models usually take an image as input (e.g., a natural image or a screenshot of a webpage), and predict a heatmap as output. The predicted heatmap on the image is evaluated against ground-truth attention data, which are typically collected by an eye tracker or approximated via mouse hovering/clicking. Previous models leveraged handcrafted features for visual clues, like color/brightness contrast, edges, and shape, while more recent approaches automatically learn discriminative features based on deep neural networks, from convolutional and recurrent neural networks to more recent vision transformer networks.

In “Deep Saliency Prior for Reducing Visual Distraction” (more information on this project site), we leverage deep saliency models for dramatic yet visually realistic edits, which can significantly change an observer’s attention to different image regions. For example, removing distracting objects in the background can reduce clutter in photos, leading to increased user satisfaction. Similarly, in video conferencing, reducing clutter in the background may increase focus on the main speaker (example demo here).

To explore what types of editing effects can be achieved and how these affect viewers’ attention, we developed an optimization framework for guiding visual attention in images using a differentiable, predictive saliency model. Our method employs a state-of-the-art deep saliency model. Given an input image and a binary mask representing the distractor regions, pixels within the mask will be edited under the guidance of the predictive saliency model such that the saliency within the masked region is reduced. To make sure the edited image is natural and realistic, we carefully choose four image editing operators: two standard image editing operations, namely recolorization and image warping (shift); and two learned operators (we do not define the editing operation explicitly), namely a multi-layer convolution filter, and a generative model (GAN).

With those operators, our framework can produce a variety of powerful effects, with examples in the figure below, including recoloring, inpainting, camouflage, object editing or insertion, and facial attribute editing. Importantly, all these effects are driven solely by the single, pre-trained saliency model, without any additional supervision or training. Note that our goal is not to compete with dedicated methods for producing each effect, but rather to demonstrate how multiple editing operations can be guided by the knowledge embedded within deep saliency models.

Examples of reducing visual distractions, guided by the saliency model with several operators. The distractor region is marked on top of the saliency map (red border) in each example.

Enriching experiences with user-aware saliency modeling

Prior research assumes a single saliency model for the whole population. However, human attention varies between individuals — while the detection of salient clues is fairly consistent, their order, interpretation, and gaze distributions can differ substantially. This offers opportunities to create personalized user experiences for individuals or groups. In “Learning from Unique Perspectives: User-aware Saliency Modeling”, we introduce a user-aware saliency model, the first that can predict attention for one user, a group of users, and the general population, with a single model.

As shown in the figure below, core to the model is the combination of each participant’s visual preferences with a per-user attention map and adaptive user masks. This requires per-user attention annotations to be available in the training data, e.g., the OSIE mobile gaze dataset for natural images; FiWI and WebSaliency datasets for web pages. Instead of predicting a single saliency map representing attention of all users, this model predicts per-user attention maps to encode individuals’ attention patterns. Further, the model adopts a user mask (a binary vector with the size equal to the number of participants) to indicate the presence of participants in the current sample, which makes it possible to select a group of participants and combine their preferences into a single heatmap.

An overview of the user aware saliency model framework. The example image is from OSIE image set.

During inference, the user mask allows making predictions for any combination of participants. In the following figure, the first two rows are attention predictions for two different groups of participants (with three people in each group) on an image. A conventional attention prediction model will predict identical attention heatmaps. Our model can distinguish the two groups (e.g., the second group pays less attention to the face and more attention to the food than the first). Similarly, the last two rows are predictions on a webpage for two distinctive participants, with our model showing different preferences (e.g., the second participant pays more attention to the left region than the first).

Predicted attention vs. ground truth (GT). EML-Net: predictions from a state-of-the-art model, which will have the same predictions for the two participants/groups. Ours: predictions from our proposed user aware saliency model, which can predict the unique preference of each participant/group correctly. The first image is from OSIE image set, and the second is from FiWI.

Progressive image decoding centered on salient features

Besides image editing, human attention models can also improve users’ browsing experience. One of the most frustrating and annoying user experiences while browsing is waiting for web pages with images to load, especially in conditions with low network connectivity. One way to improve the user experience in such cases is with progressive decoding of images, which decodes and displays increasingly higher-resolution image sections as data are downloaded, until the full-resolution image is ready. Progressive decoding usually proceeds in a sequential order (e.g., left to right, top to bottom). With a predictive attention model (1, 2), we can instead decode images based on saliency, making it possible to send the data necessary to display details of the most salient regions first. For example, in a portrait, bytes for the face can be prioritized over those for the out-of-focus background. Consequently, users perceive better image quality earlier and experience significantly reduced wait times. More details can be found in our open source blog posts (post 1, post 2). Thus, predictive attention models can help with image compression and faster loading of web pages with images, improve rendering for large images and streaming/VR applications.

Conclusion

We’ve shown how predictive models of human attention can enable delightful user experiences via applications such as image editing that can reduce clutter, distractions or artifacts in images or photos for users, and progressive image decoding that can greatly reduce the perceived waiting time for users while images are fully rendered. Our user-aware saliency model can further personalize the above applications for individual users or groups, enabling richer and more unique experiences.

Another interesting direction for predictive attention models is whether they can help improve robustness of computer vision models in tasks such as object classification or detection. For example, in “Teacher-generated spatial-attention labels boost robustness and accuracy of contrastive models”, we show that a predictive human attention model can guide contrastive learning models to achieve better representation and improve the accuracy/robustness of classification tasks (on the ImageNet and ImageNet-C datasets). Further research in this direction could enable applications such as using radiologist’s attention on medical images to improve health screening or diagnosis, or using human attention in complex driving scenarios to guide autonomous driving systems.

Acknowledgements

This work involved collaborative efforts from a multidisciplinary team of software engineers, researchers, and cross-functional contributors. We’d like to thank all the co-authors of the papers/research, including Kfir Aberman, Gamaleldin F. Elsayed, Moritz Firsching, Shi Chen, Nachiappan Valliappan, Yushi Yao, Chang Ye, Yossi Gandelsman, Inbar Mosseri, David E. Jacobes, Yael Pritch, Shaolei Shen, and Xinyu Ye. We also want to thank team members Oscar Ramirez, Venky Ramachandran and Tim Fujita for their help. Finally, we thank Vidhya Navalpakkam for her technical leadership in initiating and overseeing this body of work.

Read More

How BrainPad fosters internal knowledge sharing with Amazon Kendra

How BrainPad fosters internal knowledge sharing with Amazon Kendra

This is a guest post by Dr. Naoki Okada, Lead Data Scientist at BrainPad Inc.

Founded in 2004, BrainPad Inc. is a pioneering partner in the field of data utilization, helping companies create business and improve their management through the use of data. To date, BrainPad has helped more than 1,300 companies, primarily industry leaders. BrainPad has the advantage of providing a one-stop service from formulating a data utilization strategy to proof of concept and implementation. BrainPad’s unique style is to work together with clients to solve problems on the ground, such as data that isn’t being collected due to a siloed organizational structure or data that exists but isn’t organized.

This post discusses how to structure internal knowledge sharing using Amazon Kendra and AWS Lambda and how Amazon Kendra solves the obstacles around knowledge sharing many companies face. We summarize BrainPad’s efforts in four key areas:

  • What are the knowledge sharing problems that many companies face?
  • Why did we choose Amazon Kendra?
  • How did we implement the knowledge sharing system?
  • Even if a tool is useful, it is meaningless if it is not used. How did we overcome the barrier to adoption?

Knowledge sharing problems that many companies face

make-transfer-use

Many companies achieve their results by dividing their work into different areas. Each of these activities generates new ideas every day. This knowledge is accumulated on an individual basis. If this knowledge can be shared among people and organizations, synergies in related work can be created, and the efficiency and quality of work will increase dramatically. This is the power of knowledge sharing.

However, there are many common barriers to knowledge sharing:

  • Few people are proactively involved, and the process can’t be sustained for long due to busy schedules.
  • Knowledge is scattered across multiple media, such as internal wikis and PDFs, making it difficult to find the information you need.
  • No one enters knowledge into the knowledge consolidation system. The system will not be widely used because of its poor searchability.

Our company faced a similar situation. The fundamental problem with knowledge sharing is that although most employees have a strong need to obtain knowledge, they have little motivation to share their own knowledge at a cost. Changing employee behavior for the sole purpose of knowledge sharing is not easy.

In addition, each employee or department has its own preferred method of accumulating knowledge, and trying to force unification won’t lead to motivation or performance in knowledge sharing. This is a headache for management, who wants to consolidate knowledge, while those in the field want to have knowledge in a decentralized way.

At our company, Amazon Kendra is the cloud service that has solved these problems.

Why we chose Amazon Kendra

collection-search-cost

Amazon Kendra is a cloud service that allows us to search for internal information from a common interface. In other words, it is a search engine that specializes in internal information. In this section, we discuss the three key reasons why we chose Amazon Kendra.

Easy aggregation of knowledge

As mentioned in the previous section, knowledge, even when it exists, tends to be scattered across multiple media. In our case, it was scattered across our internal wiki and various document files. Amazon Kendra provides powerful connectors for this situation. We can easily import documents from a variety of media, including groupware, wikis, Microsoft PowerPoint files, PDFs, and more, without any hassle.

This means that employees don’t have to change the way they store knowledge in order to share it. Although knowledge aggregation can be achieved temporarily, it’s very costly to maintain. The ability to automate this was a very desirable factor for us.

Great searchability

There are a lot of groupware and wikis out there that excel at information input. However, they often have weaknesses in information output (searchability). This is especially true for Japanese search. For example, in English, word-level matching provides a reasonable level of searchability. In Japanese, however, word extraction is more difficult, and there are cases where matching is done by separating words by an appropriate number of characters. If a search for “Tokyo-to (東京都)” is separated by two characters, “Tokyo (東京)” and “Kyoto (京都),” it will be difficult to find the knowledge you are looking for.

Amazon Kendra offers great searchability through machine learning. In addition to traditional keyword searches such as “technology trends,” natural language searches such as “I want information on new technology initiatives” can greatly enhance the user experience. The ability to search appropriately for collected information is the second reason we chose Amazon Kendra.

Low cost of ownership

IT tools that specialize in knowledge aggregation and retrieval are called enterprise search systems. One problem with implementing these systems is the cost. For an organization with several hundred employees, operating costs can exceed 10 million yen per year. This is not a cheap way to start a knowledge sharing initiative.

Amazon Kendra is offered at a much lower cost than most enterprise search systems. As mentioned earlier, knowledge sharing initiatives are not easy to implement. We wanted to start small, and Amazon Kendra’s low cost of ownership was a key factor in our decision.

In addition, Amazon Kendra’s ease of implementation and flexibility are also great advantages for us. The next section summarizes an example of our implementation.

How we implemented the knowledge sharing system

amazon-kendra

Implementation is not an exaggerated development process; it can be done without code by following the Amazon Kendra processing flow. Here are five key points in the implementation process:

  • Data source (accumulating knowledge) – Each department and employee of our company frequently held internal study sessions, and through these activities, knowledge was accumulated in multiple media, such as wikis and various types of storage. At that time, it was easy to review the information from the study sessions later. However, in order to extract knowledge about a specific area or technology, it was necessary to review each medium in detail, which was not very convenient.
  • Connectors (aggregating knowledge) – With the connector functionality in Amazon Kendra, we were able to link knowledge scattered throughout the company into Amazon Kendra and achieve cross-sectional searchability. In addition, the connector is loaded through a restricted account, allowing for a security-conscious implementation.
  • Search engine (finding information) – Because Amazon Kendra has a search page for usability testing, we were able to quickly test the usability of the search engine immediately after loading documents to see what kind of knowledge could be found. This was very helpful in solidifying the image of the launch.
  • Search UI (search page for users) – Amazon Kendra has a feature called Experience Builder that exposes the search screen to users. This feature can be implemented with no code, which was very helpful in getting feedback during the test deployment. In addition to Experience Builder, Amazon Kendra also supports Python and React.js API implementations, so we can eventually provide customized search pages to our employees to improve their experience.
  • Analytics (monitoring usage trends) – An enterprise search system is only valuable if a lot of people are using it. Amazon Kendra has the ability to monitor how many searches are being performed and for what terms. We use this feature to track usage trends.

We also have some Q&A related to our implementation:

  • What were some of the challenges in gathering internal knowledge? We had to start by collecting the knowledge that each department and employee had, but not necessarily in a place that could be directly connected to Amazon Kendra.
  • How did we benefit from Amazon Kendra? We had tried to share knowledge many times in the past, but had often failed. The reasons were information aggregation, searchability, operational costs, and implementation costs. Amazon Kendra has features that solve these problems, and we successfully launched it within about 3 months of conception. Now we can use Amazon Kendra to find solutions to tasks that previously required the knowledge of individuals or departments as the collective knowledge of the entire organization.
  • How did you evaluate the searchability of the system, and what did you do to improve it? First, we had many employees interact with the system and get feedback. One problem that arose at the beginning of the implementation was that there was a scattering of information that had little value as knowledge. This was because some of the data sources contained information from internal blog posts, for example. We are continually working to improve the user experience by selecting the right data sources.

As mentioned earlier, by using Amazon Kendra, we were able to overcome many implementation hurdles at minimal cost. However, the biggest challenge with this type of tool is the adoption barrier that comes after implementation. The next section provides an example of how we overcame this hurdle.

How we overcame the barrier to adoption

chatbot-architecture

Have you ever seen a tool that you spent a lot of effort, time, and money implementing become obsolete without widespread use? No matter how good the functionality is at solving problems, it will not be effective if people are not using it.

One of the initiatives we took with the launch of Amazon Kendra was to provide a chatbot. In other words, when you ask a question in a chat tool, you get a response with the appropriate knowledge. Because all of our telecommuting employees use a chat tool on a daily basis, using chatbots is much more compatible than having them open a new search screen in their browsers.

To implement this chatbot, we use Lambda, a service that allows us to run serverless, event-driven programs. Specifically, the following workflow is implemented:

  1. A user posts a question to the chatbot with a mention.
  2. The chatbot issues an event to Lambda.
  3. A Lambda function detects the event and searches Amazon Kendra for the question.
  4. The Lambda function posts the search results to the chat tool.
  5. The user views the search results.

This process takes only a few seconds and provides a high-quality user experience for knowledge discovery. The majority of employees were exposed to the knowledge sharing mechanism through the chatbot, and there is no doubt that the chatbot contributed to the diffusion of the mechanism. And because there are some areas that can’t be covered by the chatbot alone, we have also asked them to use the customized search screen in conjunction with the chatbot to provide an even better user experience.

Conclusion

In this post, we presented a case study of Amazon Kendra for knowledge sharing and an example of a chatbot implementation using Lambda to propagate the mechanism. We look forward to seeing Amazon Kendra take another leap forward as large-scale language models continue to evolve.

If you are interested in trying out Amazon Kendra, check out Enhancing enterprise search with Amazon Kendra. BrainPad can also help you with internal knowledge sharing and document exploitation using generative AI. Please contact us for more information.


About the Author

dr-naoki-okada

Dr. Naoki Okada is a Lead Data Scientist at BrainPad Inc. With his cross-functional experience in business, analytics, and engineering, he supports a wide range of clients from building up DX organizations to leveraging data in unexplored areas.

Read More

AWS Inferentia2 builds on AWS Inferentia1 by delivering 4x higher throughput and 10x lower latency

AWS Inferentia2 builds on AWS Inferentia1 by delivering 4x higher throughput and 10x lower latency

The size of the machine learning (ML) models––large language models (LLMs) and foundation models (FMs)––is growing fast year-over-year, and these models need faster and more powerful accelerators, especially for generative AI. AWS Inferentia2 was designed from the ground up to deliver higher performance while lowering the cost of LLMs and generative AI inference.

In this post, we show how the second generation of AWS Inferentia builds on the capabilities introduced with AWS Inferentia1 and meets the unique demands of deploying and running LLMs and FMs.

The first generation of AWS Inferentia, a purpose-built accelerator launched in 2019, is optimized to accelerate deep learning inference. AWS Inferentia helped ML users reduce their inference costs and improve their prediction throughput and latency. With AWS Inferentia1, customers saw up to 2.3x higher throughput and up to 70% lower cost per inference than comparable inference-optimized Amazon Elastic Compute Cloud (Amazon EC2) instances.

AWS Inferentia2, featured in the new Amazon EC2 Inf2 instances and supported in Amazon SageMaker, is optimized for large-scale generative AI inference and is the first inference focused instance from AWS that is optimized for distributed inference, with high-speed, low-latency connectivity between accelerators.

You can now efficiently deploy a 175-billion-parameter model for inference across multiple accelerators on a single Inf2 instance without requiring expensive training instances. Until now, customers who had large models could only use instances that were built for training, but this is a waste of resources––given that they’re more expensive, consume more energy, and their workload doesn’t make use of all the available resources (such as faster networking and storage). With AWS Inferentia2, you can achieve 4 times higher throughput and up to 10 times lower latency compared to AWS Inferentia1. Also, the second generation of AWS Inferentia adds enhanced support for more data types, custom operators, dynamic tensors, and more.

AWS Inferentia2 has 4 times more memory capacity, 16.4 times higher memory bandwidth than AWS Inferentia1, and native support for sharding large models across multiple accelerators. The accelerators use NeuronLink and Neuron Collective Communication to maximize the speed of data transfer between them or between an accelerator and the network adapter. AWS Inferentia2 is better suited for larger models, which require sharding across multiple accelerators, although AWS Inferentia1 is still a great option for smaller models because it provides better price-performance compared to alternatives.

Architecture evolution

To compare both generations of AWS Inferentia, let’s review the architecture of AWS Inferentia1. It has four NeuronCores v1 per chip, shown in the following diagram.

Specifications per chip:

  • Compute – Four cores delivering in total 128 INT8 TOPS and 64FP16/BF16 TFLOPS
  • Memory – 8 GB of DRAM (50 GB/sec of bandwidth), shared by all four cores
  • NeuronLink – Link between cores for sharding models across two or more cores

Let’s look at how AWS Inferentia2 is organized. Each AWS Inferentia2 chip has two upgraded cores based on the NeuronCore-v2 architecture. Like AWS Inferentia1, you can run different models on each NeuronCore or combine multiple cores to shard big models.

Specifications per chip:

  • Compute – Two cores delivering in total 380 INT8 TOPS, 190 FP16/BF16/cFP8/TF32 TFLOPS, and 47.5 FP32 TFLOPS
  • Memory – 32 GB of HBM, shared by both cores
  • NeuronLink – Link between chips (384 GB/sec per device) for sharding models across two or more cores

NeuronCore-v2 has a modular design with four independent engines:

  • ScalarEngine (3 times faster than v1) – Operates on floating point numbers––1600 (BF16/FP16) FLOPS
  • VectorEngine (10 times faster than v1) – Operates on vectors of numbers with single operation for computations such as normalization, pooling, and others.
  • TensorEngine (6 times faster than v1) – Performs tensor computations such as Conv, Reshape, Transpose, and others.
  • GPSIMD-Engine – Has eight fully programmable 512-bit wide general-purpose processors for you to create your custom operators with standard PyTorch custom C++ operators API. This is a new feature, introduced in NeuronCore-v2.

AWS Inferentia2 NeuronCore-v2 is faster and more optimized. Also, it’s capable of accelerating different types and sizes of models, ranging from simple models such as ResNet 50 to large language models or foundation models with billions of parameters such as GPT-3 (175 billion parameters). AWS Inferentia2 also has a larger and faster internal memory, when compared to AWS Inferentia1, as shown in the following table.

Chip Neuron Cores Memory Type Memory Size Memory Bandwidth
AWS Inferentia x4 (v1) DDR4 8GB 50GB/S
AWS Inferentia 2 x2 (v2) HBM 32GB 820GB/S

The memory you find in AWS Inferentia2 is the type High-Bandwidth Memory (HBM) type. Each AWS Inferentia2 chip has 32 GB and that can be combined with other chips to distribute very large models using NeuronLink (device-to-device interconnect). An inf2.48xlarge, for instance, has 12 AWS Inferentia2 accelerators with a total of 384 GB of accelerated memory. The speed of AWS Inferentia2 memory is 16.4 times faster than AWS Inferentia1, as shown in the previous table.

Other features

AWS Inferentia2 offers the following additional features:

  • Hardware supported – cFP8 (new, configurable FP8), FP16, BF16, TF32, FP32, INT8, INT16 and INT32. For more information, refer to Data Types.
  • Lazy Tensor inference – We discuss Lazy Tensor inference later in this post.
  • Custom operators – Developers can use standard PyTorch custom operators programming interfaces to use the Custom C++ Operators feature. A custom operator is composed of low-level primitives available in the Tensor Factory Functions and accelerated by GPSIMD-Engine.
  • Control-flow (coming soon) – This is for native programming language control flow inside the model to eventually preprocess and postprocess data from one layer to another.
  • Dynamic-shapes (coming soon) – This is useful when your model changes the shape of the output of any internal layer dynamically. For instance: a filter which reduces the output tensor size or shape inside the model, based on the input data.

Accelerating models on AWS Inferentia1 and AWS Inferentia2

The AWS Neuron SDK is used for compiling and running your model. It is natively integrated with PyTorch and TensorFlow. That way, you don’t need to run an additional tool. Use your original code, written in one of these ML frameworks, and with a few lines of code changes, you’re good to go with AWS Inferentia.

Let’s look at how to compile and run a model on AWS Inferentia1 and AWS Inferentia2 using PyTorch.

Load a pre-trained model (ResNet 50) from torchvision

Load a pre-trained model and run it one time to warm it up:

import torch
import torchvision

model = torchvision.models.resnet50(weights='IMAGENET1K_V1').eval().cpu()
x = torch.rand(1,3,224,224).float().cpu() # dummy input
y = model(x) # warmup model

Trace and deploy the accelerated model on Inferentia1

To trace the model to AWS Inferentia, import torch_neuron and invoke the tracing function. Keep in mind that the model needs to be PyTorch Jit traceable to work.

At the end of the tracing process, save the model as a normal PyTorch model. Compile the model one time and load it back as many times as you need. The Neuron SDK runtime is already integrated to PyTorch and is responsible for sending the operators to the AWS Inferentia1 chip automatically to accelerate your model.

In your inference code, you always need to import torch_neuron to activate the integrated runtime.

You can pass additional parameters to the compiler to customize the way it optimizes the model or to enable special features such as neuron-pipeline-cores. Shard your model across multiple cores to increase throughput.

import torch_neuron

# Tracing the model using AWS NeuronSDK
neuron_model = torch_neuron.trace(model,x) # trace model to Inferentia
# Saving for future use
neuron_model.save('neuron_resnet50.pt')

# Next time you don't need to trace the model again
# Just load it and AWS NeuronSDK will send it to Inferentia automatically
neuron_model = torch.jit.load('neuron_resnet50.pt')

# accelerated inference on Inferentia
y = neuron_model(x)

Tracing and deploying the accelerated model on Inferentia2

For AWS Inferentia2, the process is similar. The only difference is the package you import ends with x: torch_neuronx. The Neuron SDK takes care of the compilation and running of the model for you transparently. You can also pass additional parameters to the compiler to fine-tune the operation or activate specific functionalities.

import torch_neuronx

# Tracing the model using NeuronSDK
neuron_model = torch_neuronx.trace(model,x) # trace model to Inferentia
# Saving for future use
neuron_model.save('neuron_resnet50.pt')

# Next time you don't need to trace the model again
# Just load it and NeuronSDK will send it to Inferentia automatically
neuron_model = torch.jit.load('neuron_resnet50.pt')

# accelerated inference on Inferentia
y = neuron_model(x)

AWS Inferentia2 also offers a second approach for running a model called Lazy Tensor inference. In this mode, you don’t trace or compile the model previously; instead, the compiler runs on the fly every time you run your code. It isn’t recommended for production, given that traced mode has many advantages over Lazy Tensor inference. However, if you’re still developing your model and need to test it faster, Lazy Tensor inference can be a good alternative. Here’s how to compile and run a model using Lazy Tensor:

import torch
import torchvision
import torch_neuronx
import torch_xla.core.xla_model as xm

device = xm.xla_device() # Create XLA device
model = torchvision.models.resnet50(weights='IMAGENET1K_V1').eval().cpu()
model.to(device)

x = torch.rand((1,3,224,224), device=device) # dummy input
with torch.no_grad():
  y = model(x)
  xm.mark_step() # Compilation occurs here

Now that you’re familiar with AWS Inferentia2, a good next step is to get started with PyTorch or Tensorflow and learn how to set up a dev environment and run tutorials and examples. Also, check the AWS Neuron Samples GitHub repo, where you can find multiple examples of how to prepare models to run on Inf2, Inf1, and Trn1.

Summary of feature comparison between AWS Inferentia1 and AWS Inferentia2

The AWS Inferentia2 compiler is XLA-based, and AWS is part of OpenXLA initiative. This is the biggest difference over AWS Inferentia1, and that’s relevant because PyTorch, TensorFlow, and JAX have native XLA integrations. XLA brings many performance improvements, given that it optimizes the graph to compute the results in a single kernel launch. It fuses together successive tensor operations and outputs optimal machine code for accelerating model runs on AWS Inferentia2. Other parts of the Neuron SDK were also improved in AWS Inferentia2, while keeping the user experience as simple as possible while tracing and running models. The following table shows the features available in both versions of the compiler and runtime.

Feature torch-neuron torch-neuronx
Tensorboard Yes Yes
Supported Instances Inf1 Inf2 & Trn1
Inference Support Yes Yes
Training Support No Yes
Architecture NeuronCore-v1 NeuronCore-v2
Trace API torch_neuron.trace() torch_neuronx.trace()
Distributed inference NeuronCore Pipeline Collective Communications
IR GraphDef HLO
Compiler neuron-cc neuronx-cc
Monitoring neuron-monitor / monitor-top neuron-monitor / monitor-top

For a more detailed comparison between torch-neuron (Inf1) and torch-neuronx (Inf2), refer to Comparison of torch-neuron (Inf1) versus torch-neuronx (Inf2 & Trn1) for Inference.

Model Serving

After tracing a model to deploy to Inf2, you have many deployment options. You can run real-time predictions or batch predictions in different ways. Inf2 is available because EC2 instances are natively integrated to other AWS services that make use of Deep Learning Containers (DLCs) such as Amazon Elastic Container Service (Amazon ECS), Amazon Elastic Kubernetes Service (Amazon EKS), and SageMaker.

AWS Inferentia2 is compatible with the most popular deployment technologies. Here are a list of some of the options you have for deploying models using AWS Inferentia2:

  • SageMaker – Fully managed service to prepare data and build, train, and deploy ML models
  • TorchServe – PyTorch integrated deployment mechanism
  • TensorFlow Serving – TensorFlow integrated deployment mechanism
  • Deep Java Library – Open-source Java mechanism for model deployment and training
  • Triton – NVIDIA open-source service for model deployment

Benchmark

The following table highlights the improvements AWS Inferentia2 brings over AWS Inferentia1. Specifically, we measure latency (how fast the model can make a prediction using each accelerator), throughput (how many inferences per second), and cost per inference (how much each inference costs in US dollars). The lower the latency in milliseconds and costs in US dollars, the better. The higher the throughput the better.

Two models were used in this process––both large language models: ELECTRA large discriminator and BERT large uncased. PyTorch (1.13.1) and Hugging Face transformers (v4.7.0), the main libraries used in this experiment, ran on Python 3.8. After compiling the models for batch size = 1 and 10 (using the code from the previous section as a reference), each model was warmed up (invoked one time to initialize the context) and then invoked 10 times in a row. The following table shows average numbers collected in this simple benchmark.

Model Name Batch Size Sentence Length Latency (ms) Improvements Inf2 over Inf1 (x Times) Throughput (Inferences per Second) Cost per Inference (EC2 us-east-1) **
Inf1 Inf2 Inf1 Inf2 Inf1 Inf2
ElectraLargeDiscriminator 1 256 35.7 8.31 4.30 28.01 120.34 $0.0000023 $0.0000018
ElectraLargeDiscriminator 10 256 343.7 72.9 4.71 2.91 13.72 $0.0000022 $0.0000015
BertLargeUncased 1 128 28.2 3.1 9.10 35.46 322.58 $0.0000018 $0.0000007
BertLargeUncased 10 128 121.1 23.6 5.13 8.26 42.37 $0.0000008 $0.0000005

* c6a.8xlarge with 32 AMD Epyc 7313 CPU was used in this benchmark.

**EC2 Public pricing in us-east-1 on April 20: inf2.xlarge: $0.7582/hr; inf1.xlarge: $0.228/hr. Cost per inference considers the cost per element in a batch. (Cost per inference equals the total cost of model invocation/batch size.)

For additional information about training and inference performance, refer to Trn1/Trn1n Performance.

Conclusion

AWS Inferentia2 is a powerful technology designed for improving performance and reducing costs of deep learning model inference. More performant than AWS Inferentia1, it offers up to 4 times higher throughput, up to 10 times lower latency, and up to 50% better performance/watt than other comparable inference-optimized EC2 instances. In the end, you pay less, have a faster application, and meet your sustainability goals.

It’s simple and straightforward to migrate your inference code to AWS Inferentia2, which also supports a broader variety of models, including large language models and foundation models for generative AI.

You can get started by following the AWS Neuron SDK documentation to set up a development environment and start your accelerated deep learning project. To help you get started, Hugging Face has added Neuron support to their Optimum library, which optimizes models for faster training and inference, and they have many examples tasks ready to run on Inf2. Also, check our Deploy large language models on AWS Inferentia2 using large model inference containers to learn about deploying LLMs to AWS Inferentia2 using model inference containers. For additional examples, see the AWS Neuron Samples GitHub repo.


About the authors

Samir Araújo is an AI/ML Solutions Architect at AWS. He helps customers creating AI/ML solutions which solve their business challenges using AWS. He has been working on several AI/ML projects related to computer vision, natural language processing, forecasting, ML at the edge, and more. He likes playing with hardware and automation projects in his free time, and he has a particular interest for robotics.

Read More

Deploy Falcon-40B with large model inference DLCs on Amazon SageMaker

Deploy Falcon-40B with large model inference DLCs on Amazon SageMaker

Last week, Technology Innovation Institute (TII) launched TII Falcon LLM, an open-source foundational large language model (LLM). Trained on 1 trillion tokens with Amazon SageMaker, Falcon boasts top-notch performance (#1 on the Hugging Face leaderboard at time of writing) while being comparatively lightweight and less expensive to host than other LLMs such as llama-65B. In this post, we demonstrate how to deploy Falcon for applications like language understanding and automated writing assistance using large model inference deep learning containers on SageMaker.

The Falcon has landed on SageMaker

TII is the applied research organization within Abu Dhabi’s Advanced Technology Research Council; its team of scientists, researchers, and engineers is dedicated to the discovery of transformative technologies and development of scientific breakthroughs that will future-proof our society. Earlier this year, TII set out to train a state-of-the-art, open-source LLM and used the infrastructure, tooling, and expertise of SageMaker to get the job done (to learn more about how this model was trained on SageMaker, refer to Technology Innovation Institute trains the state-of-the-art Falcon LLM 40B foundation model on Amazon SageMaker). The result of this effort is TII Falcon LLM.

Trained on 1 trillion tokens, Falcon boasts top-notch performance against the Eleuther AI Language Model Evaluation Harness and is currently #1 on the Hugging Face leaderboard for accuracy. The model is available in two different sizes—Falcon-40B and Falcon-7B—and can be used for state-of-the-art performance in applications such as language understanding, conversational experiences, and automated writing assistance. This post will help you get started with deploying Falcon on SageMaker for high-accuracy inference in these types of domains.

SageMaker large model inference DLCs simplify LLM hosting

Hosting LLMs such as Falcon-40B and Falcon-7B can be challenging. Larger models are often more accurate because they include billions of parameters, but their size can also result in slower inference latency or worse throughput. Hosting an LLM can require more GPU memory and optimized kernels to achieve acceptable performance. To further complicate things, although smaller models such as Falcon-7B can generally fit on a single GPU such as an NVIDIA A10G instance that powers AWS G5 instance types, larger models like Falcon-40B cannot. When this happens, strategies such as tensor parallelism must be used to shard that larger model into multiple pieces and take advantage of the memory of multiple GPUs. Legacy hosting solutions used for smaller models typically don’t offer this type of functionality, adding to the difficulty.

SageMaker large model inference (LMI) deep learning containers (DLCs) can help. LMI DLCs are a complete end-to-end solution for hosting LLMs like Falcon-40B. At the front end, they include a high-performance model server (DJL Serving) designed for large model inference with features such as token streaming and automatic model replication within an instance to increase throughput. On the backend, LMI DLCs also include several high-performance model parallel engines, such as DeepSpeed and FasterTransformer, that can shard and manage model parameters across multiple GPUs. These engines also include optimized kernels for popular transformer models, which can accelerate inference by up to three times faster. With LMI DLCs, you simply need to create a configuration file to get started with LLM hosting on SageMaker. To learn more about SageMaker LMI DLCs, refer to Model parallelism and large model inference and our list of available images. You can also check out our previous post about hosting Bloom-175B on SageMaker using LMI DLCs.

Solution overview

This post walks you through how to host Falcon-40B using DeepSpeed on SageMaker using LMI DLCs. Falcon-40B requires that we use multiple A10 GPUs, whereas Falcon-7B only requires a single GPU. We have also prepared examples you can reference to host Falcon-40B and Falcon-7B using both DeepSpeed and Accelerate. You can find our code examples on GitHub.

This example can be run in SageMaker notebook instances or Amazon SageMaker Studio notebooks. For hosting Falcon-40B using LMI and DeepSpeed, we need to use an ml.g5.24xlarge instance. These instances provide 4x NVIDIA A10G GPUs, which each support 96 GiB of GPU memory. In addition, the host provides 96 vCPUs and 384 GiB of host memory. The LMI container will help address much of the undifferentiated heavy lifting associated with hosting LLMs, including downloading the model and partitioning the model artifact so that its comprising parameters can be spread across multiple GPUs.

Quotas for SageMaker machine learning (ML) instances can vary between accounts. If you receive an error indicating you’ve exceeded your quota for g5.24xlarge instances while following this post, you can increase the limit through the Service Quotas console.

Notebook walkthrough

To begin, we start by installing and importing the necessary dependencies for our example. We use the Boto3 SDK as well as the SageMaker SDK. Note that we use Amazon Simple Storage Service (Amazon S3) to store the model artifacts that we need for SageMaker and LMI to use, so we set up an S3 prefix variable accordingly. See the following code:

import sagemaker
import jinja2
from sagemaker import image_uris
import boto3
import os
import time
import json
from pathlib import Path
from sagemaker.utils import name_from_base

role = sagemaker.get_execution_role()  # execution role for the endpoint
sess = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs
bucket = sess.default_bucket()  # bucket to house artifacts
model_bucket = sess.default_bucket()  # bucket to house artifacts
s3_code_prefix_deepspeed = "hf-large-model-djl-/code_falcon40b/deepspeed"  # folder within bucket where code artifact will go
region = sess._region_name
account_id = sess.account_id()
s3_client = boto3.client("s3")
sm_client = boto3.client("sagemaker")
smr_client = boto3.client("sagemaker-runtime")
jinja_env = jinja2.Environment()

We then create a local folder for our workspace to store our model artifacts:

!mkdir -p code_falcon40b_deepspeed

We first create a serving.properties configuration file in the local directory we created. This serving.properties file indicates to the LMI container and the front-end DJL Serving library which model parallelization and inference optimization engine we want to use. You can find the configuration options for both DeepSpeed and Hugging Face Accelerate in Configurations and settings. Here, note that we set the option.model_id parameter to define which Hugging Face model to pull from. SageMaker makes working with Hugging Face models simple, and this one line is all you need. In addition, we set option.tensor_parallel_degree to a value of 4 because we have four GPUs on our ml.g5.24xlarge instance. This parameter defines how many partitions of the model to create and distribute. Note that if we had used a larger instance with eight GPUs, such as ml.g5.48xlarge, and still set a value of 4, then LMI would automatically create two replicas of the model (two replicas spread across four GPUs each). See the following code:

%%writefile ./code_falcon40b_deepspeed/serving.properties
engine=Python
#to deploy falcon-40b-instruct set the model_id value to 'tiiuae/falcon-40b-instruct'
option.model_id=tiiuae/falcon-40b
option.tensor_parallel_degree=4
#option.s3url = {{s3url}}

You can also swap out tiiuae/falcon-40b with tiiuae/falcon-40b-instruct if it suits your needs better.

We also include a requirements.txt file that you can specify to install packages that you require:

%%writefile ./code_falcon40b_deepspeed/requirements.txt
einops
torch==2.0.1

The last thing we need is the model.py file that will be used with your model:

%%writefile ./code_falcon40b_deepspeed/model.py
from djl_python import Input, Output
import os
import torch
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer
from typing import Any, Dict, Tuple
import warnings

predictor = None


def get_model(properties):
    model_name = properties["model_id"]
    local_rank = int(os.getenv("LOCAL_RANK", "0"))
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        low_cpu_mem_usage=True,
        trust_remote_code=True,
        torch_dtype=torch.bfloat16,
        device_map="auto",
    )
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    generator = pipeline(
        task="text-generation", model=model, tokenizer=tokenizer, device_map="auto"
    )
    return generator


def handle(inputs: Input) -> None:
    global predictor
    if not predictor:
        predictor = get_model(inputs.get_properties())
    if inputs.is_empty():
        # Model server makes an empty call to warmup the model on startup
        return None
    data = inputs.get_as_json()
    text = data["text"]
    text_length = data["text_length"]
    outputs = predictor(text, do_sample=True, min_length=text_length, max_length=text_length)
    result = {"outputs": outputs}
    return Output().add_as_json(result)

That’s it! At this point, we have created all the artifacts you will need deploy Falcon-40B with DeepSpeed! We package the directory into a *.tar.gz file and upload it to Amazon S3. Note that the actual model has not been downloaded or packaged into this file. The LMI container will download the model for you from Hugging Face directly. You also have the option to target an S3 bucket if you would like your own copy of the model in a location that will be more performant to download. LMI also includes optimization for downloading from Amazon S3 with high performance. See the following code:

s3_code_artifact_deepspeed= sess.upload_data("model.tar.gz", bucket, s3_code_prefix_deepspeed)
print(f"S3 Code or Model tar for deepspeed uploaded to --- > {s3_code_artifact_deepspeed}")

All that is left to do at this point is to define the container we want to use and create a model object:

inference_image_uri = (
    f"763104351884.dkr.ecr.{region}.amazonaws.com/djl-inference:0.22.1-deepspeed0.8.3-cu118"
)
model_name_acc = name_from_base(f"falcon40b-model-ds")
create_model_response = sm_client.create_model(
    ModelName=model_name_acc,
    ExecutionRoleArn=role,
    PrimaryContainer={"Image": inference_image_uri, "ModelDataUrl": s3_code_artifact_deepspeed},
)
model_arn = create_model_response["ModelArn"]

Then we create an endpoint configuration and create the endpoint:


endpoint_config_name = f"{model_name}-config"
endpoint_name = f"{model_name}-endpoint"
endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            "VariantName": "variant1",
            "ModelName": model_name,
            "InstanceType": "ml.g5.24xlarge",
            "InitialInstanceCount": 1,
            "ModelDataDownloadTimeoutInSeconds": 3600,
            "ContainerStartupHealthCheckTimeoutInSeconds": 3600,
            # "VolumeSizeInGB": 512
        },
    ],
)
endpoint_config_response

create_endpoint_response = sm_client.create_endpoint(
    EndpointName=f"{endpoint_name}", EndpointConfigName=endpoint_config_name
)
print(f"Created Endpoint: {create_endpoint_response['EndpointArn']}")

Configuration items to keep in mind for successful hosting

An important consideration for large model hosting is ensuring there is adequate time for model download from Hugging Face. In our tests, the Falcon-40B took about 90 minutes to download onto the instance. A key set of configurations to allow for this are ContainerStartupHealthCheckTimeoutInSeconds and ModelDataDownloadTimeoutInSeconds. Make sure the SageMaker endpoint configuration has a value of 3600 for each of these. Additionally, it’s much easier to download from Amazon S3 instead of the original model zoo using the LMI containers that are specially designed for LLMS that use the S5cmd utility, which cuts the model download time to around 10 minutes.

You can monitor the status of the endpoint by calling DescribeEndpoint, which will tell you when everything is complete. Your endpoint is now ready to respond to inference requests! Because LMI handles the model partitioning and orchestration for you, each request will be processed using all 4 GPUs available on our ml.g5.12xlarge instance. This allows us to host LLMs and increase performance if you scale GPU accelerators horizontally. See the following code:

response_model = smr_client.invoke_endpoint(
    EndpointName=endpoint_name,
    Body=json.dumps({"text": "What is the purpose of life?", "text_length": 150}),
    ContentType="application/json",
)

response_model["Body"].read().decode("utf8")

If you are done and would like to delete the endpoint configuration, endpoint, and model object, you can run the following commands:

sm_client.delete_endpoint(EndpointName=endpoint_name)
sm_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
sm_client.delete_model(ModelName=model_name)

This code we referenced in this post can be found in the complete notebook on GitHub.

Conclusion

SageMaker Hosting and the LMI DLC makes it easy for you to host LLMs like Falcon-40B. It takes on the undifferentiated heavy lifting in orchestrating what is required to host models across multiple GPUs and provides configurable options to suit your needs. In addition, using Hugging Face models becomes very straightforward, with built-in support for these models.

In this post, we showed how you can use SageMaker to host the Falcon-40B model using DeepSpeed. In addition, we provided examples in GitHub to host Falcon-40B using Accelerate, and the smaller Falcon-7B models. We encourage you to give this a try on SageMaker with LMI and get hands-on with the best-performing publicly available LLM to date!


About the authors

James Park is a Solutions Architect at Amazon Web Services. He works with Amazon.com to design, build, and deploy technology solutions on AWS, and has a particular interest in AI and machine learning. In h is spare time he enjoys seeking out new cultures, new experiences,  and staying up to date with the latest technology trends.You can find him on LinkedIn.

Abhi Shivaditya is a Senior Solutions Architect at AWS, working with strategic global enterprise organizations to facilitate the adoption of AWS services in areas such as Artificial Intelligence, distributed computing, networking, and storage. His expertise lies in Deep Learning in the domains of Natural Language Processing (NLP) and Computer Vision. Abhi assists customers in deploying high-performance machine learning models efficiently within the AWS ecosystem.

Robert Van Dusen is a Senior Product Manager with Amazon SageMaker. He leads deep learning model optimization for applications such as large model inference.

Evandro Franco is an AI/ML Specialist Solutions Architect working on Amazon Web Services. He helps AWS customers overcome business challenges related to AI/ML on top of AWS. He has more than 15 years working with technology, from software development, infrastructure, serverless, to machine learning.

Qing Lan is a Software Development Engineer in AWS. He has been working on several challenging products in Amazon, including high performance ML inference solutions and high performance logging system. Qing’s team successfully launched the first Billion-parameter model in Amazon Advertising with very low latency required. Qing has in-depth knowledge on the infrastructure optimization and Deep Learning acceleration.

Frank Liu is a Software Engineer for AWS Deep Learning. He focuses on building innovative deep learning tools for software engineers and scientists. In his spare time, he enjoys hiking with friends and family.

Read More

Rendered.ai Integrates NVIDIA Omniverse for Synthetic Data Generation

Rendered.ai Integrates NVIDIA Omniverse for Synthetic Data Generation

Rendered.ai is easing AI training for developers, data scientists and others with its platform-as-a-service for synthetic data generation, or SDG.

Training computer vision AI models requires massive, high-quality, diverse and unbiased datasets. These can be challenging and costly to obtain, especially with increasing demands both of and for AI.

The Rendered.ai platform-as-a-service helps to solve this issue by generating physically accurate synthetic data — data that’s created from 3D simulations — to train computer vision models.

“Real-world data often can’t capture all of the possible scenarios and edge cases necessary to generalize an AI model, which is where SDG becomes key for AI and machine learning engineers,” said Nathan Kundtz, founder and CEO of Rendered.ai, which is based in Bellevue, Wash., a Seattle suburb.

A member of the NVIDIA Inception program for cutting-edge startups, Rendered.ai has now integrated into its platform NVIDIA Omniverse Replicator, a core extension of the Omniverse platform for developing and operating industrial metaverse applications.

Omniverse Replicator enables developers to generate labeled synthetic data for many such applications, including visual inspection, robotics and autonomous driving. It’s built on open standards for 3D workflows, including Universal Scene Description (“OpenUSD”), Material Definition Language (MDL) and PhysX.

Synthetic images generated with Rendered.ai have been used to model landscapes and vegetation for virtual worlds, detect objects in satellite imagery, and even test the viability of human oocytes, or egg cells.

Synthetic imagery generated using Omniverse Replicator. Image courtesy of Rendered.ai.

With Rendered.ai tapping into the RTX-accelerated functionalities of Omniverse Replicator — such as ray tracing, domain randomization and multi-sensor simulation — computer vision engineers, data scientists and other users can quickly and easily generate synthetic data through a simple web interface in the cloud.

“The data that we have to train AI is really the dominant factor on the AI’s performance,” Kundtz said. “Integrating Omniverse Replicator into Rendered.ai will enable new levels of ease and efficiency for users tapping synthetic data to train bigger, better AI models for applications across industries.”

Rendered.ai will demonstrate its platform integration with Omniverse Replicator at the Conference on Computer Vision and Pattern Recognition (CVPR), running June 18-22 in Vancouver, Canada.

Synthetic Data Generation in the Cloud

Rendered.ai, now available through AWS Marketplace, brings to the cloud a collaborative web interface for developers and teams to design SDG applications that can be easily configured by computer vision engineers and data scientists.

It’s a one-stop shop for people to share workspaces containing SDG datasets, tasks, graphs and more — all through a web browser.

A view of the Rendered.ai platform-as-a-service, available on a web browser. Image courtesy of Rendered.ai.

Omniverse Replicator tutorials, examples and other 3D assets can now be easily tapped in a Rendered.ai channel running on AWS infrastructure. This gives developers a starting point for building their own SDG capabilities in the cloud. And the process will be especially seamless for Omniverse Replicator users who are already familiar with these tools.

Rendered.ai focuses on offering physically accurate synthetic data through the platform, as this enables introducing new information to AI systems based on how physical processes work in the real world, according to Kundtz.

“In the future, every company using AI is going to have a synthetic data engineering team for physics-based modeling and domain-specific AI-model building,” he added. “It’s often not good enough to simulate something once, as complex AI tasks require training on large batches of data covering diverse scenarios — this is where synthetic data becomes key.”

Hear more from Kundtz on the NVIDIA AI Podcast.

Learn more about Omniverse Replicator, and visit Rendered.ai at CVPR booth 1125 to see a demo of this platform integration.

Explore the new cloud capability on the Rendered.ai site and sign up for a 30-day trial of the Rendered.ai Developer account using the content code “OVDEMO” — which will enable access to the new Rendered.ai Channel for Omniverse.

Featured image courtesy of Rendered.ai.

Read More