Improve performance of Falcon models with Amazon SageMaker

Improve performance of Falcon models with Amazon SageMaker

What is the optimal framework and configuration for hosting large language models (LLMs) for text-generating generative AI applications? Despite the abundance of options for serving LLMs, this is a hard question to answer due to the size of the models, varying model architectures, performance requirements of applications, and more. The Amazon SageMaker Large Model Inference (LMI) container makes it straightforward to serve LLMs by bringing together a host of different frameworks and techniques that optimize the deployment of LLMs. The LMI container has a powerful serving stack called DJL serving that is agnostic to the underlying LLM. It provides system-level configuration parameters that can be tuned for extracting the best performance of the hosting infrastructure for a given LLM. It also has support for recent optimizations like continuous batching, also known as iterative batching or rolling batching, which provides significant improvements in throughput.

In an earlier post, we showed how you can use the LMI container to deploy the Falcon family of models on SageMaker. In this post, we demonstrate how to improve the throughput and latency of serving Falcon-40B with techniques like continuous batching. We also provide an intuitive understanding of configuration parameters provided by the SageMaker LMI container that can help you find the best configuration for your real-world application.

Fundamentals of text-generative inference for LLMs

Let’s first look at a few fundamentals on how to perform inference for LLMs for text generation.

Forward pass, activations, and the KV cache

Given an input sequence of tokens, they are run in a forward pass across all the layers of the LLM (like Falcon) to generate the next token. A forward pass refers to the process of input data being passed through a neural network to produce an output. In the case of text generation, the forward pass involves feeding an initial seed or context into the language model and generating the next character or token in the sequence. To generate a sequence of text, the process is often done iteratively, meaning it is repeated for each step or position in the output sequence. At each iteration, the model generates the next character or token, which becomes part of the generated text, and this process continues until the desired length of text is generated.

Text generation with language models like Falcon or GPT are autoregressive. This means that the model generates one token at a time while conditioning on the previously generated tokens. In other words, at each iteration, the model takes the previously generated text as input and predicts the next token based on that context. As mentioned in vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention, in this autoregressive decoding process, all the input tokens to the LLM produce their attention key and value tensors, and these tensors are kept in GPU memory to generate next tokens. These cached key and value tensors are often referred to as the KV cache.

Prefill and decode phases

In an autoregressive decoding process, like the one used in text generation with language models such as Falcon, there are typically two main phases: the prefill phase and the decode phase. These phases are crucial for generating coherent and contextually relevant text.

The prefill phase includes the following:

  • Initial context – The prefill phase begins with an initial context or seed text provided by the user. This initial context can be a sentence, a phrase, or even just a single word. It sets the starting point for text generation and provides context for what comes next.
  • Model conditioning – The provided context is used to condition the language model. The model takes this context as input and generates the next token (word or character) in the sequence based on its understanding of the context.
  • Token generation – The model generates one token at a time, predicting what should come next in the text. This token is appended to the context, effectively extending it.
  • Iterative process – The process of generating tokens is repeated iteratively. At each step, the model generates a token while considering the updated context, which now includes the tokens generated in previous steps.

The prefill phase continues until a predetermined stopping condition is met. This condition can be a maximum length for the generated text, a specific token that signals the end of the text, or any other criteria set by the user or the application.

The decode phase includes the following:

  • Completion – After the prefill phase, you have a partially generated text that may be incomplete or cut off at some point. The decode phase is responsible for completing the text to make it coherent and grammatically correct.
  • Continuation from the last token – In the decode phase, the model starts from the last token generated during the prefill phase. It uses this token as the initial context and generates the next token to continue the text.
  • Iterative completion – Like in the prefill phase, the process of generating tokens is again iterative. The model generates one token at a time, conditioning on the preceding tokens in the sequence.
  • Stopping condition – The decode phase also has a stopping condition, which might be the same as in the prefill phase, such as reaching a maximum length or encountering an end-of-text token. When this condition is met, the generation process stops.

The combination of the prefill and decode phases allows autoregressive models to generate text that builds on an initial context and produces coherent, contextually relevant, and contextually consistent sequences of text.

Refer to A Distributed Serving System for Transformer-Based Generative Models for a detailed explanation of the process.

Optimizing throughput using dynamic batching

So far, we’ve only talked about a single input. In practice, we expect to deal with multiple requests coming in randomly from the application clients for inference concurrently or in a staggered fashion. In the traditional way, basic batching can be used to increase the throughput and the utilization of the computing resources of the GPU. Batching is effectively combining the numerical representations of more than one request in a batch and performing parallel runs of the autoregressive forward passes. This intelligent batching is done at the serving side. SageMaker LMI’s DJLServing server can be configured to batch together multiple requests to process them in parallel by setting the following parameters in serving.properties:

  • max_batch_delay = 100 – The maximum delay for batch aggregation in milliseconds. The default value is 100 milliseconds.
  • batch_size = 32 – The dynamic batch size. The default is 1.

This basically shows that DJLServing will queue up requests for 100 milliseconds at a time or if the number of requests that are queued up are up to the batch_size specified, the batch will be scheduled to run to the backend for inference. This is known as dynamic batching. It’s dynamic because the batch size may change across batches depending on how many requests were added in that time duration. However, because requests might have different characteristics, (for example, some requests might be of shape 20 tokens of input and 500 tokens of output, whereas others might be reversed, with 500 tokens of input but only 20 for output), some requests might complete processing faster than others in the same batch. This could result in underutilization of the GPU while waiting for all in-flight requests in the batch to complete its decode stage, even if there are additional requests waiting to be processed in the queue. The following diagram illustrates this process.

Simple Dynamic Batching Visual

Dynamic Batching Visual – notice the idle windows at the end of Request 2 and 3

Optimizing throughput using continuous batching

With continuous batching, also known as iterative or rolling batching, we take advantage of the differences between the prefill and decode stages. To activate continuous batching, DJServing provides the following additional configurations as per serving.properties:

  • engine=MPI – We encourage you to use the MPI engine for continuous batching.
  • option.rolling_batch=auto or lmi-dist – We recommend using auto because it will automatically pick the most appropriate rolling batch algorithm along with other optimizations in the future.
  • option.max_rolling_batch_size=32 – This limits the number of concurrent requests. The default is 32.

With continuous batching, the serving stack (DJLServing) doesn’t wait for all in-flight requests in a batch to complete its decode stage. Rather, at logical breaks (at the end of one iteration in the decode stage), it pulls in additional requests that are waiting in the queue while the current batch is still processing (hence the name rolling batch). It does this check for pending requests at the end of each iteration of the decode stage. Remember, for each request, we need to run the prefill stage followed by the sequential decode stage. Because we can process all the tokens from the initial prompt of a request in parallel for its prefill stage, anytime a new request is pulled in, we temporarily pause the decode stage of in-flight requests of the batch—we temporarily save its KV cache and activations in memory and run the prefill stage of the new requests.

The size of this cache can be configured with the following option:

When the prefill is complete, we combine the new requests and the old paused requests in a new rolling batch, which can proceed with their decode stage in parallel. Note that the old paused requests can continue their decode stage where they left off and the new requests will start from their first new token.

Continuous or Iterative Batching Visual

Continuous or Iterative Batching Visual – notice that the idle times are replaced with follow on requests

You might have already realized that continuous batching is an almost similar approach with which we naturally parallelize tasks in our daily lives. We have messages, emails, phone notifications (potentially new requests) coming in at random times (analogous to multiple requests coming in a random staggered fashion for GPUs). This is all happening while we go about completing our in-flight tasks—composing emails, coding, participating in meetings (analogous to the currently processing tasks in the GPUs). At logical breaks, we pause our in-flight tasks and check our notifications to decide if there is some action required on our part, and if there is, we add it to our in-flight tasks (real-life rolling batch), or put it on a to-do list (the queue).

Putting it all together: How to think about memory utilization of GPUs

It’s recommended to load test your model to see which configuration is the most cost-effective for your business use case. To build an understanding, let’s visualize the memory footprint of the GPUs as the model is loaded and as successive requests are processed in a rolling batch. For this post, let’s assume we are loading the Falcon-40B model onto one of the G5 instance types instance that are installed with NVIDIA A10G GPUs, each with 24 GB of memory. Note that a similar understanding is applicable for the p3, p4, and p5 instance types, which come with the V100, A100, and H100 GPU series.

The following is the overview of getting an approximate value of total memory required to serve Falcon-40B:

  • Model size = Number of model parameters (40 billion for Falcon-40B) x 4 bytes per parameter (for FP32) = 160 GB
  • Approximate total memory required to load Falcon-40B for inference = Model size (=160 GB) + KV Cache (Attention Cache) (=*20 GB) + Additional memory overhead by ML Frameworks (approximately 2 GB)
Memory Visual

Memory Visual – Understanding the memory footprint of a loaded Falcon-40B model

For Falcon-40B, if we compress the model by quantizing the model to the bfloat16 (2 bytes) data type, the model size becomes approximately 80 GB. As you can see, this is still larger than the memory supported by one accelerator device, so we need to adopt a model partitioning (sharding) technique with a special tensor parallelism (TP) approach and shard the model across multiple accelerator devices. Let’s assume that we have chosen g5.24xlarge, which has 4 A10G GPU devices. If we configure DJLServing (serving.properties) with the following, we can expect that the 80 GB of model weights will be divided equally across all 4 GPUs:

With tensor_parallel_degree set to 4, about 20 GB of the 24 GB GPU memory (approximately 84%) is already utilized even before a single request has been processed. The remaining 16% of the GPU will be used for the KV cache for the incoming requests. It’s possible that for your business scenario and its latency and throughput requirements, 2–3 GB of the remaining memory is more than enough. If not, you can increase the instance size to g5.48xlarge, which has 8 GPUs and uses tensor_parallel_degree set to 8. In such a case, only approximately 10 GB of the available 24 GB memory of each GPU is utilized for model weights and we have about 60% of the remaining GPU for the activations and KV cache. Intuitively, we can see that this configuration may allow us to have a higher throughput. Additionally, because we have a larger buffer now, we can increase the max_rolling_batch_prefill_tokens and max_rolling_batch_size parameters to further optimize the throughput. Together, these two parameters will control the preallocations of the activation prefills and KV cache for the model. A larger number for these two parameters will co-relate to a larger throughput, assuming you have enough buffer for the KV cache in the GPU memory.

Continuous batching with PagedAttention

PagedAttention is a new optimization algorithm developed by UC Berkeley that improves the continuous batching process by allowing the attention cache (KV cache) to be non-contiguous by allocating memory in fixed-size pages or blocks. This is inspired by virtual memory and paging concepts used by operating systems.

As per the vLLM paper, the attention cache of each sequence of tokens is partitioned into blocks and mapped to physical blocks through a block table. During the computation of attention, a PagedAttention kernel can use the block table to efficiently fetch the blocks from physical memory. This results in a significant reduction of memory waste and allows for larger batch size, increased GPU utilization, and higher throughput.

Performance comparison

To ensure effective load testing of your deployment configuration, it’s recommended to begin by considering the business scenario and clearly defining the characteristics of the input and output for the LLM-based application. For instance, if you are working on a call center summarization use case, the input could consist of larger text, such as a 500-token chat transcript between a customer service agent and a customer, but the output might be relatively smaller, around 100 tokens, representing a summary of the transcript. On the other hand, if you’re working on a code generation scenario, the input could be as short as 15 tokens, like “write an efficient implementation in Python for describing all EC2 resources, including pagination,” but the output could be much larger, reaching 500 tokens. It’s also important to consider whether achieving lower latency or maximizing throughput is the top priority for your specific scenario.

After gaining a comprehensive understanding of the business scenario, you can analyze and determine the optimal configuration for your hosting environment. In this context, the hosting environment encompasses various key elements, including the instance type and other configuration parameters such as tensor_parallel_degree, max_rolling_batch_size, max_rolling_batch_prefill_tokens, and more. Our objective is to identify the most effective setup to support our response time, throughput, and model output quality requirements.

In our analysis, we benchmarked the performance to illustrate the benefits of continuous batching over traditional dynamic batching. We used the configurations detailed in the following table in serving.properties for dynamic batching and iterative batching, using an LMI container on SageMaker.

Dynamic Batching Continuous Batching Continuous Batching with PagedAttention

engine=Python

option.model_id=tiiuae/falcon-40b

option.tensor_parallel_degree=8

option.dtype=fp16

batch_size=4

max_batch_delay=100

option.trust_remote_code = true

engine = MPI

option.model_id = {{s3_url}}

option.trust_remote_code = true

option.tensor_parallel_degree = 8

option.max_rolling_batch_size = 32

option.rolling_batch = auto

option.dtype = fp16

option.max_rolling_batch_prefill_tokens = 1024

option.paged_attention = False

engine = MPI

option.model_id = {{s3_url}}

option.trust_remote_code = true

option.tensor_parallel_degree = 8

option.max_rolling_batch_size = 32

option.rolling_batch = auto

option.dtype = fp16

option.max_rolling_batch_prefill_tokens = 1024

option.paged_attention = True

The two configurations were benchmarked for Falcon-40B with the FP16 data type deployed on ml.g5.48xlarge in a couple of different scenarios that represent real-world applications:

  • A small number of input tokens with a large number of tokens being generated – In this scenario, number of input tokens was fixed at 32 and 128 new tokens were generated
Batching Strategy Throughput (tokens/sec) Latency p90 (secs)
Dynamic Batching 5.53 58.34
Continuous Batching 56.04 4.74
Continuous Batching with PagedAttention 59.18 4.76
  • A large input with a small number of tokens being generated – Here, we fix the number of input tokens at 256 and prompt the LLM to summarize the input to 32 tokens
Batching Strategy Throughput (tokens/sec) Latency p90 (secs)
Dynamic Batching 19.96 59.31
Continuous Batching 46.69 3.88
Continuous Batching with PagedAttention 44.75 2.67

We can see that continuous batching with PagedAttention provides a throughput improvement of 10 times greater in scenario 1 and 2.3 times in scenario 2 compared to using dynamic batching on SageMaker while using the LMI container.

Conclusion

In this post, we looked at how LLMs use memory and explained how continuous batching improves the throughput using an LMI container on SageMaker. We demonstrated the benefits of continuous batching for Falcon-40B using an LMI container on SageMaker by showing benchmark results. You can find the code on the GitHub repo.


About the Authors

Abhigyan ShivadityaAbhi Shivaditya is a Senior Solutions Architect at AWS, working with strategic global enterprise organizations to facilitate the adoption of AWS services in areas such as Artificial Intelligence, distributed computing, networking, and storage. His expertise lies in Deep Learning in the domains of Natural Language Processing (NLP) and Computer Vision. Abhi assists customers in deploying high-performance machine learning models efficiently within the AWS ecosystem.

Dhawal Patel is a Principal Machine Learning Architect at AWS. He has worked with organizations ranging from large enterprises to mid-sized startups on problems related to distributed computing, and Artificial Intelligence. He focuses on Deep learning including NLP and Computer Vision domains. He helps customers achieve high performance model inference on SageMaker.

Pinak Panigrahi works with customers to build machine learning driven solutions to solve strategic business problems on AWS. When not occupied with machine learning, he can be found taking a hike, reading a book or watching sports.

Abhi Sodhani holds the position of Senior AI/ML Solutions Architect at AWS, where he specializes in offering technical expertise and guidance on Generative AI and ML solutions to customers. His primary focus is to assist Digital Native Businesses in realizing the full potential of Generative AI and ML technologies, enabling them to achieve their business objectives effectively. Beyond his professional endeavors, Abhi exhibits a strong passion for intellectual pursuits such as reading, as well as engaging in activities that promote physical and mental well-being, such as yoga, meditation.

Qing Lan is a Software Development Engineer in AWS. He has been working on several challenging products in Amazon, including high performance ML inference solutions and high performance logging system. Qing’s team successfully launched the first Billion-parameter model in Amazon Advertising with very low latency required. Qing has in-depth knowledge on the infrastructure optimization and Deep Learning acceleration.

Read More

Index your web crawled content using the new Web Crawler for Amazon Kendra

Index your web crawled content using the new Web Crawler for Amazon Kendra

Amazon Kendra is a highly accurate and simple-to-use intelligent search service powered by machine learning (ML). Amazon Kendra offers a suite of data source connectors to simplify the process of ingesting and indexing your content, wherever it resides.

Valuable data in organizations is stored in both structured and unstructured repositories. An enterprise search solution should be able to provide you with a fully managed experience and simplify the process of indexing your content from a variety of data sources in the enterprise.

One such unstructured data repository are internal and external websites. Sites may need to be crawled to create news feeds, analyze language use, or create bots to answer questions based on the website data.

We’re excited to announce that you can now use the new Amazon Kendra Web Crawler to search for answers from content stored in internal and external websites or create chatbots. In this post, we show how to index information stored in websites and use the intelligent search in Amazon Kendra to search for answers from content stored in internal and external websites. In addition, the ML-powered intelligent search can accurately get answers for your questions from unstructured documents with natural language narrative content, for which keyword search is not very effective.

The Web Crawler offers the following new features:

  • Support for Basic, NTLM/Kerberos, Form, and SAML authentication
  • The ability to specify 100 seed URLs and store connection configuration in Amazon Simple Storage Service (Amazon S3)
  • Support for a web and internet proxy with the ability to provide proxy credentials
  • Support for crawling dynamic content, such as a website containing JavaScript
  • Field mapping and regex filtering features

Solution overview

With Amazon Kendra, you can configure multiple data sources to provide a central place to search across your document repository. For our solution, we demonstrate how to index a crawled website using the Amazon Kendra Web Crawler. The solution consists of the following steps:

  1. Choose an authentication mechanism for the website (if required) and store the details in AWS Secrets Manager.
  2. Create an Amazon Kendra index.
  3. Create a Web Crawler data source V2 via the Amazon Kendra console.
  4. Run a sample query to test the solution.

Prerequisites

To try out the Amazon Kendra Web Crawler, you need the following:

Gather authentication details

For protected and secure websites, the following authentication types and standards are supported:

  • Basic
  • NTLM/Kerberos
  • Form authentication
  • SAML

You need the authentication information when you set up the data source.

For basic or NTLM authentication, you need to provide your Secrets Manager secret, user name, and password.secrets manager basic auth

Form and SAML authentication require additional information, as shown in the following screenshot. Some of the fields like User name button Xpath are optional and will depend on whether the site you are crawling uses a button after entering the user name. Also note that you will need to know how to determine the Xpath of the user name and password field and the submit buttons.

secrets manager saml

Create an Amazon Kendra index

To create an Amazon Kendra index, complete the following steps:

  1. On the Amazon Kendra console, choose Create an Index.kendra
  2. For Index name, enter a name for the index (for example, Web Crawler).
  3. Enter an optional description.
  4. For Role name, enter an IAM role name.
  5. Configure optional encryption settings and tags.
  6. Choose Next.index details
  7. In the Configure user access control section, leave the settings at their defaults and choose Next.user access control
  8. For Provisioning editions, select Developer edition and choose Next.provisioning edition
  9. On the review page, choose Create.

This creates and propagates the IAM role and then creates the Amazon Kendra index, which can take up to 30 minutes.

kendra index

Create an Amazon Kendra Web Crawler data source

Complete the following steps to create your data source:

  1. On the Amazon Kendra console, choose Data sources in the navigation pane.
  2. Locate the WebCrawler connector V2.0 tile and choose Add connector.webcrawler connector
  3. For Data source name, enter a name (for example, crawl-fda).
  4. Enter an optional description.
  5. Choose Next.data source details
  6. In the Source section, select Source URL and enter a URL. For this post, we use https://www.fda.gov/ as an example source URL.
  7. In the Authentication section, chose the appropriate authentication based on the site that you want to crawl. For this post, we select No authentication because it’s a public site and doesn’t need authentication.
  8. In the Web proxy section, you can specify a Secrets Manager secret (if required).
    1. Choose Create and Add New Secret.
    2. Enter the authentication details that you gathered previously.
    3. Choose Save.
  9. In the IAM role section, choose Create a new role and enter a name (for example, AmazonKendra-Web Crawler-datasource-role).
  10. Choose Next.access and security
  11. In the Sync scope section, configure your sync settings based on the site you are crawling. For this post, we leave all the default settings.
  12. For Sync mode, choose how you want to update your index. For this post, we select Full sync.
  13. For Sync run schedule, choose Run on demand.
  14. Choose Next.sync setting
  15. Optionally, you can set field mappings. For this post, we keep the defaults for now.

Mapping fields is a useful exercise where you can substitute field names to values that are user-friendly and that fit in your organization’s vocabulary.

  1. Choose Next.field mapping
  2. Choose Add data source.add data source
  3. To sync the data source, choose Sync now on the data source details page.start sync
  4. Wait for the sync to complete.sync complete

Example of an authenticated website

If you want to crawl a site that has authentication, then in the Authentication section in the previous steps, you need to specify the authentication details. The following is an example if you selected Form authentication.

  1. In the Source section, select Source URL and enter a URL. For this example, we use https://accounts.autodesk.com.
  2. In the Authentication section, select Form authentication.
  3. In the Web proxy section, specify your Secrets Manager secret. This is required for any option other than No authentication.
    1. Choose Create and Add New Secret.
    2. Enter the authentication details that you gathered previously.
    3. Choose Save.

    create secrets manager secret

Test the solution

Now that you have ingested the content from the site into your Amazon Kendra index, you can test some queries.

  1. Go to your index and choose Search indexed content.
  2. Enter a sample search query and test out your search results (your query will vary based on the contents of site your crawled and the query entered).search results

Congratulations! You have successfully used Amazon Kendra to surface answers and insights based on the content indexed from the site you crawled.

Clean up

To avoid incurring future costs, clean up the resources you created as part of this solution. If you created a new Amazon Kendra index while testing this solution, delete it. If you only added a new data source using the Amazon Kendra Web Crawler V2, delete that data source.

Conclusion

With the new Amazon Kendra Web Crawler V2, organizations can crawl any website that is public or behind authentication and use it for intelligent search powered by Amazon Kendra.

To learn about these possibilities and more, refer to the Amazon Kendra Developer Guide. For more information on how you can create, modify, or delete metadata and content when ingesting your data, refer to Enriching your documents during ingestion and Enrich your content and metadata to enhance your search experience with custom document enrichment in Amazon Kendra.


About the Author

Jiten Dedhia is a Sr. Solutions Architect with over 20 years of experience in the software industry. He has worked with global financial services clients, providing them advice on modernizing by using services provided by AWS.

Gunwant Walbe is a Software Development Engineer at Amazon Web Services. He is an avid learner and keen to adopt new technologies. He develops complex business applications, and Java is his primary language of choice.

Read More

Take the Wheel: NVIDIA NeMo SteerLM Lets Companies Customize a Model’s Responses During Inference

Take the Wheel: NVIDIA NeMo SteerLM Lets Companies Customize a Model’s Responses During Inference

Developers have a new AI-powered steering wheel to help them hug the road while they drive powerful large language models (LLMs) to their desired locations.

NVIDIA NeMo SteerLM lets companies define knobs to dial in a model’s responses as it’s running in production, a process called inference. Unlike current methods for customizing an LLM, it lets a single training run create one model that can serve dozens or even hundreds of use cases, saving time and money.

NVIDIA researchers created SteerLM to teach AI models what users care about, like road signs to follow in their particular use cases or markets. These user-defined attributes can gauge nearly anything — for example, the degree of helpfulness or humor in the model’s responses.

One Model, Many Uses

The result is a new level of flexibility.

With SteerLM, users define all the attributes they want and embed them in a single model. Then they can choose the combination they need for a given use case while the model is running.

For example, a custom model can now be tuned during inference to the unique needs of, say, an accounting, sales or engineering department or a vertical market.

The method also enables a continuous improvement cycle. Responses from a custom model can serve as data for a future training run that dials the model into new levels of usefulness.

Saving Time and Money

To date, fitting a generative AI model to the needs of a specific application has been the equivalent of rebuilding an engine’s transmission. Developers had to painstakingly label datasets, write lots of new code, adjust the hyperparameters under the hood of the neural network and retrain the model several times.

SteerLM replaces those complex, time-consuming processes with three simple steps:

  • Using a basic set of prompts, responses and desired attributes, customize an AI model that predicts how those attributes will perform.
  • Automatically generating a dataset using this model.
  • Training the model with the dataset using standard supervised fine-tuning techniques.

Many Enterprise Use Cases

Developers can adapt SteerLM to nearly any enterprise use case that requires generating text.

With SteerLM, a company might produce a single chatbot it can tailor in real time to customers’ changing attitudes, demographics or circumstances in the many vertical markets or geographies it serves.

SteerLM also enables a single LLM to act as a flexible writing co-pilot for an entire corporation.

For example, lawyers can modify their model during inference to adopt a formal style for their legal communications. Or marketing staff can dial in a more conversational style for their audience.

Game On With SteerLM

To show the potential of SteerLM, NVIDIA demonstrated it on one of its classic applications — gaming (see the video below).

Today, some games pack dozens of non-playable characters — characters that the player can’t control — which mechanically repeat prerecorded text, regardless of the user or situation.

SteerLM makes these characters come alive, responding with more personality and emotion to players’ prompts. It’s a tool game developers can use to unlock unique new experiences for every player.

The Genesis of SteerLM

The concept behind the new method arrived unexpectedly.

“I woke up early one morning with this idea, so I jumped up and wrote it down,” recalled Yi Dong, an applied research scientist at NVIDIA who initiated the work on SteerLM.

While building a prototype, he realized a popular model-conditioning technique could also be part of the method. Once all the pieces came together and his experiment worked, the team helped articulate the method in four simple steps.

It’s the latest advance in model customization, a hot area in AI research.

“It’s a challenging field, a kind of holy grail for making AI more closely reflect a human perspective — and I love a new challenge,” said the researcher, who earned a Ph.D. in computational neuroscience at Johns Hopkins University, then worked on machine learning algorithms in finance before joining NVIDIA.

Get Hands on the Wheel

SteerLM is available as open-source software for developers to try out today. They can also get details on how to experiment with a Llama-2-13b model customized using the SteerLM method.

For users who want full enterprise security and support, SteerLM will be integrated into NVIDIA NeMo, a rich framework for building, customizing and deploying large generative AI models.

The SteerLM method works on all models supported on NeMo, including popular community-built pretrained LLMs such as Llama-2 and BLOOM.

Read a technical blog to learn more about SteerLM.

See notice regarding software product information.

Read More

ML Model Server Resource Saving - Transition From High-Cost GPUs to Intel CPUs and oneAPI powered Software with performance

ML Model Server Resource Saving – Transition From High-Cost GPUs to Intel CPUs and oneAPI powered Software with performance

Reviewers: Yunsang Ju(Naver GplaceAI Leader), Min Jean Cho(Intel), Jing Xu(Intel), Mark Saroufim(Meta)

Intro

Here, We will be sharing our experience in moving AI workloads from our GPU servers to our Intel CPU servers without any performance or quality degradation, and saving annual costs of approximately 340 thousand U.S. Dollar (refer to the Conclusion) in the process.

We aim to provide value to our consumers by serving various AI models that enhance the Online to Offline (O2O) experience. With the ongoing growth in the demand for new models and the limited nature of high-cost resource GPUs, we needed to transition relatively lightweight AI models from GPU servers to Intel CPU servers for reducing resource consumption. In the same setting, however, the CPU server had issues where performance of rps, inference time, etc. was reduced by tens of times. We applied various engineering techniques and lightweighted the model to solve this problem, and we were able to successfully transition to the Intel CPU servers with the same performance or better performance as the GPU servers with just a three-fold scale out.

For a more detailed introduction about our team, please refer to the Introduction to NAVER Place AI Development Team.

I’ll mention it again in the middle, but I’ve received a lot of help from Grokking Pytorch Intel CPU Performance From First Principles written by Intel and PyTorch in the overall work.

Problem Definition

1: Service Architecture

Simplified service architecture

Simplified service architecture (Image Source: NAVER GplaceAI)

To facilitate understanding, a brief introduction to our service architecture will be provided. CPU intensive tasks such as preprocessing input to tensor format (then forwarded to the model) and post processing inference results to human readable output (e.g. natural language and image formats) are performed on the App Server(FastAPI) The Model Server(TorchServe) exclusively handles inference operations. For stable operation of the service, the following actions need to be performed with sufficient throughput and low latency.

The specific processing sequence is as follows:

  • The client submits a request to the app server via the Traefik gateway.
  • The app server pre-processes the input by performing actions such as resizing and transforming, and converting it into a Torch tensor before then requesting the model server.
  • The model server performs inference and returns the feature to the app server
  • The app server converts the feature into a format understandable by humans through post-processing and returns it to the client

2:  Throughput and Latency Measurement

Comparison of Image Scoring Models

Comparison of Image Scoring Models

With all other conditions remaining the same, deploying on a threefold increase CPU server pod, yet, notably, the RPS (requests per second) and response time deteriorated by more than tenfold. While it was not surprising that CPU inference performance is inferior to GPUs, the challenging situation was evident. Given the goal of maintaining performance within limited resources, achieving an approximate 10 to 20 times performance improvement was necessary Barring any additional scaling.

3: Challenges From a Throughput Perspective

Type     Name                                                                          # reqs      # fails |    Avg     Min     Max    Med |   req/s  failures/s
--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
POST     /predictions/image-scoring                                                        37     0(0.00%) |   9031    4043   28985   8200 |    1.00        0.00
--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
         Aggregated                                                                        37     0(0.00%) |   9031    4043   28985   8200 |    1.00        0.00

One of the first steps TorchServer framework users might take in order to improve throughput is to increase the number of workers in TorchServe. This approach is effective on GPU servers Because of parallel workload processing, excluding the linear memory usage increase as workers scale. However, we were experiencing worse performance when increasing the number of workers. Identifying the cause of performance degradation on CPU servers required further investigation.

4: Challenges From a Latency Perspective

Our primary concern was latency. Throughput improvement is normally achievable when a system’s implementation is faithful to scale-out principles, except for perhaps very rare worst-case scenarios. However, in the case of the Image Scoring model example, even performing a single inference took more than 1 second, and as the request volume increased, latency increased to as much as 4 seconds. It was a situation where the timeout criteria to satisfy the client could not be met even with a single inference.

Proposed Solutions

Improvements were needed from both an ML and an engineering perspective. It was essential to fundamentally reduce the inference time on the CPU and to identify the causes of performance degradation when applying config that generally enhances performance, in order to find the optimal configuration values. To accomplish this, collaboration was established with MLE professionals to concurrently execute tasks encompassing ‘model lightweighting without compromising performance’, and ‘Identify optimal configurations for achieving peak performance’. Using the aforementioned approaches we were able to effectively transition workload handling to our CPU servers.

1: Resolving Low RPS from an Engineering Perspective

First, the reason for performance degradation even after increasing the worker number was the front-end bound caused by logical threads in GEMM operations. Generally, when increasing the number of workers, the expected improvement effect is the increase in parallelism. Conversely, if performance decreases, one can infer the corresponding trade-off effect.

CPU + GPU

Image Source: Nvidia

As many are aware, the reason model inference performance on CPUs is inferior to GPUs lies in the difference in hardware design, particularly in terms of multi-threading capabilities. Diving deeper, model inference is fundamentally a repetition of GEMM (General Matrix Multiply) operations, and these GEMM operations are executed independently in “fused-multiply-add” (FMA) or “dot-product” (DP) execution units. If the GEMM operation becomes a bottleneck on the CPU, increasing parallelism might actually result in decreased performance. While researching the problem we found relevant information within the PyTorch documentation.

While two logical threads run GEMM at the same time, they will be sharing the same core resources causing front-end bound

This information highlighted that logical threads could cause a bottleneck in CPU GEMM operations, which helped us intuitively understand why performance decreased when increasing the worker num. This is because the default value of the torch thread corresponds to the physical core value of the CPU.

root@test-pod:/# lscpu
  …
Thread(s) per core: 2
Core(s) per socket: 12
  …
root@test-pod:/# python
>>> import torch
>>> print(torch.get_num_threads())
24

When the worker_num increases, the total thread count increases by the product of the physical core * worker number. Consequently, logical threads are utilized. In order to improve performance, the total number of threads per worker was adjusted to align with the physical core count. Below, it can be observed that the metric RPS increased approximately threefold to 6.3(from the previous value of 2.1) when the worker_num was increased to 4 and the total thread count was aligned with the number of physical cores.

Type     Name                                                                          # reqs      # fails |    Avg     Min     Max    Med |   req/s  failures/s
--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
POST     /predictions/image-scoring                                                       265     0(0.00%) |   3154    1885    4008   3200 |    6.30        0.00
--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
         Aggregated                                                                       265     0(0.00%) |   3154    1885    4008   3200 |    6.30        0.00

Cautionary Note 1: Our team is Using Kubernetes to maintain our deployments. So we are adjusting the which required us to adjust according to the CPU resource limit of the pod, rather than the physical core count of the node that can be checked using the lscpu command. (Setting the torch thread of each worker to 8/4 = 2, or 24/4 = 6 resulted in performance degradation.)

Cautionary Note 2: Since torch thread settings for each worker can only be configured as integers, it’s advisable to set the CPU limit divisible by the worker_num in order to adequately utilize CPU usage.

example

ex) core=8, In the case of worker_num=3: int(8/worker_num) = 2, 2*worker_num/8 = 75%

example

ex) core=8, In the case of worker_num=4: int(8/worker_num) = 2, 2*worker_num/8 = 100%

We also analyzed the model containers to see why we got a mere threefold improvement in performance despite a four times increase in the number of workers. Various resources were monitored, and among them, the core utilization rate was identified as the underlying cause.

threads

Even when the total thread count was adjusted to match the CPU(2nd Generation, Intel(R) Xeon(R) Silver 4214) limit(8 core), there were instances where computations were executed from logical thread to logical core. Due to the presence of 24 physical cores, the cores numbered 25 to 48 are classified as logical cores. The possibility of confining thread execution solely within physical cores seemed to offer the potential for further performance enhancement. The reference to this solution could be found within the source document mentioned in the PyTorch-geometric article that warned about CPU GEMM bottlenecks.

As per the instructions in the document, Intel provides Intel® Extension for PyTorch where we can simply pin cores to specific sockets. The application method is also made very simple, by adding the following settings to the torchserve config.properties file.(used intel_extension_for_pytorch==1.13.0)

ipex_enable=true
CPU_launcher_enable=true

two-socket configuration

Image Source: PyTorch

Beyond the removal of logical threads through socket pinning, there is an additional effect of eliminating UPI cache hit overhead. Since the CPU comprises more than one socket when threads scheduled on socket 1 are rescheduled on socket 2, cache hits occur in cases of accessing the cache of socket 1 via Intel Ultra Path Interconnect (UPI). At this point, UPI access to the local cache becomes more than twice as slow as local cache access, resulting in more bottlenecks. With threads being pinned to socket units by oneAPI powered Intel® Extension for PyTorch, We observed rps handling increase of up to four times than when the bottleneck existed.

Type     Name                                                                          # reqs      # fails |    Avg     Min     Max    Med |   req/s  failures/s
--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
POST     /predictions/image-scoring                                                       131     0(0.00%) |   3456    1412    6813   3100 |    7.90        0.00
--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
         Aggregated                                                                       131     0(0.00%) |   3456    1412    6813   3100 |    7.90        0.00

Cautionary Note 1: Intel® Extension for PyTorch is specialized in neural network (referred to as “nn” hereafter) inference optimization, so the performance improvement from additional techniques outside nn might be minimal. Indeed, in the instance of the image scoring system highlighted as an example, where svr (support vector regression) is applied post-inference, the performance enhancement was confined to a 4-fold increase. However, for a purely nn inference model such as the food recognition model, a performance boost of 7-fold (2.5rps -> 17.5rps) was detected.

Type     Name                                                                          # reqs      # fails |    Avg     Min     Max    Med |   req/s  failures/s
--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
POST     /predictions/food-classification                                                 446     0(0.00%) |   1113     249    1804   1200 |   17.50        0.00
--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
         Aggregated                                                                       446     0(0.00%) |   1113     249    1804   1200 |   17.50        0.00

Cautionary Note 2: Applying Intel® Extension for PyTorch requires torchserve version 0.6.1 or higher. Since our team was using version 0.6.0, there was an issue where socket pinning was not functioning correctly. Currently, we have made modifications to the guide document, specifying the required version.

Within WorkerLifeCycle.java, multi-worker pinning is not supported in 0.6.0 and below (ninstance is hardcoded to 1)

// 0.6.0 version

public ArrayList<String> launcherArgsToList() {
   ArrayList<String> arrlist = new ArrayList<String>();
   arrlist.add("-m");
   arrlist.add("intel_extension_for_pytorch.cpu.launch");
   arrlist.add(" — ninstance");
   arrlist.add("1");
   if (launcherArgs != null && launcherArgs.length() > 1) {
     String[] argarray = launcherArgs.split(" ");
     for (int i = 0; i < argarray.length; i++) {
       arrlist.add(argarray[i]);
     }
   }
   return arrlist;
 }
// master version

if (this.numWorker > 1) {
   argl.add(" — ninstances");
   argl.add(String.valueOf(this.numWorker));
   argl.add(" — instance_idx");
   argl.add(String.valueOf(this.currNumRunningWorkers));
 }

2: Addressing Slow Latency Through Model Lightweighting

We also streamlined our model using Knowledge Distillation (commonly abbreviated as KD) to further reduce latency. As is widely known, kd is a technique where knowledge from a larger network (Teacher network) is conveyed to a smaller, lightweight network (Student network) which is less resource intensive and can be more readily deployed. For more detailed information, please refer to the paper where this concept was initially introduced, titled Distilling the Knowledge in a Neural Network.

neural networks

There is a variety of KD techniques available and because we were primarily focused on accuracy loss minimization, we adopted the approach from the paper Knowledge Distillation from A Stronger Teacher, which was published in the year 2022. The concept is straightforward. Unlike the conventional method of distillation that utilizes only the model’s prop values, the chosen approach involves having the student network learn the correlations between classes in the teacher network. When put into actual application, We observed effective model weight reduction to observe the effective reduction in the model’s weight while mainting high accuracy. The following are the outcomes of our experimentation with the mentioned knowledge distillation technique on several candidate student models, where selections were made based on the maintained level of accuracy.

table of services

For the image scoring system, additional measures were taken to reduce the input size. Considering that the prior use of CPU-based ML technique SVR (Support Vector Regression) was used (2-stage: CNN + SVR), even when this was streamlined into a 1-stage model, significant speed advantages were not observed in CPU inference. In order for streamlining to have significance, the input size of the student model during inference needed further reduction. Consequently, experiments were conducted with the size reduced from 384384 to 224224.

Further simplifying transformations, the 2-stage (CNN + SVR) approach was unified into a 1-stage model with a larger ConvNext, and then kd was applied using the lightweight EfficientNet to resolve the accuracy trade-off. During the experiments, we encountered a problem where changing Img_resize to 224 led to a performance drop from 0.4007 to 0.4296 in terms of MAE. Due to the reduction in input size, various preprocessing techniques applied to the original training images (such as Affine, RandomRotate90, Blur, OneOf [GridDistortion, OpticalDistortion, ElasticTransform], VerticalFlip) had a counterproductive effect. By adopting these measures, effective training of the student was achieved, and the MAE value improved by 25% compared to the previous one (.518 to .3876).

Validation

1: Final Performance Measurement

The following shows the final performance improvements using CPU servers, on the three models mentioned throughout this article.

# Food photo classifier (pod 3): 2.5rps -> 84 rps

 Type Name                                                                           # reqs # fails | Avg Min Max Med | req/s failures/s
 --------|----------------------------------------------------------------------------|------|------------|-------|------|-------|-------|--------|--------- 
POST /predictions/food-classification 2341 0(0.00%) | 208 130 508 200 | 84.50 0.00 
--------|----------------------------------------------------------------------------|--------|-------------|------|-------|--------|------|--------|----------
         Aggregated                                                                      2341     0(0.00%) |    208     130     508    200 |   84.50        0.00

# Image scoring (pod 3): 2.1rps -> 62rps
 Type Name                                                                               #reqs #fails | Avg Min Max Median | req/s failures/s
 --------|---------------------------------------------------------------------------------|--------|-------------|--------|-------|--------|---------|--------|--------- 
  POST /predictions/image-scoring 1298 0 (0.00%) | 323 99 607 370 | 61.90 0.00 
--------|---------------------------------------------------------------------------------|--------|-------------|--------|------|--------|---------|--------|----------
          Aggregated                                                                          1298     0(0.00%)  |     323      99     607     370  |   61.90        0.00

# receipt classifier(pod 3) : 20rps -> 111.8rps
Type     Name                                                                          # reqs      # fails |    Avg     Min     Max    Med |   req/s  failures/s
--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
POST     /predictions/receipt-classification                                             4024     0(0.00%) |    266     133    2211    200 |   111.8        0.00
--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
         Aggregated                                                                      4020     0(0.00%) |    266     133    2211    200 |   111.8        0.00

2:  Traffic Mirroring

As previously mentioned, our team’s service architecture employs the tool “traefik” as a gateway in front of the app server, as briefly introduced at the beginning of the article. For final validation, the mirroring feature of this traefik gateway was utilized to mirror traffic from production to staging for a month of validation before applying it to production, which is now operational.

Details regarding mirroring are beyond the scope of this topic and hence omitted. For those interested, kindly refer to the document at https://doc.traefik.io/traefik/routing/services/#mirroring-service.

In Conclusion

This concludes the discussion about transitioning from a GPU model server to a CPU server while maintaining service quality. Through this effort, our team was able to save 15 GPUs each in South Korea and Japan, resulting in an annual cost savings of approximately 340 thousand U.S. Dollar. Although we directly purchase and use GPUs within NAVER, we calculated a rough cost reduction based on AWS EC2 instances that stably support T4 GPUs.

instance sizes

Calculation: 1.306 (1-year reserved instance effective hourly cost) * 24 (hours) * 365 (days) * 15 (number of GPUs) * 2 (KR + JP)

These secured GPUs will be harnessed to further advance and enhance our team’s AI services, delivering exceptional service experiences. We sincerely appreciate your encouragement and anticipation.:)

Explore More

Read More

New – No-code generative AI capabilities now available in Amazon SageMaker Canvas

New – No-code generative AI capabilities now available in Amazon SageMaker Canvas

Launched in 2021, Amazon SageMaker Canvas is a visual, point-and-click service that allows business analysts and citizen data scientists to use ready-to-use machine learning (ML) models and build custom ML models to generate accurate predictions without the need to write any code. Ready-to-use models enable you to derive immediate insights from text, image, and document data (such as sentiment analysis, document processing, or object detection in images). Custom models allow you to build predictive models for use cases such as demand forecasting, customer churn, and defect detection in manufacturing.

We are excited to announce that SageMaker Canvas is expanding its support of ready-to-use models to include foundation models (FMs), enabling you to use generative AI to generate and summarize content. You can use natural language with a conversational chat interface to perform tasks such as creating narratives, reports, and blog posts; answering questions; summarizing notes and articles; and explaining concepts, without writing a single line of code. Your data is not used to improve the base models, is not shared with third-party model providers, and stays entirely within your secure AWS environment.

SageMaker Canvas allows you to access a variety of FMs that include Amazon Bedrock models (such as Claude 2 from Anthropic and Jurassic-2 from AI21 Labs) and publicly available Amazon SageMaker JumpStart models, including Falcon-7B-Instruct, Falcon-40B-Instruct, and MPT-7B-Instruct). You may use a single model or up to three models to compare model responses side by side. In SageMaker Canvas, Amazon Bedrock models are always active, allowing you to use them instantly. SageMaker JumpStart models can be started and deployed in your AWS account on demand and are automatically shut down after two hours of inactivity.

Let’s explore how to use the generative AI capabilities of SageMaker Canvas. For this post, we work with a fictitious enterprise customer support use case as an example.

Prerequisites

Complete the following prerequisite steps:

  1. Create an AWS account.
  2. Set up SageMaker Canvas and optionally configure it to use a VPC without internet access.
  3. Set up model access in Amazon Bedrock.
  4. Request service quota increases for g5.12xlarge and g5.2xlarge, if required, in your Region. These instances are required to host the SageMaker JumpStart model endpoints. Other instances may be selected based on availability.

Handling customer complaints

Let’s say that you’re a customer support analyst who handles complaints for a bicycle company. When receiving a customer complaint, you can use SageMaker Canvas to analyze the complaint and generate a personalized response to the customer. To do so, complete the following steps:

  1. On the SageMaker console, choose Canvas in the navigation pane.
  2. Choose your domain and user profile and choose Open Canvas to open the SageMaker Canvas application.

SageMaker Canvas is also accessible using single sign-on or other existing identity providers (IdPs) without having to first access the SageMaker console.

  1. Choose Generate, extract and summarize content to open the chat console.
  2. With the Claude 2 model selected, enter your instructions to retrieve the customer sentiment for the provided complaint and press Enter.
  3. You may want to know the specific problems with the bicycle, especially if it’s a long complaint. So, ask for the problems with the bicycle. Note that you don’t have to repost the complaint because SageMaker Canvas stores the context for your chat.

Now that we understand the customer’s problem, you can send them a response including a link to the company’s feedback form.

  1. In the input window, request a response to the customer complaint.
  2. If you want to generate another response from the FM, choose the refresh icon in the response section.

The original response and all new responses are paginated within the response section. Note that the new response is different from the original response. You can choose the copy icon in the response section to copy the response to an email or document, as required.

  1. You can also modify the model’s response by requesting specific changes. For example, let’s ask the model to add a $50 gift card offer to the email response.

Comparing model responses

You can compare the model responses from multiple models (up to three). Let’s compare two Amazon Bedrock models (Claude 2 and Jurassic-2 Ultra) with a SageMaker JumpStart model (Falcon-7B-Instruct) to evaluate and find the best model for your use case:

  1. Choose New chat to open a chat interface.
  2. On the model drop-down menu, choose Start up another model.
  3. On the Foundation models page, under Amazon SageMaker JumpStart models, choose Falcon-7B-Instruct and in the right pane, choose Start up model.

The model will take around 10 minutes to start.

  1. On the Foundation models page, confirm that the Falcon-7B-Instruct model is active before proceeding to the next step.
  2. Choose New chat to open a chat interface.
  3. Choose Compare to display a drop-down menu for the second model, then choose Compare again to display a drop-down menu for the third model.
  4. Choose the Falcon-7B-Instruct model on the first drop-down menu, Claude 2 on the second drop-down menu, and Jurassic-2 Ultra on the third drop-down menu.
  5. Enter your instructions in the chat input box and press Enter.

You will see responses from all three models.

Clean up

Any SageMaker JumpStart models started from SageMaker Canvas will be automatically shut down after 2 hours of inactivity. If you want to shut down these models sooner to save costs, follow the instructions in this section. Note that Amazon Bedrock models are not deployed in your account, so there is no need to shut these down.

  1. To shut down the Falcon-40B-Instruct SageMaker JumpStart model, you can choose from two methods:
    1. On the results comparison page, choose the Falcon-7B-Instruct model’s options menu (three dots), then choose Shut down model.
    2. Alternatively, choose New chat, and on the model drop-down menu, choose Start up another model. Then, on the Foundation models page, under Amazon SageMaker JumpStart models, choose Falcon-7B-Instruct and in the right pane, choose Shut down model.
  2. Choose Log out in the left pane to log out of the SageMaker Canvas application to stop the consumption of SageMaker Canvas workspace instance hours and release all resources used by the workspace instance.

Conclusion

In this post, you learned how to use SageMaker Canvas to generate text with ready-to-use models from Amazon Bedrock and SageMaker JumpStart. You used the Claude 2 model to analyze the sentiment of a customer complaint, ask questions, and generate a response without a single line of code. You also started a publicly available model and compared responses from three models.

For Amazon Bedrock models, you are charged based on the volume of input tokens and output tokens as per the Amazon Bedrock pricing page. Because SageMaker JumpStart models are deployed on SageMaker instances, you are charged for the duration of usage based on the instance type as per the Amazon SageMaker pricing page.

SageMaker Canvas continues to democratize AI with a no-code visual, interactive workspace that allows business analysts to build ML models that address a wide variety of use cases. Try out the new generative AI capabilities in SageMaker Canvas today! These capabilities are available in all Regions where Amazon Bedrock or SageMaker JumpStart are available.


About the Authors

Anand Iyer has been a Principal Solutions Architect at AWS since 2016. Anand has helped global healthcare, financial services, and telecommunications clients architect and implement enterprise software solutions using AWS and hybrid cloud technologies. He has an MS in Computer Science from Louisiana State University Baton Rouge, and an MBA from USC Marshall School of Business, Los Angeles. He is AWS certified in the areas of Security, Solutions Architecture, and DevOps Engineering.

Gavin Satur is a Principal Solutions Architect at Amazon Web Services. He works with enterprise customers to build strategic, well-architected solutions and is passionate about automation. Outside of work, he enjoys family time, tennis, cooking, and traveling.

Gunjan Jain is an AWS Solutions Architect in SoCal and primarily works with large financial services companies. He helps with cloud adoption, cloud optimization, and adopting best practices for being Well-Architected on the cloud.

Harpreet Dhanoa, a seasoned Senior Solutions Architect at AWS, has a strong background in designing and building scalable distributed systems. He is passionate about machine learning, observability, and analytics. He enjoys helping large-scale customers build their cloud enterprise strategy and transform their business in AWS. In his free time, Harpreet enjoys playing basketball with his two sons and spending time with his family.

Read More

Whisper models for automatic speech recognition now available in Amazon SageMaker JumpStart

Whisper models for automatic speech recognition now available in Amazon SageMaker JumpStart

Today, we’re excited to announce that the OpenAI Whisper foundation model is available for customers using Amazon SageMaker JumpStart. Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation. Trained on 680 thousand hours of labelled data, Whisper models demonstrate a strong ability to generalize to many datasets and domains without the need for fine-tuning. Sagemaker JumpStart is the machine learning (ML) hub of SageMaker that provides access to foundation models in addition to built-in algorithms and end-to-end solution templates to help you quickly get started with ML.

You can also do ASR using Amazon Transcribe ,a fully-managed and continuously trained automatic speech recognition service.

In this post, we show you how to deploy the OpenAI Whisper model and invoke the model to transcribe and translate audio.

The OpenAI Whisper model uses the huggingface-pytorch-inference container. As a SageMaker JumpStart model hub customer, you can use ASR without having to maintain the model script outside of the SageMaker SDK. SageMaker JumpStart models also improve security posture with endpoints that enable network isolation.

Foundation models in SageMaker

SageMaker JumpStart provides access to a range of models from popular model hubs including Hugging Face, PyTorch Hub, and TensorFlow Hub, which you can use within your ML development workflow in SageMaker. Recent advances in ML have given rise to a new class of models known as foundation models, which are typically trained on billions of parameters and can be adapted to a wide category of use cases, such as text summarization, generating digital art, and language translation. Because these models are expensive to train, customers want to use existing pre-trained foundation models and fine-tune them as needed, rather than train these models themselves. SageMaker provides a curated list of models that you can choose from on the SageMaker console.

You can now find foundation models from different model providers within SageMaker JumpStart, enabling you to get started with foundation models quickly. SageMaker JumpStart offers foundation models based on different tasks or model providers, and you can easily review model characteristics and usage terms. You can also try these models using a test UI widget. When you want to use a foundation model at scale, you can do so without leaving SageMaker by using pre-built notebooks from model providers. Because the models are hosted and deployed on AWS, you trust that your data, whether used for evaluating or using the model at scale, won’t be shared with third parties.

OpenAI Whisper foundation models

Whisper is a pre-trained model for ASR and speech translation. Whisper was proposed in the paper Robust Speech Recognition via Large-Scale Weak Supervision by Alec Radford, and others, from OpenAI. The original code can be found in this GitHub repository.

Whisper is a Transformer-based encoder-decoder model, also referred to as a sequence-to-sequence model. It was trained on 680 thousand hours of labelled speech data annotated using large-scale weak supervision. Whisper models demonstrate a strong ability to generalize to many datasets and domains without the need for fine-tuning.

The models were trained on either English-only data or multilingual data. The English-only models were trained on the task of speech recognition. The multilingual models were trained on speech recognition and speech translation. For speech recognition, the model predicts transcriptions in the same language as the audio. For speech translation, the model predicts transcriptions to a different language to the audio.

Whisper checkpoints come in five configurations of varying model sizes. The smallest four are trained on either English-only or multilingual data. The largest checkpoints are multilingual only. All ten of the pre-trained checkpoints are available on the Hugging Face hub. The checkpoints are summarized in the following table with links to the models on the hub:

Model name Number of parameters Multilingual
whisper-tiny 39 M Yes
whisper-base 74 M Yes
whisper-small 244 M Yes
whisper-medium 769 M Yes
whisper-large 1550 M Yes
whisper-large-v2 1550 M Yes

Lets explore how you can use Whisper models in SageMaker JumpStart.

OpenAI Whisper foundation models WER and latency comparison

The word error rate (WER) for different OpenAI Whisper models based on the LibriSpeech test-clean is shown in the following table.  WER is a common metric for the performance of a speech recognition or machine translation system. It measures the difference between the reference text (the ground truth or the correct transcription) and the output of an ASR system in terms of the number of errors, including substitutions, insertions, and deletions that are needed to transform the ASR output into the reference text. These numbers have been taken from the Hugging Face website.

Model WER (percent)
whisper-tiny 7.54
whisper-base 5.08
whisper-small 3.43
whisper-medium 2.9
whisper-large 3
whisper-large-v2 3

For this blog, we took the below audio file and compared the latency of speech recognition across different whisper models. Latency is the amount of time from the moment that a user sends a request until the time that your application indicates that the request has been completed. The numbers in the following table represent the average latency for a total of 100 requests using the same audio file with the model hosted on the ml.g5.2xlarge instance.

Model Average latency(s) Model output
whisper-tiny 0.43 We are living in very exciting times with machine lighting. The speed of ML model development will really actually increase. But you won’t get to that end state that we won in the next coming years. Unless we actually make these models more accessible to everybody.
whisper-base 0.49 We are living in very exciting times with machine learning. The speed of ML model development will really actually increase. But you won’t get to that end state that we won in the next coming years. Unless we actually make these models more accessible to everybody.
whisper-small 0.84 We are living in very exciting times with machine learning. The speed of ML model development will really actually increase. But you won’t get to that end state that we want in the next coming years unless we actually make these models more accessible to everybody.
whisper-medium 1.5 We are living in very exciting times with machine learning. The speed of ML model development will really actually increase. But you won’t get to that end state that we want in the next coming years unless we actually make these models more accessible to everybody.
whisper-large 1.96 We are living in very exciting times with machine learning. The speed of ML model development will really actually increase. But you won’t get to that end state that we want in the next coming years unless we actually make these models more accessible to everybody.
whisper-large-v2 1.98 We are living in very exciting times with machine learning. The speed of ML model development will really actually increase. But you won’t get to that end state that we want in the next coming years unless we actually make these models more accessible to everybody.

Solution walkthrough

You can deploy Whisper models using the Amazon SageMaker console or using an Amazon SageMaker Notebook. In this post, we demonstrate how to deploy the Whisper API using the SageMaker Studio console or a SageMaker Notebook and then use the deployed model for speech recognition and language translation. The code used in this post can be found in this GitHub notebook.

Let’s expand each step in detail.

Deploy Whisper from the console

  1. To get started with SageMaker JumpStart, open the Amazon SageMaker Studio console and go to the launch page of SageMaker JumpStart and select Get Started with JumpStart.
  2. To choose a Whisper model, you can either use the tabs at the top or use the search box at the top right as shown in the following screenshot. For this example, use the search box on the top right and enter Whisper, and then select the appropriate Whisper model from the dropdown menu.
  3. After you select the Whisper model, you can use the console to deploy the model. You can select an instance for deployment or use the default.

Deploy the foundation model from a Sagemaker Notebook

The steps to first deploy and then use the deployed model to solve different tasks are:

  1. Set up
  2. Select a model
  3. Retrieve artifacts and deploy an endpoint
  4. Use deployed model for ASR
  5. Use deployed model for language translation
  6. Clean up the endpoint

Set up

This notebook was tested on an ml.t3.medium instance in SageMaker Studio with the Python 3 (data science) kernel and in an Amazon SageMaker Notebook instance with the conda_python3 kernel.

%pip install --upgrade sagemaker --quiet

Select a pre-trained model

Set up a SageMaker Session using Boto3, and then select the model ID that you want to deploy.

model_id = "huggingface-asr-whisper-large-v2"

Retrieve artifacts and deploy an endpoint

Using SageMaker, you can perform inference on the pre-trained model, even without fine-tuning it first on a new dataset. To host the pre-trained model, create an instance of sagemaker.model.Model and deploy it. The following code uses the default instance ml.g5.2xlarge for the inference endpoint of a whisper-large-v2 model. You can deploy the model on other instance types by passing instance_type in the JumpStartModel class. The deployment might take few minutes.

#Deploying the model

from sagemaker.jumpstart.model import JumpStartModel
from sagemaker.serializers import JSONSerializer

my_model = JumpStartModel(model_id=dropdown.value)
predictor = my_model.deploy()

Automatic speech recognition

Next, you read the sample audio file, sample1.wav, from a SageMaker Jumpstart public Amazon Simple Storage Service (Amazon S3) location and pass it to the predictor for speech recognition. You can replace this sample file with any other sample audio file but make sure the .wav file is sampled at 16 kHz because is required by the automatic speech recognition models. The input audio file must be less than 30 seconds.

from scipy.io.wavfile import read
import json
import boto3
from sagemaker.jumpstart import utils

# The wav files must be sampled at 16kHz (this is required by the automatic speech recognition models), so make sure to resample them if required. The input audio file must be less than 30 seconds.
s3_bucket = utils.get_jumpstart_content_bucket(boto3.Session().region_name)
key_prefix = "training-datasets/asr_notebook_data"
input_audio_file_name = "sample1.wav"

s3_client = boto3.client("s3")
s3_client.download_file(s3_bucket, f"{key_prefix}/{input_audio_file_name }", input_audio_file_name )

with open(input_audio_file_name, "rb") as file:
    wav_file_read = file.read()

# If you receive client error (413) please check the payload size to the endpoint. Payloads for SageMaker invoke endpoint requests are limited to about 5MB
response = predictor.predict(wav_file_read)
print(response["text"])

This model supports many parameters when performing inference. They include:

  • max_length: The model generates text until the output length. If specified, it must be a positive integer.
  • language and task: Specify the output language and task here. The model supports the task of transcription or translation.
  • max_new_tokens: The maximum numbers of tokens to generate.
  • num_return_sequences: The number of output sequences returned. If specified, it must be a positive integer.
  • num_beams: The number of beams used in the greedy search. If specified, it must be integer greater than or equal to num_return_sequences.
  • no_repeat_ngram_size: The model ensures that a sequence of words of no_repeat_ngram_size isn’t repeated in the output sequence. If specified, it must be a positive integer greater than 1.
  • temperature: This controls the randomness in the output. Higher temperature results in an output sequence with low-probability words and lower temperature results in an output sequence with high-probability words. If temperature approaches 0, it results in greedy decoding. If specified, it must be a positive float.
  • early_stopping: If True, text generation is finished when all beam hypotheses reach the end of sentence token. If specified, it must be boolean.
  • do_sample: If True, sample the next word for the likelihood. If specified, it must be boolean.
  • top_k: In each step of text generation, sample from only the top_k most likely words. If specified, it must be a positive integer.
  • top_p: In each step of text generation, sample from the smallest possible set of words with cumulative probability top_p. If specified, it must be a float between 0 and 1.

You can specify any subset of the preceding parameters when invoking an endpoint. Next, we show you an example of how to invoke an endpoint with these arguments.

Language translation

To showcase language translation using Whisper models, use the following audio file in French and translate it to English. The file must be sampled at 16 kHz (as required by the ASR models), so make sure to resample files if required and make sure your samples don’t exceed 30 seconds.

  1. Download the sample_french1.wav from SageMaker JumpStart from the public S3 location so it can be passed in payload for translation by the Whisper model.

    input_audio_file_name = "sample_french1.wav"
    
    s3_client.download_file(s3_bucket, f"{key_prefix}/{input_audio_file_name }", input_audio_file_name )

  2. Set the task parameter as translate and language as French to force the Whisper model to perform speech translation.
    with open(input_audio_file_name, "rb") as file:
        wav_file_read = file.read()
    
    payload = {"audio_input": wav_file_read.hex(), "language": "french", "task": "translate"}
    
    predictor.serializer = JSONSerializer()
    predictor.content_type = "application/json"

  3. Use predictor to predict the translation of the language. If you receive client error (error 413), check the payload size to the endpoint. Payloads for SageMaker invoke endpoint requests are limited to about 5 MB.
    response = predictor.predict(payload)
    print(response["text"])

  4. The text output translated to English from the French audio file follows:
    [' Welcome to JPBSystem. We have more than 150 employees and 90% of sales. We have developed about 15 patents.']

Clean up

After you’ve tested the endpoint, delete the SageMaker inference endpoint and delete the model to avoid incurring charges.

Conclusion

In this post, we showed you how to test and use OpenAI Whisper models to build interesting applications using Amazon SageMaker. Try out the foundation model in SageMaker today and let us know your feedback!

This guidance is for informational purposes only. You should still perform your own independent assessment and take measures to ensure that you comply with your own specific quality control practices and standards, and the local rules, laws, regulations, licenses and terms of use that apply to you, your content, and the third-party model referenced in this guidance. AWS has no control or authority over the third-party model referenced in this guidance and does not make any representations or warranties that the third-party model is secure, virus-free, operational, or compatible with your production environment and standards. AWS does not make any representations, warranties, or guarantees that any information in this guidance will result in a particular outcome or result.


About the authors

Hemant Singh is an Applied Scientist with experience in Amazon SageMaker JumpStart. He got his masters from Courant Institute of Mathematical Sciences and B.Tech from IIT Delhi. He has experience in working on a diverse range of machine learning problems within the domain of natural language processing, computer vision, and time series analysis.

Rachna Chadha is a Principal Solution Architect AI/ML in Strategic Accounts at AWS. Rachna is an optimist who believes that ethical and responsible use of AI can improve society in future and bring economical and social prosperity. In her spare time, Rachna likes spending time with her family, hiking and listening to music.

Dr. Ashish Khetan is a Senior Applied Scientist with Amazon SageMaker built-in algorithms and helps develop machine learning algorithms. He got his PhD from University of Illinois Urbana-Champaign. He is an active researcher in machine learning and statistical inference, and has published many papers in NeurIPS, ICML, ICLR, JMLR, ACL, and EMNLP conferences.

Read More

Reinventing a cloud-native federated learning architecture on AWS

Machine learning (ML), especially deep learning, requires a large amount of data for improving model performance. Customers often need to train a model with data from different regions, organizations, or AWS accounts. It is challenging to centralize such data for ML due to privacy requirements, high cost of data transfer, or operational complexity.

Federated learning (FL) is a distributed ML approach that trains ML models on distributed datasets. The goal of FL is to improve the accuracy of ML models by using more data, while preserving the privacy and the locality of distributed datasets. FL increases the amount of data available for training ML models, especially data associated with rare and new events, resulting in a more general ML model. Existing partner open-source FL solutions on AWS include FedML and NVIDIA FLARE. These open-source packages are deployed in the cloud by running in virtual machines, without using the cloud-native services available on AWS.

In this blog, you will learn to build a cloud-native FL architecture on AWS. By using infrastructure as code (IaC) tools on AWS, you can deploy FL architectures with ease. Also, a cloud-native architecture takes full advantage of a variety of AWS services with proven security and operational excellence, thereby simplifying the development of FL.

We first discuss different approaches and challenges of FL. We then demonstrate how to build a cloud-native FL architecture on AWS. The sample code to build this architecture is available on GitHub. We use the AWS Cloud Development Kit (AWS CDK) to deploy the architecture with one-click deployment. The sample code demos a scenario where the server and all clients belong to the same organization (the same AWS account), but their datasets cannot be centralized due to data localization requirements. The sample code supports horizontal and synchronous FL for training neural network models. The ML framework used at FL clients is TensorFlow.

Overview of federated learning

FL typically involves a central FL server and a group of clients. Clients are compute nodes that perform local training. In an FL training round, the central server first sends a common global model to a group of clients. Clients train the global model with local data, then provide local models back to the server. The server aggregates the local models into a new global model, then starts a new training round. There may be tens of training rounds until the global model converges or until the number of training rounds reaches a threshold. Therefore, FL exchanges ML models between the central FL server and clients, without moving training data to a central location.

There are two major categories of FL depending on the client type: cross-device and cross-silo. Cross-device FL trains a common global models by keeping all the training data locally on a large number of devices, such as mobile phones or IoT devices, with limited and unstable network connections. Therefore, the design of cross-device FL needs to consider frequent joining and dropout of FL clients.

Cross-silo FL trains a global model on datasets distributed at different organizations and geo-distributed data centers. These datasets are prohibited from moving out of organizations and data center regions due to data protection regulations, operational challenges (such as data duplication and synchronization), or high costs. In contrast with cross-device FL, cross-silo FL assumes that organizations or data centers have reliable network connections, powerful computing resources, and addressable datasets.

FL has been applied to various industries, such as finance, healthcare, medicine, and telecommunications, where privacy preservation is critical or data localization is required. FL has been used to train a global model for financial crime detection among multiple financial institutions. The global model outperforms models trained with only local datasets by 20%. In healthcare, FL has been used to predict mortality of hospitalized patients based on electronic health records from multiple hospitals. The global model predicting mortality outperforms local models at all participating hospitals. FL has also been used for brain tumor segmentation. The global models for brain tumor segmentation perform similarly to the model trained by collecting distributed datasets at a central location. In telecommunications, FL can be applied to edge computing, wireless spectrum management, and 5G core networks.

There are many other ways to classify FL:

  • Horizontal or vertical – Depending on the partition of features in distributed datasets, FL can be classified as horizontal or vertical. In horizontal FL, all distributed datasets have the same set of features. In vertical FL, datasets have different groups of features, requiring additional communication patterns to align samples based on overlapped features.
  • Synchronous or asynchronous – Depending on the aggregation strategy at an FL server, FL can be classified as synchronous or asynchronous. A synchronous FL server aggregates local models from a selected set of clients into a global model. An asynchronous FL server immediately updates the global model after a local model is received from a client, thereby reducing the waiting time and improving training efficiency.
  • Hub-and-spoke or peer-to-peer – The typical FL topology is hub-and-spoke, where a central FL server coordinates a set of clients. Another FL topology is peer-to-peer without any centralized FL server, where FL clients aggregate information from neighboring clients to learn a model.

Challenges in FL

You can address the following challenges using algorithms running at FL servers and clients in a common FL architecture:

  • Data heterogeneity – FL clients’ local data can vary (i.e., data heterogeneity) due to particular geographic locations, organizations, or time windows. Data heterogeneity impacts the accuracy of global models, leading to more training iterations and longer training time. Many solutions have been proposed to mitigate the impact of data heterogeneity, such as optimization algorithms, partial data sharing among clients, and domain adaptation.
  • Privacy preservation – Local and global models may leak private information via an adversarial attack. Many privacy preservation approaches have been proposed for FL. A secure aggregation approach can be used to preserve the privacy of local models exchanged between FL servers and clients. Local and global differential privacy approaches bound the privacy loss by adding noise to local or global models, which provides a controlled trade-off between privacy and model accuracy. Depending on the privacy requirements, combinations of different privacy preservation approaches can be used.
  • Federated analytics – Federated analytics provides statistical measurements of distributed datasets without violating privacy requirements. Federated analytics is important not only for data analysis across distributed datasets before training, but also for model monitoring at inference.

Despite these challenges of FL algorithms, it is critical to build a secure architecture that provides end-to-end FL operations. One important challenge to building such an architecture is to enable the ease of deployment. The architecture must coordinate FL servers and clients for FL model building, training, and deployment, including continuous integration and continuous development (CI/CD) among clients, traceability, and authentication and access control for FL servers and clients. These features are similar to centralized ML operations (ML Ops), but are more challenging to implement because more parties are involved. The architecture also needs to be flexible to implement different FL topologies and synchronous or asynchronous aggregation.

Solution overview

We propose a cloud-native FL architecture on AWS, as shown in the following diagram. The architecture includes a central FL server and two FL clients. In reality, the number of FL clients can reach hundreds for cross-silo clients. The FL server must be on the AWS Cloud because it consists of a suite of microservices offered on the cloud. The FL clients can be on AWS or on the customer premises. The FL clients host their own local dataset and have their own IT and ML system for training ML models.

During FL model training, the FL server and a group of clients exchange ML models. That is, the clients download a global ML model from the server, perform local training, and upload local models to the server. The server downloads local models, aggregates local models into a new global model. This model exchange procedure is a single FL training round. The FL training round repeats until the global model reaches a given accuracy or the number of training rounds reach a threshold.

FL-architecture

Figure 1 – A cloud-native FL architecture for model training between a FL server and FL clients.

Prerequisites

To implement this solution, you need an AWS account to launch the services for a central FL server and the two clients. On-premises FL clients need to install the AWS Command Line Interface (AWS CLI), which allows access to the AWS services at the FL server, including Amazon Simple Queue Service (Amazon SQS), Amazon Simple Storage Service (Amazon S3), and Amazon DynamoDB.

Federated learning steps

In this section, we walk through the proposed architecture in Figure 1. At the FL server, the AWS Step Functions state machine runs a workflow as shown in Figure 2, which executes Steps 0, 1, and 5 from Figure 1. The state machine initiates the AWS services at the server (Step 0) and iterates FL training rounds. For each training round, the state machine sends out an Amazon Simple Notification Service (Amazon SNS) notification to the topic global_model_ready, along with a task token (Step 1). The state machine then pauses and waits for a callback with the task token. There are SQS queues subscribing to the global_model_ready topic. Each SQS queue corresponds to an FL client and queues the notifications sent from the server to the client.

Figure 2 – The workflow at the Step Functions state machine.

Each client keeps pulling messages from its assigned SQS queue. When a global_model_ready notification is received, the client downloads a global model from Amazon S3 (Step 2) and starts local training (Step 3). Local training generates a local model. The client then uploads the local model to Amazon S3 and writes the local model information, along with the received task token, to the DynamoDB table (Step 4).

We implement the FL model registry using Amazon S3 and DynamoDB. We use Amazon S3 to store the global and local models. We use DynamoDB table to store local model information because local model information can be different between FL algorithms, which requires a flexible schema supported by a DynamoDB table.

We also enable a DynamoDB stream to trigger a Lambda function, so that whenever a record is written into the DynamoDB table (when a new local model is received), a Lambda function is triggered to check if required local models are collected (Step 5). If so, the Lambda function runs the aggregation function to aggregate the local models into global models. The resulting global model is written to Amazon S3. The function also sends a callback, along with the task token retrieved from the DynamoDB table, to the Step Functions state machine. The state machine then determines if the FL training should be continued with a new training round or should be stopped based on a condition, for example, the number of training rounds reaching a threshold.

Each FL client uses the following sample code to interact with the FL server. If you want to customize the local training at your FL clients, the localTraining() function can be modified as long as the returned values are local_model_name and local_model_info for uploading to the FL server. You can select any ML framework for training local models at FL clients as long as all clients use the same ML framework.

# Step 2: receive notifications and model file name from its SQS queue
client.receiveNotificationsFromServer(sqs_region, client_queue_name)

# Step 3: download a global model and train locally
local_model_name, local_model_info = client.localTraining(global_model_name, s3_fl_model_registry)

# Step 4: upload the local model and local model info to the FL server
client.uploadToFLServer(s3_fl_model_registry, local_model_name, dynamodb_table_model_info, local_model_info)

The Lambda function for running the aggregation function at the server has the following sample code. If you want to customize the aggregation algorithm, you need to modify the fedAvg() function and the output.

# Step 5: aggregate local models in the Lambda function
def lambda_handler(event, context):
	# obtain task_name from the event triggered by the DynamoDB Stream
	task_name = event['Records'][0]['dynamodb']['Keys']['taskName']['S']

	# retrieve transactions from the DynamoDB table
	transactions = readFromFLServerTaskTable(os.environ['TASKS_TABLE_NAME'], task_name)

	# read local model info from required clients 
	# token is a call back token from the Step Functions state machine
	local_model_info, round_id, token = receiveUpdatedModelsFromClients(transactions, task_name)

	# fedAvg function aggregates local models into a global model and stores the global model in S3
	global_model_name, avg_train_acc, avg_test_acc, avg_train_loss, avg_test_loss = fedAvg(local_model_info, round_id)

	# output sent to the Step Function state machine
	output = {'taskName': task_name, 'roundId': str(round_id), 'trainAcc': str(avg_train_acc), 'testAcc': str(avg_test_acc), 'trainLoss': str(avg_train_loss), 'testLoss': str(avg_test_loss), 'weightsFile': str(global_model_name)}

	# send call back to the Step Functions state machine to report that the task identified by the token successfully completed
	step_client = boto3.client('stepfunctions')
	out_str = json.dumps(output)
	step_client.send_task_success(taskToken=token, output=out_str)

This architecture has two innovative designs. First, the FL server uses serverless services, such as Step Functions and Lambda. Therefore, no computing instance is kept running for the FL server, which minimizes the computing cost. Second, FL clients pull messages from their assigned SQS queues and upload or download models and info to or from services at the FL server. This design avoids the FL server directly accessing resources at the clients, which is critical to provide private and flexible IT and ML environments (on premises or on the AWS Cloud) to FL clients.

Advantages of being cloud-native

This architecture is cloud-native and provides end-to-end transparency by using AWS services with proven security and operational excellence. For example, you can have cross-account clients to assume roles to access the resource at the FL server. For on-premises clients, the AWS CLI and AWS SDK for Python (Boto3) at clients automatically provide secure network connections between the FL server and clients. For clients on the AWS Cloud, you can use AWS PrivateLink and AWS services with data encryption in transit and at rest for data protection. You can use Amazon Cognito and AWS Identity and Access Management (IAM) for the authentication and access control of FL servers and clients. For deploying the trained global model, you can use ML Ops capabilities in Amazon SageMaker.

The cloud-native architecture also enables integration with customized ML frameworks and federated learning algorithms and protocols. For example, you can select a ML framework for training local models at FL clients and customize different aggregation algorithms as scripts running in Lambda functions at the server. Also, you can modify the workflows in Step Functions to accommodate different communication protocols between the server and clients.

Another advantage of the cloud-native architecture is the ease of deployment by using IaC tools offered for the cloud. You can use the AWS Cloud Development Kit (AWS CDK) and AWS CloudFormation for one-click deployment.

Conclusion

New privacy laws continue to be implemented worldwide, and technology infrastructures are rapidly expanding across multiple regions and extending to network edges. Federated learning helps cloud customers use distributed datasets to train accurate ML models in a privacy-preserving manner. Federated learning also supports data localization and potentially saves costs, because it does not require large amounts of raw data to be moved or shared.

You can start experimenting and building cloud-native federated learning architectures for your use cases. You can customize the architecture to support various ML frameworks, such as TensorFlow or PyTorch. You can also customize it to support different FL algorithms, including asynchronous federated learning, aggregation algorithms, and differential privacy algorithms. You can enable this architecture with FL Ops functionalities using ML Ops capabilities in Amazon SageMaker.


About the Authors

Qiong (Jo) Zhang, PhD, is a Senior Partner SA at AWS, specializing in AI/ML. Her current areas of interest include federated learning, distributed training, and generative AI.  She holds 30+ patents and has co-authored 100+ journal/conference papers. She is also the recipient of the Best Paper Award at IEEE NetSoft 2016, IEEE ICC 2011, ONDM 2010, and IEEE GLOBECOM 2005.


Parker Newton
is an applied scientist in AWS Cryptography. He received his Ph.D. in cryptography from U.C. Riverside, specializing in lattice-based cryptography and the complexity of computational learning problems. He is currently working at AWS in secure computation and privacy, designing cryptographic protocols to enable customers to securely run workloads in the cloud while preserving the privacy of their data.

Olivia Choudhury, PhD, is a Senior Partner SA at AWS. She helps partners, in the Healthcare and Life Sciences domain, design, develop, and scale state-of-the-art solutions leveraging AWS. She has a background in genomics, healthcare analytics, federated learning, and privacy-preserving machine learning. Outside of work, she plays board games, paints landscapes, and collects manga.

Gang Fu  is a Healthcare Solution Architect at AWS. He holds a PhD in Pharmaceutical Science from the University of Mississippi and has over ten years of technology and biomedical research experience. He is passionate about technology and the impact it can make on healthcare.

Kris is a renowned leader in machine learning and generative AI, with a career spanning Goldman Sachs, consulting for major banks, and successful ventures like Foglight and SiteRock. He founded Indigo Capital Management and co-founded adaptiveARC, focusing on green energy tech. Kris also supports non-profits aiding assault victims and disadvantaged youth.

Bill Horne is a General Manager in AWS Cryptography. He leads the Cryptographic Computing Program, consisting of a team of applied scientists and engineers who are solving customer problems using emerging technologies like secure multiparty computation and homomorphic encryption. Prior to joining AWS in 2020 he was the VP and General Manager of Intertrust Secure Systems and was the Director of Security Research at Hewlett-Packard Enterprise. He is the author of 60 peer reviewed publications in the areas of security and machine learning, and holds 50 granted patents and 58 patents pending.

Read More

MAXimum AI Performance: Latest Adobe Updates Accelerated by NVIDIA GPUs Improve Workflows for Millions of Creatives

MAXimum AI Performance: Latest Adobe Updates Accelerated by NVIDIA GPUs Improve Workflows for Millions of Creatives

Generative AI is helping creatives across many industries bring ideas to life at unprecedented speed.

This technology will be on display at Adobe MAX, running through Thursday, Oct. 12, in person and virtually.

Adobe is putting the power of generative AI into the hands of creators with the release of Adobe Firefly. Using NVIDIA GPUs, Adobe is bringing new opportunities for artists and more looking to accelerate generative AI — unleashing generative AI enhancements for millions of users. Firefly is now available as a standalone app and integrated with other Adobe apps.

Recent updates to Adobe’s most popular apps — including for Adobe Premiere Pro, Lightroom, After Effects and Substance 3D Stager, Modeler and Sampler — bring new AI features to creators. And GeForce RTX and NVIDIA RTX GPUs help accelerate these apps and AI effects, providing massive time savings.

Video editors can use AI to improve dialogue quality with the Enhance Speech (beta) function and work faster with GPU accelerated decoding of ARRIRAW camera original digital film clips up to 60% faster on RTX GPUs compared to on an Apple MacBook Pro 16 M2 Max in Premiere Pro. Plus, take advantage of improved rotoscoping quality with the Next-Gen Roto Brush (version 3.0) feature now available in After Effects.

Photographers and 2D artists now have new Lens Blur effects in Lightroom, complementing ongoing optimizations that improve performance in its Select Object, Select People and Select Sky features.

These advanced features are further enhanced by NVIDIA Studio Drivers, free for RTX GPU owners, which add performance and reliability. The October Studio Driver is available for download now.

Finally, 3D artist SouthernShotty returns to In the NVIDIA Studio to share his 3D montage of a mix of beautifully hand-crafted worlds — built with Adobe apps and Blender and featuring AI-powered workflows accelerated by his GeForce RTX 4090 Laptop GPU.

MAXimizing Creativity

Adobe Creative Cloud and Substance 3D apps run fastest on NVIDIA RTX GPUs — and recent updates show continued time-saving performance gains.

Tested on NVIDIA Studio laptops with GeForce RTX 4050 and 4090 Laptop GPUs with Intel Core i9 13th Gen; MacBook Pro 14″ with M2 Pro; and MacBook Pro 16″ with M2 Max. Performance measures total time to apply Enhanced Speech effect to video clip within Adobe Premiere Pro.

Premiere Pro’s Enhance Speech (beta) feature, currently in beta, uses AI to remove noise and improve the quality of dialogue clips so that they sound professionally recorded. Tasks are completed 8x faster with a GeForce RTX 4090 Laptop GPU compared to MacBook Pro 16 with M2 Max.

Tested on NVIDIA Studio laptops with GeForce RTX 4050 and 4090 Laptop GPUs with Intel Core i9 13th Gen; MacBook Pro 14″ with M2 Pro; and MacBook Pro 16″ with M2 Max. Performance measures total time to apply export ARRIRAW footage within Adobe Premiere Pro.

Premiere Pro professionals use ARRIRAW footage — the only format that fully retains a camera’s natural color response and great exposure latitude. ARRIRAW video exports can be done 1.6x faster on GeForce RTX 4090 Laptop GPUs than on the MacBook Pro 16 with M2 Max.

Additionally, After Effects users can access the Next-Gen Roto Brush feature in beta, powered by a brand-new AI model. It’s ideal for isolating subjects such as overlapping limbs, hair and other transparencies more easily, saving time.

RTX GPUs shine in 3D workloads. Substance 3D Stager’s new AI-powered, GPU-accelerated denoiser allows almost instantaneous photorealistic rendering.

Substance 3D Modeler’s recent Hardware Ray Tracing in Capture Mode capability uses NVIDIA technology to export high-quality screenshots 2.4x faster than before.

Meanwhile, Substance 3D Sampler’s AI UpScale feature increases detail for low-quality textures and its Image to Material feature makes it easier to create high-quality materials from a single photograph.

Lens Blur in Adobe Lightroom.

Photographers have long used the popular Super Resolution feature in Adobe Camera Raw, which is supported by Photoshop, and gives 3x faster performance on a GeForce RTX 4090 Laptop GPU compared to a MacBook Pro 16 M2 Max. Now, Lightroom users have AI-driven capabilities with the Lens Blur feature for applying realistic lens blur effects, Point Color for precise color adjustments to speed up color correction, and High Dynamic Range Output for edits and renders in an HDR color space.

Adobe Firefly Glows #76B900

Adobe Firefly provides users with generative AI features, utilizing NVIDIA GPUs in the cloud.

Firefly features such as Generative Fill — to add, remove and expand content in Photoshop, and Generative Expand to expand scenes with generative content — help complete tasks instantly in Adobe Photoshop.

Adobe Firefly-powered feature Generative Fill in Adobe Photoshop.

Adobe Illustrator offers the Generative Recolor feature, which enables graphic designers to explore a wide variety of colors, palettes and themes in their work without having to do tedious manual recoloring. Discovering the perfect combination of colors now takes just a few seconds.

Adobe Firefly-powered feature Generative Recolor in Adobe Illustrator.

Adobe Express offers the Text to Image feature to create incredible imagery from standard prompts, and the Text Effects feature helps stylize standard text for use in creating flyers, resumes, social media reels and more.

These powerful AI capabilities were developed with the creative community in mind — guided by AI ethics principles of content and data transparency — to ensure ethically and morally responsible output.

NVIDIA technology will continue to support new Adobe Firefly-powered features from the cloud as they become available to photographers, illustrators, designers, video editors, 3D artists and more.

MAXed Out AI Fun

Independent filmmaker and artist SouthernShotty knows the challenges of producing content alone and how daunting the process can be.

SouthernShotty’s artwork invokes childlike emotions with impressive visuals.

“I’m a big fan of the NVIDIA Studio Driver support, because it adds stability and reliability.” – SouthernShotty

As such, SouthernShotty is always looking for tools and techniques to ease the creative process. To accelerate his workflow, he combined new Adobe AI capabilities accelerated by his GeForce RTX 4090 GPU to achieve incredible efficiency.

The artist kept his 3D models fairly simple, focusing on textures to ensure that the world would match his vision. He deployed one of his favorite features, the AI-powered Image to Material in Adobe Substance 3D Sampler, to convert images to physically based rendering textures.

 

Applying textures in Blender.

“It’s so fast that I can pretty much preview my entire scene in real time and see the final result before I ever hit the render button.” – SouthernShotty

RTX-accelerated light and ambient occlusion baking allowed SouthernShotty to realize the desired visual effect in seconds.

His RTX GPU continued to play an essential role as he used Blender Cycles’ RTX-accelerated OptiX ray tracing in the viewport for interactive, photorealistic rendering.

As the 3D montage progresses, the main character appears and reappears in several new environments. Each new location is featured for only a second or two, but SouthernShotty still needed to create a fully fleshed out environment for each.

Normally this would take a substantial amount of time, but an AI assist from Adobe Firefly helped speed the process.

Adobe is committed to developing generative AI responsibly, with creators at the center.

SouthernShotty opened the app, entered “fantasy mushroom forest” as the text prompt and then made minor adjustments by tinkering with the digital art, golden hour, for lighting, and wide-angle settings for composition. When satisfied with the result, he downloaded the image for further editing in Photoshop.

An entirely new image is generated in minutes with Adobe Firefly, powered by GeForce RTX GPUs.

SouthernShotty then used the AI-powered Generative Fill feature to remove unwanted background elements. He used the Neural Filters optimization to color match a castle element added in the background, then used Generative Fill again to effortlessly blend the castle in with the trees.

Finally, SouthernShotty used the Neural Filters optimization in the new Lens Blur feature to add depth to the scene — first exporting depth as a separate layer and then editing in Blender to complete the scene.

Editing the depth map in Blender.

“My entire process was sprinkled with GPU-acceleration and AI-enabled features,” said SouthernShotty. “In Blender, the GeForce RTX 4090 GPU accelerated everything — but especially the live render view in my viewport, which was crucial to visualizing my scenes.”

Check out SouthernShotty’s YouTube channel for Blender tutorials on characters, animation, rigging and more.

Follow NVIDIA Studio on Instagram, Twitter and Facebook. Access tutorials on the Studio YouTube channel and get updates directly in your inbox by subscribing to the Studio newsletter.

Read More