Research award recipients named as part of the JHU + Amazon Initiative for Interactive AI (AI2AI), now in its second year.Read More
“I want to help people automate boring tasks”
Former Amazon applied science intern Margarida Ferreira conducts research to make complex cloud resources easier to manage.Read More
How Veriff decreased deployment time by 80% using Amazon SageMaker multi-model endpoints
Veriff is an identity verification platform partner for innovative growth-driven organizations, including pioneers in financial services, FinTech, crypto, gaming, mobility, and online marketplaces. They provide advanced technology that combines AI-powered automation with human feedback, deep insights, and expertise.
Veriff delivers a proven infrastructure that enables their customers to have trust in the identities and personal attributes of their users across all the relevant moments in their customer journey. Veriff is trusted by customers such as Bolt, Deel, Monese, Starship, Super Awesome, Trustpilot, and Wise.
As an AI-powered solution, Veriff needs to create and run dozens of machine learning (ML) models in a cost-effective way. These models range from lightweight tree-based models to deep learning computer vision models, which need to run on GPUs to achieve low latency and improve the user experience. Veriff is also currently adding more products to its offering, targeting a hyper-personalized solution for its customers. Serving different models for different customers adds to the need for a scalable model serving solution.
In this post, we show you how Veriff standardized their model deployment workflow using Amazon SageMaker, reducing costs and development time.
Infrastructure and development challenges
Veriff’s backend architecture is based on a microservices pattern, with services running on different Kubernetes clusters hosted on AWS infrastructure. This approach was initially used for all company services, including microservices that run expensive computer vision ML models.
Some of these models required deployment on GPU instances. Conscious of the comparatively higher cost of GPU-backed instance types, Veriff developed a custom solution on Kubernetes to share a given GPU’s resources between different service replicas. A single GPU typically has enough VRAM to hold multiple of Veriff’s computer vision models in memory.
Although the solution did alleviate GPU costs, it also came with the constraint that data scientists needed to indicate beforehand how much GPU memory their model would require. Furthermore, DevOps were burdened with manually provisioning GPU instances in response to demand patterns. This caused an operational overhead and overprovisioning of instances, which resulted in a suboptimal cost profile.
Apart from GPU provisioning, this setup also required data scientists to build a REST API wrapper for each model, which was needed to provide a generic interface for other company services to consume, and to encapsulate preprocessing and postprocessing of model data. These APIs required production-grade code, which made it challenging for data scientists to productionize models.
Veriff’s data science platform team looked for alternative ways to this approach. The main objective was to support the company’s data scientists with a better transition from research to production by providing simpler deployment pipelines. The secondary objective was to reduce the operational costs of provisioning GPU instances.
Solution overview
Veriff required a new solution that solved two problems:
- Allow building REST API wrappers around ML models with ease
- Allow managing provisioned GPU instance capacity optimally and, if possible, automatically
Ultimately, the ML platform team converged on the decision to use Sagemaker multi-model endpoints (MMEs). This decision was driven by MME’s support for NVIDIA’s Triton Inference Server (an ML-focused server that makes it easy to wrap models as REST APIs; Veriff was also already experimenting with Triton), as well as its capability to natively manage the auto scaling of GPU instances via simple auto scaling policies.
Two MMEs were created at Veriff, one for staging and one for production. This approach allows them to run testing steps in a staging environment without affecting the production models.
SageMaker MMEs
SageMaker is a fully managed service that provides developers and data scientists the ability to build, train, and deploy ML models quickly. SageMaker MMEs provide a scalable and cost-effective solution for deploying a large number of models for real-time inference. MMEs use a shared serving container and a fleet of resources that can use accelerated instances such as GPUs to host all of your models. This reduces hosting costs by maximizing endpoint utilization compared to using single-model endpoints. It also reduces deployment overhead because SageMaker manages loading and unloading models in memory and scaling them based on the endpoint’s traffic patterns. In addition, all SageMaker real-time endpoints benefit from built-in capabilities to manage and monitor models, such as including shadow variants, auto scaling, and native integration with Amazon CloudWatch (for more information, refer to CloudWatch Metrics for Multi-Model Endpoint Deployments).
Custom Triton ensemble models
There were several reasons why Veriff decided to use Triton Inference Server, the main ones being:
- It allows data scientists to build REST APIs from models by arranging model artifact files in a standard directory format (no code solution)
- It’s compatible with all major AI frameworks (PyTorch, Tensorflow, XGBoost, and more)
- It provides ML-specific low-level and server optimizations such as dynamic batching of requests
Using Triton allows data scientists to deploy models with ease because they only need to build formatted model repositories instead of writing code to build REST APIs (Triton also supports Python models if custom inference logic is required). This decreases model deployment time and gives data scientists more time to focus on building models instead of deploying them.
Another important feature of Triton is that it allows you to build model ensembles, which are groups of models that are chained together. These ensembles can be run as if they were a single Triton model. Veriff currently employs this feature to deploy preprocessing and postprocessing logic with each ML model using Python models (as mentioned earlier), ensuring that there are no mismatches in the input data or model output when models are used in production.
The following is what a typical Triton model repository looks like for this workload:
The model.py
file contains preprocessing and postprocessing code. The trained model weights are in the screen_detection_inferencer
directory, under model version 1
(model is in ONNX format in this example, but can also be TensorFlow, PyTorch format, or others). The ensemble model definition is in the screen_detection_pipeline
directory, where inputs and outputs between steps are mapped in a configuration file.
Additional dependencies needed to run the Python models are detailed in a requirements.txt
file, and need to be conda-packed to build a Conda environment (python_env.tar.gz)
. For more information, refer to Managing Python Runtime and Libraries. Also, config files for Python steps need to point to python_env.tar.gz
using the EXECUTION_ENV_PATH directive.
The model folder then needs to be TAR compressed and renamed using model_version.txt
. Finally, the resulting <model_name>_<model_version>.tar.gz
file is copied to the Amazon Simple Storage Service (Amazon S3) bucket connected to the MME, allowing SageMaker to detect and serve the model.
Model versioning and continuous deployment
As the previous section made apparent, building a Triton model repository is straightforward. However, running all the necessary steps to deploy it is tedious and error prone, if run manually. To overcome this, Veriff built a monorepo containing all models to be deployed to MMEs, where data scientists collaborate in a Gitflow-like approach. This monorepo has the following features:
- It’s managed using Pants.
- Code quality tools such as Black and MyPy are applied using Pants.
- Unit tests are defined for each model, which check that the model output is the expected output for a given model input.
- Model weights are stored alongside model repositories. These weights can be large binary files, so DVC is used to sync them with Git in a versioned manner.
This monorepo is integrated with a continuous integration (CI) tool. For every new push to the repo or new model, the following steps are run:
- Pass the code quality check.
- Download the model weights.
- Build the Conda environment.
- Spin up a Triton server using the Conda environment and use it to process requests defined in unit tests.
- Build the final model TAR file (
<model_name>_<model_version>.tar.gz
).
These steps make sure that models have the quality required for deployment, so for every push to a repo branch, the resulting TAR file is copied (in another CI step) to the staging S3 bucket. When pushes are done in the main branch, the model file is copied to the production S3 bucket. The following diagram depicts this CI/CD system.
Cost and deployment speed benefits
Using MMEs allows Veriff to use a monorepo approach to deploy models to production. In summary, Veriff’s new model deployment workflow consists of the following steps:
- Create a branch in the monorepo with the new model or model version.
- Define and run unit tests in a development machine.
- Push the branch when the model is ready to be tested in the staging environment.
- Merge the branch into main when the model is ready to be used in production.
With this new solution in place, deploying a model at Veriff is a straightforward part of the development process. New model development time has decreased from 10 days to an average of 2 days.
The managed infrastructure provisioning and auto scaling features of SageMaker brought Veriff added benefits. They used the InvocationsPerInstance CloudWatch metric to scale according to traffic patterns, saving on costs without sacrificing reliability. To define the threshold value for the metric, they performed load testing on the staging endpoint to find the best trade-off between latency and cost.
After deploying seven production models to MMEs and analyzing spend, Veriff reported a 75% cost reduction in GPU model serving as compared to the original Kubernetes-based solution. Operational costs were reduced as well, because the burden of provisioning instances manually was lifted from the company’s DevOps engineers.
Conclusion
In this post, we reviewed why Veriff chose Sagemaker MMEs over self-managed model deployment on Kubernetes. SageMaker takes on the undifferentiated heavy lifting, allowing Veriff to decrease model development time, increase engineering efficiency, and dramatically lower the cost for real-time inference while maintaining the performance needed for their business-critical operations. Finally, we showcased Veriff’s simple yet effective model deployment CI/CD pipeline and model versioning mechanism, which can be used as a reference implementation of combining software development best practices and SageMaker MMEs. You can find code samples on hosting multiple models using SageMaker MMEs on GitHub.
About the Authors
Ricard Borràs is a Senior Machine Learning at Veriff, where he is leading MLOps efforts in the company. He helps data scientists to build faster and better AI / ML products by building a Data Science Platform at the company, and combining several open source solutions with AWS services.
João Moura is an AI/ML Specialist Solutions Architect at AWS, based in Spain. He helps customers with deep learning model large-scale training and inference optimization, and more broadly building large-scale ML platforms on AWS.
Miguel Ferreira works as a Sr. Solutions Architect at AWS based in Helsinki, Finland. AI/ML has been a lifelong interest and he has helped multiple customers integrate Amazon SageMaker into their ML workflows.
2023 SCOT INFORMS scholarship recipients announced
Program is aimed at expanding participation in operations research, management science, and analytics research for those from underrepresented backgrounds.Read More
Learning to learn learning-rate schedules
In a series of papers, Amazon researchers performed a theoretical analysis of a simplified problem that led to a learnable learning-rate scheduler, applied that scheduler to a more complex neural model, and distilled the results into a practical algorithm.Read More
Improve performance of Falcon models with Amazon SageMaker
What is the optimal framework and configuration for hosting large language models (LLMs) for text-generating generative AI applications? Despite the abundance of options for serving LLMs, this is a hard question to answer due to the size of the models, varying model architectures, performance requirements of applications, and more. The Amazon SageMaker Large Model Inference (LMI) container makes it straightforward to serve LLMs by bringing together a host of different frameworks and techniques that optimize the deployment of LLMs. The LMI container has a powerful serving stack called DJL serving that is agnostic to the underlying LLM. It provides system-level configuration parameters that can be tuned for extracting the best performance of the hosting infrastructure for a given LLM. It also has support for recent optimizations like continuous batching, also known as iterative batching or rolling batching, which provides significant improvements in throughput.
In an earlier post, we showed how you can use the LMI container to deploy the Falcon family of models on SageMaker. In this post, we demonstrate how to improve the throughput and latency of serving Falcon-40B with techniques like continuous batching. We also provide an intuitive understanding of configuration parameters provided by the SageMaker LMI container that can help you find the best configuration for your real-world application.
Fundamentals of text-generative inference for LLMs
Let’s first look at a few fundamentals on how to perform inference for LLMs for text generation.
Forward pass, activations, and the KV cache
Given an input sequence of tokens, they are run in a forward pass
across all the layers of the LLM (like Falcon) to generate the next token. A forward pass
refers to the process of input data being passed through a neural network to produce an output. In the case of text generation, the forward pass involves feeding an initial seed or context into the language model and generating the next character or token in the sequence. To generate a sequence of text, the process is often done iteratively, meaning it is repeated for each step or position in the output sequence. At each iteration, the model generates the next character or token, which becomes part of the generated text, and this process continues until the desired length of text is generated.
Text generation with language models like Falcon or GPT are autoregressive
. This means that the model generates one token at a time while conditioning on the previously generated tokens. In other words, at each iteration, the model takes the previously generated text as input and predicts the next token based on that context. As mentioned in vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention, in this autoregressive decoding process, all the input tokens to the LLM produce their attention key and value tensors, and these tensors are kept in GPU memory to generate next tokens. These cached key and value tensors are often referred to as the KV cache
.
Prefill and decode phases
In an autoregressive decoding process, like the one used in text generation with language models such as Falcon, there are typically two main phases: the prefill
phase and the decode
phase. These phases are crucial for generating coherent and contextually relevant text.
The prefill phase includes the following:
- Initial context – The prefill phase begins with an initial context or seed text provided by the user. This initial context can be a sentence, a phrase, or even just a single word. It sets the starting point for text generation and provides context for what comes next.
- Model conditioning – The provided context is used to condition the language model. The model takes this context as input and generates the next token (word or character) in the sequence based on its understanding of the context.
- Token generation – The model generates one token at a time, predicting what should come next in the text. This token is appended to the context, effectively extending it.
- Iterative process – The process of generating tokens is repeated iteratively. At each step, the model generates a token while considering the updated context, which now includes the tokens generated in previous steps.
The prefill phase continues until a predetermined stopping condition is met. This condition can be a maximum length for the generated text, a specific token that signals the end of the text, or any other criteria set by the user or the application.
The decode phase includes the following:
- Completion – After the prefill phase, you have a partially generated text that may be incomplete or cut off at some point. The decode phase is responsible for completing the text to make it coherent and grammatically correct.
- Continuation from the last token – In the decode phase, the model starts from the last token generated during the prefill phase. It uses this token as the initial context and generates the next token to continue the text.
- Iterative completion – Like in the prefill phase, the process of generating tokens is again iterative. The model generates one token at a time, conditioning on the preceding tokens in the sequence.
- Stopping condition – The decode phase also has a stopping condition, which might be the same as in the prefill phase, such as reaching a maximum length or encountering an end-of-text token. When this condition is met, the generation process stops.
The combination of the prefill and decode phases allows autoregressive models to generate text that builds on an initial context and produces coherent, contextually relevant, and contextually consistent sequences of text.
Refer to A Distributed Serving System for Transformer-Based Generative Models for a detailed explanation of the process.
Optimizing throughput using dynamic batching
So far, we’ve only talked about a single input. In practice, we expect to deal with multiple requests coming in randomly from the application clients for inference concurrently or in a staggered fashion. In the traditional way, basic batching can be used to increase the throughput and the utilization of the computing resources of the GPU. Batching is effectively combining the numerical representations of more than one request in a batch and performing parallel runs of the autoregressive forward passes. This intelligent batching is done at the serving side. SageMaker LMI’s DJLServing server can be configured to batch together multiple requests to process them in parallel by setting the following parameters in serving.properties:
- max_batch_delay = 100 – The maximum delay for batch aggregation in milliseconds. The default value is 100 milliseconds.
- batch_size = 32 – The dynamic batch size. The default is 1.
This basically shows that DJLServing will queue up requests for 100 milliseconds at a time or if the number of requests that are queued up are up to the batch_size specified, the batch will be scheduled to run to the backend for inference. This is known as dynamic batching
. It’s dynamic because the batch size may change across batches depending on how many requests were added in that time duration. However, because requests might have different characteristics, (for example, some requests might be of shape 20 tokens of input and 500 tokens of output, whereas others might be reversed, with 500 tokens of input but only 20 for output), some requests might complete processing faster than others in the same batch. This could result in underutilization of the GPU while waiting for all in-flight requests in the batch to complete its decode stage, even if there are additional requests waiting to be processed in the queue. The following diagram illustrates this process.
Optimizing throughput using continuous batching
With continuous batching
, also known as iterative
or rolling
batching, we take advantage of the differences between the prefill and decode stages. To activate continuous batching, DJServing provides the following additional configurations as per serving.properties:
- engine=MPI – We encourage you to use the MPI engine for continuous batching.
- option.rolling_batch=auto or lmi-dist – We recommend using auto because it will automatically pick the most appropriate rolling batch algorithm along with other optimizations in the future.
- option.max_rolling_batch_size=32 – This limits the number of concurrent requests. The default is 32.
With continuous batching, the serving stack (DJLServing) doesn’t wait for all in-flight requests in a batch to complete its decode stage. Rather, at logical breaks (at the end of one iteration in the decode stage), it pulls in additional requests that are waiting in the queue while the current batch is still processing (hence the name rolling batch). It does this check for pending requests at the end of each iteration of the decode stage. Remember, for each request, we need to run the prefill stage followed by the sequential decode stage. Because we can process all the tokens from the initial prompt of a request in parallel for its prefill stage, anytime a new request is pulled in, we temporarily pause the decode stage of in-flight requests of the batch—we temporarily save its KV cache and activations in memory and run the prefill stage of the new requests.
The size of this cache can be configured with the following option:
- option.max_rolling_batch_prefill_tokens=1024 – Limits the number of simultaneous prefill tokens saved in the cache for the rolling batch (between the decode and the prefill stages)
When the prefill is complete, we combine the new requests and the old paused requests in a new rolling batch, which can proceed with their decode stage in parallel. Note that the old paused requests can continue their decode stage where they left off and the new requests will start from their first new token.
You might have already realized that continuous batching is an almost similar approach with which we naturally parallelize tasks in our daily lives. We have messages, emails, phone notifications (potentially new requests) coming in at random times (analogous to multiple requests coming in a random staggered fashion for GPUs). This is all happening while we go about completing our in-flight tasks—composing emails, coding, participating in meetings (analogous to the currently processing tasks in the GPUs). At logical breaks, we pause our in-flight tasks and check our notifications to decide if there is some action required on our part, and if there is, we add it to our in-flight tasks (real-life rolling batch), or put it on a to-do list (the queue).
Putting it all together: How to think about memory utilization of GPUs
It’s recommended to load test your model to see which configuration is the most cost-effective for your business use case. To build an understanding, let’s visualize the memory footprint of the GPUs as the model is loaded and as successive requests are processed in a rolling batch. For this post, let’s assume we are loading the Falcon-40B model onto one of the G5 instance types instance that are installed with NVIDIA A10G GPUs, each with 24 GB of memory. Note that a similar understanding is applicable for the p3, p4, and p5 instance types, which come with the V100, A100, and H100 GPU series.
The following is the overview of getting an approximate value of total memory required to serve Falcon-40B:
- Model size = Number of model parameters (40 billion for Falcon-40B) x 4 bytes per parameter (for FP32) = 160 GB
- Approximate total memory required to load Falcon-40B for inference = Model size (=160 GB) + KV Cache (Attention Cache) (=*20 GB) + Additional memory overhead by ML Frameworks (approximately 2 GB)
For Falcon-40B, if we compress the model by quantizing the model to the bfloat16 (2 bytes) data type, the model size becomes approximately 80 GB. As you can see, this is still larger than the memory supported by one accelerator device, so we need to adopt a model partitioning (sharding) technique with a special tensor parallelism (TP) approach and shard the model across multiple accelerator devices. Let’s assume that we have chosen g5.24xlarge, which has 4 A10G GPU devices. If we configure DJLServing (serving.properties) with the following, we can expect that the 80 GB of model weights will be divided equally across all 4 GPUs:
- option.tensor_parallel_degree = 4 or 8, or use max (maximum GPUs detected on the instance)
With tensor_parallel_degree
set to 4, about 20 GB of the 24 GB GPU memory (approximately 84%) is already utilized even before a single request has been processed. The remaining 16% of the GPU will be used for the KV cache for the incoming requests. It’s possible that for your business scenario and its latency and throughput requirements, 2–3 GB of the remaining memory is more than enough. If not, you can increase the instance size to g5.48xlarge, which has 8 GPUs and uses tensor_parallel_degree set to 8. In such a case, only approximately 10 GB of the available 24 GB memory of each GPU is utilized for model weights and we have about 60% of the remaining GPU for the activations and KV cache. Intuitively, we can see that this configuration may allow us to have a higher throughput. Additionally, because we have a larger buffer now, we can increase the max_rolling_batch_prefill_tokens and max_rolling_batch_size parameters to further optimize the throughput. Together, these two parameters will control the preallocations of the activation prefills and KV cache for the model. A larger number for these two parameters will co-relate to a larger throughput, assuming you have enough buffer for the KV cache in the GPU memory.
Continuous batching with PagedAttention
PagedAttention is a new optimization algorithm developed by UC Berkeley that improves the continuous batching process by allowing the attention cache (KV cache) to be non-contiguous by allocating memory in fixed-size pages or blocks. This is inspired by virtual memory and paging concepts used by operating systems.
As per the vLLM paper, the attention cache of each sequence of tokens is partitioned into blocks and mapped to physical blocks through a block table. During the computation of attention, a PagedAttention kernel can use the block table to efficiently fetch the blocks from physical memory. This results in a significant reduction of memory waste and allows for larger batch size, increased GPU utilization, and higher throughput.
Performance comparison
To ensure effective load testing of your deployment configuration, it’s recommended to begin by considering the business scenario and clearly defining the characteristics of the input and output for the LLM-based application. For instance, if you are working on a call center summarization use case, the input could consist of larger text, such as a 500-token chat transcript between a customer service agent and a customer, but the output might be relatively smaller, around 100 tokens, representing a summary of the transcript. On the other hand, if you’re working on a code generation scenario, the input could be as short as 15 tokens, like “write an efficient implementation in Python for describing all EC2 resources, including pagination,” but the output could be much larger, reaching 500 tokens. It’s also important to consider whether achieving lower latency or maximizing throughput is the top priority for your specific scenario.
After gaining a comprehensive understanding of the business scenario, you can analyze and determine the optimal configuration for your hosting environment. In this context, the hosting environment encompasses various key elements, including the instance type and other configuration parameters such as tensor_parallel_degree, max_rolling_batch_size, max_rolling_batch_prefill_tokens, and more. Our objective is to identify the most effective setup to support our response time, throughput, and model output quality requirements.
In our analysis, we benchmarked the performance to illustrate the benefits of continuous batching over traditional dynamic batching. We used the configurations detailed in the following table in serving.properties for dynamic batching and iterative batching, using an LMI container on SageMaker.
Dynamic Batching | Continuous Batching | Continuous Batching with PagedAttention |
engine=Python option.model_id=tiiuae/falcon-40b option.tensor_parallel_degree=8 option.dtype=fp16 batch_size=4 max_batch_delay=100 option.trust_remote_code = true |
engine = MPI option.model_id = {{s3_url}} option.trust_remote_code = true option.tensor_parallel_degree = 8 option.max_rolling_batch_size = 32 option.rolling_batch = auto option.dtype = fp16 option.max_rolling_batch_prefill_tokens = 1024 option.paged_attention = False |
engine = MPI option.model_id = {{s3_url}} option.trust_remote_code = true option.tensor_parallel_degree = 8 option.max_rolling_batch_size = 32 option.rolling_batch = auto option.dtype = fp16 option.max_rolling_batch_prefill_tokens = 1024 option.paged_attention = True |
The two configurations were benchmarked for Falcon-40B with the FP16 data type deployed on ml.g5.48xlarge in a couple of different scenarios that represent real-world applications:
- A small number of input tokens with a large number of tokens being generated – In this scenario, number of input tokens was fixed at 32 and 128 new tokens were generated
Batching Strategy | Throughput (tokens/sec) | Latency p90 (secs) |
Dynamic Batching | 5.53 | 58.34 |
Continuous Batching | 56.04 | 4.74 |
Continuous Batching with PagedAttention | 59.18 | 4.76 |
- A large input with a small number of tokens being generated – Here, we fix the number of input tokens at 256 and prompt the LLM to summarize the input to 32 tokens
Batching Strategy | Throughput (tokens/sec) | Latency p90 (secs) |
Dynamic Batching | 19.96 | 59.31 |
Continuous Batching | 46.69 | 3.88 |
Continuous Batching with PagedAttention | 44.75 | 2.67 |
We can see that continuous batching with PagedAttention provides a throughput improvement of 10 times greater in scenario 1 and 2.3 times in scenario 2 compared to using dynamic batching on SageMaker while using the LMI container.
Conclusion
In this post, we looked at how LLMs use memory and explained how continuous batching improves the throughput using an LMI container on SageMaker. We demonstrated the benefits of continuous batching for Falcon-40B using an LMI container on SageMaker by showing benchmark results. You can find the code on the GitHub repo.
About the Authors
Abhi Shivaditya is a Senior Solutions Architect at AWS, working with strategic global enterprise organizations to facilitate the adoption of AWS services in areas such as Artificial Intelligence, distributed computing, networking, and storage. His expertise lies in Deep Learning in the domains of Natural Language Processing (NLP) and Computer Vision. Abhi assists customers in deploying high-performance machine learning models efficiently within the AWS ecosystem.
Dhawal Patel is a Principal Machine Learning Architect at AWS. He has worked with organizations ranging from large enterprises to mid-sized startups on problems related to distributed computing, and Artificial Intelligence. He focuses on Deep learning including NLP and Computer Vision domains. He helps customers achieve high performance model inference on SageMaker.
Pinak Panigrahi works with customers to build machine learning driven solutions to solve strategic business problems on AWS. When not occupied with machine learning, he can be found taking a hike, reading a book or watching sports.
Abhi Sodhani holds the position of Senior AI/ML Solutions Architect at AWS, where he specializes in offering technical expertise and guidance on Generative AI and ML solutions to customers. His primary focus is to assist Digital Native Businesses in realizing the full potential of Generative AI and ML technologies, enabling them to achieve their business objectives effectively. Beyond his professional endeavors, Abhi exhibits a strong passion for intellectual pursuits such as reading, as well as engaging in activities that promote physical and mental well-being, such as yoga, meditation.
Qing Lan is a Software Development Engineer in AWS. He has been working on several challenging products in Amazon, including high performance ML inference solutions and high performance logging system. Qing’s team successfully launched the first Billion-parameter model in Amazon Advertising with very low latency required. Qing has in-depth knowledge on the infrastructure optimization and Deep Learning acceleration.
Index your web crawled content using the new Web Crawler for Amazon Kendra
Amazon Kendra is a highly accurate and simple-to-use intelligent search service powered by machine learning (ML). Amazon Kendra offers a suite of data source connectors to simplify the process of ingesting and indexing your content, wherever it resides.
Valuable data in organizations is stored in both structured and unstructured repositories. An enterprise search solution should be able to provide you with a fully managed experience and simplify the process of indexing your content from a variety of data sources in the enterprise.
One such unstructured data repository are internal and external websites. Sites may need to be crawled to create news feeds, analyze language use, or create bots to answer questions based on the website data.
We’re excited to announce that you can now use the new Amazon Kendra Web Crawler to search for answers from content stored in internal and external websites or create chatbots. In this post, we show how to index information stored in websites and use the intelligent search in Amazon Kendra to search for answers from content stored in internal and external websites. In addition, the ML-powered intelligent search can accurately get answers for your questions from unstructured documents with natural language narrative content, for which keyword search is not very effective.
The Web Crawler offers the following new features:
- Support for Basic, NTLM/Kerberos, Form, and SAML authentication
- The ability to specify 100 seed URLs and store connection configuration in Amazon Simple Storage Service (Amazon S3)
- Support for a web and internet proxy with the ability to provide proxy credentials
- Support for crawling dynamic content, such as a website containing JavaScript
- Field mapping and regex filtering features
Solution overview
With Amazon Kendra, you can configure multiple data sources to provide a central place to search across your document repository. For our solution, we demonstrate how to index a crawled website using the Amazon Kendra Web Crawler. The solution consists of the following steps:
- Choose an authentication mechanism for the website (if required) and store the details in AWS Secrets Manager.
- Create an Amazon Kendra index.
- Create a Web Crawler data source V2 via the Amazon Kendra console.
- Run a sample query to test the solution.
Prerequisites
To try out the Amazon Kendra Web Crawler, you need the following:
- A website to crawl.
- An AWS account with privileges to create AWS Identity and Access Management (IAM) roles and policies. For more information, see Overview of access management: Permissions and policies.
- Basic knowledge of AWS.
Gather authentication details
For protected and secure websites, the following authentication types and standards are supported:
- Basic
- NTLM/Kerberos
- Form authentication
- SAML
You need the authentication information when you set up the data source.
For basic or NTLM authentication, you need to provide your Secrets Manager secret, user name, and password.
Form and SAML authentication require additional information, as shown in the following screenshot. Some of the fields like User name button Xpath are optional and will depend on whether the site you are crawling uses a button after entering the user name. Also note that you will need to know how to determine the Xpath of the user name and password field and the submit buttons.
Create an Amazon Kendra index
To create an Amazon Kendra index, complete the following steps:
- On the Amazon Kendra console, choose Create an Index.
- For Index name, enter a name for the index (for example, Web Crawler).
- Enter an optional description.
- For Role name, enter an IAM role name.
- Configure optional encryption settings and tags.
- Choose Next.
- In the Configure user access control section, leave the settings at their defaults and choose Next.
- For Provisioning editions, select Developer edition and choose Next.
- On the review page, choose Create.
This creates and propagates the IAM role and then creates the Amazon Kendra index, which can take up to 30 minutes.
Create an Amazon Kendra Web Crawler data source
Complete the following steps to create your data source:
- On the Amazon Kendra console, choose Data sources in the navigation pane.
- Locate the WebCrawler connector V2.0 tile and choose Add connector.
- For Data source name, enter a name (for example, crawl-fda).
- Enter an optional description.
- Choose Next.
- In the Source section, select Source URL and enter a URL. For this post, we use https://www.fda.gov/ as an example source URL.
- In the Authentication section, chose the appropriate authentication based on the site that you want to crawl. For this post, we select No authentication because it’s a public site and doesn’t need authentication.
- In the Web proxy section, you can specify a Secrets Manager secret (if required).
- Choose Create and Add New Secret.
- Enter the authentication details that you gathered previously.
- Choose Save.
- In the IAM role section, choose Create a new role and enter a name (for example,
AmazonKendra-Web Crawler-datasource-role
). - Choose Next.
- In the Sync scope section, configure your sync settings based on the site you are crawling. For this post, we leave all the default settings.
- For Sync mode, choose how you want to update your index. For this post, we select Full sync.
- For Sync run schedule, choose Run on demand.
- Choose Next.
- Optionally, you can set field mappings. For this post, we keep the defaults for now.
Mapping fields is a useful exercise where you can substitute field names to values that are user-friendly and that fit in your organization’s vocabulary.
- Choose Next.
- Choose Add data source.
- To sync the data source, choose Sync now on the data source details page.
- Wait for the sync to complete.
Example of an authenticated website
If you want to crawl a site that has authentication, then in the Authentication section in the previous steps, you need to specify the authentication details. The following is an example if you selected Form authentication.
- In the Source section, select Source URL and enter a URL. For this example, we use https://accounts.autodesk.com.
- In the Authentication section, select Form authentication.
- In the Web proxy section, specify your Secrets Manager secret. This is required for any option other than No authentication.
- Choose Create and Add New Secret.
- Enter the authentication details that you gathered previously.
- Choose Save.
Test the solution
Now that you have ingested the content from the site into your Amazon Kendra index, you can test some queries.
- Go to your index and choose Search indexed content.
- Enter a sample search query and test out your search results (your query will vary based on the contents of site your crawled and the query entered).
Congratulations! You have successfully used Amazon Kendra to surface answers and insights based on the content indexed from the site you crawled.
Clean up
To avoid incurring future costs, clean up the resources you created as part of this solution. If you created a new Amazon Kendra index while testing this solution, delete it. If you only added a new data source using the Amazon Kendra Web Crawler V2, delete that data source.
Conclusion
With the new Amazon Kendra Web Crawler V2, organizations can crawl any website that is public or behind authentication and use it for intelligent search powered by Amazon Kendra.
To learn about these possibilities and more, refer to the Amazon Kendra Developer Guide. For more information on how you can create, modify, or delete metadata and content when ingesting your data, refer to Enriching your documents during ingestion and Enrich your content and metadata to enhance your search experience with custom document enrichment in Amazon Kendra.
About the Author
Jiten Dedhia is a Sr. Solutions Architect with over 20 years of experience in the software industry. He has worked with global financial services clients, providing them advice on modernizing by using services provided by AWS.
Gunwant Walbe is a Software Development Engineer at Amazon Web Services. He is an avid learner and keen to adopt new technologies. He develops complex business applications, and Java is his primary language of choice.
USC + Amazon Center on Secure and Trusted Machine Learning selects six new research projects
Faculty and academic-fellow projects are focused on various aspects of trustworthy machine learning.Read More
New – No-code generative AI capabilities now available in Amazon SageMaker Canvas
Launched in 2021, Amazon SageMaker Canvas is a visual, point-and-click service that allows business analysts and citizen data scientists to use ready-to-use machine learning (ML) models and build custom ML models to generate accurate predictions without the need to write any code. Ready-to-use models enable you to derive immediate insights from text, image, and document data (such as sentiment analysis, document processing, or object detection in images). Custom models allow you to build predictive models for use cases such as demand forecasting, customer churn, and defect detection in manufacturing.
We are excited to announce that SageMaker Canvas is expanding its support of ready-to-use models to include foundation models (FMs), enabling you to use generative AI to generate and summarize content. You can use natural language with a conversational chat interface to perform tasks such as creating narratives, reports, and blog posts; answering questions; summarizing notes and articles; and explaining concepts, without writing a single line of code. Your data is not used to improve the base models, is not shared with third-party model providers, and stays entirely within your secure AWS environment.
SageMaker Canvas allows you to access a variety of FMs that include Amazon Bedrock models (such as Claude 2 from Anthropic and Jurassic-2 from AI21 Labs) and publicly available Amazon SageMaker JumpStart models, including Falcon-7B-Instruct, Falcon-40B-Instruct, and MPT-7B-Instruct). You may use a single model or up to three models to compare model responses side by side. In SageMaker Canvas, Amazon Bedrock models are always active, allowing you to use them instantly. SageMaker JumpStart models can be started and deployed in your AWS account on demand and are automatically shut down after two hours of inactivity.
Let’s explore how to use the generative AI capabilities of SageMaker Canvas. For this post, we work with a fictitious enterprise customer support use case as an example.
Prerequisites
Complete the following prerequisite steps:
- Create an AWS account.
- Set up SageMaker Canvas and optionally configure it to use a VPC without internet access.
- Set up model access in Amazon Bedrock.
- Request service quota increases for g5.12xlarge and g5.2xlarge, if required, in your Region. These instances are required to host the SageMaker JumpStart model endpoints. Other instances may be selected based on availability.
Handling customer complaints
Let’s say that you’re a customer support analyst who handles complaints for a bicycle company. When receiving a customer complaint, you can use SageMaker Canvas to analyze the complaint and generate a personalized response to the customer. To do so, complete the following steps:
- On the SageMaker console, choose Canvas in the navigation pane.
- Choose your domain and user profile and choose Open Canvas to open the SageMaker Canvas application.
SageMaker Canvas is also accessible using single sign-on or other existing identity providers (IdPs) without having to first access the SageMaker console.
- Choose Generate, extract and summarize content to open the chat console.
- With the Claude 2 model selected, enter your instructions to retrieve the customer sentiment for the provided complaint and press Enter.
- You may want to know the specific problems with the bicycle, especially if it’s a long complaint. So, ask for the problems with the bicycle. Note that you don’t have to repost the complaint because SageMaker Canvas stores the context for your chat.
Now that we understand the customer’s problem, you can send them a response including a link to the company’s feedback form.
- In the input window, request a response to the customer complaint.
- If you want to generate another response from the FM, choose the refresh icon in the response section.
The original response and all new responses are paginated within the response section. Note that the new response is different from the original response. You can choose the copy icon in the response section to copy the response to an email or document, as required.
- You can also modify the model’s response by requesting specific changes. For example, let’s ask the model to add a $50 gift card offer to the email response.
Comparing model responses
You can compare the model responses from multiple models (up to three). Let’s compare two Amazon Bedrock models (Claude 2 and Jurassic-2 Ultra) with a SageMaker JumpStart model (Falcon-7B-Instruct) to evaluate and find the best model for your use case:
- Choose New chat to open a chat interface.
- On the model drop-down menu, choose Start up another model.
- On the Foundation models page, under Amazon SageMaker JumpStart models, choose Falcon-7B-Instruct and in the right pane, choose Start up model.
The model will take around 10 minutes to start.
- On the Foundation models page, confirm that the Falcon-7B-Instruct model is active before proceeding to the next step.
- Choose New chat to open a chat interface.
- Choose Compare to display a drop-down menu for the second model, then choose Compare again to display a drop-down menu for the third model.
- Choose the Falcon-7B-Instruct model on the first drop-down menu, Claude 2 on the second drop-down menu, and Jurassic-2 Ultra on the third drop-down menu.
- Enter your instructions in the chat input box and press Enter.
You will see responses from all three models.
Clean up
Any SageMaker JumpStart models started from SageMaker Canvas will be automatically shut down after 2 hours of inactivity. If you want to shut down these models sooner to save costs, follow the instructions in this section. Note that Amazon Bedrock models are not deployed in your account, so there is no need to shut these down.
- To shut down the Falcon-40B-Instruct SageMaker JumpStart model, you can choose from two methods:
- On the results comparison page, choose the Falcon-7B-Instruct model’s options menu (three dots), then choose Shut down model.
- Alternatively, choose New chat, and on the model drop-down menu, choose Start up another model. Then, on the Foundation models page, under Amazon SageMaker JumpStart models, choose Falcon-7B-Instruct and in the right pane, choose Shut down model.
- Choose Log out in the left pane to log out of the SageMaker Canvas application to stop the consumption of SageMaker Canvas workspace instance hours and release all resources used by the workspace instance.
Conclusion
In this post, you learned how to use SageMaker Canvas to generate text with ready-to-use models from Amazon Bedrock and SageMaker JumpStart. You used the Claude 2 model to analyze the sentiment of a customer complaint, ask questions, and generate a response without a single line of code. You also started a publicly available model and compared responses from three models.
For Amazon Bedrock models, you are charged based on the volume of input tokens and output tokens as per the Amazon Bedrock pricing page. Because SageMaker JumpStart models are deployed on SageMaker instances, you are charged for the duration of usage based on the instance type as per the Amazon SageMaker pricing page.
SageMaker Canvas continues to democratize AI with a no-code visual, interactive workspace that allows business analysts to build ML models that address a wide variety of use cases. Try out the new generative AI capabilities in SageMaker Canvas today! These capabilities are available in all Regions where Amazon Bedrock or SageMaker JumpStart are available.
About the Authors
Anand Iyer has been a Principal Solutions Architect at AWS since 2016. Anand has helped global healthcare, financial services, and telecommunications clients architect and implement enterprise software solutions using AWS and hybrid cloud technologies. He has an MS in Computer Science from Louisiana State University Baton Rouge, and an MBA from USC Marshall School of Business, Los Angeles. He is AWS certified in the areas of Security, Solutions Architecture, and DevOps Engineering.
Gavin Satur is a Principal Solutions Architect at Amazon Web Services. He works with enterprise customers to build strategic, well-architected solutions and is passionate about automation. Outside of work, he enjoys family time, tennis, cooking, and traveling.
Gunjan Jain is an AWS Solutions Architect in SoCal and primarily works with large financial services companies. He helps with cloud adoption, cloud optimization, and adopting best practices for being Well-Architected on the cloud.
Harpreet Dhanoa, a seasoned Senior Solutions Architect at AWS, has a strong background in designing and building scalable distributed systems. He is passionate about machine learning, observability, and analytics. He enjoys helping large-scale customers build their cloud enterprise strategy and transform their business in AWS. In his free time, Harpreet enjoys playing basketball with his two sons and spending time with his family.
Whisper models for automatic speech recognition now available in Amazon SageMaker JumpStart
Today, we’re excited to announce that the OpenAI Whisper foundation model is available for customers using Amazon SageMaker JumpStart. Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation. Trained on 680 thousand hours of labelled data, Whisper models demonstrate a strong ability to generalize to many datasets and domains without the need for fine-tuning. Sagemaker JumpStart is the machine learning (ML) hub of SageMaker that provides access to foundation models in addition to built-in algorithms and end-to-end solution templates to help you quickly get started with ML.
You can also do ASR using Amazon Transcribe ,a fully-managed and continuously trained automatic speech recognition service.
In this post, we show you how to deploy the OpenAI Whisper model and invoke the model to transcribe and translate audio.
The OpenAI Whisper model uses the huggingface-pytorch-inference container. As a SageMaker JumpStart model hub customer, you can use ASR without having to maintain the model script outside of the SageMaker SDK. SageMaker JumpStart models also improve security posture with endpoints that enable network isolation.
Foundation models in SageMaker
SageMaker JumpStart provides access to a range of models from popular model hubs including Hugging Face, PyTorch Hub, and TensorFlow Hub, which you can use within your ML development workflow in SageMaker. Recent advances in ML have given rise to a new class of models known as foundation models, which are typically trained on billions of parameters and can be adapted to a wide category of use cases, such as text summarization, generating digital art, and language translation. Because these models are expensive to train, customers want to use existing pre-trained foundation models and fine-tune them as needed, rather than train these models themselves. SageMaker provides a curated list of models that you can choose from on the SageMaker console.
You can now find foundation models from different model providers within SageMaker JumpStart, enabling you to get started with foundation models quickly. SageMaker JumpStart offers foundation models based on different tasks or model providers, and you can easily review model characteristics and usage terms. You can also try these models using a test UI widget. When you want to use a foundation model at scale, you can do so without leaving SageMaker by using pre-built notebooks from model providers. Because the models are hosted and deployed on AWS, you trust that your data, whether used for evaluating or using the model at scale, won’t be shared with third parties.
OpenAI Whisper foundation models
Whisper is a pre-trained model for ASR and speech translation. Whisper was proposed in the paper Robust Speech Recognition via Large-Scale Weak Supervision by Alec Radford, and others, from OpenAI. The original code can be found in this GitHub repository.
Whisper is a Transformer-based encoder-decoder model, also referred to as a sequence-to-sequence model. It was trained on 680 thousand hours of labelled speech data annotated using large-scale weak supervision. Whisper models demonstrate a strong ability to generalize to many datasets and domains without the need for fine-tuning.
The models were trained on either English-only data or multilingual data. The English-only models were trained on the task of speech recognition. The multilingual models were trained on speech recognition and speech translation. For speech recognition, the model predicts transcriptions in the same language as the audio. For speech translation, the model predicts transcriptions to a different language to the audio.
Whisper checkpoints come in five configurations of varying model sizes. The smallest four are trained on either English-only or multilingual data. The largest checkpoints are multilingual only. All ten of the pre-trained checkpoints are available on the Hugging Face hub. The checkpoints are summarized in the following table with links to the models on the hub:
Model name | Number of parameters | Multilingual |
whisper-tiny | 39 M | Yes |
whisper-base | 74 M | Yes |
whisper-small | 244 M | Yes |
whisper-medium | 769 M | Yes |
whisper-large | 1550 M | Yes |
whisper-large-v2 | 1550 M | Yes |
Lets explore how you can use Whisper models in SageMaker JumpStart.
OpenAI Whisper foundation models WER and latency comparison
The word error rate (WER) for different OpenAI Whisper models based on the LibriSpeech test-clean is shown in the following table. WER is a common metric for the performance of a speech recognition or machine translation system. It measures the difference between the reference text (the ground truth or the correct transcription) and the output of an ASR system in terms of the number of errors, including substitutions, insertions, and deletions that are needed to transform the ASR output into the reference text. These numbers have been taken from the Hugging Face website.
Model | WER (percent) |
whisper-tiny | 7.54 |
whisper-base | 5.08 |
whisper-small | 3.43 |
whisper-medium | 2.9 |
whisper-large | 3 |
whisper-large-v2 | 3 |
For this blog, we took the below audio file and compared the latency of speech recognition across different whisper models. Latency is the amount of time from the moment that a user sends a request until the time that your application indicates that the request has been completed. The numbers in the following table represent the average latency for a total of 100 requests using the same audio file with the model hosted on the ml.g5.2xlarge instance.
Model | Average latency(s) | Model output |
whisper-tiny | 0.43 | We are living in very exciting times with machine lighting. The speed of ML model development will really actually increase. But you won’t get to that end state that we won in the next coming years. Unless we actually make these models more accessible to everybody. |
whisper-base | 0.49 | We are living in very exciting times with machine learning. The speed of ML model development will really actually increase. But you won’t get to that end state that we won in the next coming years. Unless we actually make these models more accessible to everybody. |
whisper-small | 0.84 | We are living in very exciting times with machine learning. The speed of ML model development will really actually increase. But you won’t get to that end state that we want in the next coming years unless we actually make these models more accessible to everybody. |
whisper-medium | 1.5 | We are living in very exciting times with machine learning. The speed of ML model development will really actually increase. But you won’t get to that end state that we want in the next coming years unless we actually make these models more accessible to everybody. |
whisper-large | 1.96 | We are living in very exciting times with machine learning. The speed of ML model development will really actually increase. But you won’t get to that end state that we want in the next coming years unless we actually make these models more accessible to everybody. |
whisper-large-v2 | 1.98 | We are living in very exciting times with machine learning. The speed of ML model development will really actually increase. But you won’t get to that end state that we want in the next coming years unless we actually make these models more accessible to everybody. |
Solution walkthrough
You can deploy Whisper models using the Amazon SageMaker console or using an Amazon SageMaker Notebook. In this post, we demonstrate how to deploy the Whisper API using the SageMaker Studio console or a SageMaker Notebook and then use the deployed model for speech recognition and language translation. The code used in this post can be found in this GitHub notebook.
Let’s expand each step in detail.
Deploy Whisper from the console
- To get started with SageMaker JumpStart, open the Amazon SageMaker Studio console and go to the launch page of SageMaker JumpStart and select Get Started with JumpStart.
- To choose a Whisper model, you can either use the tabs at the top or use the search box at the top right as shown in the following screenshot. For this example, use the search box on the top right and enter
Whisper
, and then select the appropriate Whisper model from the dropdown menu. - After you select the Whisper model, you can use the console to deploy the model. You can select an instance for deployment or use the default.
Deploy the foundation model from a Sagemaker Notebook
The steps to first deploy and then use the deployed model to solve different tasks are:
- Set up
- Select a model
- Retrieve artifacts and deploy an endpoint
- Use deployed model for ASR
- Use deployed model for language translation
- Clean up the endpoint
Set up
This notebook was tested on an ml.t3.medium instance in SageMaker Studio with the Python 3 (data science) kernel and in an Amazon SageMaker Notebook instance with the conda_python3
kernel.
Select a pre-trained model
Set up a SageMaker Session using Boto3, and then select the model ID that you want to deploy.
Retrieve artifacts and deploy an endpoint
Using SageMaker, you can perform inference on the pre-trained model, even without fine-tuning it first on a new dataset. To host the pre-trained model, create an instance of sagemaker.model.Model and deploy it. The following code uses the default instance ml.g5.2xlarge
for the inference endpoint of a whisper-large-v2 model. You can deploy the model on other instance types by passing instance_type
in the JumpStartModel
class. The deployment might take few minutes.
Automatic speech recognition
Next, you read the sample audio file, sample1.wav, from a SageMaker Jumpstart public Amazon Simple Storage Service (Amazon S3) location and pass it to the predictor for speech recognition. You can replace this sample file with any other sample audio file but make sure the .wav file is sampled at 16 kHz because is required by the automatic speech recognition models. The input audio file must be less than 30 seconds.
This model supports many parameters when performing inference. They include:
max_length
: The model generates text until the output length. If specified, it must be a positive integer.- language and task: Specify the output language and task here. The model supports the task of transcription or translation.
max_new_tokens
: The maximum numbers of tokens to generate.num_return_sequences
: The number of output sequences returned. If specified, it must be a positive integer.num_beams
: The number of beams used in the greedy search. If specified, it must be integer greater than or equal tonum_return_sequences
.no_repeat_ngram_size
: The model ensures that a sequence of words ofno_repeat_ngram_size
isn’t repeated in the output sequence. If specified, it must be a positive integer greater than 1.- temperature: This controls the randomness in the output. Higher temperature results in an output sequence with low-probability words and lower temperature results in an output sequence with high-probability words. If temperature approaches 0, it results in greedy decoding. If specified, it must be a positive float.
early_stopping
: IfTrue
, text generation is finished when all beam hypotheses reach the end of sentence token. If specified, it must be boolean.do_sample
: IfTrue
, sample the next word for the likelihood. If specified, it must be boolean.top_k
: In each step of text generation, sample from only thetop_k
most likely words. If specified, it must be a positive integer.top_p
: In each step of text generation, sample from the smallest possible set of words with cumulative probabilitytop_p
. If specified, it must be a float between 0 and 1.
You can specify any subset of the preceding parameters when invoking an endpoint. Next, we show you an example of how to invoke an endpoint with these arguments.
Language translation
To showcase language translation using Whisper models, use the following audio file in French and translate it to English. The file must be sampled at 16 kHz (as required by the ASR models), so make sure to resample files if required and make sure your samples don’t exceed 30 seconds.
- Download the
sample_french1.wav
from SageMaker JumpStart from the public S3 location so it can be passed in payload for translation by the Whisper model.
- Set the task parameter as
translate
and language asFrench
to force the Whisper model to perform speech translation. - Use predictor to predict the translation of the language. If you receive client error (error 413), check the payload size to the endpoint. Payloads for SageMaker invoke endpoint requests are limited to about 5 MB.
- The text output translated to English from the French audio file follows:
Clean up
After you’ve tested the endpoint, delete the SageMaker inference endpoint and delete the model to avoid incurring charges.
Conclusion
In this post, we showed you how to test and use OpenAI Whisper models to build interesting applications using Amazon SageMaker. Try out the foundation model in SageMaker today and let us know your feedback!
This guidance is for informational purposes only. You should still perform your own independent assessment and take measures to ensure that you comply with your own specific quality control practices and standards, and the local rules, laws, regulations, licenses and terms of use that apply to you, your content, and the third-party model referenced in this guidance. AWS has no control or authority over the third-party model referenced in this guidance and does not make any representations or warranties that the third-party model is secure, virus-free, operational, or compatible with your production environment and standards. AWS does not make any representations, warranties, or guarantees that any information in this guidance will result in a particular outcome or result.
About the authors
Hemant Singh is an Applied Scientist with experience in Amazon SageMaker JumpStart. He got his masters from Courant Institute of Mathematical Sciences and B.Tech from IIT Delhi. He has experience in working on a diverse range of machine learning problems within the domain of natural language processing, computer vision, and time series analysis.
Rachna Chadha is a Principal Solution Architect AI/ML in Strategic Accounts at AWS. Rachna is an optimist who believes that ethical and responsible use of AI can improve society in future and bring economical and social prosperity. In her spare time, Rachna likes spending time with her family, hiking and listening to music.
Dr. Ashish Khetan is a Senior Applied Scientist with Amazon SageMaker built-in algorithms and helps develop machine learning algorithms. He got his PhD from University of Illinois Urbana-Champaign. He is an active researcher in machine learning and statistical inference, and has published many papers in NeurIPS, ICML, ICLR, JMLR, ACL, and EMNLP conferences.