The company’s work, supported by the Amazon Alexa Fund, has relevant applications for areas from perfumes to disease detection.Read More
Host ML models on Amazon SageMaker using Triton: ONNX Models
ONNX (Open Neural Network Exchange) is an open-source standard for representing deep learning models widely supported by many providers. ONNX provides tools for optimizing and quantizing models to reduce the memory and compute needed to run machine learning (ML) models. One of the biggest benefits of ONNX is that it provides a standardized format for representing and exchanging ML models between different frameworks and tools. This allows developers to train their models in one framework and deploy them in another without the need for extensive model conversion or retraining. For these reasons, ONNX has gained significant importance in the ML community.
In this post, we showcase how to deploy ONNX-based models for multi-model endpoints (MMEs) that use GPUs. This is a continuation of the post Run multiple deep learning models on GPU with Amazon SageMaker multi-model endpoints, where we showed how to deploy PyTorch and TensorRT versions of ResNet50 models on Nvidia’s Triton Inference server. In this post, we use the same ResNet50 model in ONNX format along with an additional natural language processing (NLP) example model in ONNX format to show how it can be deployed on Triton. Furthermore, we benchmark the ResNet50 model and see the performance benefits that ONNX provides when compared to PyTorch and TensorRT versions of the same model, using the same input.
ONNX Runtime
ONNX Runtime is a runtime engine for ML inference designed to optimize the performance of models across multiple hardware platforms, including CPUs and GPUs. It allows the use of ML frameworks like PyTorch and TensorFlow. It facilitates performance tuning to run models cost-efficiently on the target hardware and has support for features like quantization and hardware acceleration, making it one of the ideal choices for deploying efficient, high-performance ML applications. For examples of how ONNX models can be optimized for Nvidia GPUs with TensorRT, refer to TensorRT Optimization (ORT-TRT) and ONNX Runtime with TensorRT optimization.
The Amazon SageMaker Triton container flow is depicted in the following diagram.
Users can send an HTTPS request with the input payload for real-time inference behind a SageMaker endpoint. The user can specify a TargetModel
header that contains the name of the model that the request in question is destined to invoke. Internally, the SageMaker Triton container implements an HTTP server with the same contracts as mentioned in How Containers Serve Requests. It has support for dynamic batching and supports all the backends that Triton provides. Based on the configuration, the ONNX runtime is invoked and the request is processed on CPU or GPU as predefined in the model configuration provided by the user.
Solution overview
To use the ONNX backend, complete the following steps:
- Compile the model to ONNX format.
- Configure the model.
- Create the SageMaker endpoint.
Prerequisites
Ensure that you have access to an AWS account with sufficient AWS Identity and Access Management IAM permissions to create a notebook, access an Amazon Simple Storage Service (Amazon S3) bucket, and deploy models to SageMaker endpoints. See Create execution role for more information.
Compile the model to ONNX format
The transformers library provides for convenient method to compile the PyTorch model to ONNX format. The following code achieves the transformations for the NLP model:
Exporting models (either PyTorch or TensorFlow) is easily achieved through the conversion tool provided as part of the Hugging Face transformers repository.
The following is what happens under the hood:
- Allocate the model from transformers (PyTorch or TensorFlow).
- Forward dummy inputs through the model. This way, ONNX can record the set of operations run.
- The transformers inherently take care of dynamic axes when exporting the model.
- Save the graph along with the network parameters.
A similar mechanism is followed for the computer vision use case from the torchvision model zoo:
Configure the model
In this section, we configure the computer vision and NLP model. We show how to create a ResNet50 and RoBERTA large model that has been pre-trained for deployment on a SageMaker MME by utilizing Triton Inference Server model configurations. The ResNet50 notebook is available on GitHub. The RoBERTA notebook is also available on GitHub. For ResNet50, we use the Docker approach to create an environment that already has all the dependencies required to build our ONNX model and generate the model artifacts needed for this exercise. This approach makes it much easier to share dependencies and create the exact environment that is needed to accomplish this task.
The first step is to create the ONNX model package per the directory structure specified in ONNX Models. Our aim is to use the minimal model repository for a ONNX model contained in a single file as follows:
Next, we create the model configuration file that describes the inputs, outputs, and backend configurations for the Triton Server to pick up and invoke the appropriate kernels for ONNX. This file is known as config.pbtxt
and is shown in the following code for the RoBERTA use case. Note that the BATCH
dimension is omitted from the config.pbtxt
. However, when sending the data to the model, we include the batch dimension. The following code also shows how you can add this feature with model configuration files to set dynamic batching with a preferred batch size of 5 for the actual inference. With the current settings, the model instance is invoked instantly when the preferred batch size of 5 is met or the delay time of 100 microseconds has elapsed since the first request reached the dynamic batcher.
The following is the similar configuration file for the computer vision use case:
Create the SageMaker endpoint
We use the Boto3 APIs to create the SageMaker endpoint. For this post, we show the steps for the RoBERTA notebook, but these are common steps and will be the same for the ResNet50 model as well.
Create a SageMaker model
We now create a SageMaker model. We use the Amazon Elastic Container Registry (Amazon ECR) image and the model artifact from the previous step to create the SageMaker model.
Create the container
To create the container, we pull the appropriate image from Amazon ECR for Triton Server. SageMaker allows us to customize and inject various environment variables. Some of the key features are the ability to set the BATCH_SIZE
; we can set this per model in the config.pbtxt
file, or we can define a default value here. For models that can benefit from larger shared memory size, we can set those values under SHM
variables. To enable logging, set the log verbose
level to true
. We use the following code to create the model to use in our endpoint:
Create a SageMaker endpoint
You can use any instances with multiple GPUs for testing. In this post, we use a g4dn.4xlarge instance. We don’t set the VolumeSizeInGB
parameters because this instance comes with local instance storage. The VolumeSizeInGB
parameter is applicable to GPU instances supporting the Amazon Elastic Block Store (Amazon EBS) volume attachment. We can leave the model download timeout and container startup health check at the default values. For more details, refer to CreateEndpointConfig.
Lastly, we create a SageMaker endpoint:
Invoke the model endpoint
This is a generative model, so we pass in the input_ids
and attention_mask
to the model as part of the payload. The following code shows how to create the tensors:
We now create the appropriate payload by ensuring the data type matches what we configured in the config.pbtxt
. This also give us the tensors with the batch dimension included, which is what Triton expects. We use the JSON format to invoke the model. Triton also provides a native binary invocation method for the model.
Note the TargetModel
parameter in the preceding code. We send the name of the model to be invoked as a request header because this is a multi-model endpoint, therefore we can invoke multiple models at runtime on an already deployed inference endpoint by changing this parameter. This shows the power of multi-model endpoints!
To output the response, we can use the following code:
ONNX for performance tuning
The ONNX backend uses C++ arena memory allocation. Arena allocation is a C++-only feature that helps you optimize your memory usage and improve performance. Memory allocation and deallocation constitutes a significant fraction of CPU time spent in protocol buffers code. By default, new object creation performs heap allocations for each object, each of its sub-objects, and several field types, such as strings. These allocations occur in bulk when parsing a message and when building new messages in memory, and associated deallocations happen when messages and their sub-object trees are freed.
Arena-based allocation has been designed to reduce this performance cost. With arena allocation, new objects are allocated out of a large piece of pre-allocated memory called the arena. Objects can all be freed at once by discarding the entire arena, ideally without running destructors of any contained object (though an arena can still maintain a destructor list when required). This makes object allocation faster by reducing it to a simple pointer increment, and makes deallocation almost free. Arena allocation also provides greater cache efficiency: when messages are parsed, they are more likely to be allocated in continuous memory, which makes traversing messages more likely to hit hot cache lines. The downside of arena-based allocation is the C++ heap memory will be over-allocated and stay allocated even after the objects are deallocated. This might lead to out of memory or high CPU memory usage. To achieve the best of both worlds, we use the following configurations provided by Triton and ONNX:
- arena_extend_strategy – This parameter refers to the strategy used to grow the memory arena with regards to the size of the model. We recommend setting the value to 1 (=
kSameAsRequested
), which is not a default value. The reasoning is as follows: the drawback of the default arena extend strategy (kNextPowerOfTwo
) is that it might allocate more memory than needed, which could be a waste. As the name suggests,kNextPowerOfTwo
(the default) extends the arena by a power of 2, whereaskSameAsRequested
extends by a size that is the same as the allocation request each time.kSameAsRequested
is suited for advanced configurations where you know the expected memory usage in advance. In our testing, because we know the size of models is a constant value, we can safely choosekSameAsRequested
. - gpu_mem_limit – We set the value to the CUDA memory limit. To use all possible memory, pass in the maximum
size_t
. It defaults toSIZE_MAX
if nothing is specified. We recommend keeping it as default. - enable_cpu_mem_arena – This enables the memory arena on CPU. The arena may pre-allocate memory for future usage. Set this option to
false
if you don’t want it. The default isTrue
. If you disable the arena, heap memory allocation will take time, so inference latency will increase. In our testing, we left it as default. - enable_mem_pattern – This parameter refers to the internal memory allocation strategy based on input shapes. If the shapes are constant, we can enable this parameter to generate a memory pattern for the future and save some allocation time, making it faster. Use 1 to enable the memory pattern and 0 to disable. It’s recommended to set this to 1 when the input features are expected to be the same. The default value is 1.
- do_copy_in_default_stream – In the context of the CUDA execution provider in ONNX, a compute stream is a sequence of CUDA operations that are run asynchronously on the GPU. The ONNX runtime schedules operations in different streams based on their dependencies, which helps minimize the idle time of the GPU and achieve better performance. We recommend using the default setting of 1 for using the same stream for copying and compute; however, you can use 0 for using separate streams for copying and compute, which might result in the device pipelining the two activities. In our testing of the ResNet50 model, we used both 0 and 1 but couldn’t find any appreciable difference between the two in terms of performance and memory consumption of the GPU device.
- Graph optimization – The ONNX backend for Triton supports several parameters that help fine-tune the model size as well as runtime performance of the deployed model. When the model is converted to the ONNX representation (the first box in the following diagram at the IR stage), the ONNX runtime provides graph optimizations at three levels: basic, extended, and layout optimizations. You can activate all levels of graph optimizations by adding the following parameters in the model configuration file:
- cudnn_conv_algo_search – Because we’re using CUDA-based Nvidia GPUs in our testing, for our computer vision use case with the ResNet50 model, we can use the CUDA execution provider-based optimization at the fourth layer in the following diagram with the
cudnn_conv_algo_search
parameter. The default option is exhaustive (0), but when we changed this configuration to1 – HEURISTIC
, we saw the model latency in steady state reduce to 160 milliseconds. The reason this happens is because the ONNX runtime invokes the lighter weight cudnnGetConvolutionForwardAlgorithm_v7 forward pass and therefore reduces latency with adequate performance. - Run mode – The next step is selecting the correct execution_mode at layer 5 in the following diagram. This parameter controls whether you want to run operators in your graph sequentially or in parallel. Usually when the model has many branches, setting this option to
ExecutionMode.ORT_PARALLEL
(1) will give you better performance. In the scenario where your model has many branches in its graph, setting the run mode to parallel will help with better performance. The default mode is sequential, so you can enable this to suit your needs.
For a deeper understanding of the opportunities for performance tuning in ONNX, refer to the following figure.
Benchmark numbers and performance tuning
By turning on the graph optimizations, cudnn_conv_algo_search
, and parallel run mode parameters in our testing of the ResNet50 model, we saw the cold start time of the ONNX model graph reduce from 4.4 seconds to 1.61 seconds. An example of a complete model configuration file is provided in the ONNX configuration section of the following notebook.
The testing benchmark results are as follows:
- PyTorch – 176 milliseconds, cold start 6 seconds
- TensorRT – 174 milliseconds, cold start 4.5 seconds
- ONNX – 168 milliseconds, cold start 4.4 seconds
The following graphs visualize these metrics.
Furthermore, in our testing of computer vision use cases, consider sending the request payload in binary format using the HTTP client provided by Triton because it significantly improves model invoke latency.
Other parameters that SageMaker exposes for ONNX on Triton are as follows:
- Dynamic batching – Dynamic batching is a feature of Triton that allows inference requests to be combined by the server, so that a batch is created dynamically. Creating a batch of requests typically results in increased throughput. The dynamic batcher should be used for stateless models. The dynamically created batches are distributed to all model instances configured for the model.
- Maximum batch size – The
max_batch_size
property indicates the maximum batch size that the model supports for the types of batching that can be exploited by Triton. If the model’s batch dimension is the first dimension, and all inputs and outputs to the model have this batch dimension, then Triton can use its dynamic batcher or sequence batcher to automatically use batching with the model. In this case,max_batch_size
should be set to a value greater than or equal to 1, which indicates the maximum batch size that Triton should use with the model. - Default max batch size – The default-max-batch-size value is used for
max_batch_size
during autocomplete when no other value is found. Theonnxruntime
backend will set themax_batch_size
of the model to this default value if autocomplete has determined the model is capable of batching requests andmax_batch_size
is 0 in the model configuration ormax_batch_size
is omitted from the model configuration. Ifmax_batch_size
is more than 1 and no scheduler is provided, the dynamic batch scheduler will be used. The default max batch size is 4.
Clean up
Ensure that you delete the model, model configuration, and model endpoint after running the notebook. The steps to do this are provided at the end of the sample notebook in the GitHub repo.
Conclusion
In this post, we dove deep into the ONNX backend that Triton Inference Server supports on SageMaker. This backend provides for GPU acceleration of your ONNX models. There are many options to consider to get the best performance for inference, such as batch sizes, data input formats, and other factors that can be tuned to meet your needs. SageMaker allows you to use this capability using single-model and multi-model endpoints. MMEs allow a better balance of performance and cost savings. To get started with MME support for GPU, see Host multiple models in one container behind one endpoint.
We invite you to try Triton Inference Server containers in SageMaker, and share your feedback and questions in the comments.
About the authors
Abhi Shivaditya is a Senior Solutions Architect at AWS, working with strategic global enterprise organizations to facilitate the adoption of AWS services in areas such as Artificial Intelligence, distributed computing, networking, and storage. His expertise lies in Deep Learning in the domains of Natural Language Processing (NLP) and Computer Vision. Abhi assists customers in deploying high-performance machine learning models efficiently within the AWS ecosystem.
James Park is a Solutions Architect at Amazon Web Services. He works with Amazon.com to design, build, and deploy technology solutions on AWS, and has a particular interest in AI and machine learning. In h is spare time he enjoys seeking out new cultures, new experiences, and staying up to date with the latest technology trends.You can find him on LinkedIn.
Rupinder Grewal is a Sr Ai/ML Specialist Solutions Architect with AWS. He currently focuses on serving of models and MLOps on SageMaker. Prior to this role he has worked as Machine Learning Engineer building and hosting models. Outside of work he enjoys playing tennis and biking on mountain trails.
Dhawal Patel is a Principal Machine Learning Architect at AWS. He has worked with organizations ranging from large enterprises to mid-sized startups on problems related to distributed computing, and Artificial Intelligence. He focuses on Deep learning including NLP and Computer Vision domains. He helps customers achieve high performance model inference on SageMaker.
Fast-track graph ML with GraphStorm: A new way to solve problems on enterprise-scale graphs
We are excited to announce the open-source release of GraphStorm 0.1, a low-code enterprise graph machine learning (ML) framework to build, train, and deploy graph ML solutions on complex enterprise-scale graphs in days instead of months. With GraphStorm, you can build solutions that directly take into account the structure of relationships or interactions between billions of entities, which are inherently embedded in most real-world data, including fraud detection scenarios, recommendations, community detection, and search/retrieval problems.
Until now, it has been notoriously hard to build, train, and deploy graph ML solutions for complex enterprise graphs that easily have billions of nodes, hundreds of billions of edges, and dozens of attributes—just think about a graph capturing Amazon.com products, product attributes, customers, and more. With GraphStorm, we release the tools that Amazon uses internally to bring large-scale graph ML solutions to production. GraphStorm doesn’t require you to be an expert in graph ML and is available under the Apache v2.0 license on GitHub. To learn more about GraphStorm, visit the GitHub repository.
In this post, we provide an introduction to GraphStorm, its architecture, and an example use case of how to use it.
Introducing GraphStorm
Graph algorithms and graph ML are emerging as state-of-the-art solutions for many important business problems like predicting transaction risks, anticipating customer preferences, detecting intrusions, optimizing supply chains, social network analysis, and traffic prediction. For example, Amazon GuardDuty, the native AWS threat detection service, uses a graph with billions of edges to improve the coverage and accuracy of its threat intelligence. This allows GuardDuty to categorize previously unseen domains as highly likely to be malicious or benign based on their association to known malicious domains. By using Graph Neural Networks (GNNs), GuardDuty is able to enhance its capability to alert customers.
However, developing, launching, and operating graph ML solutions takes months and requires graph ML expertise. As a first step, a graph ML scientist has to build a graph ML model for a given use case using a framework like the Deep Graph Library (DGL). Training such models is challenging due to the size and complexity of graphs in enterprise applications, which routinely reach billions of nodes, hundreds of billions of edges, different node and edge types, and hundreds of node and edge attributes. Enterprise graphs can require terabytes of memory storage, requiring graph ML scientists to build complex training pipelines. Finally, after a model has been trained, they have to be deployed for inference, which requires inference pipelines that are just as difficult to build as the training pipelines.
GraphStorm 0.1 is a low-code enterprise graph ML framework that allows ML practitioners to easily pick predefined graph ML models that have been proven to be effective, run distributed training on graphs with billions of nodes, and deploy the models into production. GraphStorm offers a collection of built-in graph ML models, such as Relational Graph Convolutional Networks (RGCN), Relational Graph Attention Networks (RGAT), and Heterogeneous Graph Transformer (HGT) for enterprise applications with heterogeneous graphs, which allow ML engineers with little graph ML expertise to try out different model solutions for their task and select the right one quickly. End-to-end distributed training and inference pipelines, which scale to billion-scale enterprise graphs, make it easy to train, deploy, and run inference. If you are new to GraphStorm or graph ML in general, you will benefit from the pre-defined models and pipelines. If you are an expert, you have all options to tune the training pipeline and model architecture to get the best performance. GraphStorm is built on top of the DGL, a widely popular framework for developing GNN models, and available as open-source code under the Apache v2.0 license.
“GraphStorm is designed to help customers experiment and operationalize graph ML methods for industry applications to accelerate the adoption of graph ML,” says George Karypis, Senior Principal Scientist in Amazon AI/ML research. “Since its release inside Amazon, GraphStorm has reduced the effort to build graph ML-based solutions by up to five times.”
“GraphStorm enables our team to train GNN embedding in a self-supervised manner on a graph with 288 million nodes and 2 billion edges,” Says Haining Yu, Principal Applied Scientist at Amazon Measurement, Ad Tech, and Data Science. “The pre-trained GNN embeddings show a 24% improvement on a shopper activity prediction task over a state-of-the-art BERT- based baseline; it also exceeds benchmark performance in other ads applications.”
“Before GraphStorm, customers could only scale vertically to handle graphs of 500 million edges,” says Brad Bebee, GM for Amazon Neptune and Amazon Timestream. “GraphStorm enables customers to scale GNN model training on massive Amazon Neptune graphs with tens of billions of edges.”
GraphStorm technical architecture
The following figure shows the technical architecture of GraphStorm.
GraphStorm is built on top of PyTorch and can run on a single GPU, multiple GPUs, and multiple GPU machines. It consists of three layers (marked in the yellow boxes in the preceding figure):
- Bottom layer (Dist GraphEngine) – The bottom layer provides the basic components to enable distributed graph ML, including distributed graphs, distributed tensors, distributed embeddings, and distributed samplers. GraphStorm provides efficient implementations of these components to scale graph ML training to billion-node graphs.
- Middle layer (GS training/inference pipeline) – The middle layer provides trainers, evaluators, and predictors to simplify model training and inference for both built-in models and your custom models. Basically, by using the API of this layer, you can focus on the model development without worrying about how to scale the model training.
- Top layer (GS general model zoo) – The top layer is a model zoo with popular GNN and non-GNN models for different graph types. As of this writing, it provides RGCN, RGAT, and HGT for heterogeneous graphs and BERTGNN for textual graphs. In the future, we will add support for temporal graph models such as TGAT for temporal graphs as well as TransE and DistMult for knowledge graphs.
How to use GraphStorm
After installing GraphStorm, you only need three steps to build and train GML models for your application.
First, you preprocess your data (potentially including your custom feature engineering) and transform it into a table format required by GraphStorm. For each node type, you define a table that lists all nodes of that type and their features, providing a unique ID for each node. For each edge type, you similarly define a table in which each row contains the source and destination node IDs for an edge of that type (for more information, see Use Your Own Data Tutorial). In addition, you provide a JSON file that describes the overall graph structure.
Second, via the command line interface (CLI), you use GraphStorm’s built-in construct_graph
component for some GraphStorm-specific data processing, which enables efficient distributed training and inference.
Third, you configure the model and training in a YAML file (example) and, again using the CLI, invoke one of the five built-in components (gs_node_classification
, gs_node_regression
, gs_edge_classification
, gs_edge_regression
, gs_link_prediction
) as training pipelines to train the model. This step results in the trained model artifacts. To do inference, you need to repeat the first two steps to transform the inference data into a graph using the same GraphStorm component (construct_graph
) as before.
Finally, you can invoke one of the five built-in components, the same that was used for model training, as an inference pipeline to generate embeddings or prediction results.
The overall flow is also depicted in the following figure.
In the following section, we provide an example use case.
Make predictions on raw OAG data
For this post, we demonstrate how easily GraphStorm can enable graph ML training and inference on a large raw dataset. The Open Academic Graph (OAG) contains five entities (papers, authors, venues, affiliations, and field of study). The raw dataset is stored in JSON files with over 500 GB.
Our task is to build a model to predict the field of study of a paper. To predict the field of study, you can formulate it as a multi-label classification task, but it’s difficult to use one-hot encoding to store the labels because there are hundreds of thousands of fields. Therefore, you should create field of study nodes and formulate this problem as a link prediction task, predicting which field of study nodes a paper node should connect to.
To model this dataset with a graph method, the first step is to process the dataset and extract entities and edges. You can extract five types of edges from the JSON files to define a graph, shown in the following figure. You can use the Jupyter notebook in the GraphStorm example code to process the dataset and generate five entity tables for each entity type and five edge tables for each edge type. The Jupyter notebook also generates BERT embeddings on the entities with text data, such as papers.
After defining the entities and edges between the entities, you can create mag_bert.json
, which defines the graph schema, and invoke the built-in graph construction pipeline construct_graph
in GraphStorm to build the graph (see the following code). Even though the GraphStorm graph construction pipeline runs in a single machine, it supports multi-processing to process nodes and edge features in parallel (--num_processes
) and can store entity and edge features on external memory (--ext-mem-workspace
) to scale to large datasets.
To process such a large graph, you need a large-memory CPU instance to construct the graph. You can use an Amazon Elastic Compute Cloud (Amazon EC2) r6id.32xlarge instance (128 vCPU and 1 TB RAM) or r6a.48xlarge instances (192 vCPU and 1.5 TB RAM) to construct the OAG graph.
After constructing a graph, you can use gs_link_prediction
to train a link prediction model on four g5.48xlarge instances. When using the built-in models, you only invoke one command line to launch the distributed training job. See the following code:
After the model training, the model artifact is saved in the folder /data/mag_lp_model
.
Now you can run link prediction inference to generate GNN embeddings and evaluate the model performance. GraphStorm provides multiple built-in evaluation metrics to evaluate model performance. For link prediction problems, for example, GraphStorm automatically outputs the metric mean reciprocal rank (MRR). MRR is a valuable metric for evaluating graph link prediction models because it assesses how high the actual links are ranked among the predicted links. This captures the quality of predictions, making sure our model correctly prioritizes true connections, which is our objective here.
You can run inference with one command line, as shown in the following code. In this case, the model reaches an MRR of 0.31 on the test set of the constructed graph.
Note that the inference pipeline generates embeddings from the link prediction model. To solve the problem of finding the field of study for any given paper, simply perform a k-nearest neighbor search on the embeddings.
Conclusion
GraphStorm is a new graph ML framework that makes it easy to build, train, and deploy graph ML models on industry graphs. It addresses some key challenges in graph ML, including scalability and usability. It provides built-in components to process billion-scale graphs from raw input data to model training and model inference and has enabled multiple Amazon teams to train state-of-the-art graph ML models in various applications. Check out our GitHub repository for more information.
About the Authors
Da Zheng is a senior applied scientist at AWS AI/ML research leading a graph machine learning team to develop techniques and frameworks to put graph machine learning in production. Da got his PhD in computer science from the Johns Hopkins University.
Florian Saupe is a Principal Technical Product Manager at AWS AI/ML research supporting advanced science teams like the graph machine learning group and improving products like Amazon DataZone with ML capabilities. Before joining AWS, Florian lead technical product management for automated driving at Bosch, was a strategy consultant at McKinsey & Company, and worked as a control systems/robotics scientist – a field in which he holds a phd.
More-inclusive speech recognition with cross-utterance rescoring
In a top-3% paper at ICASSP, Amazon researchers adapt graph-based label propagation to improve speech recognition on underrepresented pronunciations.Read More
Get started with the open-source Amazon SageMaker Distribution
Data scientists need a consistent and reproducible environment for machine learning (ML) and data science workloads that enables managing dependencies and is secure. AWS Deep Learning Containers already provides pre-built Docker images for training and serving models in common frameworks such as TensorFlow, PyTorch, and MXNet. To improve this experience, we announced a public beta of the SageMaker open-source distribution at 2023 JupyterCon. This provides a unified end-to-end ML experience across ML developers of varying levels of expertise. Developers no longer need to switch between different framework containers for experimentation, or as they move from local JupyterLab environments and SageMaker notebooks to production jobs on SageMaker. The open-source SageMaker Distribution supports the most common packages and libraries for data science, ML, and visualization, such as TensorFlow, PyTorch, Scikit-learn, Pandas, and Matplotlib. You can start using the container from the Amazon ECR Public Gallery starting today.
In this post, we show you how you can use the SageMaker open-source distribution to quickly experiment on your local environment and easily promote them to jobs on SageMaker.
Solution overview
For our example, we showcase training an image classification model using PyTorch. We use the KMNIST dataset available publicly on PyTorch. We train a neural network model, test the model’s performance, and finally print the training and test loss. The full notebook for this example is available in the SageMaker Studio Lab examples repository. We start experimentation on a local laptop using the open-source distribution, move it to Amazon SageMaker Studio for using a larger instance, and then schedule the notebook as a notebook job.
Prerequisites
You need the following prerequisites:
- Docker installed.
- An active AWS account with administrator permissions.
- An environment with the AWS Command Line Interface (AWS CLI) and Docker installed.
- An existing SageMaker domain. To create a domain, refer to Onboard to Amazon SageMaker Domain.
Set up your local environment
You can directly start using the open-source distribution on your local laptop. To start JupyterLab, run the following commands on your terminal:
You can replace ECR_IMAGE_ID
with any of the image tags available in the Amazon ECR Public Gallery, or choose the latest-gpu
tag if you are using a machine that supports GPU.
This command will start JupyterLab and provide a URL on the terminal, like http://127.0.0.1:8888/lab?token=<token>
. Copy the link and enter it in your preferred browser to start JupyterLab.
Set up Studio
Studio is an end-to-end integrated development environment (IDE) for ML that lets developers and data scientists build, train, deploy, and monitor ML models at scale. Studio provides an extensive list of first-party images with common frameworks and packages, such as Data Science, TensorFlow, PyTorch, and Spark. These images make it simple for data scientists to get started with ML by simply choosing a framework and instance type of their choice for compute.
You can now use the SageMaker open-source distribution on Studio using Studio’s bring your own image feature. To add the open-source distribution to your SageMaker domain, complete the following steps:
- Add the open-source distribution to your account’s Amazon Elastic Container Registry (Amazon ECR) repository by running the following commands on your terminal:
- Create a SageMaker image and attach the image to the Studio domain:
- On the SageMaker console, launch Studio by choosing your domain and existing user profile.
- Optionally, restart Studio by following the steps in Shut down and update SageMaker Studio.
Download the notebook
Download the sample notebook locally from the GitHub repo.
Open the notebook in your choice of IDE and add a cell to the beginning of the notebook to install torchsummary
. The torchsummary
package is not part of the distribution, and installing this on the notebook will ensure the notebook runs end to end. We recommend using conda
or micromamba
to manage environments and dependencies. Add the following cell to the notebook and save the notebook:
Experiment on the local notebook
Upload the notebook to the JupyterLab UI you launched by choosing the upload icon as shown in the following screenshot.
When it’s uploaded, launch the cv-kmnist.ipynb
notebook. You can start running the cells immediately, without having to install any dependencies such as torch, matplotlib, or ipywidgets.
If you followed the preceding steps, you can see that you can use the distribution locally from your laptop. In the next step, we use the same distribution on Studio to take advantage of Studio’s features.
Move the experimentation to Studio (optional)
Optionally, let’s promote the experimentation to Studio. One of the advantages of Studio is that the underlying compute resources are fully elastic, so you can easily dial the available resources up or down, and the changes take place automatically in the background without interrupting your work. If you wanted to run the same notebook from earlier on a larger dataset and compute instance, you can migrate to Studio.
Navigate to the Studio UI you launched earlier and choose the upload icon to upload the notebook.
After you launch the notebook, you will be prompted to choose the image and instance type. On the kernel launcher, choose sagemaker-runtime
as the image and an ml.t3.medium
instance, then choose Select.
You can now run the notebook end to end without needing any changes on the notebook from your local development environment to Studio notebooks!
Schedule the notebook as a job
When you’re done with your experimentation, SageMaker provides multiple options to productionalize your notebook, such as training jobs and SageMaker pipelines. One such option is to directly run the notebook itself as a non-interactive, scheduled notebook job using SageMaker notebook jobs. For example, you might want to retrain your model periodically, or get inferences on incoming data periodically and generate reports for consumption by your stakeholders.
From Studio, choose the notebook job icon to launch the notebook job. If you have installed the notebook jobs extension locally on your laptop, you can also schedule the notebook directly from your laptop. See Installation Guide to set up the notebook jobs extension locally.
The notebook job automatically uses the ECR image URI of the open-source distribution, so you can directly schedule the notebook job.
Choose Run on schedule, choose a schedule, for example every week on Saturday, and choose Create. You can also choose Run now if you’d like to view the results immediately.
When the first notebook job is complete, you can view the notebook outputs directly from the Studio UI by choosing Notebook under Output files.
Additional considerations
In addition to using the publicly available ECR image directly for ML workloads, the open-source distribution offers the following advantages:
- The Dockerfile used to build the image is available publicly for developers to explore and build their own images. You can also inherit this image as the base image and install your custom libraries to have a reproducible environment.
- If you’re not used to Docker and prefer to use Conda environments on your JupyterLab environment, we provide an
env.out
file for each of the published versions. You can use the instructions in the file to create your own Conda environment that will mimic the same environment. For example, see the CPU environment file cpu.env.out. - You can use the GPU versions of the image to run GPU-compatible workloads such as deep learning and image processing.
Clean up
Complete the following steps to clean up your resources:
- If you have scheduled your notebook to run on a schedule, pause or delete the schedule on the Notebook Job Definitions tab to avoid paying for future jobs.
- Shut down all Studio apps to avoid paying for unused compute usage. See Shut down and Update Studio Apps for instructions.
- Optionally, delete the Studio domain if you created one.
Conclusion
Maintaining a reproducible environment across different stages of the ML lifecycle is one of the biggest challenges for data scientists and developers. With the SageMaker open-source distribution, we provide an image with mutually compatible versions of the most common ML frameworks and packages. The distribution is also open source, providing developers with transparency into the packages and build processes, making it easier to customize their own distribution.
In this post, we showed you how to use the distribution on your local environment, on Studio, and as the container for your training jobs. This feature is currently in public beta. We encourage you to try this out and share your feedback and issues on the public GitHub repository!
About the authors
Durga Sury is an ML Solutions Architect on the Amazon SageMaker Service SA team. She is passionate about making machine learning accessible to everyone. In her 4 years at AWS, she has helped set up AI/ML platforms for enterprise customers. When she isn’t working, she loves motorcycle rides, mystery novels, and long walks with her 5-year-old husky.
Ketan Vijayvargiya is a Senior Software Development Engineer in Amazon Web Services (AWS). His focus areas are machine learning, distributed systems and open source. Outside work, he likes to spend his time self-hosting and enjoying nature.
Exploring Generative AI in conversational experiences: An Introduction with Amazon Lex, Langchain, and SageMaker Jumpstart
Customers expect quick and efficient service from businesses in today’s fast-paced world. But providing excellent customer service can be significantly challenging when the volume of inquiries outpaces the human resources employed to address them. However, businesses can meet this challenge while providing personalized and efficient customer service with the advancements in generative artificial intelligence (generative AI) powered by large language models (LLMs).
Generative AI chatbots have gained notoriety for their ability to imitate human intellect. However, unlike task-oriented bots, these bots use LLMs for text analysis and content generation. LLMs are based on the Transformer architecture, a deep learning neural network introduced in June 2017 that can be trained on a massive corpus of unlabeled text. This approach creates a more human-like conversation experience and accommodates several topics.
As of this writing, companies of all sizes want to use this technology but need help figuring out where to start. If you are looking to get started with generative AI and the use of LLMs in conversational AI, this post is for you. We have included a sample project to quickly deploy an Amazon Lex bot that consumes a pre-trained open-source LLM. The code also includes the starting point to implement a custom memory manager. This mechanism allows an LLM to recall previous interactions to keep the conversation’s context and pace. Finally, it’s essential to highlight the importance of experimenting with fine-tuning prompts and LLM randomness and determinism parameters to obtain consistent results.
Solution overview
The solution integrates an Amazon Lex bot with a popular open-source LLM from Amazon SageMaker JumpStart, accessible through an Amazon SageMaker endpoint. We also use LangChain, a popular framework that simplifies LLM-powered applications. Finally, we use a QnABot to provide a user interface for our chatbot.
First, we start by describing each component in the preceding diagram:
- JumpStart offers pre-trained open-source models for various problem types. This enables you to begin machine learning (ML) quickly. It includes the FLAN-T5-XL model, an LLM deployed into a deep learning container. It performs well on various natural language processing (NLP) tasks, including text generation.
- A SageMaker real-time inference endpoint enables fast, scalable deployment of ML models for predicting events. With the ability to integrate with Lambda functions, the endpoint allows for building custom applications.
- The AWS Lambda function uses the requests from the Amazon Lex bot or the QnABot to prepare the payload to invoke the SageMaker endpoint using LangChain. LangChain is a framework that lets developers create applications powered by LLMs.
- The Amazon Lex V2 bot has the built-in
AMAZON.FallbackIntent
intent type. It is triggered when a user’s input doesn’t match any intents in the bot. - The QnABot is an open-source AWS solution to provide a user interface to Amazon Lex bots. We configured it with a Lambda hook function for a
CustomNoMatches
item, and it triggers the Lambda function when QnABot can’t find an answer. We assume you have already deployed it and included the steps to configure it in the following sections.
The solution is described at a high level in the following sequence diagram.
Major tasks performed by the solution
In this section, we look at the major tasks performed in our solution. This solution’s entire project source code is available for your reference in this GitHub repository.
Handling chatbot fallbacks
The Lambda function handles the “don’t know” answers via AMAZON.FallbackIntent
in Amazon Lex V2 and the CustomNoMatches
item in QnABot. When triggered, this function looks at the request for a session and the fallback intent. If there is a match, it hands off the request to a Lex V2 dispatcher; otherwise, the QnABot dispatcher uses the request. See the following code:
Providing memory to our LLM
To preserve the LLM memory in a multi-turn conversation, the Lambda function includes a LangChain custom memory class mechanism that uses the Amazon Lex V2 Sessions API to keep track of the session attributes with the ongoing multi-turn conversation messages and to provide context to the conversational model via previous interactions. See the following code:
The following is the sample code we created for introducing the custom memory class in a LangChain ConversationChain:
Prompt definition
A prompt for an LLM is a question or statement that sets the tone for the generated response. Prompts function as a form of context that helps direct the model toward generating relevant responses. See the following code:
Using an Amazon Lex V2 session for LLM memory support
Amazon Lex V2 initiates a session when a user interacts to a bot. A session persists over time unless manually stopped or timed out. A session stores metadata and application-specific data known as session attributes. Amazon Lex updates client applications when the Lambda function adds or changes session attributes. The QnABot includes an interface to set and get session attributes on top of Amazon Lex V2.
In our code, we used this mechanism to build a custom memory class in LangChain to keep track of the conversation history and enable the LLM to recall short-term and long-term interactions. See the following code:
Prerequisites
To get started with the deployment, you need to fulfill the following prerequisites:
- Access to the AWS Management Console via a user who can launch AWS CloudFormation stacks
- Familiarity navigating the Lambda and Amazon Lex consoles
Deploy the solution
To deploy the solution, proceed with the following steps:
- Choose Launch Stack to launch the solution in the
us-east-1
Region: - For Stack name, enter a unique stack name.
- For HFModel, we use the
Hugging Face Flan-T5-XL
model available on JumpStart. - For HFTask, enter
text2text
. - Keep S3BucketName as is.
These are used to find Amazon Simple Storage Service (Amazon S3) assets needed to deploy the solution and may change as updates to this post are published.
- Acknowledge the capabilities.
- Choose Create stack.
There should be four successfully created stacks.
Configure the Amazon Lex V2 bot
There is nothing to do with the Amazon Lex V2 bot. Our CloudFormation template already did the heavy lifting.
Configure the QnABot
We assume you already have an existing QnABot deployed in your environment. But if you need help, follow these instructions to deploy it.
- On the AWS CloudFormation console, navigate to the main stack that you deployed.
- On the Outputs tab, make a note of the
LambdaHookFunctionArn
because you need to insert it in the QnABot later.
- Log in to the QnABot Designer User Interface (UI) as an administrator.
- In the Questions UI, add a new question.
- Enter the following values:
- ID –
CustomNoMatches
- Question –
no_hits
- Answer – Any default answer for “don’t know”
- ID –
- Choose Advanced and go to the Lambda Hook section.
- Enter the Amazon Resource Name (ARN) of the Lambda function you noted previously.
- Scroll down to the bottom of the section and choose Create.
You get a window with a success message.
Your question is now visible on the Questions page.
Test the solution
Let’s proceed with testing the solution. First, it’s worth mentioning that we deployed the FLAN-T5-XL model provided by JumpStart without any fine-tuning. This may have some unpredictability, resulting in slight variations in responses.
Test with an Amazon Lex V2 bot
This section helps you test the Amazon Lex V2 bot integration with the Lambda function that calls the LLM deployed in the SageMaker endpoint.
- On the Amazon Lex console, navigate to the bot entitled
Sagemaker-Jumpstart-Flan-LLM-Fallback-Bot
.
This bot has been configured to call the Lambda function that invokes the SageMaker endpoint hosting the LLM as a fallback intent when no other intents are matched. - Choose Intents in the navigation pane.
On the top right, a message reads, “English (US) has not built changes.”
- Choose Build.
- Wait for it to complete.
Finally, you get a success message, as shown in the following screenshot.
- Choose Test.
A chat window appears where you can interact with the model.
We recommend exploring the built-in integrations between Amazon Lex bots and Amazon Connect. And also, messaging platforms (Facebook, Slack, Twilio SMS) or third-party Contact Centers using Amazon Chime SDK and Genesys Cloud, for example.
Test with a QnABot instance
This section tests the QnABot on AWS integration with the Lambda function that calls the LLM deployed in the SageMaker endpoint.
- Open the tools menu in the top left corner.
- Choose QnABot Client.
- Choose Sign In as Admin.
- Enter any question in the user interface.
- Evaluate the response.
Clean up
To avoid incurring future charges, delete the resources created by our solution by following these steps:
- On the AWS CloudFormation console, select the stack named
SagemakerFlanLLMStack
(or the custom name you set to the stack). - Choose Delete.
- If you deployed the QnABot instance for your tests, select the QnABot stack.
- Choose Delete.
Conclusion
In this post, we explored the addition of open-domain capabilities to a task-oriented bot that routes the user requests to an open-source large language model.
We encourage you to:
- Save the conversation history to an external persistence mechanism. For example, you can save the conversation history to Amazon DynamoDB or an S3 bucket and retrieve it in the Lambda function hook. In this way, you don’t need to rely on the internal non-persistent session attributes management offered by Amazon Lex.
- Experiment with summarization – In multiturn conversations, it’s helpful to generate a summary that you can use in your prompts to add context and limit the usage of conversation history. This helps to prune the bot session size and keep the Lambda function memory consumption low.
- Experiment with prompt variations – Modify the original prompt description that matches your experimentation purposes.
- Adapt the language model for optimal results – You can do this by fine-tuning the advanced LLM parameters such as randomness (
temperature
) and determinism (top_p
) according to your applications. We demonstrated a sample integration using a pre-trained model with sample values, but have fun adjusting the values for your use cases.
In our next post, we plan to help you discover how to fine-tune pre-trained LLM-powered chatbots with your own data.
Are you experimenting with LLM chatbots on AWS? Tell us more in the comments!
Resources and references
- Companion source code for this post
- Amazon Lex V2 Developer Guide
- AWS Solutions Library: QnABot on AWS
- Text2Text Generation with FLAN T5 models
- LangChain – Building applications with LLMs
- Amazon SageMaker Examples with Jumpstart Foundation Models
- Amazon BedRock – The easiest way to build and scale generative AI applications with foundation models
- Quickly build high-accuracy Generative AI applications on enterprise data using Amazon Kendra, LangChain, and large language models
About the Authors
Marcelo Silva is an experienced tech professional who excels in designing, developing, and implementing cutting-edge products. Starting off his career at Cisco, Marcelo worked on various high-profile projects including deployments of the first ever carrier routing system and the successful rollout of ASR9000. His expertise extends to cloud technology, analytics, and product management, having served as senior manager for several companies like Cisco, Cape Networks, and AWS before joining GenAI. Currently working as a Conversational AI/GenAI Product Manager, Marcelo continues to excel in delivering innovative solutions across industries.
Victor Rojo is a highly experienced technologist who is passionate about the latest in AI, ML, and software development. With his expertise, he played a pivotal role in bringing Amazon Alexa to the US and Mexico markets while spearheading the successful launch of Amazon Textract and AWS Contact Center Intelligence (CCI) to AWS Partners. As the current Principal Tech Leader for the Conversational AI Competency Partners program, Victor is committed to driving innovation and bringing cutting-edge solutions to meet the evolving needs of the industry.
Justin Leto is a Sr. Solutions Architect at Amazon Web Services with a specialization in machine learning. His passion is helping customers harness the power of machine learning and AI to drive business growth. Justin has presented at global AI conferences, including AWS Summits, and lectured at universities. He leads the NYC machine learning and AI meetup. In his spare time, he enjoys offshore sailing and playing jazz. He lives in New York City with his wife and baby daughter.
Ryan Gomes is a Data & ML Engineer with the AWS Professional Services Intelligence Practice. He is passionate about helping customers achieve better outcomes through analytics and machine learning solutions in the cloud. Outside work, he enjoys fitness, cooking, and spending quality time with friends and family.
Mahesh Birardar is a Sr. Solutions Architect at Amazon Web Services with specialization in DevOps and Observability. He enjoys helping customers implement cost-effective architectures that scale. Outside work, he enjoys watching movies and hiking.
Kanjana Chandren is a Solutions Architect at Amazon Web Services (AWS) who is passionate about Machine Learning. She helps customers in designing, implementing and managing their AWS workloads. Outside of work she loves travelling, reading and spending time with family and friends.
Introducing popularity tuning for Similar-Items in Amazon Personalize
Amazon Personalize now enables popularity tuning for its Similar-Items recipe (aws-similar-items
). Similar-Items generates recommendations that are similar to the item that a user selects, helping users discover new items in your catalog based on the previous behavior of all users and item metadata. Previously, this capability was only available for SIMS, the other Related_Items
recipe within Amazon Personalize.
Every customer’s item catalog and the way that users interact with it are unique to their business. When recommending similar items, some customers may want to place more emphasis on popular items because they increase the likelihood of user interaction, while others may want to de-emphasize popular items to surface recommendations that are more similar to the selected item but are less widely known. This launch gives you more control over the degree to which popularity influences Similar-Items recommendations, so you can tune the model to meet your particular business needs.
In this post, we show you how to tune popularity for the Similar-Items recipe. We specify a value closer to zero to include more popular items, and specify a value closer to 1 to place less emphasis on popularity.
Example use cases
To explore the impact of this new feature in greater detail, let’s review two examples. [1]
First, we used the Similar-Items recipe to find recommendations similar to Disney’s 1994 movie The Lion King (IMDB record). When the popularity discount is set to 0, Amazon Personalize recommends movies that have a high frequency of occurrence (are popular). In this example, the movie Seven (a.k.a. Se7en), which occurred 19,295 times in the dataset, is recommended at rank 3.0.
By tuning the popularity discount to a value of 0.4 for The Lion King recommendations, we see that the rank of the movie Seven drops to 4.0. We also see movies from the Children genre like Babe, Beauty and the Beast, Aladdin, and Snow White and the Seven Dwarfs get recommended at a higher rank despite their lower overall popularity in the dataset.
Let’s explore another example. We used the Similar-Items recipe to find recommendations similar to Disney and Pixar’s 1995 movie Toy Story (IMDB record). When the popularity discount is set to 0, Amazon Personalize recommends movies that have a high frequency occurrence in the dataset. In this example, we see that the movie Twelve Monkeys (a.k.a. 12 Monkeys), which occurred 6,678 times in the dataset, is recommended at rank 5.0.
By tuning the popularity discount to a value of 0.4 for Toy Story recommendations, we see that the rank of the Twelve Monkeys is no longer recommended in the top 10. We also see movies from the Children genre like Aladdin, Toy Story 2, and A Bug’s Life get recommended at a higher rank despite their lower overall popularity in the dataset.
Placing greater emphasis on more popular content can help increase likelihood that users will engage with item recommendations. Reducing emphasis on popularity may surface recommendations that seem more relevant to the queried item, but may be less popular with users. You can tune the degree of importance placed on popularity to meet your business needs for a specific personalization campaign.
Implement popularity tuning
To tune popularity for the Similar-Items recipe, configure the popularity_discount_factor
hyperparameter via the AWS Management Console, the AWS SDKs, or the AWS Command Line Interface (AWS CLI).
The following is sample code setting the popularity discount factor to 0.5 via the AWS SDK:
The following screenshot shows setting the popularity discount factor to 0.3 on the Amazon Personalize console.
Conclusion
With popularity tuning, you can now further refine the Similar-Items recipe within Amazon Personalize to control the degree to which popularity influences item recommendations. This gives you greater control over defining the end-user experience and what is included or excluded in your Similar-Items recommendations.
For more details on how to implement popularity tuning for the Similar-Items recipe, refer to documentation.
References
[1] Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4, Article 19 (December 2015), 19 pages. DOI=http://dx.doi.org/10.1145/2827872About the Authors
Julia McCombs Clark is a Sr. Technical Product Manager on the Amazon Personalize team.
Nihal Harish is a Software Development Engineer on the Amazon Personalize team.
Accelerate PyTorch with DeepSpeed to train large language models with Intel Habana Gaudi-based DL1 EC2 instances
Training large language models (LLMs) with billions of parameters can be challenging. In addition to designing the model architecture, researchers need to set up state-of-the-art training techniques for distributed training like mixed precision support, gradient accumulation, and checkpointing. With large models, the training setup is even more challenging because the available memory in a single accelerator device bounds the size of models trained using only data parallelism, and using model parallel training requires additional level of modifications to the training code. Libraries such as DeepSpeed (an open-source deep learning optimization library for PyTorch) address some of these challenges, and can help accelerate model development and training.
In this post, we set up training on the Intel Habana Gaudi-based Amazon Elastic Compute Cloud (Amazon EC2) DL1 instances and quantify the benefits of using a scaling framework such as DeepSpeed. We present scaling results for an encoder-type transformer model (BERT with 340 million to 1.5 billion parameters). For the 1.5-billion-parameter model, we achieved a scaling efficiency of 82.7% across 128 accelerators (16 dl1.24xlarge instances) using DeepSpeed ZeRO stage 1 optimizations. The optimizer states were partitioned by DeepSpeed to train large models using the data parallel paradigm. This approach has been extended to train a 5-billion-parameter model using data parallelism. We also used Gaudi’s native support of the BF16 data type for reduced memory size and increased training performance compared to using the FP32 data type. As a result, we achieved pre-training (phase 1) model convergence within 16 hours (our target was to train a large model within a day) for the BERT 1.5-billion-parameter model using the wikicorpus-en dataset.
Training setup
We provisioned a managed compute cluster comprised of 16 dl1.24xlarge instances using AWS Batch. We developed an AWS Batch workshop that illustrates the steps to set up the distributed training cluster with AWS Batch. Each dl1.24xlarge instance has eight Habana Gaudi accelerators, each with 32 GB of memory and a full mesh RoCE network between cards with a total bi-directional interconnect bandwidth of 700 Gbps each (see Amazon EC2 DL1 instances Deep Dive for more information). The dl1.24xlarge cluster also used four AWS Elastic Fabric Adapters (EFA), with a total of 400 Gbps interconnect between nodes.
The distributed training workshop illustrates the steps to set up the distributed training cluster. The workshop shows the distributed training setup using AWS Batch and in particular, the multi-node parallel jobs feature to launch large-scale containerized training jobs on fully managed clusters. More specifically, a fully managed AWS Batch compute environment is created with DL1 instances. The containers are pulled from Amazon Elastic Container Registry (Amazon ECR) and launched automatically into the instances in the cluster based on the multi-node parallel job definition. The workshop concludes by running a multi-node, multi-HPU data parallel training of a BERT (340 million to 1.5 billion parameters) model using PyTorch and DeepSpeed.
BERT 1.5B pre-training with DeepSpeed
Habana SynapseAI v1.5 and v1.6 support DeepSpeed ZeRO1 optimizations. The Habana fork of the DeepSpeed GitHub repository includes the modifications necessary to support the Gaudi accelerators. There is full support of distributed data parallel (multi-card, multi-instance), ZeRO1 optimizations, and BF16 data types.
All these features are enabled on the BERT 1.5B model reference repository, which introduces a 48-layer, 1600-hidden dimension, and 25-head bi-directional encoder model, derived from a BERT implementation. The repository also contains the baseline BERT Large model implementation: a 24-layer, 1024-hidden, 16-head, 340-million-parameter neural network architecture. The pre-training modeling scripts are derived from the NVIDIA Deep Learning Examples repository to download the wikicorpus_en data, preprocess the raw data into tokens, and shard the data into smaller h5 datasets for distributed data parallel training. You can adopt this generic approach to train your custom PyTorch model architectures using your datasets using DL1 instances.
Pre-training (phase 1) scaling results
For pre-training large models at scale, we mainly focused on two aspects of the solution: training performance, as measured by the time to train, and cost-effectiveness of arriving at a fully converged solution. Next, we dive deeper into these two metrics with BERT 1.5B pre-training as an example.
Scaling performance and time to train
We start by measuring the performance of the BERT Large implementation as a baseline for scalability. The following table lists the measured throughput of sequences per second from 1-8 dl1.24xlarge instances (with eight accelerator devices per instance). Using the single-instance throughput as baseline, we measured the efficiency of scaling across multiple instances, which is an important lever to understand the price-performance training metric.
Number of Instances | Number of Accelerators | Sequences per Second | Sequences per Second per Accelerator | Scaling Efficiency |
1 | 8 | 1,379.76 | 172.47 | 100.0% |
2 | 16 | 2,705.57 | 169.10 | 98.04% |
4 | 32 | 5,291.58 | 165.36 | 95.88% |
8 | 64 | 9,977.54 | 155.90 | 90.39% |
The following figure illustrates the scaling efficiency.
For BERT 1.5B, we modified the hyperparameters for the model in the reference repository to guarantee convergence. The effective batch size per accelerator was set to 384 (for maximum memory utilization), with micro-batches of 16 per step and 24 steps of gradient accumulation. Learning rates of 0.0015 and 0.003 were used for 8 and 16 nodes, respectively. With these configurations, we achieved convergence of the phase 1 pre-training of BERT 1.5B across 8 dl1.24xlarge instances (64 accelerators) in approximately 25 hours, and 15 hours across 16 dl1.24xlarge instances (128 accelerators). The following figure shows the average loss as a function of number of training epochs, as we scale up the number of accelerators.
With the configuration described earlier, we obtained 85% strong scaling efficiency with 64 accelerators and 83% with 128 accelerators, from a baseline of 8 accelerators in a single instance. The following table summarizes the parameters.
Number of Instances | Number of Accelerators | Sequences per Second | Sequences per Second per Accelerator | Scaling Efficiency |
1 | 8 | 276.66 | 34.58 | 100.0% |
8 | 64 | 1,883.63 | 29.43 | 85.1% |
16 | 128 | 3,659.15 | 28.59 | 82.7% |
The following figure illustrates the scaling efficiency.
Conclusion
In this post, we evaluated support for DeepSpeed by Habana SynapseAI v1.5/v1.6 and how it helps scale LLM training on Habana Gaudi accelerators. Pre-training of a 1.5-billion-parameter BERT model took 16 hours to converge on a cluster of 128 Gaudi accelerators, with 85% strong scaling. We encourage you to take a look at the architecture demonstrated in the AWS workshop and consider adopting it to train custom PyTorch model architectures using DL1 instances.
About the authors
Mahadevan Balasubramaniam is a Principal Solutions Architect for Autonomous Computing with nearly 20 years of experience in the area of physics-infused deep learning, building, and deploying digital twins for industrial systems at scale. Mahadevan obtained his PhD in Mechanical Engineering from the Massachusetts Institute of Technology and has over 25 patents and publications to his credit.
RJ is an engineer in Search M5 team leading the efforts for building large scale deep learning systems for training and inference. Outside of work he explores different cuisines of food and plays racquet sports.
Sundar Ranganathan is the Head of Business Development, ML Frameworks on the Amazon EC2 team. He focuses on large-scale ML workloads across AWS services like Amazon EKS, Amazon ECS, Elastic Fabric Adapter, AWS Batch, and Amazon SageMaker. His experience includes leadership roles in product management and product development at NetApp, Micron Technology, Qualcomm, and Mentor Graphics.
Abhinandan Patni is a Senior Software Engineer at Amazon Search. He focuses on building systems and tooling for scalable distributed deep learning training and real time inference.
Pierre-Yves Aquilanti is Head of Frameworks ML Solutions at Amazon Web Services where he helps develop the industry’s best cloud based ML Frameworks solutions. His background is in High Performance Computing and prior to joining AWS, Pierre-Yves was working in the Oil & Gas industry. Pierre-Yves is originally from France and holds a Ph.D. in Computer Science from the University of Lille.
Retrain ML models and automate batch predictions in Amazon SageMaker Canvas using updated datasets
You can now retrain machine learning (ML) models and automate batch prediction workflows with updated datasets in Amazon SageMaker Canvas, thereby making it easier to constantly learn and improve the model performance and drive efficiency. An ML model’s effectiveness depends on the quality and relevance of the data it’s trained on. As time progresses, the underlying patterns, trends, and distributions in the data may change. By updating the dataset, you ensure that the model learns from the most recent and representative data, thereby improving its ability to make accurate predictions. Canvas now supports updating datasets automatically and manually enabling you to use the latest version of the tabular, image, and document dataset for training ML models.
After the model is trained, you may want to run predictions on it. Running batch predictions on an ML model enables processing multiple data points simultaneously instead of making predictions one by one. Automating this process provides efficiency, scalability, and timely decision-making. After the predictions are generated, they can be further analyzed, aggregated, or visualized to gain insights, identify patterns, or make informed decisions based on the predicted outcomes. Canvas now supports setting up an automated batch prediction configuration and associating a dataset to it. When the associated dataset is refreshed, either manually or on a schedule, a batch prediction workflow will be triggered automatically on the corresponding model. Results of the predictions can be viewed inline or downloaded for later review.
In this post, we show how to retrain ML models and automate batch predictions using updated datasets in Canvas.
Overview of solution
For our use case, we play the part of a business analyst for an ecommerce company. Our product team wants us to determine the most critical metrics that influence a shopper’s purchase decision. For this, we train an ML model in Canvas with a customer website online session dataset from the company. We evaluate the model’s performance and, if needed, retrain the model with additional data to see if it improves the performance of the existing model or not. To do so, we use the auto update dataset capability in Canvas and retrain our existing ML model with the latest version of training dataset. Then we configure automatic batch prediction workflows—when the corresponding prediction dataset is updated, it automatically triggers the batch prediction job on the model and makes the results available for us to review.
The workflow steps are as follows:
- Upload the downloaded customer website online session data to Amazon Simple Storage Service (Amazon S3) and create a new training dataset Canvas. For the full list of supported data sources, refer to Importing data in Amazon SageMaker Canvas.
- Build ML models and analyze their performance metrics. Refer to the steps on how to build a custom ML Model in Canvas and evaluate a model’s performance.
- Set up auto update on the existing training dataset and upload new data to the Amazon S3 location backing this dataset. Upon completion, it should create a new dataset version.
- Use the latest version of the dataset to retrain the ML model and analyze its performance.
- Set up automatic batch predictions on the better performing model version and view the prediction results.
You can perform these steps in Canvas without writing a single line of code.
Overview of data
The dataset consists of feature vectors belonging to 12,330 sessions. The dataset was formed so that each session would belong to a different user in a 1-year period to avoid any tendency to a specific campaign, special day, user profile, or period. The following table outlines the data schema.
Column Name | Data Type | Description |
Administrative |
Numeric | Number of pages visited by the user for user account management-related activities. |
Administrative_Duration |
Numeric | Amount of time spent in this category of pages. |
Informational |
Numeric | Number of pages of this type (informational) that the user visited. |
Informational_Duration |
Numeric | Amount of time spent in this category of pages. |
ProductRelated |
Numeric | Number of pages of this type (product related) that the user visited. |
ProductRelated_Duration |
Numeric | Amount of time spent in this category of pages. |
BounceRates |
Numeric | Percentage of visitors who enter the website through that page and exit without triggering any additional tasks. |
ExitRates |
Numeric | Average exit rate of the pages visited by the user. This is the percentage of people who left your site from that page. |
Page Values |
Numeric | Average page value of the pages visited by the user. This is the average value for a page that a user visited before landing on the goal page or completing an ecommerce transaction (or both). |
SpecialDay |
Binary | The “Special Day” feature indicates the closeness of the site visiting time to a specific special day (such as Mother’s Day or Valentine’s Day) in which the sessions are more likely to be finalized with a transaction. |
Month |
Categorical | Month of the visit. |
OperatingSystems |
Categorical | Operating systems of the visitor. |
Browser |
Categorical | Browser used by the user. |
Region |
Categorical | Geographic region from which the session has been started by the visitor. |
TrafficType |
Categorical | Traffic source through which user has entered the website. |
VisitorType |
Categorical | Whether the customer is a new user, returning user, or other. |
Weekend |
Binary | If the customer visited the website on the weekend. |
Revenue |
Binary | If a purchase was made. |
Revenue is the target column, which will help us predict whether or not a shopper will purchase a product or not.
The first step is to download the dataset that we will use. Note that this dataset is courtesy of the UCI Machine Learning Repository.
Prerequisites
For this walkthrough, complete the following prerequisite steps:
- Split the downloaded CSV that contains 20,000 rows into multiple smaller chunk files.
This is so that we can showcase the dataset update functionality. Ensure all the CSV files have the same headers, otherwise you may run into schema mismatch errors while creating a training dataset in Canvas.
- Create an S3 bucket and upload
online_shoppers_intentions1-3.csv
to the S3 bucket.
- Set aside 1,500 rows from the downloaded CSV to run batch predictions on after the ML model is trained.
- Remove the
Revenue
column from these files so that when you run batch prediction on the ML model, that is the value your model will be predicting.
Ensure all the predict*.csv
files have the same headers, otherwise you may run into schema mismatch errors while creating a prediction (inference) dataset in Canvas.
- Perform the necessary steps to set up a SageMaker domain and Canvas app.
Create a dataset
To create a dataset in Canvas, complete the following steps:
- In Canvas, choose Datasets in the navigation pane.
- Choose Create and choose Tabular.
- Give your dataset a name. For this post, we call our training dataset
OnlineShoppersIntentions
. - Choose Create.
- Choose your data source (for this post, our data source is Amazon S3).
Note that as of this writing, the dataset update functionality is only supported for Amazon S3 and locally uploaded data sources.
- Select the corresponding bucket and upload the CSV files for the dataset.
You can now create a dataset with multiple files.
- Preview all the files in the dataset and choose Create dataset.
We now have version 1 of the OnlineShoppersIntentions
dataset with three files created.
- Choose the dataset to view the details.
The Data tab shows a preview of the dataset.
- Choose Dataset details to view the files that the dataset contains.
The Dataset files pane lists the available files.
- Choose the Version History tab to view all the versions for this dataset.
We can see our first dataset version has three files. Any subsequent version will include all the files from previous versions and will provide a cumulative view of the data.
Train an ML model with version 1 of the dataset
Let’s train an ML model with version 1 of our dataset.
- In Canvas, choose My models in the navigation pane.
- Choose New model.
- Enter a model name (for example,
OnlineShoppersIntentionsModel
), select the problem type, and choose Create. - Select the dataset. For this post, we select the
OnlineShoppersIntentions
dataset.
By default, Canvas will pick up the most current dataset version for training.
- On the Build tab, choose the target column to predict. For this post, we choose the Revenue column.
- Choose Quick build.
The model training will take 2–5 minutes to complete. In our case, the trained model gives us a score of 89%.
Set up automatic dataset updates
Let’s update on our dataset using the auto update functionality and bring in more data and see if the model performance improves with the new version of dataset. Datasets can be manually updated as well.
- On the Datasets page, select the
OnlineShoppersIntentions
dataset and choose Update dataset. - You can either choose Manual update, which is a one-time update option, or Automatic update, which allows you to automatically update your dataset on a schedule. For this post, we showcase the automatic update feature.
You’re redirected to the Auto update tab for the corresponding dataset. We can see that Enable auto update is currently disabled.
- Toggle Enable auto update to on and specify the data source (as of this writing, Amazon S3 data sources are supported for auto updates).
- Select a frequency and enter a start time.
- Save the configuration settings.
An auto update dataset configuration has been created. It can be edited at any time. When a corresponding dataset update job is triggered on the specified schedule, the job will appear in the Job history section.
- Next, let’s upload the
online_shoppers_intentions4.csv
,online_shoppers_intentions5.csv
, andonline_shoppers_intentions6.csv
files to our S3 bucket.
We can view our files in the dataset-update-demo
S3 bucket.
The dataset update job will get triggered at the specified schedule and create a new version of the dataset.
When the job is complete, dataset version 2 will have all the files from version 1 and the additional files processed by the dataset update job. In our case, version 1 has three files and the update job picked up three additional files, so the final dataset version has six files.
We can view the new version that was created on the Version history tab.
The Data tab contains a preview of the dataset and provides a list of all the files in the latest version of the dataset.
Retrain the ML model with an updated dataset
Let’s retrain our ML model with the latest version of the dataset.
- On the My models page, choose your model.
- Choose Add version.
- Select the latest dataset version (v2 in our case) and choose Select dataset.
- Keep the target column and build configuration similar to the previous model version.
When the training is complete, let’s evaluate the model performance. The following screenshot shows that adding additional data and retraining our ML model has helped improve our model performance.
Create a prediction dataset
With an ML model trained, let’s create a dataset for predictions and run batch predictions on it.
- On the Datasets page, create a tabular dataset.
- Enter a name and choose Create.
- In our S3 bucket, upload one file with 500 rows to predict.
Next, we set up auto updates on the prediction dataset.
- Toggle Enable auto update to on and specify the data source.
- Select the frequency and specify a starting time.
- Save the configuration.
Automate the batch prediction workflow on an auto updated predictions dataset
In this step, we configure our auto batch prediction workflows.
- On the My models page, navigate to version 2 of your model.
- On the Predict tab, choose Batch prediction and Automatic.
- Choose Select dataset to specify the dataset to generate predictions on.
- Select the
predict
dataset that we created earlier and choose Choose dataset. - Choose Set up.
We now have an automatic batch prediction workflow. This will be triggered when the Predict
dataset is automatically updated.
Now let’s upload more CSV files to the predict
S3 folder.
This operation will trigger an auto update of the predict
dataset.
This will in turn trigger the automatic batch prediction workflow and generate predictions for us to view.
We can view all automations on the Automations page.
Thanks to the automatic dataset update and automatic batch prediction workflows, we can use the latest version of the tabular, image, and document dataset for training ML models, and build batch prediction workflows that get automatically triggered on every dataset update.
Clean up
To avoid incurring future charges, log out of Canvas. Canvas bills you for the duration of the session, and we recommend logging out of Canvas when you’re not using it. Refer to Logging out of Amazon SageMaker Canvas for more details.
Conclusion
In this post, we discussed how we can use the new dataset update capability to build new dataset versions and train our ML models with the latest data in Canvas. We also showed how we can efficiently automate the process of running batch predictions on updated data.
To start your low-code/no-code ML journey, refer to the Amazon SageMaker Canvas Developer Guide.
Special thanks to everyone who contributed to the launch.
About the Authors
Janisha Anand is a Senior Product Manager on the SageMaker No/Low-Code ML team, which includes SageMaker Canvas and SageMaker Autopilot. She enjoys coffee, staying active, and spending time with her family.
Prashanth is a Software Development Engineer at Amazon SageMaker and mainly works with SageMaker low-code and no-code products.
Esha Dutta is a Software Development Engineer at Amazon SageMaker. She focuses on building ML tools and products for customers. Outside of work, she enjoys the outdoors, yoga, and hiking.
Expedite the Amazon Lex chatbot development lifecycle with Test Workbench
Amazon Lex is excited to announce Test Workbench, a new bot testing solution that provides tools to simplify and automate the bot testing process. During bot development, testing is the phase where developers check whether a bot meets the specific requirements, needs and expectations by identifying errors, defects, or bugs in the system before scaling. Testing helps validate bot performance on several fronts such as conversational flow (understanding user queries and responding accurately), intent overlap handling, and consistency across modalities. However, testing is often manual, error-prone, and non-standardized. Test Workbench standardizes automated test management by allowing chatbot development teams to generate, maintain, and execute test sets with a consistent methodology and avoid custom scripting and ad-hoc integrations. In this post, you will learn how Test Workbench streamlines automated testing of a bot’s voice and text modalities and provides accuracy and performance measures for parameters such as audio transcription, intent recognition, and slot resolution for both single utterance inputs and multi-turn conversations. This allows you to quickly identify bot improvement areas and maintain a consistent baseline to measure accuracy over time and observe any accuracy regression due to bot updates.
Amazon Lex is a fully managed service for building conversational voice and text interfaces. Amazon Lex helps you build and deploy chatbots and virtual assistants on websites, contact center services, and messaging channels. Amazon Lex bots help increase interactive voice response (IVR) productivity, automate simple tasks, and drive operational efficiencies across the organization. Test Workbench for Amazon Lex standardizes and simplifies the bot testing lifecycle, which is critical to improving bot design.
Features of Test Workbench
Test Workbench for Amazon Lex includes the following features:
- Generate test datasets automatically from a bot’s conversation logs
- Upload manually built test set baselines
- Perform end-to-end testing of single input or multi-turn conversations
- Test both audio and text modalities of a bot
- Review aggregated and drill-down metrics for bot dimensions:
- Speech transcription
- Intent recognition
- Slot resolution (including multi-valued slots or composite slots)
- Context tags
- Session attributes
- Request attributes
- Runtime hints
- Time delay in seconds
Prerequisites
To test this feature, you should have the following:
- An AWS account with administrator access
- A sample retail bot imported via the Amazon Lex console (for more information, refer to importing a bot)
- A test set source, either from:
- Conversation logs enabled for the bot to store bot interactions, or
- A sample retail test set that can be imported following the instructions provided in this post
In addition, you should have knowledge and understanding of the following services and features:
Create a test set
To create your test set, complete the following steps:
- On the Amazon Lex console, under Test workbench in the navigation pane, choose Test sets.
You can review a list of existing test sets, including basic information such as name, description, number of test inputs, modality, and status. In the following steps, you can choose between generating a test set from the conversation logs associated with the bot or uploading an existing manually built test set in a CSV file format.
- Choose Create test set.
- Generating test sets from conversation logs allows you to do the following:
- Include real multi-turn conversations from the bot’s logs in CloudWatch
- Include audio logs and conduct tests that account for real speech nuances, background noises, and accents
- Speed up the creation of test sets
- Uploading a manually built test set allows you to do the following:
- Test new bots for which there is no production data
- Perform regression tests on existing bots for any new or modified intents, slots, and conversation flows
- Test carefully crafted and detailed scenarios that specify session attributes and request attributes
To generate a test set, complete the following steps. To upload a manually built test set, skip to step 7.
- Choose Generate a baseline test set.
- Choose your options for Bot name, Bot alias, and Language.
- For Time range, set a time range for the logs.
- For Existing IAM role, choose a role.
Ensure that the IAM role is able to grant you access to retrieve information from the conversation logs. Refer to Creating IAM roles to create an IAM role with the appropriate policy.
- If you prefer to use a manually created test set, select Upload a file to this test set.
- For Upload a file to this test set, choose from the following options:
- Select Upload from S3 bucket to upload a CSV file from an Amazon Simple Storage Service (Amazon S3) bucket.
- Select Upload a file to this test set to upload a CSV file from your computer.
You can use the sample test set provided in this post. For more information about templates, choose the CSV Template link on the page.
- For Modality, select the modality of your test set, either Text or Audio.
Test Workbench provides testing support for audio and text input formats.
- For S3 location, enter the S3 bucket location where the results will be stored.
- Optionally, choose an AWS Key Management Service (AWS KMS) key to encrypt output transcripts.
- Choose Create.
Your newly created test set will be listed on the Test sets page with one of the following statuses:
- Ready for annotation – For test sets generated from Amazon Lex bot conversation logs, the annotation step serves as a manual gating mechanism to ensure quality test inputs. By annotating values for expected intents and expected slots for each test line item, you indicate the “ground truth” for that line. The test results from the bot run are collected and compared against the ground truth to mark test results as pass or fail. This line level comparison then allows for creating aggregated measures.
- Ready for testing – This indicates that the test set is ready to be executed against an Amazon Lex bot.
- Validation error – Uploaded test files are checked for errors such as exceeding maximum supported length, invalid characters in intent names, or invalid Amazon S3 links containing audio files. If the test set is in the Validation error state, download the file showing the validation details to see test input issues or errors on a line-by-line basis. Once they are addressed, you can manually upload the corrected test set CSV into the test set.
Executing a test set
A test set is de-coupled from a bot. The same test set can be executed against a different bot or bot alias in the future as your business use case evolves. To report performance metrics of a bot against the baseline test data, complete the following steps:
- Import the sample bot definition and build the bot (refer to Importing a bot for guidance).
- On the Amazon Lex console, choose Test sets in the navigation pane.
- Choose your validated test set.
Here you can review basic information about the test set and the imported test data.
- Choose Execute test.
- Choose the appropriate options for Bot name, Bot alias, and Language.
- For Test type, select Audio or Text.
- For Endpoint selection, select either Streaming or Non-streaming.
- Choose Validate discrepancy to validate your test dataset.
Before executing a test set, you can validate test coverage, including identifying intents and slots present in the test set but not in the bot. This early warning serves to set tester expectation for unexpected test failures. If discrepancies between your test dataset and your bot are detected, the Execute test page will update with the View details button.
Intents and slots found in the test data set but not in the bot alias are listed as shown in the following screenshots.
- After you validate the discrepancies, choose Execute to run the test.
Review results
The performance measures generated after executing a test set help you identify areas of bot design that need improvements and are useful for expediting bot development and delivery to support your customers. Test Workbench provides insights on intent classification and slot resolution in end-to-end conversation and single-line input level. The completed test runs are stored with timestamps in your S3 bucket, and can be used for future comparative reviews.
- On the Amazon Lex console, choose Test results in the navigation pane.
- Choose the test result ID for the results you want to review.
On the next page, the test results will include a breakdown of results organized in four main tabs: Overall results, Conversation results, Intent and slot results, and Detailed results.
Overall results
The Overall results tab contains three main sections:
- Test set input breakdown — A chart showing the total number of end-to-end conversations and single input utterances in the test set.
- Single input breakdown — A chart showing the number of passed or failed single inputs.
- Conversation breakdown — A chart showing the number of passed or failed multi-turn inputs.
For test sets run in audio modality, speech transcription charts are provided to show the number of passed or failed speech transcriptions on both single input and conversation types. In audio modality, a single input or multi-turn conversation could pass the speech transcription test, yet fail the overall end-to-end test. This can be caused, for instance, by a slot resolution or an intent recognition issue.
Conversation results
Test Workbench helps you drill down into conversation failures that can be attributed to specific intents or slots. The Conversation results tab is organized into three main areas, covering all intents and slots used in the test set:
- Conversation pass rates — A table used to visualize which intents and slots are responsible for possible conversation failures.
- Conversation intent failure metrics — A bar graph showing the top five worst performing intents in the test set, if any.
- Conversation slot failure metrics — A bar graph showing the top five worst performing slots in the test set, if any.
Intent and slot results
The Intent and slot results tab provides drill-down metrics for bot dimensions such as intent recognition and slot resolution.
- Intent recognition metrics — A table showing the intent recognition success rate.
- Slot resolution metrics — A table showing the slot resolution success rate, by
Detailed results
You can access a detailed report of the executed test run on the Detailed results tab. A table is displayed to show the actual transcription, output intent, and slot values in a test set. The report can be downloaded as a CSV for further analysis.
The line-level output provides insights to help improve the bot design and boost accuracy. For instance, misrecognized or missed speech inputs such as branded words can be added to custom vocabulary of an intent or as utterances under an intent.
In order to further improve conversation design, you can refer to this post, outlining best practices on using ML to create a bot that will delight your customers by accurately understanding them.
Conclusion
In this post, we presented the Test Workbench for Amazon Lex, a native capability that standardizes a chatbot automated testing process and allows developers and conversation designers to streamline and iterate quickly through bot design and development.
We look forward to hearing how you use this new functionality of Amazon Lex and welcome feedback! For any questions, bugs, or feature requests, please reach us through AWS re:Post for Amazon Lex or your AWS Support contacts.
To learn more, see Amazon Lex FAQs and the Amazon Lex V2 Developer Guide.
About the authors
Sandeep Srinivasan is a Product Manager on the Amazon Lex team. As a keen observer of human behavior, he is passionate about customer experience. He spends his waking hours at the intersection of people, technology, and the future.
Grazia Russo Lassner is a Senior Consultant with the AWS Professional Services Natural Language AI team. She specializes in designing and developing conversational AI solutions using AWS technologies for customers in various industries. Outside of work, she enjoys beach weekends, reading the latest fiction books, and family.