Achieve low-latency hosting for decision tree-based ML models on NVIDIA Triton Inference Server on Amazon SageMaker

Machine learning (ML) model deployments can have very demanding performance and latency requirements for businesses today. Use cases such as fraud detection and ad placement are examples where milliseconds matter and are critical to business success. Strict service level agreements (SLAs) need to be met, and a typical request may require multiple steps such as preprocessing, data transformation, model selection logic, model aggregation, and postprocessing. At scale, this often means maintaining a huge volume of traffic while maintaining low latency. Common design patterns include serial inference pipelines, ensembles (scatter-gather), and business logic workflows, which result in realizing the entire workflow of the request as a Directed Acyclic Graph (DAG). However, as workflows get more complex, this can lead to an increase in overall response times, which in turn can negatively impact the end-user experience and jeopardize business goals. Triton can address these use cases where multiple models are composed in a pipeline with input and output tensors connected between them, helping you address these workloads.

As you evaluate your goals in relation to ML model inference, many options can be considered, but few are as capable and proven as Amazon SageMaker with Triton Inference Server. SageMaker with Triton Inference Server has been a popular choice for many customers because it’s purpose-built to maximize throughput and hardware utilization with ultra-low (single-digit milliseconds) inference latency. It has wide range of supported ML frameworks (including TensorFlow, PyTorch, ONNX, XGBoost, and NVIDIA TensorRT) and infrastructure backends, including NVIDIA GPUs, CPUs, and AWS Inferentia. Additionally, Triton Inference Server is integrated with SageMaker, a fully managed end-to-end ML service, providing real-time inference options for model hosting.

In this post, we walk through deploying a fraud detection ensemble workload to SageMaker with Triton Inference Server.

Solution overview

It’s essential for any project to have a list of requirements and an effort estimation, in order to approximate the total cost of the project. It’s important to estimate the return on investment (ROI) that supports the decision of an organization. Some considerations to take account when moving a workload to Triton include:

Effort estimation is key in software development, and its measurement is often based on incomplete, uncertain, and noisy inputs. ML workloads are no different. Multiple factors will affect an architecture for ML inference, some of which include:

  • Client-side latency budget – It specifies the client-side round-trip maximum acceptable waiting time for an inference response, commonly expressed in percentiles. For workloads that require a latency budget near tens of milliseconds, network transfers could become expensive, so using models at the edge would be a better fit.
  • Data payload distribution size – Payload, often referred to as message body, is the request data transmitted from the client to the model, as well as the response data transmitted from the model to the client. The payload size often has a major impact on latency and should be taken into consideration.
  • Data format – This specifies how the payload is sent to the ML model. Format can be human-readable, such as JSON and CSV, however there are also binary formats, which are often compressed and smaller in size. This is a trade-off between compression overhead and transfer size, meaning that CPU cycles and latency is added to compress or decompress, in order to save bytes transferred over the network. This post shows how to utilize both JSON and binary formats.
  • Software stack and components required – A stack is a collection of components that operate together to support an ML application, including operating system, runtimes, and software layers. Triton comes with built-in popular ML frameworks, called backends, such as ONNX, TensorFlow, FIL, OpenVINO, native Python, and others. You can also author a custom backend for your own homegrown components. This post goes over an XGBoost model and data preprocessing, which we migrate to the NVIDIA provided FIL and Python Triton backends, respectively.

All these factors should play a vital part in evaluating how your workloads perform, but in this use case we focus on the work needed to move your ML models to be hosted in SageMaker with Triton Inference Server. Specifically, we use an example of a fraud detection ensemble composed of an XGBoost model with preprocessing logic written in Python.

NVIDIA Triton Inference Server

Triton Inference Server has been designed from the ground up to enable teams to deploy, run, and scale trained AI models from any framework on GPU or CPU based infrastructure. In addition, it has been optimized to offer high-performance inference at scale with features like dynamic batching, concurrent runs, optimal model configuration, model ensemble, and support for streaming inputs.

The following diagram shows an example NVIDIA Triton ensemble pipeline.

Workloads should take into account the capabilities that Triton provides along with SageMaker hosting to maximize the benefits offered. For example, Triton supports both HTTP and gRPC protocols as well a C API, which allow for flexibility as well as payload optimization when needed. As previously mentioned, Triton supports several popular frameworks out of the box, including TensorFlow, PyTorch, ONNX, XGBoost, and NVIDIA TensorRT. These frameworks are supported through Triton backends, and in the rare event that a backend doesn’t support your use case, Triton allows you to implement your own and integrate it easily.

The following diagram shows an example of the NVIDIA Triton architecture.

NVIDIA Triton on SageMaker

SageMaker hosting services are the set of SageMaker features aimed at making model deployment and serving easier. It provides a variety of options to easily deploy, auto scale, monitor, and optimize ML models tailored for different use cases. This means that you can optimize your deployments for all types of usage patterns, from persistent and always available with serverless options, to transient, long-running, or batch inference needs.

Under the SageMaker hosting umbrella is also the set of SageMaker inference Deep Learning Containers (DLCs), which come prepackaged with the appropriate model server software for their corresponding supported ML framework. This enables you to achieve high inference performance with no model server setup, which is often the most complex technical aspect of model deployment and in general isn’t part of a data scientist’s skill set. Triton inference server is now available on SageMaker DLCs.

This breadth of options, modularity, and ease of use of different serving frameworks makes SageMaker and Triton a powerful match.

NVIDIA FIL backend support

With the 22.05 version release of Triton, NVIDIA now supports forest models trained by several popular ML frameworks, including XGBoost, LightGBM, Scikit-learn, and cuML. When using the FIL backend for Triton, you should ensure that the model artifacts that you provide are supported. For example, FIL supports model_type xgboost, xgboost_json, lightgbm, or treelite_checkpoint, indicating whether the provided model is in XGBoost binary format, XGBoost JSON format, LightGBM text format, or Treelite binary format, respectively.

This backend support is essential for us to use in our example because FIL supports XGBoost models. The only consideration to check is to ensure that the model that we deploy supports binary or JSON formats.

In addition to ensuring that you have the proper model format, other considerations should be taken. The FIL backend for Triton provides configurable options for developers to tune their workloads and optimize model run performance. The configuration dynamic_batching allows Triton to hold client-side requests and batch them on the server side, in order to efficiently use FIL’s parallel computation to inference the entire batch together. The option max_queue_delay_microseconds offers a fail-safe control of how long Triton waits to form a batch. FIL comes with Shapley explainer, which can be activated by the configuration treeshap_output; however, you should keep in mind that Shapley outputs hurt performance due to its output size. Another important aspect is storage_type in order to trade-off between memory footprint and runtime. For example, using storage as SPARSE can reduce the memory consumption, whereas DENSE can reduce your model run performance at the expense of higher memory usage. Deciding the best choice for each of these depends on your workload and your latency budget, so we recommend a deeper look into all options in the FIL backend FAQ and the list of configurations available in FIL.

Steps to host a model on triton

Let’s look at our fraud detection use case as an example of what to consider when moving a workload to Triton.

Identify your workload

In this use case, we have a fraud detection model used during the checkout process of a retail customer. The inference pipeline is using an XGBoost algorithm with preprocessing logic that includes data preparation for preprocessing.

Identify current and target performance metrics and other goals that may apply

You may find that your end-to-end inference time is taking too long to be acceptable. Your goal could be to go from tens of milliseconds of latency to single-digit latency for the same volume of requests and respective throughput. You determine that the bulk of the time is consumed by data preprocessing and the XGBoost model. Other factors such as network and payload size play a minimal role in the overhead associated with the end-to-end inference time.

Work backward to determine if Triton can host your workload based on your requirements

To determine if Triton can meet your requirements, you want to pay attention to two main areas of concern. The first is to ensure that Triton can serve with an acceptable front end option such as HTTP or C API.

As mentioned previously, it’s also critical to determine if Triton supports a backend that can serve your artifacts. Triton supports a number of backends that are tailor-made to support various frameworks like PyTorch and TensorFlow. Check to ensure that your models are supported and that you have the proper model format that Triton expects. To do this, first check to see what model formats the Triton backend supports. In many cases, this doesn’t require any changes for the model. In other cases, your model may require transformation to a different format. Depending on the source and target format, various options exist, such as transforming a Python pickle file to use Treelite’s binary checkpoint format.

For this use case, we determine the FIL backend can support the XGBoost model with no changes needed and that we can use the Python backend for the preprocessing. With the ensemble feature of Triton, you can further optimize your workload by avoiding costly network calls between hosting instances.

Create a plan and estimate the effort required to use Triton for hosting

Let’s talk about the plan to move your models to Triton. Every Triton deployment requires the following:

  • Model artifacts required by Triton backends
  • Triton configuration files
  • A model repository folder with the proper structure

We show an example of how to create these deployment dependencies later in this post.

Run the plan and validate the results

After you create the required files and artifacts in the properly structured model repository, you need to tune your deployment and test it to validate that you have now hit your target metrics.

At this point, you can use SageMaker Inference Recommender to determine what endpoint instance type is best for you based upon your requirements. In addition, Triton provides tools to make build optimizations to get better performance.

Implementation

Now let’s look at the implementation details. For this we have prepared two notebooks that provide an example of what can be expected. The first notebook shows the training of the given XGBoost model as well as the preprocessing logic that is used for both training and inference time. The second notebook shows how we prepare the artifacts needed for deployment on Triton.

The first notebook shows an existing notebook your organization has that uses the RAPIDS suite of libraries and the RAPIDS Conda kernel. This instance runs on a G4DN instance type provided by AWS, which is GPU accelerated by using NVIDIA T4 processors.

Preprocessing tasks in this example benefit from GPU acceleration and heavily use the cuML and cuDF libraries. An example of this is in the following code, where we show categorical label encoding using cuML. We also generate a label_encoders.pkl file that we can use to serialize the encoders and use them for preprocessing during inference time.

The first notebook concludes by training our XGBoost model and saving the artifacts accordingly.

In this scenario, the training code already existed and no changes are needed for the model at training time. Additionally, although we used GPU acceleration for preprocessing during training, we plan to use CPUs for preprocessing at inference time. We explain more later in the post.

Let’s now move on to the second notebook and recall what we need for a successful Triton deployment.

First, we need the model artifacts required by backends. The files that we need to create for this ensemble include:

  • Preprocessing artifacts (model.py, label_encoders.pkl)
  • XGBoost model artifacts (xgboost.json)

The Python backend in Triton requires us to use a Conda environment as a dependency. In this case, we use the Python backend to preprocess the raw data before feeding it into the XGBoost model being run in the FIL backend. Even though we originally used RAPIDS cuDF and cuML libraries to do the data preprocessing (as referenced earlier using our GPU), here we use Pandas and Scikit-learn as preprocessing dependencies for inference time (using our CPU). We do this for three reasons:

  • To show how to create a Conda environment for your dependencies and how to package it in the format expected by Triton’s Python backend.
  • By showing the preprocessing model running in the Python backend on the CPU while the XGBoost model runs on the GPU in the FIL backend, we illustrate how each model in Triton’s ensemble pipeline can run on a different framework backend, and run on different hardware with different configurations.
  • It highlights how the RAPIDS libraries (cuDF, cuML) are compatible with their CPU counterparts (Pandas, Scikit-learn). This way, we can show how LabelEncoders created in cuML can be used in Scikit-learn and vice-versa. Note that if you expect to preprocess large amounts of tabular data during inference time, you can still use RAPIDS to GPU-accelerate it.

Recall that we created the label_encoders.pkl file in the first notebook. There’s nothing more to do for category encoding other than include it in our model.py file for preprocessing.

To create the model.py file required by the Triton Python backend, we adhere to the formatting required by the backend and include our Python logic to process the incoming tensor and use the label encoder referenced earlier. You can review the file used for preprocessing.

For the XGBoost model, nothing more needs to be done. We trained the model in the first notebook and Triton’s FIL backend requires no additional effort for XGBoost models.

Next, we need the Triton configuration files. Each model in the Triton ensemble requires a config.pbtxt file. In addition, we also create a config.pbtxt file for the ensemble as a whole. These files allow Triton to know metadata about the ensemble with information such as the inputs and outputs we expect as well as help defining the DAG associated with the ensemble.

Lastly, to deploy a model on Triton, we need our model repository folder to have the proper folder structure. Triton has specific requirements for model repository layout. Within the top-level model repository directory, each model has its own sub-directory containing the information for the corresponding model. Each model directory in Triton must have at least one numeric sub-directory representing a version of the model. For our use case, the resulting structure should look like the following.

After we have these three prerequisites, we create a compressed file as packaging for deployment and upload it to Amazon Simple Storage Service (Amazon S3).

We can now create a SageMaker model from the model repository we uploaded to Amazon S3 in the previous step.

In this step, we also provide the additional environment variable SAGEMAKER_TRITON_DEFAULT_MODEL_NAME, which specifies the name of the model to be loaded by Triton. The value of this key should match the folder name in the model package uploaded to Amazon S3. This variable is optional in the case of a single model. In the case of ensemble models, this key has to be specified for Triton to start up in SageMaker.

Additionally, you can set SAGEMAKER_TRITON_BUFFER_MANAGER_THREAD_COUNT and SAGEMAKER_TRITON_THREAD_COUNT for optimizing the thread counts. Both configuration values help tune the number of threads that are running on your CPUs, so you can possibly gain better utilization by increasing these values for CPUs with a greater number of cores. In the majority of cases, the default values often work well, but it may be worth experimenting see if further efficiency can be gained for your workloads.

With the preceding model, we create an endpoint configuration where we can specify the type and number of instances we want in the endpoint.

Lastly, we use the preceding endpoint configuration to create a new SageMaker endpoint and wait for the deployment to finish. The status changes to InService after the deployment is successful.

That’s it! Your endpoint is now ready for testing and validation. At this point, you may want to use various tools to help optimize your instance types and configuration to get the best possible performance. The following figure provides an example of the gains that can be achieved by using the FIL backend for an XGBoost model on Triton.

Summary

In this post, we walked you through deploying an XGBoost ensemble workload to SageMaker with Triton Inference Server. Moving workloads to Triton on SageMaker can be a highly beneficial return on investment. As with any adoption of technology, a vetting process and plan are key, and we detailed a five-step process to guide you through what to consider when moving your workloads. In addition, we dove deep into the steps needed to deploy an ensemble that uses using Python preprocessing and an XGBoost model on Triton on SageMaker.

SageMaker provides the tools to remove the undifferentiated heavy lifting from each stage of the ML lifecycle, thereby facilitating the rapid experimentation and exploration needed to fully optimize your model deployments. SageMaker hosting support for Triton Inference Server enables low-latency, high transactions per second (TPS) workloads.

We highly recommend evaluating Triton Inference Server on SageMaker hosting for your inference needs; it can be well worth the effort to move your existing models to take advantage of this technology.

You can find the notebooks used for this example on GitHub.


About the author

James Park is a Solutions Architect at Amazon Web Services. He works with Amazon.com to design, build, and deploy technology solutions on AWS, and has a particular interest in AI and machine learning. In his spare time he enjoys seeking out new cultures, new experiences,  and staying up to date with the latest technology trends.

Jiahong Liu is a Solution Architect on the Cloud Service Provider team at NVIDIA. He assists clients in adopting machine learning and AI solutions that leverage NVIDIA accelerated computing to address their training and inference challenges. In his leisure time, he enjoys origami, DIY projects, and playing basketball.

Kshitiz Gupta is a Solutions Architect at NVIDIA. He enjoys educating cloud customers about the GPU AI technologies NVIDIA has to offer and assisting them with accelerating their machine learning and deep learning applications. Outside of work, he enjoys running, hiking and wildlife watching.

Bruno Aguiar de Melo is a Software Development Engineer at Amazon.com, where he helps science teams to build, deploy and release ML workloads. He is interested in instrumentation and controllable aspects within the ML modelling/design phase that must be considered and measured with the insight that model execution performance is just as important as model quality performance, particularly in latency constrained use cases. In his spare time, he enjoys wine, board games and cooking.

Eliuth Triana is a Developer Relations Manager at NVIDIA. He connects Amazon and AWS product leaders, developers, and scientists with NVIDIA technologists and product leaders to accelerate Amazon ML/DL workloads, EC2 products, and AWS AI services. In addition, Eliuth is a passionate mountain biker, skier, and poker player.

Read More

Build a multi-lingual document translation workflow with domain-specific and language-specific customization

In the digital world, providing information in a local language isn’t novel, but it can be a tedious and expensive task. Advancements in machine learning (ML) and natural language processing (NLP) have made this task much easier and less expensive.

We have seen increased adoption of ML for multi-lingual data and document processing workloads. Enterprise and government customers are migrating their manual translation workloads to take advantage of automated ML translation services. Amazon Translate is a neural machine translation service that delivers fast, high-quality, and affordable language translation between several thousand language pairings that can be used for synchronous (real-time) or asynchronous translation tasks. For a complete list of available translation pairs, refer to Supported languages and language codes.

Customers migrating and modernizing their translation workloads need the ability to customize translations for their business domain. A translation workload may also need the ability to adapt to regional language dialects or usage. For example, the Spanish translation of “elderly” is anciano(a) but in Puerto Rico the word envejeciente is preferred.

In this post, we demonstrate how to incorporate Amazon Translate’s Active Custom Translation (ACT) feature. We propose a solution to create a multi-lingual document translation workflow with domain- and language-specific customizations that you can review and augment as needed to continuously improve results and delight end-users.

Solution overview

ACT produces custom-translated output without the need to build and maintain a custom translation model. Using ACT, Amazon Translate will use your preferred translation examples as parallel data to customize your translation result, eliminating the time and cost of required to build and train a new machine learning model.

The solution covered in this post explains how to create a human-in-the-loop workflow using Amazon Augmented AI (Amazon A2I) to continuously improve the customized translation. Amazon A2I provides a simple way to integrate human oversight into your ML workflows, with no ML experience required. Amazon A2I makes it straightforward to integrate human judgement and AI into any ML application, regardless of whether it’s run on AWS or on another platform.

For more information refer to Designing human review workflows with Amazon Translate and Amazon Augmented AI post.

The following diagram displays the command flow and data flow of the solution. The command flow shows the logical sequence of events in the workflow. A data flow indicates how data is being created or used by various components in the solution.

The following sequence diagram shows two separate processes in the solution: the translation workflow (A) and the process to update parallel data (B).

The translation workflow is initiated by an Amazon CloudWatch scheduled event which starts the Translation Job Invoker AWS Lambda function. This function creates an asynchronous translation job in Amazon Translate, passing along the document to translate and the location of the parallel data to customize the translation. The translation job reads the parallel data, performs the translation, and writes the translated result back to an Amazon S3 bucket. As of this writing, only asynchronous translation jobs can use parallel data.

When the translation job is complete, an event is generated that triggers the Translation Job Completion Handler Lambda function. This function creates a human workflow loop—the main component of the Amazon A2I portion of the workflow.

Human reviewers assess the translation and accept or modify the translation. Any corrections are used to update the translated document and also added to a customization dictionary. When the review is finalized, another event is generated to trigger the Workflow Completion Handler function. This function writes the latest translated document back to Amazon S3. The customization data is used to update an Amazon DynamoDB table with the source and translated text pairs.

To close the loop, we must incorporate this customization data stored in DynamoDB back into the parallel data stored in Amazon S3. To accomplish this, we use a scheduled CloudWatch event to trigger the Parallel Data Refresher function, which reads the data from the DynamoDB table, reformats it as parallel data, and updates the S3 bucket, storing the parallel data.

Deploy the solution with AWS CloudFormation

Launch the provided AWS CloudFormation template to deploy the solution in your account. This stack only works in the us-east-1 Region. If you want to deploy this solution in other Regions, refer to the following GitHub repo.

  1. Choose Launch Stack:
  2. Follow the instructions to populate the necessary parameters. If you’re running this stack for the first time, SNS Email is the only required parameter.
  3. On the Review page, in the Capabilities section, select the check box and choose Create stack.

The stack creates the following key components:

  • Customization data – A DynamoDB table (translate_parallel_data) to maintain the customization data. You migrate the existing customization data to this table. This table is used to continuously add and update customizations.
  • Parallel Data Refresher – The Lambda function to convert the customization data in the DynamoDB table to a parallel data format—CSV, TSV, or TMX—and store it in Amazon S3. It creates and updates parallel data with the new parallel data file in Amazon S3.
  • Translation Job Invoker – The Lambda function to start the Amazon Translate batch job with parallel data.
  • Translation Job Completion Handler – This Lambda function is triggered when the Amazon Translate batch job is complete. The function creates one human loop per document (we’ll refine this in the future to create a human loop only for a select percentage of documents processed). It uses the original and translated documents to create the human loop.
  • Amazon A2I customized template – This template is used to render the translation pair for human review. The template has the Add option for every translation segment. Users can select this option to add the corrections to the customization data. The new customization data is used in the next batch translation job.
  • Workflow Completion Handler – This Lambda function is triggered when the human workflow is complete. The function updates the translated document with corrections and checks for parallel data updates. New parallel data is added to the DynamoDB table.
  • Amazon A2I private team – An Amazon A2I private team is created with a human worker using the email provided. Initial credentials are emailed upon successful creation of the private team. You use this email and credential to log in to the Amazon A2I worker portal.

Test the solution

The sample_text.txt file would have been created under the input prefix of the S3 bucket created by the stack. We use this file for our testing. It contains the following content:

Life insurance companies have the freedom to charge different premiums based on risk
factors that predict mortality. Purchasing a life insurance policy often entails a health 
status check or medical exam, and asking for vaccination status is not banned.

Health insurers are a different story. A slew of state and federal regulations in the 
last three decades have heavily restricted their ability to use health factors in issuing 
or pricing polices. The use of health status in any group health insurance policy is 
prohibited by law. The Affordable Care Act, passed in 2014, prevents insurers from pricing 
plans according to health – with one exception: smoking status.

To test the solution, complete the following steps:

  1. Invoke the Translation Job Invoker function manually, or wait for it to be triggered by CloudWatch based on the cron schedule you specified.
    This function triggers the Amazon Translate batch job. You can observe the progress of the job on the Amazon Translate console.
    This batch job takes approximately 30 minutes to complete. When it’s complete, the TextTranslationJob state change event triggers the Translation Job Completion Handler function. This function creates one human loop per translated document.
  2. Navigate to the Amazon A2I workforces page.
  3. Choose the Private tab.
  4. Log in to the Amazon A2I worker portal by choosing the link for Labelling portal sign-in URL.
  5. Select the task Human review task in the jobs list.
  6. Choose Start working.

    You can see the following page displayed.
  7. Follow the instructions to make domain- and language-specific corrections.
    In the preceding screenshot, the phrase “The use of health status in any group health insurance policy is prohibited by law” has been translated to “La ley prohíbe el uso del estado de salud en cualquier póliza de seguro médico de grupo.” Although the translation is accurate, the phrases have been rearranged.
  8. Let’s modify this to “El uso del estado de salud en cualquier póliza de seguro de salud grupal está prohibido por ley” to make this a more direct translation reflecting the original phraseology.
  9. Select Add to add this to the dictionary.
  10. When you’re done, choose Submit.

This triggers the Workflow Completion Handler function, and the customization data is updated in the DynamoDB table. The function also stores the corrected translation under the post-edits prefix.

You can observe the customizations being added to translate_parallel_data table on the DynamoDB console.

Command flow

The Parallel Data Refresher function is triggered every hour by a CloudWatch scheduled event. This function checks for new updates in the translate_parallel_data table, creates a new parallel data TMX file in Amazon S3 under the parallel_data prefix, and updates the Amazon Translate parallel data component. You can trigger this function manually if you don’t want to wait for the scheduled event trigger.

You can observe the parallel data being updated on the Amazon Translate console.

When it’s complete, the job status should be Active and the value for Updated records should reflect the number of customizations you added (in this case 1).

Now we can run the translation job again with the updated data. Trigger the Translation Job Invoker function again to observe the customization being added to the translation in the second iteration. Amazon Translate now uses the parallel data provided to customize the translation.

You can observe the change in the translation output in the labeling portal. Instead of the default translation, we see the customized translation being applied.

This workflow helps create a virtuous cycle to continuously improve translation output using Amazon A2I and Amazon Translate customization features.

Cost

With Amazon Translate and Amazon A2I, you pay as you go based on the number of text characters that you processed and for each human-reviewed object. We use DynamoDB on-demand mode for this example. DynamoDB charges you for the reads and writes performed on your tables. Refer to the pricing pages for Amazon Translate, Amazon A2I, and Amazon DynamoDB for actual costs.

Clean up

When you’re finished experimenting with this solution, clean up your resources by using the AWS CloudFormation console to delete all the resources deployed in this example. This helps you avoid continuing costs in your account.

Conclusion

You can use the solution presented in this post to build a multi-lingual translation workflow that uses and augments domain-specific customization incrementally to continuously improve translation results. We provided a simple mechanism to integrate your existing customization assets with managed AI services like Amazon Translate and Amazon A2I to build a robust translation service for your application. Amazon Translate can help you scale this solution to support over 5,550 translation pairs out of the box. Amazon A2I can help you easily integrate with your in-house linguistic expert or take advantage of an external workforce to scale the solution.

For more information about Amazon Translate, visit Amazon Translate resources to find video resources and blog posts, and refer to AWS Translate FAQs. Please share your thoughts with us in the comments section, or in the issues section of the project’s Github repository.


About the Authors

Sathya Balakrishnan is a Sr Customer Delivery Architect in the Professional Services team at AWS, specializing in Data/ML solutions. He works with US federal financial clients. He is passionate about building pragmatic solutions to solve customers’ business problems. In his spare time, he enjoys watching movies and hiking with his family.

Paul W. Joireman is a Sr Customer Delivery Architect in Professional Services at AWS, specializing in Application Migration and working with US federal financial clients. Paul enjoys creating technology solutions, traveling with family and hiking in the Shenandoah National Park, as long as the hike finishes at a local craft brewery.

Read More

AWS Deep Learning Challenge sees innovative and impactful use of Amazon EC2 DL1 instances

In the AWS Deep Learning Challenge held from January 5, 2022, to March 1, 2022, participants from academia, startups, and enterprise organizations joined to test their skills and train a deep learning model of their choice using Amazon Elastic Compute Cloud (Amazon EC2) DL1 instances and Habana’s SynapseAI SDK. The EC2 DL1 instances powered by Gaudi accelerators from Habana Labs, an Intel company, are designed specifically for training deep learning models. Participants were able to realize the significant price/performance benefits that DL1 offers over GPU-based instances.

We are excited to announce the winners and showcase some of the machine learning (ML) models that were trained in this hackathon. You will learn about some of the deep learning use cases that are supported by EC2 DL1 instances, including computer vision, natural language processing, and acoustic modeling.

Winning models

Our first-place winner is a project submitted by Gustavo Zomer. It’s an implementation of multi-lingual CLIP (Contrastive Language-Image Pre-Training). CLIP was introduced by OpenAI in 2021 as a way to train a more generalizable image classifier across larger datasets through self-supervised learning. It’s trained on a large set of images with a wide variety of natural language supervision that’s abundantly available on the internet, but is limited to the English language. This project replaces the text encoder in CLIP with a multi-lingual text encoder called XLM-RoBERTa to broaden the model’s applicability to multiple languages. This modified implementation of CLIP is able to pair images with captions across multiple languages. The model was trained on 16 accelerators across two DL1 instances, showing how ML training can be scaled to use multiple Gaudi accelerators across multiple nodes to increase training throughput and reduce the time to train. The judges were impressed by the impactful use of deep learning to break down language barriers, and the technical implementation, which used distributed training.

In second place, we have a project submitted by Remco van Akker. It uses a GAN (Generative Adversarial Network) to generate synthetic retinal image data for medical applications. Synthetic data is used in model training in medical applications to overcome the scarcity of annotated medical data, which is labor-intensive and costly to produce. Synthetic data can be used as part of data augmentation to remove biases and make vision models in medical applications more generalizable. This project stood out because it implemented a generative model on DL1 to solve a real-world problem impacting the application of AI and ML in healthcare.

Rounding out our top three was a project submitted by Zohar Jackson that implemented a vision transformer model for semantic segmentation. This project uses the Ray Tune library to fine-tune hyperparameters and uses Horovod to parallelize training on 16 Gaudi accelerators across two DL1 instances.

In addition to the top three winners, participants won several other prizes, including best technical implementation, highest potential impact, and most creative project. We offer our congratulations to all the winners of this hackathon for building such a diverse set of impactful projects on Gaudi accelerator-based EC2 DL1 instances. We can’t wait to see what our participants will continue to build on DL1 instances going forward.

Get started with DL1 instances

As demonstrated by the various projects in this hackathon, you can use EC2 DL1 instances to train deep learning models for use cases such as natural language processing, object detection, and image recognition. With DL1 instances, you also get up to 40% better price/performance for training deep learning models compared to current generation GPU-based EC2 instances. Visit Amazon EC2 DL1 Instances to learn more about how DL1 instances can accelerate your training workloads.


About the authors

Dvij Bajpai is a Senior Product Manager at AWS. He works on developing EC2 instances for workloads in machine learning and high-performance computing.

Amr Ragab is a Principal Solutions Architect at AWS. He provides technical guidance to help customers run complex computational workloads at scale.

Shruti Koparkar is a Senior Product Marketing Manager at AWS. She helps customers explore, evaluate, and adopt EC2 accelerated computing infrastructure for their machine learning needs.

Read More

Conduct what-if analyses with Amazon Forecast, up to 80% faster than before

Now with Amazon Forecast, you can seamlessly conduct what-if analyses up to 80% faster to analyze and quantify the potential impact of business levers on your demand forecasts. Forecast is a service that uses machine learning (ML) to generate accurate demand forecasts, without requiring any ML experience. Simulating scenarios through what-if analyses is a powerful business tool to navigate through the uncertainty of future events by capturing possible outcomes from hypothetical scenarios. It’s a common practice to assess the impact of business decisions on revenue or profitability, quantify the risk associated with market trends, evaluate how to organize logistics and workforce to meet customer demand, and much more.

Conducting a what-if analysis for demand forecasting can be challenging because you first need accurate models to forecast demand and then a quick and easy way to reproduce the forecast across a range of scenarios. Until now, although Forecast provided accurate demand forecasts, conducting what-if analysis using Forecast could be cumbersome and time-consuming. For example, retail promotion planning is a common application of what-if analysis to identify the optimal price point for a product to maximize the revenue. Previously on Forecast, you had to prepare and import a new input file for each scenario you wanted to test. If you wanted to test three different price points, you first had to create three new input files by manually transforming the data offline and then importing each file into Forecast separately. In effect, you were doing the same set of tasks for each and every scenario. Additionally, to compare scenarios, you had to download the prediction from each scenario individually and then merge them offline.

With today’s launch, you can easily conduct what-if analysis up to 80% faster. We have made it easy to create new scenarios by removing the need for offline data manipulation and import for each scenario. Now, you can define a scenario by transforming your initial dataset through simple operations, such as multiplying the price for product A by 90% or decreasing the price for product B by $10. These transformations can also be combined with conditions to control the parameters that the scenario applies in (for example, reducing product A’s price in one location only). With this launch, you can define and run multiple scenarios of the same type of analysis (such as promotion analysis) or different types of analyses (such as promotion analysis in geographical region 1 and inventory planning in geographical region 2) simultaneously. Lastly, you no longer need to merge and compare results of scenarios offline. Now, you can view the forecast predictions across all scenarios in the same graph or bulk export the data for offline review.

Solution overview

The steps in this post demonstrate how to use what-if analysis on the AWS Management Console. To directly use Forecast APIs for what-if analysis, follow the notebook in our GitHub repo that provides an analogous demonstration.

Import your training data

To conduct a what-if analysis, you must import two CSV files representing the target time series data (showing the prediction target) and the related time series data (showing attributes that impact the target). Our example target time series file contains the product item ID, timestamp, demand, store ID, city, and region, and our related time series file contains the product item ID, store ID, timestamp, city, region, and price.

To import your data, complete the following steps:

  1. On the Forecast console, choose View dataset groups.
Figure 1: View dataset group on the Amazon Forecast home page

Figure 1: View dataset group on the Amazon Forecast home page

  1. Choose Create dataset group.
Figure 2: Creating a dataset group

Figure 2: Creating a dataset group

  1. For Dataset group name, enter a dataset name (for this post, my_company_consumer_sales_history).
  2. For Forecasting domain, choose a forecasting domain (for this post, Retail).
  3. Choose Next.
Figure 3: Provide a dataset name and select your forecasting domain

Figure 3: Provide a dataset name and select your forecasting domain

  1. On the Create target time series dataset page, provide the dataset name, frequency of your data, and data schema
  2. Provide the dataset import details.
  3. Choose Start.

The following screenshot shows the information for the target time series page filled out for our example.

Figure 4: Sample information filled out for the target time series data import page

Figure 4: Sample information filled out for the target time series data import page

You will be taken to the dashboard that you can use to track progress.

  1. To import the related time series file, on the dashboard, choose Import.
Figure 5: Dashboard that allows you to track progress

Figure 5: Dashboard that allows you to track progress

  1. On the Create related time series dataset page, provide the dataset name and data schema.
  2. Provide the dataset import details.
  3. Choose Start.

The following screenshot shows the information filled out for our example.

Figure 6: Sample information filled out for the related time series data import page

Figure 6: Sample information filled out for the related time series data import page

Train a predictor

Next, we train a predictor.

  1. On the dashboard, choose Train predictor.
Figure 7: Dashboard of completed dataset import step and button to train a predictor

Figure 7: Dashboard of completed dataset import step and button to train a predictor

  1. On the Train predictor page, enter a name for your predictor, how long in the future you want to forecast and at what frequency, and the number of quantiles you want to forecast for.
  2. Enable AutoPredictor – this is required to use what-if analysis.
  3. Choose Create.

The following screenshot shows the information filled out for our example.

Figure 8: Sample information filled out to train a predictor

Figure 8: Sample information filled out to train a predictor

Create a forecast

After our predictor is trained (this can take approximately 2.5 hours), we create a forecast. You will know that your predictor is trained when you see the View Predictors button on your dashboard.

  1. Choose Create a forecast on the Dashboard

Figure 9: Dashboard of completed train predictor step and button to create a forecast

  1. On the Create a forecast page, enter a forecast name, choose the predictor that you created, and specify the forecast quantiles (optional) and the items to generate a forecast for.
  2. Choose Start.
Figure 10: Sample information filled out to create a forecast

Figure 10: Sample information filled out to create a forecast

After you complete these steps, you have successfully created a forecast. This represents your baseline forecast scenario that you use to do what-if analyses on.

If you need more help creating your baseline forecasts, refer to Getting Started (Console). We now move to the next steps of conducting a what-if analysis.

Create a what-if analysis

At this point, we have created our baseline forecast and will start the walkthrough of how to conduct a what-if analysis. There are three stages to conducting a what-if analysis: setting up the analysis, creating the what-if forecast by defining what is changed in the scenario, and comparing the results.

  1. To set up your analysis, choose Explore what-if analysis on the dashboard.
Figure 11: Dashboard of complete create forecast step and button to start what-if analysis

Figure 11: Dashboard of complete create forecast step and button to start what-if analysis

  1. Choose Create.
Figure 12: Page to create a new what-if analysis

Figure 12: Page to create a new what-if analysis

  1. Enter a unique name and select the baseline forecast on the drop-down menu.
  2. Choose the items in your dataset you want to conduct a what-if analysis for. You have two options:
    1. Select all items is the default, which we choose in this post.
    2. If you want to pick specific items, choose Select items with a file and import a CSV file containing the unique identifier for the corresponding item and any associated dimension (such as region).
  3. Choose Create what-if analysis.
Figure 13: Option to specify items to conduct what-if analysis for and button to create the analysis

Figure 13: Option to specify items to conduct what-if analysis for and button to create the analysis

Create a what-if forecast

Next, we create a what-if forecast to define the scenario we want to analyze.

  1. Choose Create.

Figure 14: Creating a what-if forecast

  1. Enter a name of your scenario.

You can define your scenario through two options:

  • Use transformation functions – Use the transformation builder to transform the related time series data you imported. For this walkthrough, we evaluate how the demand an item in our dataset changes when the price is reduced by 10% and then by 30% when compared to the price in the baseline forecast.
  • Define the what-if forecast with a replacement dataset – Replace the related time series dataset you imported.
Figure 15: Options to define a scenario

Figure 15: Options to define a scenario

The transformation function builder provides the capability to transform the related time series data you imported earlier through simple operations to add, subtract, divide, and multiply features in your data (for example price) by a value you specify. For our example, we create a scenario where we reduce the price by 10%, and price is a feature in the dataset.

  1. For What-if forecast definition method, select Use transformation functions.
  2. Choose Multiply as our operator, price as our time series, and enter 0.9.
Figure 16: Using the transformation builder to reduce price by 10%

Figure 16: Using the transformation builder to reduce price by 10%

You can also add conditions to further refine your scenario. For example if your dataset contained store information organized by region, you could limit the price reduction scenario by region. You could define a scenario of a 10% price reduction that’s applicable to stores not in Region_1.

  1. Choose Add condition.
  2. Choose Not equals as the operation and enter Region_1.
Figure 17: Using the transformation builder to reduce price by 10% for stores that are not in region 1

Figure 17: Using the transformation builder to reduce price by 10% for stores that are not in region 1

Another option to modify your related time series is by importing a new dataset that already contains the data defining the scenario. For example, to define a scenario with 10% price reduction, we can upload a new dataset specifying the unique identifier for the items that are changing and the price change that is 10% lower. To do so, select Define the what-if forecast with a replacement dataset and import a CSV containing the price change.

Figure 18: Importing a replacement dataset to define a new scenario

Figure 18: Importing a replacement dataset to define a new scenario

  1. To complete the what-if forecast definition, choose Create.
Figure 19: Completing the what-if forecast creation

Figure 19: Completing the what-if forecast creation

Repeat the process to create another what-if forecast with a 30% price reduction.

Figure 20: Showing the completed run of the two what-if forecasts

After the what-if analysis has run for each what-if forecast, the status will change to active. This concludes the second stage, and you can move on to comparing the what-if forecasts.

Compare the forecasts

We can now compare the what-if forecasts for both our scenarios, comparing a 10% price reduction with a 30% price reduction.

  1. On the analysis insights page, navigate to the Compare what-if forecasts section.

Figure 21: Inputs required to compare what-if forecasts

  1. For item_id, enter the item to analyze.
  2. For What-if forecasts, choose the scenarios to compare (for this post, Scenario_1 and Scenario_2).
  3. Choose Compare what-if.
Figure 22: button to generate what-if forecast comparison graph

Figure 22: button to generate what-if forecast comparison graph

The following graph shows the resulting demand in both our scenarios.

Figure 23: What-if forecast comparison for scenario 1 and 2

Figure 23: What-if forecast comparison for scenario 1 and 2

By default, it showcases the P50 and the base case scenario. You can view all quantiles generated by selecting your preferred quantiles on the Choose forecasts drop-down menu.

Export your data

To export your data to CSV, complete the following steps:

  1. Choose Create export.

Figure 24: Creating a what-if forecast export

  1. Enter a name for your export file (for this post, my_scenario_export)
  2. Specify the scenarios to be exported by selecting the scenarios on the What-If Forecast drop-down menu. You can export multiple scenarios at once in a combined file.
  3. For Export location, specify the Amazon Simple Storage Service (Amazon S3) location.
  4. To begin the export, choose Create Export.
Figure 25: specifying the scenario information and export location for the bulk export

Figure 25: specifying the scenario information and export location for the bulk export

  1. To download the export, first navigate to S3 file path location from the AWS Management Console and the select the file and choose the download button. The export file will contain the timestamp, item ID, dimensions, and the forecasts for each quantile for all scenarios selected (including the base scenario).

Conclusion

Scenario analysis is a critical tool to help navigate through the uncertainties of business. It provides foresight and a mechanism to stress-test ideas, leaving businesses more resilient, better prepared, and in control of their future. Forecast now supports forecasting what-if scenario analyses. To conduct your scenario analysis, open the Forecast console and follow the steps outlined in this post, or refer to our GitHub notebook on how to access the functionality via API.

To learn more, refer to the CreateWhatIfAnalysis page in the developer guide.


About the authors

Brandon Nair is a Sr. Product Manager for Amazon Forecast. His professional interest lies in creating scalable machine learning services and applications. Outside of work he can be found exploring national parks, perfecting his golf swing or planning an adventure trip.

Akhil Raj Azhikodan is a Software Development Engineer working on Amazon Forecast. His interests are in designing and building reliable systems that solve complex customer problems. Outside of work, he enjoys learning about history, hiking and playing video games.

Conner Smith is a Software Development Engineer working on Amazon Forecast. He focuses on building secure, scalable distributed systems that provide value to customers. Outside of work he spends time reading fiction, playing guitar, and watching random YouTube videos.

Shannon Killingsworth is the UX Designer for Amazon Forecast. He has been improving the user experience in Forecast for two years by simplifying processes as well as adding new features in ways that make sense to our users. Outside of work he enjoys running, drawing, and reading.

Read More

Intelligently search Alfresco content using Amazon Kendra

Amazon Kendra is an intelligent search service powered by machine learning (ML). With Amazon Kendra, you can easily aggregate content from a variety of content repositories into a centralized index that lets you quickly search all your enterprise data and find the most accurate answer. Many organizations use the content management platform Alfresco to store their content. One of the key requirements for many enterprise customers using Alfresco is the ability to easily and securely find accurate information across all the documents in the data source.

We are excited to announce the public preview of the Amazon Kendra Alfresco connector. You can index Alfresco content, filter the types of content you want to index, and easily search your data in Alfresco with Amazon Kendra intelligent search and its Alfresco OnPrem connector.

This post shows you how to use the Amazon Kendra Alfresco OnPrem connector to configure the connector as a data source for your Amazon Kendra index and search your Alfresco documents. Based on the configuration of the Alfresco connector, you can synchronize the connector to crawl and index different types of Alfresco content such as wikis and blogs. The connector also ingests the access control list (ACL) information for each file. The ACL information is used for user context filtering, where search results for a query are filtered by what a user has authorized access to.

Prerequisites

To try out the Amazon Kendra connector for Alfresco using this post as a reference, you need the following:

Configure the data source using the Amazon Kendra connector for Alfresco

To add a data source to your Amazon Kendra index using the Alfresco OnPrem connector, you can use an existing index or create a new index. Then complete the following steps. For more information on this topic, refer to the Amazon Kendra Developer Guide.

  1. On the Amazon Kendra console, open your index and choose Data sources in the navigation pane.
  2. Choose Add data source.
  3. Under Alfresco, choose Add connector.
  4. In the Specify data source details section, enter a name and description and choose Next.
  5. In the Define access and security section, for Alfresco site URL, enter the Alfresco host name.
  6. To configure the SSL certificates, you can create a self-signed certificate for this setup utilizing openssl x509 -in pattern.pem -out alfresco.crt and add this certificate to an Amazon Simple Storage Service (Amazon S3) bucket. Choose Browse S3 and choose the S3 bucket with the SSL certificate.
  7. For Site ID, enter the Alfresco site ID where you want to search documents.
  8. Under Authentication, you have two options:
    1. Use Secrets Manager to create new Alfresco authentication credentials. You need an Alfresco admin user name and password.
    2. Use an existing Secrets Manager secret that has the Alfresco authentication credentials you want the connector to access.
  9. Choose Save and add secret.
  10. For IAM role, choose Create a new role or choose an existing IAM role configured with appropriate IAM policies to access the Secrets Manager secret, Amazon Kendra index, and data source.
  11. Choose Next.
  12. In the Configure sync settings section, provide information about your sync scope and run schedule.
    You can include the files to be crawled using inclusion patterns or exclude them using exclusion patterns.
  13. Choose Next.
  14. In the Set field mappings section, you can optionally configure the field mappings to specify how the Alfresco field names are mapped to Amazon Kendra attributes or facets.
  15. Choose Next.
  16. Review your settings and confirm to add the data source.
  17. After the data source is added, choose Data sources in the navigation pane, select the newly added data source, and choose Sync now to start data source synchronization with the Amazon Kendra index.

    The sync process can take about 10–15 minutes. You can now search indexed Alfresco content using the search console or a search application. Optionally, you can search with ACL with the following additional steps.
  18. Go to the index page that you created and on the User access control tab, choose Edit settings.
  19. Under Access control settings, select Yes.
  20. For Token type, choose JSON.
  21. Choose Next.
  22. Choose Update.

Wait a few minutes for the index to get updated by the changes. Now let’s see how you can perform intelligent search with Amazon Kendra.

Perform intelligent search with Amazon Kendra

Before you try searching on the Amazon Kendra console or using the API, make sure that the data source sync is complete. To check, view the data sources and verify if the last sync was successful.

  1. To start your search, on the Amazon Kendra console, choose Search indexed content in the navigation pane.
    You’re redirected to the Amazon Kendra Search console. Now you can search information from the Alfresco documents you indexed using Amazon Kendra.
  2. For this post, we search for a document stored in Alfresco, AWS.
  3. Expand Test query with an access token and choose Apply token.
  4. For Username, enter the email address associated with your Alfresco account.
  5. Choose Apply.

Now the user can only see the content they have access to. In our example, user test@amazon.com doesn’t have access to any documents on Alfresco, so none are visible.

Limitations

The connector has the following limitations:

  • As of this writing, we only support Alfresco OnPrem. Alfresco PAAS is not supported.
  • The connector doesn’t crawl the following entities: calendars, discussions, data lists, links, and system files.
  • During public preview, we only support basic authentication. For support for other forms of authentication please contact your Amazon representative.

Clean up

To avoid incurring future costs, clean up the resources you created as part of this solution. If you created a new Amazon Kendra index while testing this solution, delete it. If you only added a new data source using the Amazon Kendra connector for Alfresco, delete that data source.

Conclusion

With the Amazon Kendra Alfresco connector, your organization can search contents securely using intelligent search powered by Amazon Kendra.

To learn more about the Amazon Kendra Alfresco connector, refer to the Amazon Kendra Developer Guide.

For more information on other Amazon Kendra built-in connectors to popular data sources, refer to Amazon Kendra native connectors.


About the author

Vikas Shah is an Enterprise Solutions Architect at Amazon web services. He is a technology enthusiast who enjoys helping customers find innovative solutions to complex business challenges. His areas of interest are ML, IoT, robotics and storage. In his spare time, Vikas enjoys building robots, hiking, and traveling.

Read More