Amazon AWS – Page 203

Reduce the time taken to deploy your models to Amazon SageMaker for testing

September 27, 2022

by Kirit Thadaka Amazon AWS

Data scientists often train their models locally and look for a proper hosting service to deploy their models. Unfortunately, there’s no one set mechanism or guide to deploying pre-trained models to the cloud. In this post, we look at deploying trained models to Amazon SageMaker hosting to reduce your deployment time.

SageMaker is a fully managed machine learning (ML) service. With SageMaker, you can quickly build and train ML models and directly deploy them into a production-ready hosted environment. Additionally, you don’t need to manage servers. You get an integrated Jupyter notebook environment with easy access to your data sources. You can perform data analysis, train your models, and test them using your own algorithms or use SageMaker-provided ML algorithms that are optimized to run efficiently against large datasets spread across multiple machines. Training and hosting are billed by minutes of usage, with no minimum fees and no upfront commitments.

Solution overview

Data scientists sometimes train models locally using their IDE and either ship those models to the ML engineering team for deployment or just run predictions locally on powerful machines. In this post, we introduce a Python library that simplifies the process of deploying models to SageMaker for hosting on real-time or serverless endpoints.

This Python library gives data scientists a simple interface to quickly get started on SageMaker without needing to know any of the low-level SageMaker functionality.

If you have models trained locally using your preferred IDE and want to benefit from the scale of the cloud, you can use this library to deploy your model to SageMaker. With SageMaker, in addition to all the scaling benefits of a cloud-based ML platform, you have access to purpose-built training tools (distributed training, hyperparameter tuning), experiment management, model management, bias detection, model explainability, and many other capabilities that can help you in any aspect of the ML lifecycle. You can choose from the three most popular frameworks for ML: Scikit-learn, PyTorch, and TensorFlow, and can pick the type of compute you want. Defaults are provided along the way so users of this library can deploy their models without needing to make complex decisions or learn new concepts. In this post, we show you how to get started with this library and optimize deploying your ML models on SageMaker hosting.

The library can be found in the GitHub repository.

The SageMaker Migration Toolkit

The SageMakerMigration class is available through a Python library published to GitHub. Instructions to install this library are provided in the repository; make sure that you follow the README to properly set up your environment. After you install this library, the rest of this post talks about how you can use it.

The SageMakerMigration class consists of high-level abstractions over SageMaker APIs that significantly reduce the steps needed to deploy your model to SageMaker, as illustrated in the following figure. This is intended for experimentation so developers can quickly get started and test SageMaker. It is not intended for production migrations.

For Scikit-learn, PyTorch, and TensorFlow models, this library supports deploying trained models to a SageMaker real-time endpoint or serverless endpoint. To learn more about the inference options in SageMaker, refer to Deploy Models for Inference.

Real-time vs. serverless endpoints

Real-time inference is ideal for inference workloads where you have real-time, interactive, low latency requirements. You can deploy your model to SageMaker hosting services and get an endpoint that can be used for inference. These endpoints are fully managed and support auto scaling.

SageMaker Serverless Inference is a purpose-built inference option that makes it easy for you to deploy and scale ML models. Serverless Inference is ideal for workloads that have idle periods between traffic spurts and can tolerate cold starts. Serverless endpoints automatically launch compute resources and scale them in and out depending on traffic, eliminating the need to choose instance types or manage scaling policies. This takes away the undifferentiated heavy lifting of selecting and managing servers.

Depending on your use case, you may want to quickly host your model on SageMaker without actually having an instance always on and incurring costs, in which case a serverless endpoint is a great solution.

Prepare your trained model and inference script

After you’ve identified the model you want to deploy on SageMaker, you must ensure the model is presented to SageMaker in the right format. SageMaker endpoints generally consist of two components: the trained model artifact (.pth, .pkl, and so on) and an inference script. The inference script is not always mandatory, but if not provided, the default handlers for the serving container that you’re using are applied. It’s essential to provide this script if you need to customize your input/output functionality for inference.

The trained model artifact is simply a saved Scikit-learn, PyTorch, or TensorFlow model. For Scikit-learn, this is typically a pickle file, for PyTorch this is a .pt or .pth file, and for TensorFlow this is a folder with assets, .pb files, and other variables.

Generally, you need to able to control how your model processes input and performs inference, and control the output format for your response. With SageMaker, you can provide an inference script to add this customization. Any inference script used by SageMaker must have one or more of the following four handler functions: model_fn, input_fn, predict_fn, and output_fn.

Note that these four functions apply to PyTorch and Scikit-learn containers specifically. TensorFlow has slightly different handlers because it’s integrated with TensorFlow Serving. For an inference script with TensorFlow, you have two model handlers: input_handler and output_handler. Again, these have the same preprocessing and postprocessing purpose that you can work with, but they’re configured slightly differently to integrate with TensorFlow Serving. For PyTorch models, model_fn is a compulsory function to have in the inference script.

model_fn

This is the function that is first called when you invoke your SageMaker endpoint. This is where you write your code to load the model. For example:

def model_fn(model_dir):
    model = Your_Model()
    with open(os.path.join(model_dir, 'model.pth'), 'rb') as f:
        model.load_state_dict(torch.load(f))
    return model

Depending on the framework and type of model, this code may change, but the function must return an initialized model.

input_fn

This is the second function that is called when your endpoint is invoked. This function takes the data sent to the endpoint for inference and parses it into the format required for the model to generate a prediction. For example:

def input_fn(request_body, request_content_type):
    """An input_fn that loads a pickled tensor"""
    if request_content_type == 'application/python-pickle':
        return torch.load(BytesIO(request_body))
    else:
        # Handle other content-types here or raise an Exception
        # if the content type is not supported.
        pass

The request_body contains the data to be used for generating inference from the model and is parsed in this function so that it’s in the required format.

predict_fn

This is the third function that is called when your model is invoked. This function takes the preprocessed input data returned from input_fn and uses the model returned from model_fn to make the prediction. For example:

def predict_fn(input_data, model):
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model.to(device)
    model.eval()
    with torch.no_grad():
        return model(input_data.to(device))

You can optionally add output_fn to parse the output of predict_fn before returning it to the client. The function signature is def output_fn(prediction, content_type).

Move your pre-trained model to SageMaker

After you have your trained model file and inference script, you must put these files in a folder as follows:

#SKLearn Model

model_folder/
    model.pkl
    inference.py
    
# Tensorflow Model
model_folder/
    0000001/
        assets/
        variables/
        keras_metadata.pb
        saved_model.pb
    inference.py
    
# PyTorch Model
model_folder/
    model.pth
    inference.py

After your model and inference script have been prepared and saved in this folder structure, your model is ready for deployment on SageMaker. See the following code:

from sagemaker_migration import frameworks as fwk

if __name__ == "__main__":
    ''' '''
    sk_model = fwk.SKLearnModel(
        version = "0.23-1", 
        model_data = 'model.joblib',
        inference_option = 'real-time',
        inference = 'inference.py',
        instance_type = 'ml.m5.xlarge'
    )
    sk_model.deploy_to_sagemaker()

After deployment of your endpoint, make sure to clean up any resources you won’t utilize via the SageMaker console or through the delete_endpoint Boto3 API call.

Conclusion

The goal of the SageMaker Migration Toolkit project is to make it easy for data scientists to onboard their models onto SageMaker to take advantage of cloud-based inference. The repository will continue to evolve and support more options for migrating workloads to SageMaker. The code is open source and we welcome community contributions through pull requests and issues.

Check out the GitHub repository to explore more on utilizing the SageMaker Migration Toolkit, and feel free to also contribute examples or feature requests to add to the project!

About the authors

Kirit Thadaka is an ML Solutions Architect working in the Amazon SageMaker Service SA team. Prior to joining AWS, Kirit spent time working in early stage AI startups followed by some time in consulting in various roles in AI research, MLOps, and technical leadership.

Ram Vegiraju is a ML Architect with the SageMaker Service team. He focuses on helping customers build and optimize their AI/ML solutions on Amazon SageMaker. In his spare time, he loves traveling and writing.

Alexa’s text-to-speech research at Interspeech 2022

September 27, 2022

by Amazon AWS

Highlighted papers focus on transference — of prosody, accent, and speaker identity.Read More

Scaling to trillion-parameter model training on AWS

September 26, 2022

by Amazon AWS

Contiguous parameter management and prefetched activation offloading expand the MiCS tool kit.Read More

Introducing self-service quota management and higher default service quotas for Amazon Textract

September 26, 2022

by Anjan Biswas Amazon AWS

Today, we’re excited to announce self-service quota management support for Amazon Textract via the AWS Service Quotas console, and higher default service quotas in select AWS Regions.

Customers tell us they need quick turnaround times to process their requests for quota increases and visibility into their service quotas so they may continue to scale their Amazon Textract usage. With this launch, we’re improving Amazon Textract support for service quotas by enabling you to self-manage your service quotas via the Service Quotas console. In addition to viewing the default service quotas, you can now view your account’s applied custom quotas for a specific Region, view your historical utilization metrics per applied quota, set up alarms to notify when utilization approaches a threshold, and add tags to your quotas for easier organization. Additionally, we’re launching the Amazon Textract Service Quota Calculator, which will help you quickly estimate service quota requirements for your workload prior to submitting a quota increase request.

In this post, we discuss the updated default service quotas, the new service quota management capabilities, and the service quota calculator for Amazon Textract.

Increased default service quotas for Amazon Textract

Amazon Textract now has higher service quotas for several asynchronous and synchronous APIs in multiple major AWS Regions. The updated default service quotas are available for US East (Ohio), US East (N. Virginia), US West (Oregon), Asia Pacific (Mumbai), and Europe (Ireland) Regions. The following table summarizes the before and after default quota numbers for each of these Regions for the respective synchronous and asynchronous APIs. You can refer to Amazon Textract endpoints and quotas to learn more about the current default quotas.

Synchronous Operations	API	Region	Before	After
Transactions per second per account for synchronous operations	AnalyzeDocument	US East (Ohio)	1	10
		Asia Pacific (Mumbai)	1	5
		Europe (Ireland)	1	5
	DetectDocumentText	US East (Ohio)	1	10
		US East (N. Virginia)	10	25
		US West (Oregon)	10	25
		Asia Pacific (Mumbai)	1	5
		Europe (Ireland)	1	5
Asynchronous Operations	API	Region	Before	After
Transactions per second per account for all Start (asynchronous) operations	StartDocumentAnalysis	US East (Ohio)	2	10
		Asia Pacific (Mumbai)	2	5
		Europe (Ireland)	2	5
	StartDocumentTextDetection	US East (Ohio)	1	5
		US East (N. Virginia)	10	15
		US West (Oregon)	10	15
		Asia Pacific (Mumbai)	1	5
		Europe (Ireland)	1	5
Transactions per second per account for all Get (asynchronous) operations	GetDocumentAnalysis	US East (Ohio)	5	10
	GetDocumentTextDetection	US East (Ohio)	5	10
		US East (N. Virginia)	10	25
		US West (Oregon)	10	25

Improved service quota support for Amazon Textract

Starting today, you can manage your Amazon Textract service quotas via the Service Quotas console. Requests may now be processed automatically, speeding up approval times. After a quota request for a specific Region is approved, the new quota is immediately available for scaling your Amazon Textract usage and also visible on the Service Quotas console. You can see the default and applied quota values for your account in a given Region, and view the historical utilization metrics via an integrated Amazon CloudWatch graph. This enables you to make informed decisions about whether a quota increase is required to scale your workload. You can also use CloudWatch alarms to notify whenever a specified quota reaches a predefined threshold, which can help investigate issues with your applications or monitor spikey workloads. You can also add tags to the quotas, which allows better administration and monitoring.

The following sections discusses the features that are now available via the Service Quotas console for Amazon Textract.

Default and applied quotas

You can now have visibility into the AWS default quota value and applied quota value of a specific quota for Amazon Textract on the Service Quotas console. The default quota value is the default value of the quota in that specific Region, and the applied quota value is the currently applied value for that quota for the account in that Region.

Monitoring via CloudWatch graphs

The Service Quotas console also displays a utilization against the total applied quota value. You can also view the weekly, daily, and hourly trend in utilization of the applied quota through an integrated CloudWatch graph, right from the Service Quotas console, for a given quota. You can add this graph to a custom CloudWatch dashboard for better monitoring and reporting of service usage and overall utilization.

We have also added the capability to set up CloudWatch alarms to notify you automatically whenever a specified quota reaches a certain configurable threshold. This helps you monitor the usage of Amazon Textract from your applications, analyze spikey workloads, make informed decisions about the overall utilization, control costs, and make improvements to the application’s architecture.

Quota tagging

With quota tagging, you can now add tags to applied quotas to simplify administration. Tags help you identify and organize AWS resources. With quota tags, you can manage the applied service quotas for Amazon Textract along with other AWS service quotas, as part of your administration and governance practices. You can better manage and monitor quotas and quota utilization for different environments based on tags. For example, you can use production or development tags to logically separate and monitor dev environment and production environment quotas and quota utilization for accounts under AWS Organizations and unified reporting.

Amazon Textract Service Quota Calculator

We’re introducing a new quota calculator on the Amazon Textract console. The quota calculator helps forecast service quota requirements based on answers to questions about your workload and usage of Amazon Textract. With calculations based on your usage patterns, such as number of documents and number of pages per document, it provides actionable recommendations in the form of a required quota value for the workload.

As shown in the following screenshot, the quota calculator is now accessible directly from the Amazon Textract console. You can also navigate to the Service Quotas console directly from the calculator, where you can manage the service quotas based on the calculated recommendations.

Quota calculator for synchronous operations

To view the current quota values and recommended quota values for synchronous operations, you start by selecting Synchronous under Processing type. For example, if you’re interested in calculating the desired quota values for your workload that uses the DetectDocumentText API, you select the Synchronous processing type, and then choose Detect Document Text on the Use case type drop-down menu.

After you specify your desired options, the quota calculator prompts for additional inputs, which include the maximum number of documents you expect to process via the API per day or per hour. The corresponding numbers of documents to be processed shown under View calculation is automatically calculated based on the input. Because synchronous processing allows text detection and analysis of single-page documents, the number of pages per document defaults to 1. For multi-page documents, we recommend using asynchronous processing.

The output of this calculation is a current quota value applicable for that account in the current Region, and the recommended quota value, based on the quota type selected and the provided number of documents.

You can copy the recommended quota value within the calculator and use the Quota type (in this case, DetectDocumentText) deep link to navigate to the specific quota on the Service Quotas console to create a quota increase request.

Quota calculator for asynchronous operations

The way to view current quota values and recommended quota values for asynchronous operations is similar to that of the synchronous operations. Specify the use case type for your asynchronous operation usage, and answer a few questions relevant to your workload to view the current quotas and recommended quotas for all the asynchronous operations relevant to the use case.

For example, if you’re running asynchronous jobs using the StartDocumentTextDetection API and consecutively using the GetDocumentTextDetection API to get the results of the job in your workload, choose the Document Text Detection option as your use case. Because these two APIs are always used in conjunction to each other, the calculator provides recommendations for both the APIs. For asynchronous operations, there are limits on the total number of concurrent jobs that can be run per account in a given Region. Therefore, the calculator also calculates the recommended total number of concurrent asynchronous jobs recommended for your workload.

In addition to the processing type and use case type, you need to provide specific values relevant to your workload:

The maximum number of documents you expect to process
A processing time frame value in hours, which is the approximate length of time over which you expect to process the documents
The maximum number of pages per document, because asynchronous operations allow processing multi-page documents

Quota calculation for asynchronous operations generates recommended quota values for all the asynchronous APIs relevant to the selected use case. In our example, the quota values for the StartDocumentTextDetection API, GetDocumentTextDetection API, and number of concurrent text detection jobs are generated by the calculator, as shown in the following screenshot. You can then use the required quota value to request quota increases via the Service Quotas console using the corresponding deep links under Quota type.

It’s worth noting that the all the quota-related information within the calculator is shown for the current AWS Region for the AWS Management Console. To view the quota information for a different Region, you can change the Region manually from the top navigation bar of the console. Recommendations generated by the calculator are based on the current applied quota for that account for the current Region, the selected processing type (asynchronous and synchronous), and other information relevant to your workload. You can use these recommendations to submit quota increase requests via the Service Quotas console. Although most requests are processed automatically, some requests may need additional manual review prior to being approved.

Conclusion

In this post, we announced the updated default service quotas in select AWS Regions and the self-service quota management capabilities of Amazon Textract. We also announced the availability of a new quota calculator, available on the Amazon Textract console. You can start taking advantage of the new default service quotas, and use the Amazon Textract quota calculator to generate recommended quota values to quickly scale your workload. With the improved Service Quotas console for Amazon Textract, you can request quota increases, monitor quota utilization and service usage, and set up alarms. With the features announced in this post, you can now easily monitor your quota utilization, manage costs, and follow best practices to scale your Amazon Textract usage.

To learn more about the Amazon Textract service quota calculator and extended features for quota management, visit Quotas in Amazon Textract.

About the authors

Anjan Biswas is a Senior AI Services Solutions Architect with focus on AI/ML and Data Analytics. Anjan is part of the world-wide AI services team and works with customers to help them understand, and develop solutions to business problems with AI and ML. Anjan has over 14 years of experience working with global supply chain, manufacturing, and retail organizations and is actively helping customers get started and scale on AWS AI services.

Shashwat Sapre is a Senior Technical Product Manager with the Amazon Textract team. He is focused on building machine learning-based services for AWS customers. In his spare time, he likes reading about new technologies, traveling and exploring different cuisines.

Hackathon promotes language diversity in speech technology

September 26, 2022

by Amazon AWS

Two-day hackathon will provide a hands-on approach to building speech-technology applications for underrepresented languages; registration deadline is Sept. 30.Read More

Large-scale revenue forecasting at Bosch with Amazon Forecast and Amazon SageMaker custom models

September 23, 2022

by Goktug Cinar Amazon AWS

This post is co-written by Goktug Cinar, Michael Binder, and Adrian Horvath from Bosch Center for Artificial Intelligence (BCAI).

Revenue forecasting is a challenging yet crucial task for strategic business decisions and fiscal planning in most organizations. Often, revenue forecasting is manually performed by financial analysts and is both time consuming and subjective. Such manual efforts are especially challenging for large-scale, multinational business organizations that require revenue forecasts across a wide range of product groups and geographical areas at multiple levels of granularity. This requires not only accuracy but also hierarchical coherence of the forecasts.

Bosch is a multinational corporation with entities operating in multiple sectors, including automotive, industrial solutions, and consumer goods. Given the impact of accurate and coherent revenue forecasting on healthy business operations, the Bosch Center for Artificial Intelligence (BCAI) has been heavily investing in the use of machine learning (ML) to improve the efficiency and accuracy of financial planning processes. The goal is to alleviate the manual processes by providing reasonable baseline revenue forecasts via ML, with only occasional adjustments needed by the financial analysts using their industry and domain knowledge.

To achieve this goal, BCAI has developed an internal forecasting framework capable of providing large-scale hierarchical forecasts via customized ensembles of a wide range of base models. A meta-learner selects the best-performing models based on features extracted from each time series. The forecasts from the selected models are then averaged to obtain the aggregated forecast. The architectural design is modularized and extensible through the implementation of a REST-style interface, which allows continuous performance improvement via the inclusion of additional models.

BCAI partnered with the Amazon ML Solutions Lab (MLSL) to incorporate the latest advances in deep neural network (DNN)-based models for revenue forecasting. Recent advances in neural forecasters have demonstrated state-of-the-art performance for many practical forecasting problems. Compared to traditional forecasting models, many neural forecasters can incorporate additional covariates or metadata of the time series. We include CNN-QR and DeepAR+, two off-the-shelf models in Amazon Forecast, as well as a custom Transformer model trained using Amazon SageMaker. The three models cover a representative set of the encoder backbones often used in neural forecasters: convolutional neural network (CNN), sequential recurrent neural network (RNN), and transformer-based encoders.

One of the key challenges faced by the BCAI-MLSL partnership was to provide robust and reasonable forecasts under the impact of COVID-19, an unprecedented global event causing great volatility on global corporate financial results. Because neural forecasters are trained on historical data, the forecasts generated based on out-of-distribution data from the more volatile periods could be inaccurate and unreliable. Therefore, we proposed the addition of a masked attention mechanism in the Transformer architecture to address this issue.

The neural forecasters can be bundled as a single ensemble model, or incorporated individually into Bosch’s model universe, and accessed easily via REST API endpoints. We propose an approach to ensemble the neural forecasters through backtest results, which provides competitive and robust performance over time. Additionally, we investigated and evaluated a number of classical hierarchical reconciliation techniques to ensure that forecasts aggregate coherently across product groups, geographies, and business organizations.

In this post, we demonstrate the following:

How to apply Forecast and SageMaker custom model training for hierarchical, large-scale time-series forecasting problems
How to ensemble custom models with off-the-shelf models from Forecast
How to reduce the impact of disruptive events such as COVID-19 on forecasting problems
How to build an end-to-end forecasting workflow on AWS

Challenges

We addressed two challenges: creating hierarchical, large-scale revenue forecasting, and the impact of the COVID-19 pandemic on long-term forecasting.

Hierarchical, large-scale revenue forecasting

Financial analysts are tasked with forecasting key financial figures, including revenue, operational costs, and R&D expenditures. These metrics provide business planning insights at different levels of aggregation and enable data-driven decision-making. Any automated forecasting solution needs to provide forecasts at any arbitrary level of business-line aggregation. At Bosch, the aggregations can be imagined as grouped time series as a more general form of hierarchical structure. The following figure shows a simplified example with a two-level structure, which mimics the hierarchical revenue forecasting structure at Bosch. The total revenue is split into multiple levels of aggregations based on product and region.

The total number of time series that need to be forecasted at Bosch is at the scale of millions. Notice that the top-level time series can be split by either products or regions, creating multiple paths to the bottom level forecasts. The revenue needs to be forecasted at every node in the hierarchy with a forecasting horizon of 12 months into the future. Monthly historical data is available.

The hierarchical structure can be represented using the following form with the notation of a summing matrix S (Hyndman and Athanasopoulos):

In this equation, Y equals the following:

Here, b represents the bottom level time-series at time t.

Impacts of the COVID-19 pandemic

The COVID-19 pandemic brought significant challenges for forecasting due to its disruptive and unprecedented effects on almost all aspects of work and social life. For long-term revenue forecasting, the disruption also brought unexpected downstream impacts. To illustrate this problem, the following figure shows a sample time series where the product revenue experienced a significant drop at the start of the pandemic and gradually recovered afterwards. A typical neural forecasting model will take revenue data including the out-of-distribution (OOD) COVID period as the historical context input, as well as the ground truth for model training. As a result, the forecasts produced are no longer reliable.

Modeling approaches

In this section, we discuss our various modeling approaches.

Amazon Forecast

Forecast is a fully-managed AI/ML service from AWS that provides preconfigured, state-of-the-art time series forecasting models. It combines these offerings with its internal capabilities for automated hyperparameter optimization, ensemble modeling (for the models provided by Forecast), and probabilistic forecast generation. This allows you to easily ingest custom datasets, preprocess data, train forecasting models, and generate robust forecasts. The service’s modular design further enables us to easily query and combine predictions from additional custom models developed in parallel.

We incorporate two neural forecasters from Forecast: CNN-QR and DeepAR+. Both are supervised deep learning methods that train a global model for the entire time series dataset. Both CNNQR and DeepAR+ models can take in static metadata information about each time series, which are the corresponding product, region, and business organization in our case. They also automatically add temporal features such as month of the year as part of the input to the model.

Transformer with attention masks for COVID

The Transformer architecture (Vaswani et al.), originally designed for natural language processing (NLP), recently emerged as a popular architectural choice for time series forecasting. Here, we used the Transformer architecture described in Zhou et al. without probabilistic log sparse attention. The model uses a typical architecture design by combining an encoder and a decoder. For revenue forecasting, we configure the decoder to directly output the forecast of the 12-month horizon instead of generating the forecast month by month in an autoregressive manner. Based on the frequency of the time series, additional time related features such as month of the year are added as the input variable. Additional categorical variables describing the meta information (product, region, business organization) are fed into the network via a trainable embedding layer.

The following diagram illustrates the Transformer architecture and the attention masking mechanism. Attention masking is applied throughout all the encoder and decoder layers, as highlighted in orange, to prevent OOD data from affecting the forecasts.

We mitigate the impact of OOD context windows by adding attention masks. The model is trained to apply very little attention to the COVID period that contains outliers via masking, and performs forecasting with masked information. The attention mask is applied throughout every layer of the decoder and encoder architecture. The masked window can be either specified manually or through an outlier detection algorithm. Additionally, when using a time window containing outliers as the training labels, the losses are not back-propagated. This attention masking-based method can be applied to handle disruptions and OOD cases brought by other rare events and improve the robustness of the forecasts.

Model ensemble

Model ensemble often outperforms single models for forecasting—it improves model generalizability and is better at handling time series data with varying characteristics in periodicity and intermittency. We incorporate a series of model ensemble strategies to improve model performance and robustness of forecasts. One common form of deep learning model ensemble is to aggregate results from model runs with different random weight initializations, or from different training epochs. We utilize this strategy to obtain forecasts for the Transformer model.

To further build an ensemble on top of different model architectures, such as Transformer, CNNQR, and DeepAR+, we use a pan-model ensemble strategy that selects the top-k best performing models for each time series based on the backtest results and obtain their averages. Because backtest results can be exported directly from trained Forecast models, this strategy enables us to take advantage of turnkey services like Forecast with improvements gained from custom models such as Transformer. Such an end-to-end model ensemble approach doesn’t require training a meta-learner or calculating time series features for model selection.

Hierarchical reconciliation

The framework is adaptive to incorporate a wide range of techniques as postprocessing steps for hierarchical forecast reconciliation, including bottom-up (BU), top-down reconciliation with forecasting proportions (TDFP), ordinary least square (OLS), and weighted least square (WLS). All the experimental results in this post are reported using top-down reconciliation with forecasting proportions.

Architecture overview

We developed an automated end-to-end workflow on AWS to generate revenue forecasts utilizing services including Forecast, SageMaker, Amazon Simple Storage Service (Amazon S3), AWS Lambda, AWS Step Functions, and AWS Cloud Development Kit (AWS CDK). The deployed solution provides individual time series forecasts through a REST API using Amazon API Gateway, by returning the results in predefined JSON format.

The following diagram illustrates the end-to-end forecasting workflow.

Key design considerations for the architecture are versatility, performance, and user-friendliness. The system should be sufficiently versatile to incorporate a diverse set of algorithms during development and deployment, with minimal required changes, and can be easily extended when adding new algorithms in the future. The system should also add minimum overhead and support parallelized training for both Forecast and SageMaker to reduce training time and obtain the latest forecast faster. Finally, the system should be simple to use for experimentation purposes.

The end-to-end workflow sequentially runs through the following modules:

A preprocessing module for data reformatting and transformation
A model training module incorporating both the Forecast model and custom model on SageMaker (both are running in parallel)
A postprocessing module supporting model ensemble, hierarchical reconciliation, metrics, and report generation

Step Functions organizes and orchestrates the workflow from end to end as a state machine. The state machine run is configured with a JSON file containing all the necessary information, including the location of the historical revenue CSV files in Amazon S3, the forecast start time, and model hyperparameter settings to run the end-to-end workflow. Asynchronous calls are created to parallelize model training in the state machine using Lambda functions. All the historical data, config files, forecast results, as well as intermediate results such as backtesting results are stored in Amazon S3. The REST API is built on top of Amazon S3 to provide a queryable interface for querying forecasting results. The system can be extended to incorporate new forecast models and supporting functions such as generating forecast visualization reports.

Evaluation

In this section, we detail the experiment setup. Key components include the dataset, evaluation metrics, backtest windows, and model setup and training.

Dataset

To protect the financial privacy of Bosch while using a meaningful dataset, we used a synthetic dataset that has similar statistical characteristics to a real-world revenue dataset from one business unit at Bosch. The dataset contains 1,216 time series in total with revenue recorded in a monthly frequency, covering January 2016 to April 2022. The dataset is delivered with 877 time series at the most granular level (bottom time series), with a corresponding grouped time series structure represented as a summing matrix S. Each time series is associated with three static categorical attributes, which corresponds to product category, region, and organizational unit in the real dataset (anonymized in the synthetic data).

Evaluation metrics

We use median-Mean Arctangent Absolute Percentage Error (median-MAAPE) and weighted-MAAPE to evaluate the model performance and perform comparative analysis, which are the standard metrics used at Bosch. MAAPE addresses the shortcomings of the Mean Absolute Percentage Error (MAPE) metric commonly used in business context. Median-MAAPE gives an overview of the model performance by computing the median of the MAAPEs calculated individually on each time series. Weighted-MAAPE reports a weighted combination of the individual MAAPEs. The weights are the proportion of the revenue for each time series compared to the aggregated revenue of the entire dataset. Weighted-MAAPE better reflects downstream business impacts of the forecasting accuracy. Both metrics are reported on the entire dataset of 1,216 time series.

Backtest windows

We use rolling 12-month backtest windows to compare model performance. The following figure illustrates the backtest windows used in the experiments and highlights the corresponding data used for training and hyperparameter optimization (HPO). For backtest windows after COVID-19 starts, the result is affected by OOD inputs from April to May 2020, based on what we observed from the revenue time series.

Model setup and training

For Transformer training, we used quantile loss and scaled each time series using its historical mean value before feeding it into Transformer and computing the training loss. The final forecasts are rescaled back to calculate the accuracy metrics, using the MeanScaler implemented in GluonTS. We use a context window with monthly revenue data from the past 18 months, selected via HPO in the backtest window from July 2018 to June 2019. Additional metadata about each time series in the form of static categorical variables are fed into the model via an embedding layer before feeding it to the transformer layers. We train the Transformer with five different random weight initializations and average the forecast results from the last three epochs for each run, in total averaging 15 models. The five model training runs can be parallelized to reduce training time. For the masked Transformer, we indicate the months from April to May 2020 as outliers.

For all Forecast model training, we enabled automatic HPO, which can select the model and training parameters based on a user-specified backtest period, which is set to the last 12 months in the data window used for training and HPO.

Experiment results

We train masked and unmasked Transformers using the same set of hyperparameters, and compared their performance for backtest windows immediately after COVID-19 shock. In the masked Transformer, the two masked months are April and May 2020. The following table shows the results from a series of backtest periods with 12-month forecasting windows starting from June 2020. We can observe that the masked Transformer consistently outperforms the unmasked version.

We further performed evaluation on the model ensemble strategy based on backtest results. In particular, we compare the two cases when only the top performing model is selected vs. when the top two performing models are selected, and model averaging is performed by computing the mean value of the forecasts. We compare the performance of the base models and the ensemble models in the following figures. Notice that none of the neural forecasters consistently out-perform others for the rolling backtest windows.

The following table shows that, on average, ensemble modeling of the top two models gives the best performance. CNNQR provides the second-best result.

Conclusion

This post demonstrated how to build an end-to-end ML solution for large-scale forecasting problems combining Forecast and a custom model trained on SageMaker. Depending on your business needs and ML knowledge, you can use a fully managed service such as Forecast to offload the build, train, and deployment process of a forecasting model; build your custom model with specific tuning mechanisms with SageMaker; or perform model ensembling by combining the two services.

If you would like help accelerating the use of ML in your products and services, please contact the Amazon ML Solutions Lab program.

References

Hyndman RJ, Athanasopoulos G. Forecasting: principles and practice. OTexts; 2018 May 8.

Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I. Attention is all you need. Advances in neural information processing systems. 2017;30.

Zhou H, Zhang S, Peng J, Zhang S, Li J, Xiong H, Zhang W. Informer: Beyond efficient transformer for long sequence time-series forecasting. InProceedings of AAAI 2021 Feb 2.

About the Authors

Goktug Cinar is a lead ML scientist and the technical lead of the ML and stats-based forecasting at Robert Bosch LLC and Bosch Center for Artificial Intelligence. He leads the research of the forecasting models, hierarchical consolidation, and model combination techniques as well as the software development team which scales these models and serves them as part of the internal end-to-end financial forecasting software.

Michael Binder is a product owner at Bosch Global Services, where he coordinates the development, deployment and implementation of the company wide predictive analytics application for the large-scale automated data driven forecasting of financial key figures.

Adrian Horvath is a Software Developer at Bosch Center for Artificial Intelligence, where he develops and maintains systems to create predictions based on various forecasting models.

Panpan Xu is a Senior Applied Scientist and Manager with the Amazon ML Solutions Lab at AWS. She is working on research and development of Machine Learning algorithms for high-impact customer applications in a variety of industrial verticals to accelerate their AI and cloud adoption. Her research interest includes model interpretability, causal analysis, human-in-the-loop AI and interactive data visualization.

Jasleen Grewal is an Applied Scientist at Amazon Web Services, where she works with AWS customers to solve real world problems using machine learning, with special focus on precision medicine and genomics. She has a strong background in bioinformatics, oncology, and clinical genomics. She is passionate about using AI/ML and cloud services to improve patient care.

Selvan Senthivel is a Senior ML Engineer with the Amazon ML Solutions Lab at AWS, focusing on helping customers on machine learning, deep learning problems, and end-to-end ML solutions. He was a founding engineering lead of Amazon Comprehend Medical and contributed to the design and architecture of multiple AWS AI services.

Ruilin Zhang is an SDE with the Amazon ML Solutions Lab at AWS. He helps customers adopt AWS AI services by building solutions to address common business problems.

Shane Rai is a Sr. ML Strategist with the Amazon ML Solutions Lab at AWS. He works with customers across a diverse spectrum of industries to solve their most pressing and innovative business needs using AWS’s breadth of cloud-based AI/ML services.

Lin Lee Cheong is an Applied Science Manager with the Amazon ML Solutions Lab team at AWS. She works with strategic AWS customers to explore and apply artificial intelligence and machine learning to discover new insights and solve complex problems.

Amazon’s spoken-language-understanding research at Interspeech 2022

September 23, 2022

by Amazon AWS

Methods for learning from noisy data, using phonetic embeddings to improve entity resolution, and quantization-aware training are a few of the highlights.Read More

Detect population variance of endangered species using Amazon Rekognition

September 22, 2022

by Jyothi Goudar Amazon AWS

Our planet faces a global extinction crisis. UN Report shows a staggering number of more than a million species feared to be on the path of extinction. The most common reasons for extinction include loss of habitat, poaching, and invasive species. Several wildlife conservation foundations, research scientists, volunteers, and anti-poaching rangers have been working tirelessly to address this crisis. Having accurate and regular information about endangered animals in the wild will improve wildlife conservationists’ ability to study and conserve endangered species. Wildlife scientists and field staff use cameras equipped with infrared triggers, called camera traps, and place them in the most effective locations in forests to capture images of wildlife. These images are then manually reviewed, which is a very time-consuming process.

In this post, we demonstrate a solution using Amazon Rekognition Custom Labels along with motion sensor camera traps to automate this process to recognize engendered species and study them. Rekognition Custom Labels is a fully managed computer vision service that allows developers to build custom models to classify and identify objects in images that are specific and unique to their use case. We detail how to recognize endangered animal species from images collected from camera traps, draw insights about their population count, and detect humans around them. This information will be helpful to conservationists, who can make proactive decisions to save them.

Solution overview

The following diagram illustrates the architecture of the solution.

This solution uses the following AI services, serverless technologies, and managed services to implement a scalable and cost-effective architecture:

Amazon Athena – A serverless interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL
Amazon CloudWatch – A monitoring and observability service that collects monitoring and operational data in the form of logs, metrics, and events
Amazon DynamoDB – A key-value and document database that delivers single-digit millisecond performance at any scale
AWS Lambda – A serverless compute service that lets you run code in response to triggers such as changes in data, shifts in system state, or user actions
Amazon QuickSight – A serverless, machine learning (ML)-powered business intelligence service that provides insights, interactive dashboards, and rich analytics
Amazon Rekognition – Uses ML to identify objects, people, text, scenes, and activities in images and videos, as well as detect any inappropriate content
Amazon Rekognition Custom Labels – Uses AutoML to help train custom models to identify the objects and scenes in images that are specific to your business needs
Amazon Simple Queue Service (Amazon SQS) – A fully managed message queuing service that enables you to decouple and scale microservices, distributed systems, and serverless applications
Amazon Simple Storage Service (Amazon S3) – Serves as an object store for documents and allows for central management with fine-tuned access controls.

The high-level steps in this solution are as follows:

Train and build a custom model using Rekognition Custom Labels to recognize endangered species in the area. For this post, we train on images of rhinoceros.
Images that are captured through the motion sensor camera traps are uploaded to an S3 bucket, which publishes an event for every uploaded image.
A Lambda function is triggered for every event published, which retrieves the image from the S3 bucket and passes it to the custom model to detect the endangered animal.
The Lambda function uses the Amazon Rekognition API to identify the animals in the image.
If the image has any endangered species of rhinoceros, the function updates the DynamoDB database with the count of the animal, date of image captured, and other useful metadata that can be extracted from the image EXIF header.
QuickSight is used to visualize the animal count and location data collected in the DynamoDB database to understand the variance of the animal population over time. By looking at the dashboards regularly, conservation groups can identify patterns and isolate probable causes like diseases, climate, or poaching that could be causing this variance and proactively take steps to address the issue.

Prerequisites

A good training set is required to build an effective model using Rekognition Custom Labels. We have used the images from AWS Marketplace (Animals & Wildlife Data Set from Shutterstock) and Kaggle to build the model.

Implement the solution

Our workflow includes the following steps:

Train a custom model to classify the endangered species (rhino in our example) using the AutoML capability of Rekognition Custom Labels.

You can also perform these steps from the Rekognition Custom Labels console. For instructions, refer to Creating a project, Creating training and test datasets, and Training an Amazon Rekognition Custom Labels model.

In this example, we use the dataset from Kaggle. The following table summarizes the dataset contents.

Label	Training Set	Test Set
Lion	625	156
Rhino	608	152
African_Elephant	368	92

Upload the pictures captured from the camera traps to a designated S3 bucket.
Define the event notifications in the Permissions section of the S3 bucket to send a notification to a defined SQS queue when an object is added to the bucket.

The upload action triggers an event that is queued in Amazon SQS using the Amazon S3 event notification.

Add the appropriate permissions via the access policy of the SQS queue to allow the S3 bucket to send the notification to the queue.

Configure a Lambda trigger for the SQS queue so the Lambda function is invoked when a new message is received.

Modify the access policy to allow the Lambda function to access the SQS queue.

The Lambda function should now have the right permissions to access the SQS queue.

Set up the environment variables so they can be accessed in the code.

Lambda function code

The Lambda function performs the following tasks on receiving a notification from the SNS queue:

Make an API call to Amazon Rekognition to detect labels from the custom model that identify the endangered species:

exports.handler = async (event) => {
const id = AWS.util.uuid.v4();
const bucket = event.Records[0].s3.bucket.name;
const photo = decodeURIComponent(event.Records[0].s3.object.key.replace(/+/g, ' '));
const client = new AWS.Rekognition({ region: REGION });
const paramsCustomLabel = {
Image: {
S3Object: {
Bucket: bucket,
Name: photo
},
},
ProjectVersionArn: REK_CUSTOMMODEL,
MinConfidence: MIN_CONFIDENCE
}
let response = await client.detectCustomLabels(paramsCustomLabel).promise();
console.log("Rekognition customLabels response = ",response);

Fetch the EXIF tags from the image to get the date when the picture was taken and other relevant EXIF data. The following code uses the dependencies (package – version) exif-reader – ^1.0.3, sharp – ^0.30.7:

const getExifMetaData = async (bucket,key)=>{
return new Promise((resolve) => {
const s3 = new AWS.S3({ region: REGION });
const param = {
Bucket: bucket,
Key : key
};

s3.getObject(param, (error, data) => {
if (error) {
console.log("Error getting S3 file",error);
resolve({status:false,errorText: error.message});
} else {
sharp(data.Body)
.metadata()
.then(({ exif }) => {
const exifProperties = exifReader(exif);
resolve({status:true,exifProp: exifProperties});
}).catch(err => {console.log("Error Processing Exif ");resolve({status:false});})
}
});
});
}

var gpsData = "";
var createDate = "";
const imageS3 = await getExifMetaData(bucket, photo);
if(imageS3.status){
gpsData = imageS3.exifProp.gps;
createDate = imageS3.exifProp.image.CreateDate;
}else{
createDate = event.Records[0].eventTime;
console.log("No exif found in image, setting createDate as the date of event", createDate);
}

The solution outlined here is asynchronous; the images are captured by the camera traps and then at a later time uploaded to an S3 bucket for processing. If the camera trap images are uploaded more frequently, you can extend the solution to detect humans in the monitored area and send notifications to concerned activists to indicate possible poaching in the vicinity of these endangered animals. This is implemented through the Lambda function that calls the Amazon Rekognition API to detect labels for the presence of a human. If a human is detected, an error message is logged to CloudWatch Logs. A filtered metric on the error log triggers a CloudWatch alarm that sends an email to the conservation activists, who can then take further action.

Expand the solution with the following code:

const paramHumanLabel = {
Image: {
S3Object: {
Bucket: bucket,
Name: photo
},
},
MinConfidence: MIN_CONFIDENCE
}

let humanLabel = await client.detectLabels(paramHumanLabel).promise();
let humanFound = humanLabel.Labels.filter(obj => obj.Name === HUMAN);
var humanDetected = false;
if(humanFound.length > 0){
console.error("Human Face Detected");
humanDetected = true;
}

If any endangered species is detected, the Lambda function updates DynamoDB with the count, date and other optional metadata that is obtained from the image EXIF tags:

let dbresponse = await dynamo.putItem({
Item: {
id: { S: id },
type: { S: response.CustomLabels[0].Name },
image: {S : photo},
createDate: {S: createDate.toString()},
confidence: {S: response.CustomLabels[0].Confidence.toString()},
gps: {S: gpsData.toString()},
humanDetected: {BOOL: humanDetected}
},

TableName: ANIMAL_TABLENAME,
}).promise();

Query and visualize the data

You can now use Athena and QuickSight to visualize the data.

Set the DynamoDB table as the data source for Athena.

Add the data source details.

The next important step is to define a Lambda function that connects to the data source.

Chose Create Lambda function.

Enter names for AthenaCatalogName and SpillBucket; the rest can be default settings.
Deploy the connector function.

After all the images are processed, you can use QuickSight to visualize the data for the population variance over time from Athena.

On the Athena console, choose a data source and enter the details.
Choose Create Lambda function to provide a connector to DynamoDB.

On the QuickSight dashboard, choose New Analysis and New Dataset.
Choose Athena as the data source.

Enter the catalog, database, and table to connect to and choose Select.

Complete dataset creation.

The following chart shows the number of endangered species captured on a given day.

GPS data is presented as part of the EXIF tags of a captured image. Due to the sensitivity of the location of these endangered animals, our dataset didn’t have the GPS location. However, we created a geospatial chart using simulated data to show how you can visualize locations when GPS data is available.

Clean up

To avoid incurring unexpected costs, be sure to turn off the AWS services you used as part of this demonstration—the S3 buckets, DynamoDB table, QuickSight, Athena, and the trained Rekognition Custom Labels model. You should delete these resources directly via their respective service consoles if you no longer need them. Refer to Deleting an Amazon Rekognition Custom Labels model for more information about deleting the model.

Conclusion

In this post, we presented an automated system that identifies endangered species, records their population count, and provides insights about variance in population over time. You can also extend the solution to alert the authorities when humans (possible poachers) are in the vicinity of these endangered species. With the AI/ML capabilities of Amazon Rekognition, we can support the efforts of conservation groups to protect endangered species and their ecosystems.

For more information about Rekognition Custom Labels, refer to Getting started with Amazon Rekognition Custom Labels and Moderating content. If you’re new to Rekognition Custom Labels, you can use our Free Tier, which lasts 3 months and includes 10 free training hours per month and 4 free inference hours per month. The Amazon Rekognition Free Tier includes processing 5,000 images per month for 12 months.

About the Authors

Jyothi Goudar is Partner Solutions Architect Manager at AWS. She works closely with global system integrator partner to enable and support customers moving their workloads to AWS.

Jay Rao is a Principal Solutions Architect at AWS. He enjoys providing technical and strategic guidance to customers and helping them design and implement solutions on AWS.

How Amazon Search reduced ML inference costs by 85% with AWS Inferentia

September 22, 2022

by Joao Moura Amazon AWS

Amazon’s product search engine indexes billions of products, serves hundreds of millions of customers worldwide, and is one of the most heavily used services in the world. The Amazon Search team develops machine learning (ML) technology that powers the Amazon.com search engine and helps customers search effortlessly. To deliver a great customer experience and operate at the massive scale required by the Amazon.com search engine, this team is always looking for ways to build more cost-effective systems with real-time latency and throughput requirements. The team constantly explores hardware and compilers optimized for deep learning to accelerate model training and inference, while reducing operational costs across the board.

In this post, we describe how Amazon Search uses AWS Inferentia, a high-performance accelerator purpose built by AWS to accelerate deep learning inference workloads. The team runs low-latency ML inference with Transformer-based NLP models on AWS Inferentia-based Amazon Elastic Compute Cloud (Amazon EC2) Inf1 instances, and saves up to 85% in infrastructure costs while maintaining strong throughput and latency performance.

Deep learning for duplicate and query intent prediction

Searching the Amazon Marketplace is a multi-task, multi-modal problem, dealing with several inputs such as ASINs (Amazon Standard Identification Number, a 10-digit alphanumeric number that uniquely identifies products), product images, textual descriptions, and queries. To create a tailored user experience, predictions from many models are used for different aspects of search. This is a challenge because the search system has thousands of models with tens of thousands of transactions per second (TPS) at peak load. We focus on two components of that experience:

Customer-perceived duplicate predictions – To show the most relevant list of products that match a user’s query, it’s important to identify products that customers have a hard time differentiating between
Query intent prediction – To adapt the search page and product layout to better suit what the customer is looking for, it’s important to predict the intent and type of the user’s query (for example, a media-related query, help query, and other query types)

Both of these predictions are made using Transformer model architectures, namely BERT-based models. In fact, both share the same BERT-based model as a basis, and each one stacks a classification/regression head on top of this backbone.

Duplicate prediction takes in various textual features for a pair of evaluated products as inputs (such as product type, title, description, and so on) and is computed periodically for large datasets. This model is trained end to end in a multi-task fashion. Amazon SageMaker Processing jobs are used to run these batch workloads periodically to automate their launch and only pay for the processing time that is used. For this batch workload use case, the requirement for inference throughput was 8,800 total TPS.

Intent prediction takes the user’s textual query as input and is needed in real time to dynamically serve everyday traffic and enhance the user experience on the Amazon Marketplace. The model is trained on a multi-class classification objective. This model is then deployed on Amazon Elastic Container Service (Amazon ECS), which enables quick auto scaling and easy deployment definition and management. Because this is a real-time use case, it required the P99 latency to be under 10 milliseconds to ensure a delightful user experience.

AWS Inferentia and the AWS Neuron SDK

EC2 Inf1 instances are powered by AWS Inferentia, the first ML accelerator purpose built by AWS to accelerate deep learning inference workloads. Inf1 instances deliver up to 2.3 times higher throughput and up to 70% lower cost per inference than comparable GPU-based EC2 instances. You can keep training your models using your framework of choice (PyTorch, TensorFlow, MXNet), and then easily deploy them on AWS Inferentia to benefit from the built-in performance optimizations. You can deploy a wide range of model types using Inf1 instances, from image recognition, object detection, natural language processing (NLP), and modern recommender models.

AWS Neuron is a software development kit (SDK) consisting of a compiler, runtime, and profiling tools that optimize the ML inference performance of the EC2 Inf1 instances. Neuron is natively integrated with popular ML frameworks such as TensorFlow and PyTorch. Therefore, you can deploy deep learning models on AWS Inferentia with the same familiar APIs provided by your framework of choice, and benefit from the boost in performance and lowest cost-per-inference in the cloud.

Since its launch, the Neuron SDK has continued to increase the breadth of models it supports while continuing to improve performance and reduce inference costs. This includes NLP models (BERTs), image classification models (ResNet, VGG), and object detection models (OpenPose and SSD).

Deploy on Inf1 instances for low latency, high throughput, and cost savings

The Amazon Search team wanted to save costs while meeting their high throughput requirement on duplication prediction, and the low latency requirement on query intent prediction. They chose to deploy on AWS Inferentia-based Inf1 instances and not only met the high performance requirements, but also saved up to 85% on inference costs.

Customer-perceived duplicate predictions

Prior to the usage of Inf1, a dedicated Amazon EMR cluster was running using CPU-based instances. Without relying on hardware acceleration, a large number of instances were necessary to meet the high throughput requirement of 8,800 total transactions per second. The team switched to inf1.6xlarge instances, each with 4 AWS Inferentia accelerators, and 16 NeuronCores (4 cores per AWS Inferentia chip). They traced the Transformer-based model for a single NeuronCore and loaded one mode per NeuronCore to maximize throughput. By taking advantage of the 16 available NeuronCores, they decreased inference costs by 85% (based on the current public Amazon EC2 on-demand pricing).

Query intent prediction

Given the P99 latency requirement of 10 milliseconds or less, the team loaded the model to every available NeuronCore on inf1.6xlarge instances. You can easily do this with PyTorch Neuron using the torch.neuron.DataParallel API. With the Inf1 deployment, the model latency was 3 milliseconds, end-to-end latency was approximately 10 milliseconds, and maximum throughput at peak load reached 16,000 TPS.

Get started with sample compilation and deployment code

The following is some sample code to help you get started on Inf1 instances and realize the performance and cost benefits like the Amazon Search team. We show how to compile and perform inference with a PyTorch model, using PyTorch Neuron.

First, the model is compiled with torch.neuron.trace():

m = torch.jit.load(f="./cpu_model.pt", map_location=torch.device('cpu'))
m.eval()
model_neuron = torch.neuron.trace(
    m,
    inputs,
    compiler_workdir="work_" + str(cores) + "_" + str(batch_size),
    compiler_args=[
        '--fp32-cast=all', '--neuroncore-pipeline-cores=' + str(cores)
    ])
model_neuron.save("m5_batch" + str(batch_size) + "_cores" + str(cores) +
                  "_with_extra_op_and_fp32cast.pt")

For the full list of possible arguments to the trace method, refer to PyTorch-Neuron trace Python API. As you can see, compiler arguments can be passed to the torch.neuron API directly. All FP32 operators are cast to BF16 with --fp32-cast=all, providing the highest performance while preserving dynamic range. More casting options are available to let you control the performance to model precision trade-off. The models used for both use cases were compiled for a single NeuronCore (no pipelining).

We then load the model on Inferentia with torch.jit.load, and use it for prediction. The Neuron runtime automatically loads the model to NeuronCores.

cm_cpd_preprocessing_jit = torch.jit.load(f=CM_CPD_PROC,
                                          map_location=torch.device('cpu'))
cm_cpd_preprocessing_jit.eval()
m5_model = torch.jit.load(f=CM_CPD_M5)
m5_model.eval()

input = get_input()
with torch.no_grad():
    batch_cm_cpd = cm_cpd_preprocessing_jit(input)
    input_ids, attention_mask, position_ids, valid_length, token_type_ids = (
        batch_cm_cpd['input_ids'].type(torch.IntTensor),
        batch_cm_cpd['attention_mask'].type(torch.HalfTensor),
        batch_cm_cpd['position_ids'].type(torch.IntTensor),
        batch_cm_cpd['valid_length'].type(torch.IntTensor),
        batch_cm_cpd['token_type_ids'].type(torch.IntTensor))
    model_res = m5_model(input_ids, attention_mask, position_ids, valid_length,
                         token_type_ids)

Conclusion

The Amazon Search team was able to reduce their inference costs by 85% using AWS Inferentia-based Inf1 instances, under heavy traffic and demanding performance requirements. AWS Inferentia and the Neuron SDK provided the team the flexibility to optimize the deployment process separately from training, and put forth a shallow learning curve via well-rounded tools and familiar framework APIs.

You can unlock performance and cost benefits by getting started with the sample code provided in this post. Also, check out the end-to-end tutorials to run ML models on Inferentia with PyTorch and TensorFlow.

About the authors

João Moura is an AI/ML Specialist Solutions Architect at Amazon Web Services. He is mostly focused on NLP use cases and helping customers optimize deep learning model training and deployment. He is also an active proponent of ML-specialized hardware and low-code ML solutions.

Weiqi Zhang is a Software Engineering Manager at Search M5, where he works on productizing large-scale models for Amazon machine learning applications. His interests include information retrieval and machine learning infrastructure.

Jason Carlson is a Software Engineer for developing machine learning pipelines to help reduce the number of stolen search impressions due to customer-perceived duplicates. He mostly works with Apache Spark, AWS, and PyTorch to help deploy and feed/process data for ML models. In his free time, he likes to read and go on runs.

Shaohui Xi is an SDE at the Search Query Understanding Infra team. He leads the effort for building large-scale deep learning online inference services with low latency and high availability. Outside of work, he enjoys skiing and exploring good foods.

Zhuoqi Zhang is a Software Development Engineer at the Search Query Understanding Infra team. He works on building model serving frameworks to improve latency and throughput for deep learning online inference services. Outside of work, he likes playing basketball, snowboarding, and driving.

Haowei Sun is a software engineer in the Search Query Understanding Infra team. She works on designing APIs and infrastructure supporting deep learning online inference services. Her interests include service API design, infrastructure setup, and maintenance. Outside of work, she enjoys running, hiking, and traveling.

Jaspreet Singh is an Applied Scientist on the M5 team, where he works on large-scale foundation models to improve the customer shopping experience. His research interests include multi-task learning, information retrieval, and representation learning.

Shruti Koparkar is a Senior Product Marketing Manager at AWS. She helps customers explore, evaluate, and adopt EC2 accelerated computing infrastructure for their machine learning needs.

Alexa speech science developments at Interspeech 2022

September 22, 2022

by Amazon AWS

Research from Alexa Speech covers a range of topics related to end-to-end neural speech recognition and fairness.Read More

Solution overview

The SageMaker Migration Toolkit

Real-time vs. serverless endpoints

Prepare your trained model and inference script

model_fn

input_fn

predict_fn

Move your pre-trained model to SageMaker

Conclusion

About the authors

Increased default service quotas for Amazon Textract

Improved service quota support for Amazon Textract

Default and applied quotas

Monitoring via CloudWatch graphs

Quota tagging

Amazon Textract Service Quota Calculator

Quota calculator for synchronous operations

Quota calculator for asynchronous operations

Conclusion

About the authors

Challenges

Hierarchical, large-scale revenue forecasting

Impacts of the COVID-19 pandemic

Modeling approaches

Amazon Forecast

Transformer with attention masks for COVID

Model ensemble

Hierarchical reconciliation

Architecture overview

Evaluation

Dataset

Evaluation metrics

Backtest windows

Model setup and training

Experiment results

Conclusion

References

About the Authors

Solution overview

Prerequisites

Implement the solution

Lambda function code

Query and visualize the data

Clean up

Conclusion

About the Authors

Deep learning for duplicate and query intent prediction

AWS Inferentia and the AWS Neuron SDK

Deploy on Inf1 instances for low latency, high throughput, and cost savings

Customer-perceived duplicate predictions

Query intent prediction

Get started with sample compilation and deployment code

Conclusion

About the authors

Navigation

GenAI Vision Endless Possibilities

"I'm interested in things that change the world or that affect the future and wondrous, new technology where you see it, and you're like, 'Wow, how did that even happen? How is that possible?'" -- Elon Musk

Copyright © 2019-2025 Vedere AI. All Rights Reserved.