Amazon AWS – Page 204

Identifying the sponsored-product ads most useful to customers

August 25, 2022

by Amazon AWS

Large language models improve click-through-rate prediction for sponsored products on Amazon product pages.Read More

AWS Deep Learning Challenge sees innovative and impactful use of Amazon EC2 DL1 instances

August 24, 2022

by Dvij Bajpai Amazon AWS

In the AWS Deep Learning Challenge held from January 5, 2022, to March 1, 2022, participants from academia, startups, and enterprise organizations joined to test their skills and train a deep learning model of their choice using Amazon Elastic Compute Cloud (Amazon EC2) DL1 instances and Habana’s SynapseAI SDK. The EC2 DL1 instances powered by Gaudi accelerators from Habana Labs, an Intel company, are designed specifically for training deep learning models. Participants were able to realize the significant price/performance benefits that DL1 offers over GPU-based instances.

We are excited to announce the winners and showcase some of the machine learning (ML) models that were trained in this hackathon. You will learn about some of the deep learning use cases that are supported by EC2 DL1 instances, including computer vision, natural language processing, and acoustic modeling.

Winning models

Our first-place winner is a project submitted by Gustavo Zomer. It’s an implementation of multi-lingual CLIP (Contrastive Language-Image Pre-Training). CLIP was introduced by OpenAI in 2021 as a way to train a more generalizable image classifier across larger datasets through self-supervised learning. It’s trained on a large set of images with a wide variety of natural language supervision that’s abundantly available on the internet, but is limited to the English language. This project replaces the text encoder in CLIP with a multi-lingual text encoder called XLM-RoBERTa to broaden the model’s applicability to multiple languages. This modified implementation of CLIP is able to pair images with captions across multiple languages. The model was trained on 16 accelerators across two DL1 instances, showing how ML training can be scaled to use multiple Gaudi accelerators across multiple nodes to increase training throughput and reduce the time to train. The judges were impressed by the impactful use of deep learning to break down language barriers, and the technical implementation, which used distributed training.

In second place, we have a project submitted by Remco van Akker. It uses a GAN (Generative Adversarial Network) to generate synthetic retinal image data for medical applications. Synthetic data is used in model training in medical applications to overcome the scarcity of annotated medical data, which is labor-intensive and costly to produce. Synthetic data can be used as part of data augmentation to remove biases and make vision models in medical applications more generalizable. This project stood out because it implemented a generative model on DL1 to solve a real-world problem impacting the application of AI and ML in healthcare.

Rounding out our top three was a project submitted by Zohar Jackson that implemented a vision transformer model for semantic segmentation. This project uses the Ray Tune library to fine-tune hyperparameters and uses Horovod to parallelize training on 16 Gaudi accelerators across two DL1 instances.

In addition to the top three winners, participants won several other prizes, including best technical implementation, highest potential impact, and most creative project. We offer our congratulations to all the winners of this hackathon for building such a diverse set of impactful projects on Gaudi accelerator-based EC2 DL1 instances. We can’t wait to see what our participants will continue to build on DL1 instances going forward.

Get started with DL1 instances

As demonstrated by the various projects in this hackathon, you can use EC2 DL1 instances to train deep learning models for use cases such as natural language processing, object detection, and image recognition. With DL1 instances, you also get up to 40% better price/performance for training deep learning models compared to current generation GPU-based EC2 instances. Visit Amazon EC2 DL1 Instances to learn more about how DL1 instances can accelerate your training workloads.

About the authors

Dvij Bajpai is a Senior Product Manager at AWS. He works on developing EC2 instances for workloads in machine learning and high-performance computing.

Amr Ragab is a Principal Solutions Architect at AWS. He provides technical guidance to help customers run complex computational workloads at scale.

Shruti Koparkar is a Senior Product Marketing Manager at AWS. She helps customers explore, evaluate, and adopt EC2 accelerated computing infrastructure for their machine learning needs.

The science behind grouping package deliveries

August 24, 2022

by Amazon AWS

How Customer Order and Network Density OptimizeR (CONDOR) has led to improved delivery routes.Read More

Conduct what-if analyses with Amazon Forecast, up to 80% faster than before

August 23, 2022

by Brandon Nair Amazon AWS

Now with Amazon Forecast, you can seamlessly conduct what-if analyses up to 80% faster to analyze and quantify the potential impact of business levers on your demand forecasts. Forecast is a service that uses machine learning (ML) to generate accurate demand forecasts, without requiring any ML experience. Simulating scenarios through what-if analyses is a powerful business tool to navigate through the uncertainty of future events by capturing possible outcomes from hypothetical scenarios. It’s a common practice to assess the impact of business decisions on revenue or profitability, quantify the risk associated with market trends, evaluate how to organize logistics and workforce to meet customer demand, and much more.

Conducting a what-if analysis for demand forecasting can be challenging because you first need accurate models to forecast demand and then a quick and easy way to reproduce the forecast across a range of scenarios. Until now, although Forecast provided accurate demand forecasts, conducting what-if analysis using Forecast could be cumbersome and time-consuming. For example, retail promotion planning is a common application of what-if analysis to identify the optimal price point for a product to maximize the revenue. Previously on Forecast, you had to prepare and import a new input file for each scenario you wanted to test. If you wanted to test three different price points, you first had to create three new input files by manually transforming the data offline and then importing each file into Forecast separately. In effect, you were doing the same set of tasks for each and every scenario. Additionally, to compare scenarios, you had to download the prediction from each scenario individually and then merge them offline.

With today’s launch, you can easily conduct what-if analysis up to 80% faster. We have made it easy to create new scenarios by removing the need for offline data manipulation and import for each scenario. Now, you can define a scenario by transforming your initial dataset through simple operations, such as multiplying the price for product A by 90% or decreasing the price for product B by $10. These transformations can also be combined with conditions to control the parameters that the scenario applies in (for example, reducing product A’s price in one location only). With this launch, you can define and run multiple scenarios of the same type of analysis (such as promotion analysis) or different types of analyses (such as promotion analysis in geographical region 1 and inventory planning in geographical region 2) simultaneously. Lastly, you no longer need to merge and compare results of scenarios offline. Now, you can view the forecast predictions across all scenarios in the same graph or bulk export the data for offline review.

Solution overview

The steps in this post demonstrate how to use what-if analysis on the AWS Management Console. To directly use Forecast APIs for what-if analysis, follow the notebook in our GitHub repo that provides an analogous demonstration.

Import your training data

To conduct a what-if analysis, you must import two CSV files representing the target time series data (showing the prediction target) and the related time series data (showing attributes that impact the target). Our example target time series file contains the product item ID, timestamp, demand, store ID, city, and region, and our related time series file contains the product item ID, store ID, timestamp, city, region, and price.

To import your data, complete the following steps:

On the Forecast console, choose View dataset groups.

Figure 1: View dataset group on the Amazon Forecast home page

Choose Create dataset group.

Figure 2: Creating a dataset group

For Dataset group name, enter a dataset name (for this post, my_company_consumer_sales_history).
For Forecasting domain, choose a forecasting domain (for this post, Retail).
Choose Next.

Figure 3: Provide a dataset name and select your forecasting domain

On the Create target time series dataset page, provide the dataset name, frequency of your data, and data schema
Provide the dataset import details.
Choose Start.

The following screenshot shows the information for the target time series page filled out for our example.

Figure 4: Sample information filled out for the target time series data import page

You will be taken to the dashboard that you can use to track progress.

To import the related time series file, on the dashboard, choose Import.

Figure 5: Dashboard that allows you to track progress

On the Create related time series dataset page, provide the dataset name and data schema.
Provide the dataset import details.
Choose Start.

The following screenshot shows the information filled out for our example.

Figure 6: Sample information filled out for the related time series data import page

Train a predictor

Next, we train a predictor.

On the dashboard, choose Train predictor.

Figure 7: Dashboard of completed dataset import step and button to train a predictor

On the Train predictor page, enter a name for your predictor, how long in the future you want to forecast and at what frequency, and the number of quantiles you want to forecast for.
Enable AutoPredictor – this is required to use what-if analysis.
Choose Create.

The following screenshot shows the information filled out for our example.

Figure 8: Sample information filled out to train a predictor

Create a forecast

After our predictor is trained (this can take approximately 2.5 hours), we create a forecast. You will know that your predictor is trained when you see the View Predictors button on your dashboard.

Choose Create a forecast on the Dashboard

Figure 9: Dashboard of completed train predictor step and button to create a forecast

On the Create a forecast page, enter a forecast name, choose the predictor that you created, and specify the forecast quantiles (optional) and the items to generate a forecast for.
Choose Start.

Figure 10: Sample information filled out to create a forecast

After you complete these steps, you have successfully created a forecast. This represents your baseline forecast scenario that you use to do what-if analyses on.

If you need more help creating your baseline forecasts, refer to Getting Started (Console). We now move to the next steps of conducting a what-if analysis.

Create a what-if analysis

At this point, we have created our baseline forecast and will start the walkthrough of how to conduct a what-if analysis. There are three stages to conducting a what-if analysis: setting up the analysis, creating the what-if forecast by defining what is changed in the scenario, and comparing the results.

To set up your analysis, choose Explore what-if analysis on the dashboard.

Figure 11: Dashboard of complete create forecast step and button to start what-if analysis

Choose Create.

Figure 12: Page to create a new what-if analysis

Enter a unique name and select the baseline forecast on the drop-down menu.
Choose the items in your dataset you want to conduct a what-if analysis for. You have two options:
1. Select all items is the default, which we choose in this post.
2. If you want to pick specific items, choose Select items with a file and import a CSV file containing the unique identifier for the corresponding item and any associated dimension (such as region).
Choose Create what-if analysis.

Figure 13: Option to specify items to conduct what-if analysis for and button to create the analysis

Create a what-if forecast

Next, we create a what-if forecast to define the scenario we want to analyze.

Choose Create.

Figure 14: Creating a what-if forecast

Enter a name of your scenario.

You can define your scenario through two options:

Use transformation functions – Use the transformation builder to transform the related time series data you imported. For this walkthrough, we evaluate how the demand an item in our dataset changes when the price is reduced by 10% and then by 30% when compared to the price in the baseline forecast.
Define the what-if forecast with a replacement dataset – Replace the related time series dataset you imported.

Figure 15: Options to define a scenario

The transformation function builder provides the capability to transform the related time series data you imported earlier through simple operations to add, subtract, divide, and multiply features in your data (for example price) by a value you specify. For our example, we create a scenario where we reduce the price by 10%, and price is a feature in the dataset.

For What-if forecast definition method, select Use transformation functions.
Choose Multiply as our operator, price as our time series, and enter 0.9.

Figure 16: Using the transformation builder to reduce price by 10%

You can also add conditions to further refine your scenario. For example if your dataset contained store information organized by region, you could limit the price reduction scenario by region. You could define a scenario of a 10% price reduction that’s applicable to stores not in Region_1.

Choose Add condition.
Choose Not equals as the operation and enter Region_1.

Figure 17: Using the transformation builder to reduce price by 10% for stores that are not in region 1

Another option to modify your related time series is by importing a new dataset that already contains the data defining the scenario. For example, to define a scenario with 10% price reduction, we can upload a new dataset specifying the unique identifier for the items that are changing and the price change that is 10% lower. To do so, select Define the what-if forecast with a replacement dataset and import a CSV containing the price change.

Figure 18: Importing a replacement dataset to define a new scenario

To complete the what-if forecast definition, choose Create.

Figure 19: Completing the what-if forecast creation

Repeat the process to create another what-if forecast with a 30% price reduction.

Figure 20: Showing the completed run of the two what-if forecasts

After the what-if analysis has run for each what-if forecast, the status will change to active. This concludes the second stage, and you can move on to comparing the what-if forecasts.

Compare the forecasts

We can now compare the what-if forecasts for both our scenarios, comparing a 10% price reduction with a 30% price reduction.

On the analysis insights page, navigate to the Compare what-if forecasts section.

Figure 21: Inputs required to compare what-if forecasts

For item_id, enter the item to analyze.
For What-if forecasts, choose the scenarios to compare (for this post, Scenario_1 and Scenario_2).
Choose Compare what-if.

Figure 22: button to generate what-if forecast comparison graph

The following graph shows the resulting demand in both our scenarios.

Figure 23: What-if forecast comparison for scenario 1 and 2

By default, it showcases the P50 and the base case scenario. You can view all quantiles generated by selecting your preferred quantiles on the Choose forecasts drop-down menu.

Export your data

To export your data to CSV, complete the following steps:

Choose Create export.

Figure 24: Creating a what-if forecast export

Enter a name for your export file (for this post, my_scenario_export)
Specify the scenarios to be exported by selecting the scenarios on the What-If Forecast drop-down menu. You can export multiple scenarios at once in a combined file.
For Export location, specify the Amazon Simple Storage Service (Amazon S3) location.
To begin the export, choose Create Export.

Figure 25: specifying the scenario information and export location for the bulk export

To download the export, first navigate to S3 file path location from the AWS Management Console and the select the file and choose the download button. The export file will contain the timestamp, item ID, dimensions, and the forecasts for each quantile for all scenarios selected (including the base scenario).

Conclusion

Scenario analysis is a critical tool to help navigate through the uncertainties of business. It provides foresight and a mechanism to stress-test ideas, leaving businesses more resilient, better prepared, and in control of their future. Forecast now supports forecasting what-if scenario analyses. To conduct your scenario analysis, open the Forecast console and follow the steps outlined in this post, or refer to our GitHub notebook on how to access the functionality via API.

To learn more, refer to the CreateWhatIfAnalysis page in the developer guide.

About the authors

Brandon Nair is a Sr. Product Manager for Amazon Forecast. His professional interest lies in creating scalable machine learning services and applications. Outside of work he can be found exploring national parks, perfecting his golf swing or planning an adventure trip.

Akhil Raj Azhikodan is a Software Development Engineer working on Amazon Forecast. His interests are in designing and building reliable systems that solve complex customer problems. Outside of work, he enjoys learning about history, hiking and playing video games.

Conner Smith is a Software Development Engineer working on Amazon Forecast. He focuses on building secure, scalable distributed systems that provide value to customers. Outside of work he spends time reading fiction, playing guitar, and watching random YouTube videos.

Shannon Killingsworth is the UX Designer for Amazon Forecast. He has been improving the user experience in Forecast for two years by simplifying processes as well as adding new features in ways that make sense to our users. Outside of work he enjoys running, drawing, and reading.

Why Amazon Scholar Yossi Keshet remains “excited about speech”

August 23, 2022

by Amazon AWS

New speech representations and self-supervised learning are two of the recent trends that most intrigue him.Read More

Intelligently search Alfresco content using Amazon Kendra

August 22, 2022

by Vikas Shah Amazon AWS

Amazon Kendra is an intelligent search service powered by machine learning (ML). With Amazon Kendra, you can easily aggregate content from a variety of content repositories into a centralized index that lets you quickly search all your enterprise data and find the most accurate answer. Many organizations use the content management platform Alfresco to store their content. One of the key requirements for many enterprise customers using Alfresco is the ability to easily and securely find accurate information across all the documents in the data source.

We are excited to announce the public preview of the Amazon Kendra Alfresco connector. You can index Alfresco content, filter the types of content you want to index, and easily search your data in Alfresco with Amazon Kendra intelligent search and its Alfresco OnPrem connector.

This post shows you how to use the Amazon Kendra Alfresco OnPrem connector to configure the connector as a data source for your Amazon Kendra index and search your Alfresco documents. Based on the configuration of the Alfresco connector, you can synchronize the connector to crawl and index different types of Alfresco content such as wikis and blogs. The connector also ingests the access control list (ACL) information for each file. The ACL information is used for user context filtering, where search results for a query are filtered by what a user has authorized access to.

Prerequisites

To try out the Amazon Kendra connector for Alfresco using this post as a reference, you need the following:

An AWS account with privileges to create AWS Identity and Access Management (IAM) roles and policies. For more information, see Overview of access management: Permissions and policies and IAM roles for Alfresco data sources.
Basic knowledge of AWS and working knowledge of Alfresco administration.
Alfresco OnPrem set up with a user added to the Alfresco_Adminstrators group. We will store the admin user name and password in AWS Secrets Manager.

Configure the data source using the Amazon Kendra connector for Alfresco

To add a data source to your Amazon Kendra index using the Alfresco OnPrem connector, you can use an existing index or create a new index. Then complete the following steps. For more information on this topic, refer to the Amazon Kendra Developer Guide.

On the Amazon Kendra console, open your index and choose Data sources in the navigation pane.
Choose Add data source.
Under Alfresco, choose Add connector.
In the Specify data source details section, enter a name and description and choose Next.
In the Define access and security section, for Alfresco site URL, enter the Alfresco host name.
To configure the SSL certificates, you can create a self-signed certificate for this setup utilizing openssl x509 -in pattern.pem -out alfresco.crt and add this certificate to an Amazon Simple Storage Service (Amazon S3) bucket. Choose Browse S3 and choose the S3 bucket with the SSL certificate.
For Site ID, enter the Alfresco site ID where you want to search documents.
Under Authentication, you have two options:
1. Use Secrets Manager to create new Alfresco authentication credentials. You need an Alfresco admin user name and password.
2. Use an existing Secrets Manager secret that has the Alfresco authentication credentials you want the connector to access.
Choose Save and add secret.
For IAM role, choose Create a new role or choose an existing IAM role configured with appropriate IAM policies to access the Secrets Manager secret, Amazon Kendra index, and data source.
Choose Next.
In the Configure sync settings section, provide information about your sync scope and run schedule.
You can include the files to be crawled using inclusion patterns or exclude them using exclusion patterns.
Choose Next.
In the Set field mappings section, you can optionally configure the field mappings to specify how the Alfresco field names are mapped to Amazon Kendra attributes or facets.
Choose Next.
Review your settings and confirm to add the data source.
After the data source is added, choose Data sources in the navigation pane, select the newly added data source, and choose Sync now to start data source synchronization with the Amazon Kendra index.

The sync process can take about 10–15 minutes. You can now search indexed Alfresco content using the search console or a search application. Optionally, you can search with ACL with the following additional steps.
Go to the index page that you created and on the User access control tab, choose Edit settings.
Under Access control settings, select Yes.
For Token type, choose JSON.
Choose Next.
Choose Update.

Wait a few minutes for the index to get updated by the changes. Now let’s see how you can perform intelligent search with Amazon Kendra.

Perform intelligent search with Amazon Kendra

Before you try searching on the Amazon Kendra console or using the API, make sure that the data source sync is complete. To check, view the data sources and verify if the last sync was successful.

To start your search, on the Amazon Kendra console, choose Search indexed content in the navigation pane.
You’re redirected to the Amazon Kendra Search console. Now you can search information from the Alfresco documents you indexed using Amazon Kendra.
For this post, we search for a document stored in Alfresco, AWS.
Expand Test query with an access token and choose Apply token.
For Username, enter the email address associated with your Alfresco account.
Choose Apply.

Now the user can only see the content they have access to. In our example, user test@amazon.com doesn’t have access to any documents on Alfresco, so none are visible.

Limitations

The connector has the following limitations:

As of this writing, we only support Alfresco OnPrem. Alfresco PAAS is not supported.
The connector doesn’t crawl the following entities: calendars, discussions, data lists, links, and system files.
During public preview, we only support basic authentication. For support for other forms of authentication please contact your Amazon representative.

Clean up

To avoid incurring future costs, clean up the resources you created as part of this solution. If you created a new Amazon Kendra index while testing this solution, delete it. If you only added a new data source using the Amazon Kendra connector for Alfresco, delete that data source.

Conclusion

With the Amazon Kendra Alfresco connector, your organization can search contents securely using intelligent search powered by Amazon Kendra.

To learn more about the Amazon Kendra Alfresco connector, refer to the Amazon Kendra Developer Guide.

For more information on other Amazon Kendra built-in connectors to popular data sources, refer to Amazon Kendra native connectors.

About the author

Vikas Shah is an Enterprise Solutions Architect at Amazon web services. He is a technology enthusiast who enjoys helping customers find innovative solutions to complex business challenges. His areas of interest are ML, IoT, robotics and storage. In his spare time, Vikas enjoys building robots, hiking, and traveling.

Ozge Sahin on the art and science of studying consumer behavior

August 19, 2022

by Amazon AWS

The Johns Hopkins business school professor and Amazon Scholar focuses on enhancing customer experiences.Read More

Best practices for TensorFlow 1.x acceleration training on Amazon SageMaker

August 19, 2022

by Yuhui Liang Amazon AWS

Today, a lot of customers are using TensorFlow to train deep learning models for their clickthrough rate in advertising and personalization recommendations in ecommerce. As the behavior of their clients change, they can accumulate large amounts of new data every day. Model iteration is one of a data scientist’s daily jobs, but they face the problem of taking too long to train on large datasets.

Amazon SageMaker is a fully managed machine learning (ML) platform that could help data scientists focus on models instead of infrastructure, with native support for bring-your-own-algorithms and frameworks such as TensorFlow and PyTorch. SageMaker offers flexible distributed training options that adjust to your specific workflows. Because many data scientists may lack experience in the acceleration training process, in this post we show you the factors that matter for fast deep learning model training and the best practices of acceleration training for TensorFlow 1.x on SageMaker. We also have a sample code of DeepFM distributed training on SageMaker on the GitHub repo.

There are many factors you should consider to maximize CPU/GPU utilization when you run your TensorFlow script on SageMaker, such as infrastructure, type of accelerator, distributed training method, data loading method, mixed precision training, and more.

We discuss best practices in the following areas:

Accelerate training on a single instance
Accelerate training on multiple instances
Data pipelines
Automatic mixed precision training

Accelerate training on a single instance

When running your TensorFlow script on a single instance, you could choose a computer optimized series such as the Amazon Elastic Compute Cloud (Amazon EC2) C5 series, or an accelerated computing series with multiple GPU in a single instance such as p3.8xlarge, p3.16xlarge, p3dn.24xlarge, and p4d.24xlarge.

In this section, we discuss strategies for multiple CPUs on a single instance, and distributed training with multiple GPUs on a single instance.

Multiple CPUs on a single instance

In this section, we discuss manually setting operators’ parallelism on CPU devices, the tower method, TensorFlow MirroredStrategy, and Horovod.

Manually setting operators’ parallelism on CPU devices

TensorFlow automatically selects the appropriate number of threads to parallelize the operation calculation in the training process. However, you could set the intra_op threads pool and inter_op parallelism settings provided by TensorFlow and use environment variables of MKL-DNN to set binding for the OS thread. See the following code:

# Set parallelism of intra_op and inter_op
num_cpus = int(os.environ['SM_NUM_CPUS'])
config = tf.ConfigProto(allow_soft_placement=True, device_count={'CPU': num_cpus}, intra_op_parallelism_threads=num_cpus, inter_op_parallelism_threads=num_cpus)
run_config = tf.estimator.RunConfig().replace(session_config = config)

# Use Intel MKL-DNN Setting to accelerate training speed
os.environ["KMP_AFFINITY"]= "verbose,disabled"
os.environ['OMP_NUM_THREADS'] = str(num_cpus)
os.environ['KMP_SETTINGS'] = '1'

The environment variable KMP_AFFINITY of MKL-DNN is set to granularity=fine,compact,1,0 by default. After setting both intra and inter of TensorFlow to the maximum number of vCPUs of the current instance, the upper limit of CPU usage is almost the same as the number of physical cores of the training instance.

If you set os.environ["KMP_AFFINITY"]= "verbose,disabled", the OS thread isn’t bound to the hardware hyper thread, and CPU usage could exceed the number of physical cores.

Regarding the settings of TensorFlow intra parallelism, TensorFlow inter parallelism, and the number of MKL-DNN threads, different combinations of these three parameters result in different training speeds. Therefore, you need to test each case to find the best combination. A common situation is to set the three parameters (intra_op_parallelism_threads and inter_op_parallelism_threads for TensorFlow, os.environ['OMP_NUM_THREADS'] for MKL-DNN) to half the number of vCPUs (physical core) or the total number of vCPUs.

Tower method

To replicate a model over GPUs, each GPU gets its own instance of the forward pass. The instance of the forward pass is called a tower. The tower method is almost always used for GPU devices. To compare training speed with other methods, here we also use the tower method for our CPU device.

If you don’t set the CPU device manually, TensorFlow don’t use the tower method to average the gradients, so you don’t need to scale the batch size in such cases.

Set the CPU device manually:

device_list = []
if manual_CPU_device_set:
		cpu_prefix=’/cpu:’
		for I in range(1, num_cpus):
			devices_list.append(cpu_prefix + str(i))

Use replicate_model_fn to wrap model_fn:

DeepFM = tf.estimator.Estimator(model_fn=tf.contrib.estimator.replicate_model_fn(model_fn, devices=device_list), model_dir=FLAGS.model_dir, params=model_params, config=config)

Use TowerOptimizer to wrap optimizer:

optimizer = tf.contrib.estimator.TowerOptimizer(optimizer)

Wrap your model_fn:

with tf.variable_scope(‘deepfm_model’, reuse=tf.AUTO_REUSE)

Scale batch size to (NUM_CPU – 1).

Let’s look at the difference of CPU utilization with tower mode enabled. The following figure shows ml.c5.18xlarge instance’s CPU utilization with the following configuration:

No Tower + LibSVM data + pipe mode + MKL-DNN disable binding + TensorFlow intra/inter op parallelism setting to max number of instance’s vCPUs

The following figure shows the ml.c5.18xlarge instance’s CPU utilization with the following configuration:

Tower with set CPU device + LibSVM data + pipe mode + MKL-DNN disable binding + TensorFlow intra/inter op parallelism setting to max number of instance’s vCPUs

The CPU usage is higher when using the tower method, and it exceeds the number of physical cores.

TensorFlow MirroredStrategy

TensorFlow MirroredStrategy means synchronous training across multiple replicas on one machine. This strategy is typically used for training on one machine with multiple GPUs. To compare training speed with another method, we use MirroredStrategy for our CPU device.

When using TensorFlow MirroredStrategy, if you don’t set the CPU device, TensorFlow just uses one CPU as single worker, which is a waste of resources. We recommend manually setting the CPU device, because it will do a reduce operation on /CPU:0, so the /CPU:0 device isn’t used as a replica here. See the following code:

device_list = []
if manual_CPU_device_set:
		cpu_prefix=’/cpu:’
		for I in range(1, num_cpus):
			devices_list.append(cpu_prefix + str(i))
mirrored_strategy = tf.distribute.MirroredStrategy(devices=devices_list)
	else:
mirrored_strategy = tf.distribute.MirroredStrategy()

# Set strategy to config:
config = tf.estimator.RunConfig(train_distribute=mirrored_strategy,
eval_distribute=mirrored_strategy,
session_config = config)

You need to scale batch size when using MirroredStrategy; for example, scale the batch size to a multiple of the number of GPU devices.

For the sub-strategy when you set CPU device, if you don’t set the cross_device_ops parameter in tf.distribute.MirroredStrategy(), TensorFlow uses the ReductionToOneDevice sub-strategy by default. However, if you set HierarchicalCopyAllReduce as the sub-strategy, TensorFlow just does the reduce work on /CPU:0. When you use the TensorFlow dataset API and distribute strategy together, the dataset object should be returned instead of features and labels in function input_fn.

Usually, TensorFlow MirroredStrategy is slower than the tower method on CPU training, so we don’t recommend using MirroredStrategy on a multi-CPU single host.

Horovod

Horovod is a distributed deep learning training framework for TensorFlow, Keras, PyTorch, and Apache MXNet. The goal of Horovod is to make distributed deep learning fast and easy to use.

There is a parameter of distribution in the SageMaker Python SDK Estimator API, which you could use to state the Horovod distributed training. SageMaker provisions the infrastructure and runs your script with MPI. See the following code:

hvd_processes_per_host = 4
distribution = {'mpi': { 
'enabled': True, 
'processes_per_host': hvd_processes_per_host,
'custom_mpi_options': '-verbose --NCCL_DEBUG=INFO -x OMPI_MCA_btl_vader_single_copy_mechanism=none' 
} 
}

When choosing a GPU instance such as ml.p3.8xlarge, you need to pin each GPU for every worker:

config = tf.ConfigProto()
config.gpu_options.visible_device_list = str(hvd.local_rank())

To speed up model convergence, scale the learning rate by the number of workers according to the Horovod official documentation. However, in real-world projects, you should scale the learning rate to some extent, but not by the number of workers, which results in bad model performance. For example, if the original learning rate is 0.001, we scale the learning rate to 0.0015, even if number of workers is four or more.

Generally, only the primary (Horovod rank 0) saves the checkpoint and model as well as the evaluation operation. You don’t need to scale the batch size when using Horovod. SageMaker offers Pipe mode to stream data from Amazon Simple Storage Service (Amazon S3) into training instances. When you enable Pipe mode, be aware that different workers on the same host need to use different channels to avoid errors. This is because the first worker process reads the FIFO/channel data, and other worker processes on the same instance will hang because they can’t read data from the same FIFO/channel, so Horovod doesn’t work properly. To avoid this issue, set the channels according to the number of workers per instance. At least make sure that different workers on the same host consume different channels; the same channel can be consumed by workers on a different host.

When using Horovod, you may encounter the following error:

“One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock.”

The possible cause for this issue is that a certain rank (such as rank 0) works slower or does more jobs than other ranks, and this causes other ranks to wait for a long time. Although rank 0 sometimes has to do more work than other ranks, it should be noted that rank 0 shouldn’t do much for a long time. For example, for the model evaluation on the validation set and saving checkpoints during training, if it’s inevitable that these operations will take a long time, which could cause errors, one workaround is to let all workers do the same work as rank 0 (checkpoints saving, evaluation, and so on).

Data sharding is one of the most important things to consider when using distributed training. You can use TensorFlow dataset.shard() in your script. SageMaker also offers a dataset shard feature in the inputs channel by setting distribution=S3shardbykey in the dataset channel. See the following code:

dataset = PipeModeDataset(channel, record_format='TFRecord')

number_host = len(FLAGS.hosts)

if FLAGS.enable_data_multi_path : # If there are multi channels mapping with different S3 path
    if FLAGS.enable_s3_shard == False :
        if number_host > 1:
            index = hvd.rank() // FLAGS.worker_per_host
            dataset = dataset.shard(number_host, index)
else :
    if FLAGS.enable_s3_shard :
        dataset = dataset.shard(FLAGS.worker_per_host, hvd.local_rank())
    else :
        dataset = dataset.shard(hvd.size(), hvd.rank())

The following figure shows the result when using Horovod (ml.c5.18xlarge, Horovod + LibSVM + default intra op and inter op setting), which you can compare to the tower method.

Distributed training with multiple GPUs on a single instance

It’s normal to start distributed training with multiple GPUs on a single instance because data scientists only need to manage one instance and take advantage of the high-speed interlink between GPUs. SageMaker training jobs support multiple instance types that have multiple GPUs on a single instance, such as ml.p3.8xlarge, ml.p3.16xlarge, ml.p3dn.24xlarge, and ml.p4d.24xlarge. The method is the same as multiple CPUs in a single instance, but with a few changes in the script.

Tower method

The tower method here is almost the same as in multi-CPU training. You need to scale the batch size according to the number of GPUs in use.

TensorFlow MirroredStrategy

The default sub-strategy of MirroredStrategy is NcclAllReduce. You need to scale the batch size according to the number of GPUs in use. See the following code:

mirrored_strategy = tf.distribute.MirroredStrategy()
config = tf.estimator.RunConfig(train_distribute=mirrored_strategy,
				eval_distribute=mirrored_strategy)

Accelerate training on multiple instances

Scaling out is always an option to improve training speed. More and more data scientists choose this as a default option in regards to distributed training. In this section, we discuss strategies for distributed training with multiple hosts.

Multiple CPUs with multiple instances

There are four main methods for using multiple CPUs with multiple instances when enabling distributed training:

- Parameter server without manually setting operators’ parallelism on CPU devices
- Parameter server with manually setting operators’ parallelism on CPU devices
- Parameter server with tower (setting CPU devices manually, and set allow_soft_placement=True in tf.ConfigProto)
- Horovod

When using a parameter server in the tf.estimator API, the path of checkpoint must be a sharable path such as Amazon S3 or the local path of Amazon Elastic File Service (Amazon EFS) mapping to the container. For a parameter server in tf.keras, the checkpoint path can be set to the local path. For Horovod, the checkpoint path can be set to a local path of the training instance.

When using a parameter server and the tf.estimator API with the checkpoint path to Amazon S3, if the model is quite large, you might encounter an error of the primary is stuck at saving checkpoint to S3. You can use SageMaker built-in container TensorFlow 1.15 or TensorFlow 1.15.2 or use Amazon EFS as the checkpoint path of the share.

When using a parameter server for multiple hosts, the parameter load on each parameter server process may be unbalanced (especially when there are relatively large embedding table variables), which could cause errors. You could check the file size of each the shard’s checkpoint in Amazon S3 to determine whether the parameters on the parameter server are balanced, because each parameter server corresponds to a shard of the checkpoint file. To avoid such issues, you can use the partitioner function to try to make the parameters of each parameter server evenly distributed:

with tf.variable_scope('deepfm_model', reuse=tf.AUTO_REUSE, partitioner = tf.fixed_size_partitioner(num_shards=len(FLAGS.hosts))):

Single GPU with multiple instances

SageMaker training jobs support instances that only have one GPU, like the ml.p3.xlarge, ml.g4dn, and ml.g5 series. There are two main methods used in this scenario: parameter servers and Horovod.

The built-in parameter server distributed training method of SageMaker is to start a parameter server process and a worker process for each training instance (each parameter server is only responsible for part of the model parameters), so the default is multi-machine single-GPU training. The SageMaker built-in parameter server distributed training is an asynchronous gradient update method. To reduce the impact of asynchronous updates on training convergence, it’s recommended to reduce the learning rate. If you want to use all the GPUs on the instance, you need to use a combination of parameter servers and the tower method.

For Horovod, just set processes_per_host=1 in the distribution parameter of the SageMaker Python Estimator API.

Multiple GPUs with multiple instances

For parameter servers and the tower method, the code changes are basically the same as the tower method for a single instance with multiple GPUs, and there is no need to manually set the GPU devices.

For Horovod, set processes_per_host in the distribution parameter to the number of GPUs of each training instance. If you use Pipe mode, the number of workers per instance needs to match the number of channels.

Data pipelines

In addition to the infrastructure we have discussed, there is another important thing to consider: the data pipeline. A data pipeline refers to how you load data and transform data before it feeds into neural networks. CPU is used to prepare data, whereas GPU is used to calculate the data from CPU. Because GPU is an expensive resource, more GPU idle time is inefficient; a good data pipeline in your training job could improve GPU and CPU utilization.

When you’re trying to optimize your TensorFlow data input pipeline, consider the API order used in TensorFlow datasets, the training data size (a lot of small files or several large files), batch size, and so on.

Let’s look at the interaction between GPU and CPU during training. The following figures compare interactions with and without a pipeline.

A better pipeline could reduce GPU idle time. Consider the following tips:

Use simple function logic in extracting features and labels
Prefetch samples to memory
Reduce unnecessary disk I/O and networking I/O
Cache the processed features and labels in memory
Reduce the number of replication times between CPU and GPU
Have different workers deal with different parts of the training dataset
Reduce the times of calling the TensorFlow dataset API

TensorFlow provides a transform API related to dataset formats, and the order of the transformation API in TensorFlow affects training speed a lot. The best order of calling the TensorFlow dataset API needs to be tested. The following are some basic principles:

Use a vectorized map. This means call the TensorFlow dataset batch API first, then the dataset map API. The custom parsing function provided in the map function, such as decode_tfrecord in the sample code, parses a mini batch of data. On the contrary, map first and then batch is a scalar map, and the custom parser function processes just one sample.
Use the TensorFlow dataset cache API to cache features and labels. Put the TensorFlow dataset cache API before the TensorFlow dataset repeat API, otherwise RAM utilization increases linearly epoch by epoch. If the dataset is as large as RAM, don’t use the TensorFlow dataset cache API. If you need to use the TensorFlow dataset cache API and shuffle API, consider use the following order: create TensorFlow dataset object -> cache API -> shuffle API -> batch API -> map API -> repeat API -> prefetch API.
Use the tfrecord dataset format more than LibSVM format.
File mode or Pipe mode depends on your dataset format and amount of files. The tfrecorddataset API can set num_parallel_reads to read multiple files in parallel and set buffer_size to optimize data’s reading, whereas the pipemodedataset API doesn’t have such settings. Pipe mode is more suitable for situations where a single file is large and the total number of files is small. We recommend using a SageMaker processing job to do the preprocessing work, such as joining multiple files to a bigger file according to labels, using a sampling method to make the dataset more balanced, and shuffling the balanced dataset.

See the following code sample:

def decode_tfrecord(batch_examples):
        # The feature definition here should BE consistent with LibSVM TO TFRecord process.
        features = tf.parse_example(batch_examples,
                                           features={
                                               "label": tf.FixedLenFeature([], tf.float32),
                                               "ids": tf.FixedLenFeature(dtype=tf.int64, shape=[FLAGS.field_size]),
                                               "values": tf.FixedLenFeature(dtype=tf.float32, shape=[FLAGS.field_size]) 
                                           })
        
        batch_label = features["label"]
        batch_ids = features["ids"]
        batch_values = features["values"]
        
        return {"feat_ids": batch_ids, "feat_vals": batch_values}, batch_label


    def decode_libsvm(line):
        columns = tf.string_split([line], ' ')
        labels = tf.string_to_number(columns.values[0], out_type=tf.float32)
        splits = tf.string_split(columns.values[1:], ':')
        id_vals = tf.reshape(splits.values,splits.dense_shape)
        feat_ids, feat_vals = tf.split(id_vals,num_or_size_splits=2,axis=1)
        feat_ids = tf.string_to_number(feat_ids, out_type=tf.int32)
        feat_vals = tf.string_to_number(feat_vals, out_type=tf.float32)
        return {"feat_ids": feat_ids, "feat_vals": feat_vals}, labels

if FLAGS.pipe_mode == 0:
        dataset = tf.data.TFRecordDataset(filenames)
    else :
        # Enter Pipe mode
        dataset = PipeModeDataset(channel, record_format='TFRecord')
        
    if FLAGS.enable_s3_shard == False:
        host_rank = FLAGS.hosts.index(FLAGS.current_host)
        number_host = len(FLAGS.hosts)
        dataset = dataset.shard(number_host, host_rank)
    
    dataset = dataset.batch(batch_size, drop_remainder=True) # Batch size to use
    dataset = dataset.map(decode_tfrecord,
                          num_parallel_calls=tf.data.experimental.AUTOTUNE) 

    if num_epochs > 1:
        dataset = dataset.repeat(num_epochs)
    dataset = dataset.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)

For training on CPU instances, setting parallelism of intra op, inter op, and the environment variable of MKL-DNN is a good starting point.

Automatic mixed precision training

The last thing we discuss is automatic mixed precision training, which can accelerate speed and result in model performance. As of this writing, Nvidia V100 GPU (P3 instance) and A100 (P4dn instance) support Tensor core. You can enable mixed precision training in TensorFlow when using those types of instances. Starting from version 1.14, TensorFlow has supported automatic mixed precision training. You can use the following statement to wrap your original optimizer:

tf.train.experimental.enable_mixed_precision_graph_rewrite(optimizer)

If the model is small and utilization of GPU is low, there’s no advantage of automatic mixed precision training. If the model is large, automatic mixed precision training can accelerate training speed.

Conclusion

When you start your deep learning model training in SageMaker, consider the following tips to achieve a faster training speed:

Try the multi-CPU, single-instance method or single-GPU, single-instance method first. If CPU/GPU utilization is very high (for example more than 90%), move to the next step.
Try more CPUs in single host or more GPUs in single host. If utilization is near the maximum utilization of CPUs or GPUs, move to the next step.
Try multiple CPUs or multiple GPUs with multiple hosts.
You need to modify codes when using parameter servers or Horovod. The code modification isn’t the same for the TensorFlow session-based API, tf.estimator API, and tf.keras API. A parameter server or Horovod may show different training speeds in different training cases and tasks, so try both methods if you have the time and budget to determine the best one.

Keep in mind the following advice:

Check utilization before scaling, optimize your data pipeline, and make CPU and GPU overlap in the timeline.
First scale up, then scale out.
If you can’t increate your GPU utilization after all the methods, try CPU. There are many cases (especially for the clickthrough rate ranking model) where the total training time of CPU instance training is shorter and more cost-effective than GPU instance training.

We also have a code sample in the GitHub repo, where we show two samples of DeepFM distributed training on SageMaker. One is a TensorFlow parameter server on CPU instances, the other one is Horovod on GPU instances.

About the Authors

Yuhui Liang is a Sr. Machine Learning Solutions Architect. He’s focused on the promotion and application of machine learning, and deeply involved in many customers’ machine learning projects. He has a rich experience in deep learning distributed training, recommendation systems, and computational advertising.

Shishuai Wang is a Sr. Machine Learning Solutions Architect. He works with AWS customers to help them adopt machine learning on a large scale. He enjoys watching movies and traveling around the world.

Run PyTorch Lightning and native PyTorch DDP on Amazon SageMaker Training, featuring Amazon Search

August 18, 2022

by Emily Webber Amazon AWS

So much data, so little time. Machine learning (ML) experts, data scientists, engineers and enthusiasts have encountered this problem the world over. From natural language processing to computer vision, tabular to time series, and everything in-between, the age-old problem of optimizing for speed when running data against as many GPUs as you can get has inspired countless solutions. Today, we’re happy to announce features for PyTorch developers using native open-source frameworks, like PyTorch Lightning and PyTorch DDP, that will streamline their path to the cloud.

Amazon SageMaker is a fully-managed service for ML, and SageMaker model training is an optimized compute environment for high-performance training at scale. SageMaker model training offers a remote training experience with a seamless control plane to easily train and reproduce ML models at high performance and low cost. We’re thrilled to announce new features in the SageMaker training portfolio that make running PyTorch at scale even easier and more accessible:

PyTorch Lightning can now be integrated to SageMaker’s distributed data parallel library with only one-line of code change.
SageMaker model training now has support for native PyTorch Distributed Data Parallel with NCCL backend, allowing developers to migrate onto SageMaker easier than ever before.

In this post, we discuss these new features, and also learn how Amazon Search has run PyTorch Lightning with the optimized distributed training backend in SageMaker to speed up model training time.

Before diving into the Amazon Search case study, for those who aren’t familiar we would like to give some background on SageMaker’s distributed data parallel library. In 2020, we developed and launched a custom cluster configuration for distributed gradient descent at scale that increases overall cluster efficiency, introduced on Amazon Science as Herring. Using the best of both parameter servers and ring-based topologies, SageMaker Distributed Data Parallel (SMDDP) is optimized for the Amazon Elastic Compute Cloud (Amazon EC2) network topology, including EFA. For larger cluster sizes, SMDDP is able to deliver 20–40% throughput improvements relative to Horovod (TensorFlow) and PyTorch Distributed Data Parallel. For smaller cluster sizes and supported models, we recommend the SageMaker Training Compiler, which is able to decrease overall job time by up to 50%.

Customer spotlight: PyTorch Lightning on SageMaker’s optimized backend with Amazon Search

Amazon Search is responsible for the search and discovery experience on Amazon.com. It powers the search experience for customers looking for products to buy on Amazon. At a high level, Amazon Search builds an index for all products sold on Amazon.com. When a customer enters a query, Amazon Search then uses a variety of ML techniques, including deep learning models, to match relevant and interesting products to the customer query. Then it ranks the products before showing the results to the customer.

Amazon Search scientists have used PyTorch Lightning as one of the main frameworks to train the deep learning models that power Search ranking due to its added usability features on top of PyTorch. SMDDP was not supported for deep learning models written in PyTorch Lightning before this new SageMaker launch. This prevented Amazon Search scientists who prefer using PyTorch Lightning from scaling their model training using data parallel techniques, significantly slowing down their training time and preventing them from testing new experiments that require more scalable training.

The team’s early benchmarking results show 7.3 times faster training time for a sample model when trained on eight nodes as compared to a single-node training baseline. The baseline model used in these benchmarking is a multi-layer perceptron neural network with seven dense fully connected layers and over 200 parameters. The following table summarizes the benchmarking result on ml.p3.16xlarge SageMaker training instances.

Number of Instances	Training Time (minutes)	Improvement
1	99	Baseline
2	55	1.8x
4	27	3.7x
8	13.5	7.3x

Next, we dive into the details on the new launches. If you like, you can step through our corresponding example notebook .

Run PyTorch Lightning with the SageMaker distributed training library

We are happy to announce that SageMaker Data Parallel now seamlessly integrates with PyTorch Lightning within SageMaker training.

PyTorch Lightning is an open-source framework that provides a simplification for writing custom models in PyTorch. In some ways similar to what Keras did for TensorFlow, or even arguably Hugging Face, PyTorch Lightning provides a high-level API with abstractions for much of the lower-level functionality of PyTorch itself. This includes defining the model, profiling, evaluation, pruning, model parallelism, hyperparameter configurations, transfer learning, and more.

Previously, PyTorch Lightning developers were uncertain about how to seamlessly migrate their training code on to high-performance SageMaker GPU clusters. In addition, there was no way for them to take advantage of efficiency gains introduced by SageMaker Data Parallel.

For PyTorch Lightning, generally speaking, there should be little-to-no code changes to simply run these APIs on SageMaker Training. In the example notebooks we use the DDPStrategy and DDPPlugin methods.

There are three steps to use PyTorch Lightning with SageMaker Data Parallel as an optimized backend:

Use a supported AWS Deep Learning Container (DLC) as your base image, or optionally create your own container and install the SageMaker Data Parallel backend yourself. Ensure that you have PyTorch Lightning included in your necessary packages, such as with a requirements.txt file.
Make a few minor code changes to your training script that enable the optimized backend. These include:
1. Import the SM DDP library:
```
import smdistributed.dataparallel.torch.torch_smddp
```
2. Set up the PyTorch Lightning environment for SageMaker:
```
from pytorch_lightning.plugins.environments.lightning_environment 
  import LightningEnvironment

env = LightningEnvironment()
env.world_size = lambda: int(os.environ["WORLD_SIZE"])
env.global_rank = lambda: int(os.environ["RANK"])
```
3. If you’re using a version of PyTorch Lightning older than 1.5.10, you’ll need to add a few more steps.
  1. First, add the environment variable:
```
os.environ["PL_TORCH_DISTRIBUTED_BACKEND"] = "smddp"
```
  2. Second, ensure you use DDPPlugin, rather than DDPStrategy. If you’re using a more recent version, which you can easily set by placing the requirements.txt in the source_dir for your job, then this isn’t necessary. See the following code:
```
ddp = DDPPlugin(parallel_devices=[torch.device("cuda", d) for d in range(num_gpus)], cluster_environment=env)
```
4. Optionally, define your process group backend as "smddp" in the DDPSTrategy object. However, if you’re using PyTorch Lightning with the PyTorch DDP backend, which is also supported, simply remove this `process_group_backend` parameter. See the following code:
```
ddp = DDPStrategy(
  cluster_environment=env, 
  process_group_backend="smddp", 
  accelerator="gpu")
```
Ensure that you have a distribution method noted in the estimator, such as distribution={"smdistributed":{"dataparallel":{"enabled":True} if you’re using the Herring backend, or distribution={"pytorchddp":{"enabled":True}.

For a full listing of suitable parameters in the distribution parameter, see our documentation here.

Now you can launch your SageMaker training job! You can launch your training job via the Python SDK, Boto3, the SageMaker console, the AWS Command Line Interface (AWS CLI), and countless other methods. From an AWS perspective, this is a single API command: create-training-job. Whether you launch this command from your local terminal, an AWS Lambda function, an Amazon SageMaker Studio notebook, a KubeFlow pipeline, or any other compute environment is completely up to you.

Please note that the integration between PyTorch Lightning and SageMaker Data Parallel is currently supported for only newer versions of PyTorch, starting at 1.11. In addition, this release is only available in the AWS DLCs for SageMaker starting at PyTorch 1.12. Make sure you point to this image as your base. In us-east-1, this address is as follows:

ecr_image = '763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.12.0-gpu-py38-cu113-ubuntu20.04-sagemaker'

Then you can extend your Docker container using this as your base image, or you can pass this as a variable into the image_uri argument of the SageMaker training estimator.

As a result, you’ll be able to run your PyTorch Lightning code on SageMaker Training’s optimized GPUs, with the best performance available on AWS.

Run PyTorch Distributed Data Parallel on SageMaker

The biggest problem PyTorch Distributed Data Parallel (DDP) solves is deceptively simple: speed. A good distributed training framework should provide stability, reliability, and most importantly, excellent performance at scale. PyTorch DDP delivers on this through providing torch developers with APIs to replicate their models over multiple GPU devices, in both single-node and multi-node settings. The framework then manages sharding different objects from the training dataset to each model copy, averaging the gradients for each of the model copies to synchronize them at each step. This produces one model at the total completion of the full training run. The following diagram illustrates this process.

PyTorch DDP is common in projects that use large datasets. The precise size of each dataset will vary widely, but a general guideline is to scale datasets, compute sizes, and model sizes in similar ratios. Also called scaling laws, the optimal combination of these three is very much up for debate and will vary based on applications. At AWS, based on working with multiple customers, we can clearly see benefits from data parallel strategies when an overall dataset size is at least a few tens of GBs. When the datasets get even larger, implementing some type of data parallel strategy is a critical technique to speed up the overall experiment and improve your time to value.

Previously, customers who were using PyTorch DDP for distributed training on premises or in other compute environments lacked a framework to easily migrate their projects onto SageMaker Training to take advantage of high-performance GPUs with a seamless control plane. Specifically, they needed to either migrate their data parallel framework to SMDDP, or develop and test the capabilities of PyTorch DDP on SageMaker Training manually. Today, SageMaker Training is happy to provide a seamless experience for customers onboarding their PyTorch DDP code.

To use this effectively, you don’t need to make any changes to your training scripts.

You can see this new parameter in the following code. In the distribution parameter, simply add pytorchddp and set enabled as true.

estimator = PyTorch(
    base_job_name="pytorch-dataparallel-mnist",
    source_dir="code",
    entry_point = "my_model.py",
    ... 
    # Training using SMDataParallel Distributed Training Framework
    distribution = {"pytorchddp": {"enabled": "true"}}
)

This new configuration starts at SageMaker Python SDK versions 2.102.0 and PyTorch DLC’s 1.11.

For PyTorch DDP developers who are familiar with the popular torchrun framework, it’s helpful to know that this isn’t necessary on the SageMaker training environment, which already provides robust fault tolerance. However, to minimize code rewrites, you can bring another launcher script that runs this command as your entry point.

Now PyTorch developers can easily move their scripts onto SageMaker, ensuring their scripts and containers can run seamlessly across multiple compute environments.

This prepares them to, in the future, take advantage of SageMaker’s distributed training libraries that provide AWS-optimized training topologies to deliver up to 40% speedup enhancements. For PyTorch developers, this is a single line of code! For PyTorch DDP code, you can simply set the backend to smddp in the initialization (see Modify a PyTorch Training Script), as shown in the following code:

import smdistributed.dataparallel.torch.torch_smddp
import torch.distributed as dist
dist.init_process_group(backend='smddp')

As we saw above, you can also set the backend of DDPStrategy to smddp when using Lightning. This can lead to up to 40% overall speedups for large clusters! To learn more about distributed training on SageMaker see our on-demand webinar, supporting notebooks, relevant documentation, and papers.

Conclusion

In this post, we introduced two new features within the SageMaker Training family. These make it much easier for PyTorch developers to use their existing code on SageMaker, both PyTorch DDP and PyTorch Lightning.

We also showed how Amazon Search uses SageMaker Training for training their deep learning models, and in particular PyTorch Lightning with the SageMaker Data Parallel optimized collective library as a backend. Moving to distributed training overall helped Amazon Search achieve 7.3x faster train times.

About the authors

Emily Webber joined AWS just after SageMaker launched, and has been trying to tell the world about it ever since! Outside of building new ML experiences for customers, Emily enjoys meditating and studying Tibetan Buddhism.

Karan Dhiman is a Software Development Engineer at AWS, based in Toronto, Canada. He is very passionate about Machine Learning space and building solutions for accelerating distributed computing workloads.

Vishwa Karia is a Software Development Engineer at AWS Deep Engine. Her interests lie at the intersection of Machine Learning and Distributed Systems and she is also passionate about empowering women in tech and AI.

Eiman Elnahrawy is a Principal Software Engineer at Amazon Search leading the efforts on Machine Learning acceleration, scaling, and automation. Her expertise spans multiple areas, including Machine Learning, Distributed Systems, and Personalization.

A billion SMT queries a day

August 18, 2022

by Amazon AWS

CAV keynote lecture by the director of applied science for AWS Identity explains how AWS is making the power of automated reasoning available to all customers.Read More

Winning models

Get started with DL1 instances

About the authors

Solution overview

Import your training data

Train a predictor

Create a forecast

Create a what-if analysis

Create a what-if forecast

Compare the forecasts

Export your data

Conclusion

About the authors

Prerequisites

Configure the data source using the Amazon Kendra connector for Alfresco

Perform intelligent search with Amazon Kendra

Limitations

Clean up

Conclusion

About the author

Accelerate training on a single instance

Multiple CPUs on a single instance

Manually setting operators’ parallelism on CPU devices

Tower method

TensorFlow MirroredStrategy

Horovod

Distributed training with multiple GPUs on a single instance

Tower method

TensorFlow MirroredStrategy

Accelerate training on multiple instances

Multiple CPUs with multiple instances

Single GPU with multiple instances

Multiple GPUs with multiple instances

Data pipelines

Automatic mixed precision training

Conclusion

About the Authors

Customer spotlight: PyTorch Lightning on SageMaker’s optimized backend with Amazon Search

Run PyTorch Lightning with the SageMaker distributed training library

Run PyTorch Distributed Data Parallel on SageMaker

Conclusion

About the authors

Navigation

GenAI Vision Endless Possibilities

"I'm interested in things that change the world or that affect the future and wondrous, new technology where you see it, and you're like, 'Wow, how did that even happen? How is that possible?'" -- Elon Musk

Copyright © 2019-2025 Vedere AI. All Rights Reserved.