Customizing and reusing models generated by Amazon SageMaker Autopilot

Customizing and reusing models generated by Amazon SageMaker Autopilot

Amazon SageMaker Autopilot automatically trains and tunes the best machine learning (ML) models for classification or regression problems while allowing you to maintain full control and visibility. This not only allows data analysts, developers, and data scientists to train, tune, and deploy models with little to no code, but you can also review a generated notebook that outlines all the steps that Autopilot took to generate the model. In some cases, you might also want to customize pipelines generated by Autopilot with your own custom components.

This post shows you how to create and use models with Autopilot in a couple of clicks, then outlines how to adapt the SageMaker Autopilot generated code with your own feature selectors and custom transformers to add domain-specific features. We also use the dry run capability of Autopilot, in which Autopilot only generates code for data preprocessors, algorithms, and algorithm parameter settings. This can be done by simply choosing the option run a pilot to create a notebook with candidate definitions.

Customizing Autopilot

Customizing Autopilot models is, in most cases, not necessary. Autopilot creates high-quality models that can be deployed without the need for customization. Autopilot automatically performs exploratory analysis of your data and decides which features may produce the best results. As such, it presents a low barrier of entry to ML for a wide range of users, from data analysts to developers, wishing to add AI/ML capabilities to their project.

However, more advanced users can take advantage of Autopilot’s transparent approach to AutoML to dramatically reduce the undifferentiated heavy lifting prevalent in ML projects. For example, you may want Autopilot to use custom feature transformations that your company uses, or custom imputation techniques that work better in the context of your data. You can preprocess your data before bringing it to SageMaker Autopilot, but that would involve going outside Autopilot and maintaining a separate preprocessing pipeline. Alternatively, you can use Autopilot’s data processing pipeline to direct Autopilot to use your custom transformations and imputations. The advantage to this approach is that you can focus on data collection, and let Autopilot do the heavy lifting to apply your desired feature transformations and imputations, and then find and deploy the best model.

Preparing your data and Autopilot job

Let’s start by creating an Autopilot experiment using the Forest Cover Type dataset.

  1. Download the dataset and upload it to Amazon Simple Storage Service (Amazon S3).

Make sure that you create your Amazon SageMaker Studio user in the same Region as the S3 bucket.

  1. Open SageMaker Studio.
  2. Create a job, providing the following information:
    1. Experiment name
    2. Training dataset location
    3. S3 bucket for saving Autopilot output data
    4. Type of ML problem

Your Autopilot job is now ready to run. Instead of running a complete experiment, we choose to let Autopilot generate a notebook with candidate definitions.

Inspecting the Autopilot-generated pipelines

SageMaker Autopilot automates the key tasks in an ML pipeline. It explores hundreds of models comprised of different features, algorithms, and hyperparameters to find the one that best fits your data. It also provides a leader board of 250 models so you can see how each model candidate performed and pick the best one to deploy. We explore this in more depth in the final section of this post.

When the experiment is complete, you can inspect your generated candidate pipelines. Candidate refers to the combination of data preprocessing steps and algorithm selection used to train the 250 models. The candidate generation notebook contains Python code that Autopilot used to generate these candidates.

  1. Choose Open candidate generation notebook.
  2. Open your notebook.
  3. Choose Import to import the notebook into your workspace.
  4. When prompted, choose Python 3 (Data Science) as the kernel.
  5. Inside the notebook, run all the cells in the SageMaker Setup

This copies the data preparation code that Autopilot generated into your workspace.

In your root SageMaker Studio directory, you should now see a folder with the name of your Autopilot experiment. The folder’s name should be <Your Experiment Name>artifacts. That directory contains two sub-directories: generated_module and sagemaker_automl. The generated_module directory contains the data processing artifacts that Autopilot generated.

So far, the Autopilot job has analyzed the dataset and generated ML candidate pipelines that contain a set of feature transformers and an ML algorithm. Navigate down the generated_module folder to the candidate_data_processors directory, which contains 12 files:

  • dpp0.py–dpp9.py – Data processing candidates that Autopilot generated
  • trainer.py – Script that runs the data processing candidates
  • sagemaker_serve.py – Script for running the preprocessing pipeline at inference time

If you examine any of the dpp*.py files, you can observe that Autopilot generated code that builds sckit-learn pipelines, which you can easily extend with your own transformations. You can do this by either modifying the existing dpp*.py files directly or extending the pipelines after they’re instantiated in the trainer.py file, in which you define a transformer that can be called inside existing dpp*.py files. The second approach is recommended because it’s more maintainable and allows you to extend all the proposed processing pipelines at once as opposed to modifying each one individually.

Using specific transformers

You may wish to call a specific transformer from sckit-learn or use one implemented in the open-source package sagemaker-scikit-learn-extension. The latter provides a number of scikit-learn-compatible estimators and transformers that you can use. For instance, it implements the Weight of Evidence (WoE) encoder, an often-used encoding for categorical features in the context of binary classification.

To use additional transformers, first extend the import statements in the trainer.py file. For our use case, we add the following code:

from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
from sagemaker_sklearn_extension.preprocessing import RobustStandardScaler

If, upon modifying trainer.py, you encounter errors when running the notebook cell containing automl_interactive_runner.fit_data_transformers(...), you can get debugging information from Amazon CloudWatch under the log group /aws/sagemaker/TrainingJobs.

Implementing custom transformers

Going back to the forest cover type use case, we have features for the vertical and horizontal distance to hydrology. We want to extend this with an additional feature transform that calculates the straight line distance to hydrology. We can do this by adding an additional file into the candidate_data_processors directory where we define our custom transform. See the following code:

# additional_features.py
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin

class HydrologyDistance(BaseEstimator, TransformerMixin):

    def __init__(self, feature_index):
        self._feature_index = feature_index
      
    def fit(self, X, y = None):
        return self 
    
    def transform(self, X, y = None):
        X = X.copy().astype(np.float32)
        a, b = np.split(X[:, self._feature_index ], 2, axis=1)
        return np.hypot(a,b).reshape(-1,1)

Inside the trainer.py file, we then import the additional_features module and add our HydrologyDistance transformer as a parallel pipeline to the existing generated ones.

In addition to our additional feature transformer, we also add a feature selector to our pipeline to select only the features with the highest importance as determined by a RandomForestClassifier:

def update_feature_transformer(header, feature_transformer):
    """Customize the feature transformer. Default returns

    header: sagemaker_sklearn_extension.externals.Header
        Object of class Header, used to map the column names to the appropriate index

    feature_transformer : obj
        transformer applied to the features

    Returns
    -------
    feature_transformer : obj
        updated transformer to be applied to the features
    """
    
    features_to_transform = header.as_feature_indices(
        [
        'Horizontal_Distance_To_Hydrology',
        'Vertical_Distance_To_Hydrology'
        ]
        )

    # new pipeline with custom transforms
    additional_pipeline = Pipeline([("distance", HydrologyDistance(features_to_transform)),
                                    ("scaleDistance",RobustStandardScaler())
                                   ])
    # combine with the AutoPilot generated pipeline
    combined_transformer = FeatureUnion([("additional", additional_pipeline),
                                         ("existing", feature_transformer)]
                                       )
    # perform feature selection on the combined pipeline
    feature_selector = SelectFromModel(RandomForestClassifier(n_estimators = 10))
    
    feature_transformer = Pipeline([("feature_engineering", combined_transformer),
                                    ("feature_selection",feature_selector)]
                                  )
    return feature_transformerfrom additional_features import *

Running inferences

Next we need to copy our additional_features.py file into the model directory to make it available at inference time. A serialize_code function is provided specifically for this. Modify the function as per the following example code to make sure that it’s included with the model artifact. The line of code that requires modification is highlighted.

def serialize_code(dest_dir, processor_file):
    """Copies the code required for inference to the destination directory
    By default, sagemaker_serve.py and the processor module's file are copied.
    To serialize any additional .py file for custom transformer, add it to the
    list files_to_serialize.

    dest_dir: str
        destination where the python files would be serialized

    """
    files_to_serialize = [
        os.path.join(os.path.dirname(__file__), 'sagemaker_serve.py'),
        processor_file]
    
    # Include the custom transformer code in the model directory
    files_to_serialize.append(os.path.join(os.path.dirname(__file__), 'additional_features.py'))

    os.makedirs(dest_dir, exist_ok=True)
    for source in files_to_serialize:
        shutil.copy(source, os.path.join(dest_dir, os.path.basename(source)))

Finally, we need to modify the model_fn function in sagemaker_serve.py to copy the additional_features.py file into the current working directory so that the scikit-learn pipeline can import the file at inference time:

import shutil # make sure this is imported so that file can be copied
def model_fn(model_dir):
    """Loads the model.

    The SageMaker Scikit-learn model server loads model by invoking this method.

    Parameters
    ----------
    model_dir: str
        the directory where the model files reside

    Returns
    -------
    : AutoMLTransformer
        deserialized model object that can be used for model serving

    """
    
    shutil.copyfile(os.path.join(model_dir, 'additional_features.py'), 'additional_features.py')
    
    return load(filename=os.path.join(model_dir, 'model.joblib'))

When you finish all these steps, you can return to the candidate definition notebook and run the remaining cells. The additional transforms you defined are applied across all the selected data processing pipeline candidates and are also included in the inference pipeline.

Deploying the best model

As Autopilot runs the candidate pipelines, it iterates over 250 combinations of processing pipelines, algorithm types, and model hyperparameters. When the process is complete, you can navigate to the final section of the notebook (Model Selection and Deployment) and view a leaderboard of the models Autopilot generated. Running the remaining notebook cells automatically deploys the model that produced the best results and exposes it as a RSET API endpoint.

Conclusions

In this post, we demonstrated how to customize an Autopilot training and inference pipeline with your own feature engineering code. We first let Autopilot generate candidate definitions without running the actual training and hyperparameter tuning. Then we implemented custom transformers that represent custom feature engineering that we want to bring to Autopilot. For more information about Autopilot, see Amazon SageMaker Autopilot.


About the Authors

Simon Zamarin is an AI/ML Solutions Architect whose main focus is helping customers extract value from their data assets. In his spare time, Simon enjoys spending time with family, reading sci-fi, and working on various DIY house projects.

 

 

 

Qingwei Li is a Machine Learning Specialist at Amazon Web Services. He received his Ph.D. in Operations Research after he broke his advisor’s research grant account and failed to deliver the Nobel Prize he promised. Currently he helps customers in the financial service and insurance industry build machine learning solutions on AWS. In his spare time, he likes reading and teaching.

 

 

Piali Das is a Senior Software Engineer in the AWS SageMaker Autopilot team. She previously contributed to building SageMaker Algorithms. She enjoys scientific programming in general and has developed an interest in machine learning and distributed systems.

Read More

Making sense of your health data with Amazon HealthLake

Making sense of your health data with Amazon HealthLake

We’re excited to announce Amazon HealthLake, a new HIPAA-eligible service for healthcare providers, health insurance companies, and pharmaceutical companies to securely store, transform, query, analyze, and share health data in the cloud, at petabyte scale. HealthLake uses machine learning (ML) models trained to automatically understand and extract meaningful medical data from raw, disparate data, such as medications, procedures, and diagnoses. This revolutionizes a process that is traditionally manual, error-prone, and costly. HealthLake tags and indexes all the data and structures it in Fast Healthcare Interoperability Resources (FHIR) to provide a complete view of each patient and a consistent way to query and share the data. It integrates with services like Amazon QuickSight and Amazon SageMaker to visualize and understand relationships in the data, identify trends, and make predictions. Because HealthLake automatically structures all of a healthcare organization’s data into the FHIR industry format, the information can be easily and securely shared between health systems and with third-party applications, enabling providers to collaborate more effectively and allowing patients unfettered access to their medical information.

Every healthcare provider, payer, and life sciences company is trying to solve the problem of organizing and structuring their data in order to make better patient support decisions, design better clinical trials, operate more efficiently, understand population health trends, and share data securely. It all starts with making sense of health data.

Let’s look at one specific example—imagine you have a diabetic patient whom you’re trying to manage, and 2 months later their glucose level is still not responding to the treatment that you prescribed. With HealthLake, you can easily create a cohort of diabetic patients and their demographics, treatments, blood glucose readings, tests, and clinical observations and export this data. You can then create an interactive dashboard with QuickSight and compare that patient to a population with similar treatment options to see what helped improve their health outcome. You can use SageMaker to train and tune the best ML models to help you identify which subset of these diabetic patients are at increased risk of complications like high blood pressure so you can intervene early and introduce a second line of medications in addition to preventive measures, like special diets.

Health data is complex

Healthcare organizations are doing some amazing things with ML today, but health data remains complex and difficult to work with (data is siloed, spread out across multiple systems in incompatible formats). Over the past decade, we’ve witnessed a digital transformation in healthcare, with organizations capturing huge volumes of patient data every day, from family history and clinical observations to diagnoses and medications. The vast majority of this data is contained in unstructured medical records such as clinical notes, laboratory reports (PDFs), insurance claims (forms), recorded conversations (audio), X-rays (images), and more.

Before leveraging healthcare data for effective care, it all needs to be securely ingested, stored, and aggregated. Relevant attributes need to be extracted, tagged, indexed, and structured before you can start analyzing it. The cost and operational complexity of doing all this work well is prohibitive to most healthcare organizations and takes weeks, or even months. The FHIR standard is a start toward the goal of standardizing a data structure and exchange for healthcare, but the data still needs to be transformed to enable advanced analytics via queries, visualizations, and ML tools and techniques. This means analysis effectively remains hard to reach for almost all providers.

Create a complete view of a patient’s medical history, in minutes

With HealthLake, we’re demystifying a set of challenges for our healthcare and life sciences customers by removing the heavy lifting needed to tag, index, structure, and organize this data, providing a complete view of each patient’s medical history in minutes, instead of weeks or months. HealthLake makes it easy for you to copy your on-premises data to AWS. HealthLake transforms raw, disparate data with integrated medical natural language processing (NLP), which uses specialized ML models that have been trained to automatically understand and extract meaningful medical information, such as medications, procedures, and diagnoses, from raw, disparate data. HealthLake tags each patients’ record, indexes every data element using standardized labels, structures each data element in interoperable standards, and organizes the data in a timeline view for each patient. HealthLake presents data on each patient in chronological order of medical events so that you can look at trends like disease progression over time, giving you new tools to improve care and intervene earlier.

Your data in HealthLake is secure, compliant, and auditable. Data versioning is enabled to protect data against accidental deletion, and per FHIR specification, if you delete a piece of data, it’s only hidden from analysis and results—not deleted from the service, only versioned. Your data is encrypted using customer managed keys (CMKs) in a single-tenant architecture to provide an additional level of protection when data is accessed or searched, so that the same key isn’t shared by multiple customers. You retain ownership and control of your data, along with the ability to encrypt it, protect it, move it, and delete it in alignment with your organization’s security policies.

Identify trends and make predictions to manage your entire population

Today, the most widely used clinical models to predict disease risk lack personalization and often use a very limited number of commonly collected data points, which is problematic because the resulting models may produce imprecise predictions. However, if you look at an individual’s medical record, there may be hundreds of thousands of data points, and the majority of that is untapped data stored in doctors’ notes. With your health data structured and organized chronologically by medical events, you can easily query, perform analytics, and build ML models to observe health trends across an entire population.

You can use other AWS services that work seamlessly with HealthLake, such as QuickSight or SageMaker. For example, you can create an interactive dashboard with QuickSight to observe population health trends, and zoom in on a smaller group of patients with a similar state to compare their treatments and health outcomes. You can also build, train, and deploy your own ML models with SageMaker to track the progression of at-risk patients over the course of many years against a similar cohort of patients. This enables you to identify early warning signs that need to be addressed proactively and would be missed without the complete clinical picture provided by HealthLake.

Bringing it all together

Now, your health data is tagged, indexed, structured, and organized in chronological order of medical events, so it can be easily searched and analyzed. You can securely share patient’s data across health systems in a consistent, compatible FHIR format across multiple applications. You now have the ability to make point-of-care or population health decisions that are driven by evidence from the overall data.

AWS customers are excited about the innovation that HealthLake offers and the opportunity to make sense of their health data to deliver personalized treatments, understand population health trends, and identify patients for clinical trial enrollment. This offers an unprecedented opportunity to close gaps in care and provide the high quality and personalized care every patient deserves.

Cerner Corporation, a global healthcare technology company, is focused on using data to help solve issues at the speed of innovation—evolving healthcare to enhance clinical and operational outcomes, help resolve clinician burnout, and improve health equity.

“At Cerner, we are committed to transforming the future of healthcare through cloud delivery, machine learning, and AI. Working alongside AWS, we are in a position to accelerate innovation in healthcare. That starts with data. We are excited about the launch of HealthLake and its potential to quickly ingest patient data from diverse sources, unlock new insights through advanced analytics, and serve many of our initiatives across population health.”

—Ryan Hamilton, SVP & Chief Architect, Cerner

Konica Minolta Precision Medicine (KMPM) is a life science company dedicated to the advancement of precision medicine to more accurately predict, detect, treat, and ultimately cure disease.

“We are building a multi-modal platform at KMPM to handle a significant amount of health data inclusive of pathology, imaging, and genetic information. HealthLake will allow us to unlock the real power of this multi-modal approach to find novel associations and signals in our data. It will provide our expert team of data scientists and developers the ability to integrate, label, and structure this data faster and discover insights that our clinicians and pharmaceutical partners require to truly drive precision medicine.”

—Kiyotaka Fujii, President, Global Healthcare, Konica Minolta, & Chairman, Ambry Genetics

Orion Health is a global, award-winning provider of health information technology, advancing population health and precision medicine solutions for the delivery of care across the entire health ecosystem.

“At Orion Health, we believe that there is significant untapped potential to transform the healthcare sector by improving how technology is used and providing insights into the data being generated. Data is frequently messy and incomplete, which is costly and time consuming to clean up. We are excited to work alongside AWS to use HealthLake to help deliver new ways for patients to interact with the healthcare system, supporting initiatives such as the 21st Century Cures Act, designed to make healthcare more accessible and affordable, and Digital Front Door, which aims to improve health outcomes by helping patients receive the perfect care for them from the comfort of their home.”

—Anne O’Hanlon, Product Director, Orion Health

Conclusion

What was once just a pile of disparate and unstructured data looking like a patchwork quilt—an incomplete health history stitched together with limited data—is now structured to be easily read and searched. For every healthcare provider, health insurer, and life sciences company, there is now a purpose-built service enabled by ML they can use to aggregate and organize previously unusable health data, so that it can be analyzed in a secure and compliant single-tenant location in the cloud. HealthLake represents a significant leap forward for these organizations to learn from all their data to proactively manage their patients and population, improve the quality of patient care, optimize hospital efficiency, and reduce cost.

 


About the Authors

Dr. Taha Kass-Hout, is director of machine learning and chief medical officer at Amazon Web Services (AWS), where he leads initiatives such as as Amazon HealthLake and Amazon Comprehend Medical. A physician and bioinformatician, Taha has previously pioneered the use of emerging technologies and cloud at both the CDC (in electronic disease surveillance) and the FDA, where he was the Agency’s first Chief Health Informatics Officer, and established both the OpenFDA and PrecisionFDA data sharing initiatives.

 

Dr. Matt Wood is Vice President of Product Management and leads our vertical AI efforts on the ML team, including Personalize, Forecast, Poirot, and Colossus, along with our thought leadership projects such as DeepRacer. In his spare time Matt also serves as the chief science geek for the scalable COVID testing initiative at Amazon; providing guidance on scientific and technical development, including test design, lab sciences, regulatory oversight, and the evaluation and implementation of emerging testing technologies.

Read More

Identify bottlenecks, improve resource utilization, and reduce ML training costs with the deep profiling feature in Amazon SageMaker Debugger

Identify bottlenecks, improve resource utilization, and reduce ML training costs with the deep profiling feature in Amazon SageMaker Debugger

Machine learning (ML) has shown great promise across domains such as predictive analysis, speech processing, image recognition, recommendation systems, bioinformatics, and more. Training ML models is a time- and compute-intensive process, requiring multiple training runs with different hyperparameters before a model yields acceptable accuracy. CPU- and GPU-based distributed training with frameworks such as Horovod and Parameter Servers addresses this issue by allowing training to be easily scalable to a cluster of resources. However, distributed training makes it harder to identify and debug resource bottlenecks. Gaining insight into the training in progress, both at the ML framework level and the underlying compute resources level, is a critical step towards understanding resource usage patterns and reducing resource wastage. Analyzing bottleneck issues is necessary to maximize the utilization of compute resources and optimize model training performance to deliver state-of-the-art ML models with target accuracy.

Amazon SageMaker is a fully managed service that enables developers and data scientists to quickly and easily build, train, and deploy ML models at scale. Amazon SageMaker Debugger is a feature of SageMaker training that makes it easy to train ML models faster by capturing real-time metrics such as learning gradients and weights. This provides transparency into the training process, so you can correct anomalies such as losses, overfitting, and overtraining. Debugger provides built-in rules to easily analyze emitted data, including tensors that are critical for the success of training jobs.

With the newly introduced profiling capability, Debugger now automatically monitors system resources such as CPU, GPU, network, I/O, and memory, providing a complete resource utilization view of training jobs. You can also profile your entire training job or portions thereof to emit detailed framework metrics during different phases of the training job. Framework metrics are metrics that are captured from within the training script, such as step duration, data loading, preprocessing, and operator runtime on CPU and GPU.

Debugger correlates system and framework metrics, which helps you identify possible root causes. For example, if utilization on GPU drops to zero, you can inspect what has been happening within the training script at this particular time. You can right-size resources and quickly identify bottlenecks and fix them using insights from the profiler.

You can re-allocate resources based on recommendations from the profiling capability. Metrics and insights are captured and monitored programmatically using the SageMaker Python SDK or visually through Amazon SageMaker Studio.

In this post, we demonstrate Debugger profiling capabilities using a TensorFlow-based sentiment analysis use case. In the notebook included in this post, we set a Convolutional Neural Network (CNN) using TensorFlow script mode on SageMaker. For our dataset, we use the IMDB dataset, which consists of movie reviews labeled as positive or negative sentiment. We use Debugger to showcase how to gain visibility into utilizing system resources of the training instances, profile framework metrics, and identify an underutilized training resource due to resource bottlenecks. We further demonstrate how to improve resource utilization after implementing the recommendations from Debugger.

Walkthrough overview

The remainder of this post details how to use the Debugger profiler capability to gain visibility into ML training jobs and analysis of profiler recommendations. The notebook includes details of using TensorFlow Horovod distributed training where the profiling capability enabled us to improve resource utilization up to 36%. The first training run was on three p3.8xlarge instances for 503 seconds, and the second training run after implementing the profiler recommendations took 502 seconds on two p3.2xlarge instances, resulting in 83% cost savings. Profiler analysis of the second training run provided additional recommendations highlighting the possibility of further cost savings and better resource utilization.

The walkthrough includes the following high-level steps:

  1. Train a TensorFlow sentiment analysis CNN model using SageMaker distributed training with custom profiler configuration.
  2. Visualize the system and framework metrics generated to analyze the profiler data.
  3. Access Debugger Insights in Studio.
  4. Analyze the profiler report generated by Debugger.
  5. Analyze and Implement recommendations from the profiler report.

Additional steps such as importing the necessary libraries and examining the dataset are included in the notebook. Review the notebook for complete details.

Training a CNN model using SageMaker distributed training with custom profiler configuration

In this step, you train the sentiment analysis model using TensorFlow estimator with the profiler enabled.

First ensure that Debugger libraries are imported. See the following code:

# import debugger libraries
from sagemaker.debugger import ProfilerConfig, DebuggerHookConfig, Rule, ProfilerRule, rule_configs, FrameworkProfile

Next, set up Horovod distribution for TensorFlow distributed training. Horovod is a distributed deep learning training framework for TensorFlow, Keras, and PyTorch. The objective is to take a single-GPU training script and successfully scale it to train across many GPUs in parallel. After a training script has been written for scale with Horovod, it can run on a single GPU, multiple GPUs, or even multiple hosts without any further code changes. In addition to being easy to use, Horovod is fast. For more information, see the Horovod GitHub page.

We can set up hyperparameters such as number of epochs, batch size, and data augmentation:

hyperparameters = {'epoch': 25, 
                   'batch_size': 256,
                   'data_augmentation': True}

Changing these hyperparameters might impact resource utilization with your training job.

For our training, we start off using three p3.8xlarge instances and change our training configuration based on profiling recommendations from Debugger:

distributions = {
                    "mpi": {
                        "enabled": True,
                        "processes_per_host": 3,
                        "custom_mpi_options": "-verbose -x HOROVOD_TIMELINE=./hvd_timeline.json -x NCCL_DEBUG=INFO -x OMPI_MCA_btl_vader_single_copy_mechanism=none",
                    }
                }

model_dir = '/opt/ml/model'
train_instance_type='ml.p3.8xlarge'
instance_count = 3

The p3.8xlarge instance comes with 4 GPUs and 32 vCPU cores with 10 Gbps networking performance. For more information, see Amazon EC2 Instance Types. Take your AWS account limits into consideration while setting up the instance_type and instance_count of the cluster.

Then we define the profiler configuration. With the following profiler_config parameter configuration, Debugger calls the default settings of monitoring and profiling. Debugger monitors system metrics every 500 milliseconds. You specify additional details on when to start and how long to run profiling. You can set different profiling settings to profile target steps and target time intervals in detail.

profiler_config = ProfilerConfig(
    system_monitor_interval_millis=500,
    framework_profile_params=FrameworkProfile(start_step=2, num_steps=7)
)

For complete list of parameters, see Amazon SageMaker Debugger.

Then we configure a training job using TensorFlow estimator and pass in the profiler configuration. For framework_version and py_version, specify the TensorFlow framework version and supported Python version, respectively:

estimator = TensorFlow(
    role=sagemaker.get_execution_role(),
    base_job_name= 'tf-keras-silent',
    image_uri=f"763104351884.dkr.ecr.{region}.amazonaws.com/tensorflow-training:2.3.1-gpu-py37-cu110-ubuntu18.04",
    model_dir=model_dir,
    instance_count=instance_count,
    instance_type=train_instance_type,
    entry_point= 'sentiment-distributed.py',
    source_dir='./tf-sentiment-script-mode',
    profiler_config=profiler_config,
    script_mode=True,
    hyperparameters=hyperparameters,
    distribution=distributions
)

For complete list of the supported framework versions and the corresponding Python version to use, see Amazon SageMaker Debugger.

Finally, start the training job:

estimator.fit(inputs, wait= False)

Visualizing the system and framework metrics generated

Now that our training job is running, we can perform interactive analysis of the data captured by Debugger. The analysis is organized in order of training phases: initialization, training, and finalization. The profiling data results are categorized as system metrics and algorithm (framework) metrics. After the training job initiates, Debugger starts collecting system and framework metrics. The smdebug library provides profiler analysis tools that enable you to access and analyze the profiling data.

First, we collect the system and framework metrics using the S3SystemMetricsReader library:

from smdebug.profiler.system_metrics_reader import S3SystemMetricsReader
import time

path = estimator.latest_job_profiler_artifacts_path()
system_metrics_reader = S3SystemMetricsReader(path)

Check if we have metrics available for analysis:

while system_metrics_reader.get_timestamp_of_latest_available_file() == 0:
    		system_metrics_reader.refresh_event_file_list()
    		client = sagemaker_client.describe_training_job(
        			TrainingJobName=training_job_name
   			 )
    		if 'TrainingJobStatus' in client:
       	 	training_job_status = f"TrainingJobStatus: {client['TrainingJobStatus']}"
    		if 'SecondaryStatus' in client:
       	 training_job_secondary_status = f"TrainingJobSecondaryStatus: {client['SecondaryStatus']}"

When the data is available, we can query and inspect it:

system_metrics_reader.refresh_event_file_list()
last_timestamp = system_metrics_reader.get_timestamp_of_latest_available_file()
events = system_metrics_reader.get_events(0, last_timestamp)

Along with the notebook, the smdebug SDK contains several utility classes that can be used for visualizations. From the data collected, you can visualize the CPU and GPU utilization values as a histogram using the utility class MetricHistogram. MetricHistogram computes a histogram on GPU and CPU utilization values. Bins are between 0–100. Good system utilization means that the center of the distribution should be between 80–90. In case of multi-GPU training, if distributions of GPU utilization values aren’t similar, it indicates an issue with workload distribution.

The following code plots the histograms per metric. To only plot specific metrics, define the list select_dimensions and select_events. A dimension can be CPUUtilization, GPUUtilization, or GPUMemoryUtilization IOPS. If no event is specified, then for the CPU utilization, a histogram for each single core and total CPU usage is plotted.

from smdebug.profiler.analysis.notebook_utils.metrics_histogram import MetricsHistogram

system_metrics_reader.refresh_event_file_list()
metrics_histogram = MetricsHistogram(system_metrics_reader)

The following screenshot shows our histograms.

Similar to system metrics, let’s retrieve all the events emitted from the framework or algorithm metrics using the following code:

from smdebug.profiler.algorithm_metrics_reader import S3AlgorithmMetricsReader

framework_metrics_reader = S3AlgorithmMetricsReader(path)

events = []
while framework_metrics_reader.get_timestamp_of_latest_available_file() == 0 or len(events) == 0:
    framework_metrics_reader.refresh_event_file_list()
    last_timestamp = framework_metrics_reader.get_timestamp_of_latest_available_file()
    events = framework_metrics_reader.get_events(0, last_timestamp)

framework_metrics_reader.refresh_event_file_list()
last_timestamp = framework_metrics_reader.get_timestamp_of_latest_available_file()
events = framework_metrics_reader.get_events(0, last_timestamp)

We can inspect one of the recorded events to get the following:

print("Event name:", events[0].event_name, 
      "nStart time:", timestamp_to_utc(events[0].start_time/1000000000), 
      "nEnd time:", timestamp_to_utc(events[0].end_time/1000000000), 
      "nDuration:", events[0].duration, "nanosecond")

	Event name: Step:ModeKeys.TRAIN 
	Start time: 2020-12-04 22:44:14 
	End time: 2020-12-04 22:44:25 
	Duration: 10966842000 nanosecond

For more information about system and framework metrics, see documentation.

Next, we use the StepHistogram utility class to create a histogram of step duration values. Significant outliers in step durations are an indication of a bottleneck. It allows you to easily identify clusters of step duration values.

from smdebug.profiler.analysis.notebook_utils.step_histogram import StepHistogram
framework_metrics_reader.refresh_event_file_list()
step_histogram = StepHistogram(framework_metrics_reader)

The following screenshot shows our visualization.

The following screenshot shows our visualization.

For an alternative view of CPU and GPU utilizations, the following code creates a heat map where each row corresponds to one metric (CPU core and GPU utilizations) and the x-axis is the duration of the training job. It allows you to more easily spot CPU bottlenecks, for example, if utilization on GPU is low but a utilization of one or more cores is high.

from smdebug.profiler.analysis.notebook_utils.heatmap import Heatmap

view_heatmap = Heatmap(
    system_metrics_reader,
    framework_metrics_reader,
    select_dimensions=["CPU", "GPU", "I/O"], # optional
    select_events=["total"],                 # optional
    plot_height=450
)

The following screenshot shows the heat map of a training job that has been using 4 GPUs and 32 CPU cores. The first few rows show the GPUs’ utilization, and the remaining rows show the utilization on CPU cores. Yellow indicates maximum utilization, and purple means that utilization was 0. GPUs have frequent stalled cycles where utilization drops to 0, whereas at the same time, utilization on CPU cores is at a maximum. This is a clear indication of a CPU bottleneck where GPUs are waiting for the data to arrive. Such a bottleneck can occur by a too compute-heavy preprocessing.

Accessing Debugger Insights in Studio

You can also use Studio to perform training with our existing notebook. Studio provides built-in visualizations to analyze profiling insights. Alternatively, you can move to next section in this post to directly analyze the profiler report generated.

If you trained in a SageMaker notebook instance, you can still find the Debugger insights for that training in Studio if the training happened in same Region.

  1. On the navigation pane, choose Components and registries.
  2. Choose Experiments and trails.
  3. Choose your training job (right-click).
  4. Choose Debugger Insights.

For more information about setting up Studio, see Set up Amazon SageMaker.

Reviewing Debugger reports

After you have set up and run this notebook in Studio, you can access Debugger Insights.

  1. On the navigation pane, choose Components and registries.
  2. Choose Experiments and trails.
  3. Choose your training job (right-click).
  4. Choose View Debugger for insights.

After you have set up and run this notebook in Studio, you can access Debugger Insights.

A Debugger tab opens for this training job. For more information, see Debugger Insights.

Training job summary

This section of the report shows details of the training job, such as the start time, end time, duration, and time spent in individual phases of the training. The pie chart visualization of these delays shows the time spent in initialization, training, and finalization phases relative to each other.

This section of the report shows details of the training job, such as the start time, end time, duration, and time spent in individual phases of the training.

The pie chart visualization of these delays shows the time spent in initialization, training, and finalization phases relative to each other.

System usage statistics

This portion of the report gives detailed system usage statistics for both training instances involved in training, along with analysis and suggestions for improvements. The following text is an excerpt from the report, with key issues highlighted:

The 95th quantile of the total GPU utilization on node algo-1 is only 13%. The 95th quantile of the total CPU utilization is only 24%. Node algo-1 is under-utilized. You may want to consider switching to a smaller instance type. The 95th quantile of the total GPU utilization on node algo-2 is only 13%. The 95th quantile of the total CPU utilization is only 24%. Node algo-2 is under-utilized. You may want to consider switching to a smaller instance type. The 95th quantile of the total GPU utilization on node algo-3 is only 13%. The 95th quantile of the total CPU utilization is only 24%. Node algo-3 is under-utilized. You may want to consider switching to a smaller instance type.

The following table shows usage statistics per worker node, such as total CPU and GPU utilization, total CPU, and memory footprint. The table also include total I/O wait time and total sent and received bytes. The table shows minimum and maximum values as well as p99, p90, and p50 percentiles.

The following table shows usage statistics per worker node, such as total CPU and GPU utilization, total CPU, and memory footprint.

Framework metrics summary

In this section, the following pie charts show the breakdown of framework operations on CPUs and GPUs.

The following pie charts show the breakdown of framework operations on CPUs and GPUs.

Insights

Insights provides suggestions and additional details, such as the number of times each rule triggered, the rule parameters, and the default threshold values to evaluate your training job performance. According to the insights for our TensorFlow training job, profiler rules were run for three out of the eight insights. The following screenshot shows the insights.

The following screenshot shows the insights.

If you choose an insight, you can view the profiler recommendations.

By default, we are showing the overview report, but you could choose Nodes to show the dashboard.

We are showing the overview report, but you could choose Nodes to show the dashboard.

You can expand each algorithm to get deep dive information such as CPU utilization, network utilization, and system metrics per algorithm used during training.

You can expand each algorithm to get deep dive information such as CPU utilization, network utilization, and system metrics.

Furthermore, you can scroll down to analyze GPU memory utilization over time and system utilization over time for each algorithm.

Analyzing the profiler report generated by Debugger

Download the profiler report by choosing Download report.

Download the profiler report by choosing Download report.

Alternatively, if you’re not using Studio, you can download your report directly from Amazon Simple Storage Service (Amazon S3) at s3://<your bucket> /tf-keras-sentiment-<job id>/profiler-output/.

Alternatively, if you’re not using Studio, you can download your report directly.

Next, we review a few sections of the generated report. For additional details, see SageMaker Debugger report . You can also use the SMDebug client library for performing data analysis.

Framework metrics summary

In this section of the report, you see a pie chart that shows the time the training job spent in the training phase, validation phase, or “others.” “Others” represents the accumulated time between steps; that is, the time between when a step has finished but the next step hasn’t started. Ideally, most time should be spent in training steps.

In this section of the report, you see a pie chart that shows the time the training job spent in the training phase, validation phase, or "others.”

Identifying the most expensive CPU operator

This section provides information of the CPU operators in detail. The table shows the percentage of the time and the absolute cumulative time spent on the most frequently called CPU operators.

The following table shows a list of operators that your training job run on CPU. The most expensive operator on CPU was ExecutorState::Process with 16%.

The following table shows a list of operators that your training job run on CPU.

Identifying the most expensive GPU operator

This section provides information of the GPU operators in detail. The table shows the percentage of the time and the absolute cumulative time spent on the most frequently called GPU operators. 

The following table shows a list of operators that your training job ran on GPU. The most expensive operator on GPU was Adam with 29%.

The following table shows a list of operators that your training job ran on GPU.

Rules summary

In this section, Debugger aggregates all the rule evaluation results, analysis, rule descriptions, and suggestions. The following table shows a summary of the profiler rules that ran. The table is sorted by the rules that triggered most frequently. In the training job, this was the case for rule LowGPUUtilization. It processed 1,001 data points and was triggered 8 times.

 

Because the rules were triggered for LowGPUUTilization, Batchsize, and CPUBottleneck, lets deep dive into each to understand the profiler recommendations for each.

LowGPUUtilization

The LowGPUUtilization rule checks for low and fluctuating GPU usage. If usage is consistently low, it might be caused by bottlenecks or if batch size or model is too small. If usage is heavily fluctuating, it can be caused by bottlenecks or blocking calls.

The rule computed the 95th and 5th quantile of GPU utilization on 500 continuous data points and found eight cases where p95 was above 70% and p5 was below 10%. If p95 is high and p5 is low, it indicates that the usage is highly fluctuating. If both values are very low, it means that the machine is under-utilized. During initialization, utilization is likely 0, so the rule skipped the first 1,000 data points. The rule analyzed 1,001 data points and was triggered eight times. Moreover it also provides the time when this rule was last triggered.

BatchSize

The BatchSize rule helps detect if GPU is under-utilized because of the batch size being too small. To detect this, the rule analyzes the GPU memory footprint and CPU and GPU utilization. The rule analyzed 1,000 data points and was triggered four times. Your training job is under-utilizing the instance. You may want to consider switching to a smaller instance type or increasing the batch size of your model training. Moreover it also provides the time when this rule was last triggered.

The following boxplot is a snapshot from this timestamp that shows for each node the total CPU utilization and the utilization and memory usage per GPU.

The following boxplot is a snapshot from this timestamp that shows for each node the total CPU utilization.

CPUBottleneck

The CPUBottleneck rule checks when CPU utilization was above cpu_threshold of 90% and GPU utilization was below gpu_threshold of 10%. During initialization, utilization is likely 0, so the rule skipped the first 1,000 data points. With this configuration, the rule found 2,129 CPU bottlenecks, which is 70% of the total time. This is above the threshold of 50%. The rule analyzed 3,019 data points and was triggered four times.

The following chart (left) shows how many data points were below the gpu_threshold of 10% and how many of those data points were likely caused by a CPU bottleneck. The rule found 3,000 out of 3,019 data points that had a GPU utilization below 10%. Out of those data points, 70.52% were likely caused by CPU bottlenecks. The second chart (right) shows whether CPU bottlenecks mainly happened during the train or validation phase.

The following chart (left) shows how many data points were below the gpu_threshold of 10%.

Analyzing and implementing recommendations from the profiler report

Let’s now analyze and implement the profiling recommendations for our training job to improve resource utilization and make our training efficient. First let’s review the configuration of our training job and check the three rules that were triggered by Debugger during the training run.

The following table summarizes the training job configuration.

Instance Type Instance Count Number of processes per host Profiling Configuration Number of Epochs Batch Size
P3.8xlarge 3 3 FrameworkProfile(start_step=2, num_steps=7), Monitoring Interval = 500 milliseconds 25 256

The following table summarizes the Debugger profiling recommendations.

Rule Triggered Reason Recommendations
BatchSize Checks if GPU is under-utilized because of the batch size being too small. Run on a smaller instance type or increase batch size.
LowGPUUtilization Checks if GPU utilization is low or suffers from fluctuations. This can happen if there are bottlenecks, many blocking calls due to synchronizations, or batch size being too small. Check for bottlenecks, minimize blocking calls, change distributed training strategy, increase batch size.

CPUBottleneck

 

Checks if CPU usage is high but GPU usage is low at the same time, which may indicate a CPU bottleneck where GPU is waiting for data to arrive from CPU. CPU bottlenecks can happen when data preprocessing is very compute intensive. You should consider increasing the number of data-loader processes or apply pre-fetching.

Based on the recommendation to consider switching to a smaller instance type and to increase the batch size, we change the training configuration settings and rerun the training. In the notebook, the training instances are changed from p3.8xlarge to p3.2xlarge instances, the number of instances is reduced to two, and only one process per host for MPI is configured to increase the number of data loaders. The batch size is also changed in parallel to 512.

The following table summarizes the revised training job configuration. 

Instance Type Instance Count Number of processes per host Profiling Configuration Number of Epochs Batch Size
P3.2xlarge 2 1 FrameworkProfile(start_step=2, num_steps=7), Monitoring Interval = 500 milliseconds 25 512

After running the second training job with the new settings, a new report is generated, but with no rules triggered, indicating all the issues identified in the earlier run were resolved. Now let’s compare the report analysis from the two training jobs and understand the impact of the configuration changes made.

The training job summary shows that the training time was almost similar, with 502 seconds in the revised run compared to 503 seconds in the first run. The amount of time spent in the training loop for both jobs was also comparable at 45%.

The amount of time spent in the training loop for both jobs was also comparable at 45%.

Examining the system usage statistics shows that both CPU and GPU utilization of the two training instances increased when compared to the original run. For the first training run, GPU utilization was constant at 13.5% across the three instances for the 95th quantile of GPU utilization, and the CPU utilization was constant at 24.4% across the three instances for the 95th quantile of CPU utilization. For the second training run, GPU utilization increased to 46% for the 95th quantile, and the CPU utilization increased to 61% for the 95th quantile.

Examining the system usage statistics shows that both CPU and GPU utilization of the two training instances increased.

Although no rules were triggered during this run, there is still room for improvement in resource utilization.

The following screenshot shows the rules summary for our revised training run.

The following screenshot shows the rules summary for our revised training run.

You can continue to tune your training job, change the training parameters, rerun the training, and compare the results against previous training runs. Repeat this process to fine-tune your training strategy and training resources to achieve the optimal combination of training cost and training performance according to your business needs.

Optimizing costs

The following table shows a cost comparison of the two training runs.

Instance Count Instance Type Training Time (in Seconds)

Instance Hourly Cost

(us-west-2)

Training Cost Cost Savings
First training run 3 p3.8xlarge 503 $14.688 $6.16 N/A
Second training run with Debugger profiling recommendations 2 p3.2xlarge 502 $3.825 $1.07 82.6%

Considering the cost of the training instances in a specific Region at the time of the this writing, for example us-west-2, training with three ml.p3.8xlarge instances for 503 seconds costs $6.16, and training with two ml.p3.2xlarge for 502 seconds costs $1.07. That is 83% cost savings by simply implementing the profiler recommendation to reduce the instance type.

Conclusion

The profiling feature of SageMaker Debugger is a powerful tool to gain visibility into ML training jobs. In this post, we provided insight into training resource utilization to identify bottlenecks, analyze the various phases of training, and identify expensive framework functions. We also showed how to analyze and implement profiler recommendations. We applied profiler recommendations to a TensorFlow Horovod distributed training for a sentiment analysis model and achieved resource utilization improvement up to 60% and cost savings of 83%. Debugger provides profiling capabilities for all leading deep learning frameworks, including TensorFlow, PyTorch, and Keras.

Give Debugger profiling a try and leave your feedback in the comments. For additional information on SageMaker Debugger, check out the announcement post linked below.

 


About the Authors

Mona Mona is an AI/ML Specialist Solutions Architect based out of Arlington, VA. She works with the World Wide Public Sector team and helps customers adopt machine learning on a large scale. Prior to joining Amazon, she worked as an IT Consultant and completed her masters in Computer Information Systems from Georgia State University, with a focus in big data analytics. She is passionate about NLP and ML explainability in AI/ML.

 

Prem Ranga is an Enterprise Solutions Architect based out of Houston, Texas. He is part of the Machine Learning Technical Field Community and loves working with customers on their ML and AI journey. Prem is passionate about robotics, is an Autonomous Vehicles researcher, and also built the Alexa-controlled Beer Pours in Houston and other locations.

 

Sireesha Muppala is an AI/ML Specialist Solutions Architect at AWS, providing guidance to customers on architecting and implementing machine learning solutions at scale. She received her Ph.D. in Computer Science from the University of Colorado, Colorado Springs. In her spare time, Sireesha loves to run and hike Colorado trails.

Read More

Amazon Forecast Weather Index – automatically include local weather to increase your forecasting model accuracy

Amazon Forecast Weather Index – automatically include local weather to increase your forecasting model accuracy

We’re excited to announce the Amazon Forecast Weather Index, which can increase your forecasting accuracy by automatically including local weather information in your demand forecasts with one click and at no extra cost. Weather conditions influence consumer demand patterns, product merchandizing decisions, staffing requirements, and energy consumption needs. However, acquiring, cleaning, and effectively using live weather information for demand forecasting is challenging and requires ongoing maintenance. With this launch, you can now include 14-day weather forecasts for US and Europe locations with one click to your demand forecasts.

The Amazon Forecast Weather Index combines multiple weather metrics from historical weather events and current forecasts at a given location to increase your demand forecast model accuracy. Amazon Forecast uses machine learning (ML) to generate more accurate demand forecasts, without requiring any prior ML experience. Forecast brings the same technology used at Amazon.com to developers as a fully managed service, removing the need to manage resources or rebuild your systems.

Changes in local weather conditions can impact short-term demand for products and services at particular locations for many organizations in retail, hospitality, travel, entertainment, insurance, and energy domains. Although historical demand patterns show seasonal demand, advance planning for day-to-day variation is harder.

In retail inventory management use cases, day-to-day weather variation impacts foot traffic and product mix. Typical demand forecasting systems don’t take expected weather conditions into account, leading to stock-outs or excess inventory at some locations, resulting in the need to transfer inventory mid-week. For example, for retailers, knowing that a heat wave is expected, they may choose to over-stock air conditioners from distribution centers to specific store locations. Or they may choose to prepare different types of grab-and-go prepared food items depending on the weather conditions.

Outside of product demand, weather conditions also impact staffing needs. For example, restaurants can better balance staff dependent on dine-in vs. take-out orders, or businesses with warehouses can better predict the number of workers that may come into work because of disrupted transportation. Although store managers may be able to make one-off stocking decisions based on weather conditions using their intuition and judgment, making buying, inventory placement, and workforce management decisions at scale becomes more challenging.

Day-to-day weather variation also impacts hyper-local on-demand services that rely on efficient matching of supply and demand at scale. A looming storm can lead to high demand for local ride hailing or food delivery services, while also impacting the number of drivers available. Having the information of upcoming weather changes enables you to better meet customer demand. Programmatically applying local weather information at scale can help you preemptively match supply and demand.

Predicting future weather conditions is common, and although it’s possible to use these predictions to more accurately forecast demand for products and services, it may be a struggle to do so in practice. Acquiring your own historical weather data and weather forecasts is expensive, and requires constant data collation, aggregation, and cleaning. Additionally, without weather domain expertise, transforming raw weather metrics into predictive data is challenging.

With today’s launch, you can account for local day-to-day weather changes to better predict demand, with only one click and at no additional cost, using Forecast. When you use the Weather Index, Forecast trains a model with historical weather information for the locations of your operations and uses the latest 14-day weather forecasts on items that are influenced by day-to-day variations to create more accurate demand forecasts.

Tom Summerfield is the Director of Retail at Peak.AI, an accessible AI system that harnesses the power of data to assist—not displace—humans to improve business efficiency and productivity. Summerfield says, “At Peak, we work with retail, CPG, and manufacturing customers who all know that weather plays a strong role in dictating consumer buying habits. Variation in weather ultimately impacts their product demand and product basket mix. Our customers frequently ask us to include weather in their demand forecasts. With Amazon Forecast adding a weather feature, we are now able to seamlessly integrate these insights and improve the accuracy of our demand planning models.”

The Weather Index is currently optimized for in-store retail demand planning and local on-demand services, but may still add value to scenarios where weather impacts demand such as power and utilities. As of this writing, the Weather Index is only available for US and Europe Regions. Other Regions will become available soon. For more information about latitude-longitude bounding boxes and US zip codes supported, see Weather Index.

Using the Weather Index for your forecasting use case

You can add local weather information to your model by adding the Weather Index during training. In this section, we walk through the steps to use the Weather Index on the Forecast console. For this post, we use the New York City Taxi dataset. To review the steps through the APIs, refer to the following notebook in our GitHub repo, where we have a cleaned version of the New York Taxi dataset ready to be used.

The New York Taxi dataset has 260 locations and is being used to predict the demand for taxis per location per hour for the next 7 days (168 hours).

  1. On the Forecast console, create a dataset group.

On the Forecast console, create a dataset group.

  1. Upload the historical demand dataset as the target time series. This dataset must include geolocation information for you to use the Weather Index.
  2. Select Schema builder.
  3. Choose your location format (for this post, the dataset includes latitude and longitude coordinates). 

Forecast also supports postal codes for US only.

  1. For Dataset import details, select Select time zone.
  2. Choose your time zone (for this post, we choose America/New York).

You can apply a single time zone to the entire dataset, or ask Forecast to derive a time zone from the geolocation of each item ID in the target time series dataset.

You can apply a single time zone to the entire dataset, or ask Forecast to derive a time zone from the geolocation of each item ID in the target time series dataset.

  1. In the navigation pane, under your dataset, choose Predictors.
  2. Choose Train predictor.

You can apply a single time zone to the entire dataset, or ask Forecast to derive a time zone from the geolocation of each item ID in the target time series dataset.

  1. For Forecast horizon, choose 168.
  2. For Forecast frequency, choose hour.
  3. For Number of backtest windows, choose 3.
  4. For Backtest window offset, choose 168.
  5. For Forecast types, choose p50, p60, and p70.
  6. For Algorithm, you can either select AutoML for Forecast to find the best algorithm for your dataset or select a specific algorithm. For this post, we select DeepAR+ with Hyperparameter optimization turned on for Forecast to optimize the model.
  7. Under Built-in datasets, select Enable Weather Index to apply the Weather Index to your training model. For this post, we have also selected Enable Holidays for US, as we hypothesize that holidays will have an impact on the demand for Taxis.

If you’re following the notebook in our GitHub repo, we call this predictor nyctaxi_demo_weather_deepar. While training the model, Forecast uses the historical weather to apply the Weather Index to only those items that are impacted by weather to improve item level accuracy.

Forecast uses the historical weather to apply the Weather Index to only those items that are impacted by weather to improve item level accuracy.

  1. After your predictor is trained, choose your predictor on the Predictors page to view the details of the accuracy metrics.

16. After your predictor is trained, choose your predictor on the Predictors page to view the details of the accuracy metrics.  

  1. On the predictor’s details page, you can review the model accuracy numbers and choose Export backtest results in the Predictor metrics

Forecast provides different model accuracy metrics for you to assess the strength of your forecasting models. We provide the weighted quantile loss (wQL) metric for each selected distribution point, also called quantiles, and weighted absolute percentage error (WAPE) and root mean square error (RMSE), calculated at the mean forecast. For each metric, a lower value indicates a smaller error and therefore a more accurate model. All these accuracy metrics are non-negative. Quantiles are specified when choosing your forecast type. For more information about how each metric is calculated and recommendations for the best use case for each metric, see Measuring forecast model accuracy to optimize your business objectives with Amazon Forecast.

Quantiles are specified when choosing your forecast type.

  1. For S3 predictor backtest export location, enter the details of your Amazon Simple Storage Service (Amazon S3) location for exporting the CSV files.

For S3 predictor backtest export location, enter the details of your Amazon Simple Storage Service (Amazon S3) location for exporting the CSV files.

Exporting the backtest results downloads the forecasts from the backtesting for each item and the accuracy metrics for each item. This helps you measure the accuracy of forecasts for individual items, allowing you to better understand your forecasting model’s performance for the items that most impact your business. For more information about the benefits of exporting backtest results, see Amazon Forecast now supports accuracy measurements for individual items.

In the next section of this post, we use these backtest results to assess the accuracy improvements of enabling the Weather Index by comparing the accuracy of specific items between models where you have not enabled the Weather Index.

  1. After you evaluate the model accuracy, you can start creating forecasts by choosing Forecasts in the navigation pane.
  2. Choose Create a forecast.

Choose Create a forecast.

To create these forecasts, Forecast automatically pulls in the weather forecasts for the next 14 days and applies the weather prediction to only those item IDs that are influenced by weather. In our example, we create forecasts for the next 7 days with hourly frequency.

Assessing the impact of the Weather Index

To assess the impact of adding weather information to your forecasting models, we can create another predictor with the same dataset and settings, but this time without enabling the Weather Index. If you’re following the notebook in our GitHub repo, we call this predictor nyctaxi_demo_baseline_deepar.

When creating this predictor, you should not select Hyperparameter optimization for DeepAR+, but rather use the winning training parameters from the hyperparameter optimization of DeepAR+ model of nyctaxi_demo_weather_deepar as the training parameters setting, for a fair comparison between the two models. You can find the winning training parameters in the predictor details page under the Predictor metrics section. For this post, these are as follows.

"context_length": "63",
"epochs": "500",
"learning_rate": "0.014138165570842774",
"learning_rate_decay": "0.5",
"likelihood": "student-t",
"max_learning_rate_decays": "0",
"num_averaged_models": "1",
"num_cells": "40",
"num_layers": "2",
"prediction_length": "168"

You can then go to the Predictors page to review the predictor metrics nyctaxi_demo_baseline_deepar.

The following screenshot shows the predictor details page for the nyctaxi_demo_baseline_deepar model that is trained without enabling the Weather Index. The predictor metrics for nyctaxi_demo_weather_deepar with weather enabled is shown above after the create predictor steps.

The predictor metrics for nyctaxi_demo_weather_deepar with weather enabled is shown above after the create predictor steps.

The following table summarizes the predictor metrics for the two models. Forecast provides the weighted quantile loss (wQL) metric for each quantile, and weighted absolute percentage error (WAPE) metric and root mean square error (RMSE) metric, calculated at the mean forecast. For each metric, a lower value indicates a smaller error and therefore a more accurate model. The model with the Weather Index is more accurate, with lower values for each metric.

Predictor wQL[0.5] wQL[0.6] wQL[0.7] WAPE RMSE
nyctaxi_demo_baseline_deepar 0.2637 0.2769 0.2679 0.2625 31.3986
nyctaxi_demo_weather_deepar 0.1646 0.1620 0.1498 0.1647 19.7874

You can now export the backtest results for both predictors to assess the forecasting accuracy at an item level. With the backtest results, you can also use a visualization tool like Amazon QuickSight to create graphs that help you visualize and compare the model accuracy of both the predictors by plotting the forecasts against actuals for items that are important for you. The following graph visualizes the comparison of the models with and without the Weather Index to the actual demand for a few items in the dataset at the 0.60 quantile.

The following graph visualizes the comparison of the models with and without the Weather Index to the actual demand for a few items in the dataset at the 0.60 quantile.

For Feb 27, we have zoomed in to better assess the difference in accuracies at an hourly level.

For Feb 27, we have zoomed in better assess the difference in accuracies at an hourly level.

Here we show the magnitude of error for each item id for the two models. Lower error values correspond to a more accurate model. Most items in the model with the Weather Index have errors below 0.05.

Most items in the model with the Weather Index have errors below 0.05.

Tips and best practices

When using the Weather Index, consider the following best practices:

  • Before using the Weather Index, define your use case and the forecasting challenge. Evaluate if your business problem will be impacted by day-to-day weather, because the Weather Index is only available for short-term use cases of 14-day forecasts. Weekly, monthly, and yearly frequencies aren’t supported, so use cases where you are forecasting for the next season don’t benefit from the Weather Index. Only daily, hourly and minute frequencies are acceptable to use the Weather Index.
  • For experimentation, start by identifying the most important item IDs for your business that you want to improve your forecasting accuracy. Measure the accuracy of your existing forecasting methodology as a baseline and compare that to the accuracy of those items with Forecast.
  • Incrementally add the Weather Index, related time series, or item metadata to train your model to assess whether additional information improves accuracy. Different combinations of related time series, item metadata and built-in datasets can give you different results.
  • To assess the impact of the Weather Index, first train a model with only your target time series, and then create another model with the Weather Index enabled. We recommend to use the same predictor settings for this comparison, because different hyperparameters and combinations of related time series can give you different results.
  • You may see an increase in training costs when using the Weather Index, because the index is applied and optimized for only those items that are impacted by day-to-day weather variation. However, there is no extra cost to access the weather information or use the Weather Index for creating forecasts. The cost for training continues to be $0.24 per training hour and $0.60 per 1,000 forecasts.
  • Experiment with multiple distribution points to optimize your forecast model to balance the costs associated with under-forecasting and over-forecasting. Choose a higher quantile if you want to over-forecast to meet demand.
  • If you’re comparing different models, use the weighted quantile loss metric at the same quantile for comparison. The lower the value, the more accurate the forecasting model.
  • Forecast allows you to select up to five backtest windows. Forecast uses backtesting to tune predictors and produce accuracy metrics. To perform backtesting, Forecast automatically splits your time series datasets into two sets: training and testing. The training set is used to train your model, and the testing set to evaluate the model’s predictive accuracy. We recommend choosing more than one backtest window to minimize selection bias that may make one window more or less accurate by chance. Assessing the overall model accuracy from multiple backtest windows provides a better measure of the strength of the model.

Conclusion

With the Amazon Forecast Weather Index, you can now automatically include local weather information to your demand forecasts with one click and at no extra cost. The Weather Index combines multiple weather metrics from historical weather events and current forecasts at a given location to increase your demand forecast model accuracy. To get started with this capability, see Weather Index and go through the notebook in our GitHub repo that walks you through how to use the Forecast APIs to enable the Weather Index. You can use this capability in all Regions where Forecast is publicly available. For more information about Region availability, see AWS Regional Services.


About the Authors

Namita Das is a Sr. Product Manager for Amazon Forecast. Her current focus is to democratize machine learning by building no-code/low-code ML services. On the side, she frequently advises startups and is raising a puppy named Imli.

 

 

Gunjan Garg is a Sr. Software Development Engineer in the AWS Vertical AI team. In her current role at Amazon Forecast, she focuses on engineering problems and enjoys building scalable systems that provide the most value to end-users. In her free time, she enjoys playing Sudoku and Minesweeper.

 

 

Christy Bergman is working as an AI/ML Specialist Solutions Architect at AWS. Her work involves helping AWS customers be successful using AI/ML services to solve real-world business problems. Prior to joining AWS, Christy worked as a data scientist in banking and software industries. In her spare time, she enjoys hiking and bird watching.

Read More

New Amazon SageMaker Neo features to run more models faster and more efficiently on more hardware platforms

New Amazon SageMaker Neo features to run more models faster and more efficiently on more hardware platforms

Amazon SageMaker Neo enables developers to train machine learning (ML) models once and optimize them to run on any Amazon SageMaker endpoints in the cloud and supported devices at the edge. Since Neo was first announced at re:Invent 2018, we have been continuously working with the Neo-AI open-source communities and several hardware partners to increase the types of ML models Neo can compile, the types of target hardware Neo can compile for, and to add new inference performance optimization techniques.

As of this writing, Neo optimizes models trained in DarkNet, Gluon, Keras, MXNet, PyTorch, TensorFlow, TensorFlow-Lite, ONNX, and XGBoost for inference on Android, iOS, Linux, and Windows machines based on processors from Ambarella, Apple, ARM, Intel, NVIDIA, NXP, Qualcomm, Texas Instruments, and Xilinx. Models optimized by Neo can perform up to 25 times faster with no loss in accuracy.

Over the past few months, Neo has added a number of key new features:

  • Expanded support for PC and mobile devices
  • Heterogeneous execution with NVIDIA TensorRT
  • Bring Your Own Codegen (BYOC) framework
  • Inference optimized containers
  • Compilation for dynamic models

In this post, we summarize how these new features allow you to run more models on more hardware platforms both faster and more efficiently.

Expanded support for PC and mobile devices

Earlier in 2020, Neo launched support for Windows on x86 processor-based devices, allowing you to run your models faster and more efficiently on personal computers and other Windows devices. In addition, Neo launched support for Android on ARM-based processors and Qualcomm processors with Hexagon DSP.

Most recently, Apple and AWS partnered to automate model conversion to Core ML format using Neo. As a result, ML app developers can now train models in SageMaker and convert them to Core ML format with a click of button, and deploy the models on iOS and MacOS devices.

Heterogeneous execution with NVIDIA TensorRT

Neo uses the NVIDIA TensorRT acceleration library to increase the speedup of ML models on NVIDIA Jetson devices at the edge and AWS g4dn and p3 instances in the AWS Cloud. The TensorRT library supports a subset of operators commonly used in deep learning models.

Previously, Neo used TensorRT only when the entire computational graph of the model and all its operators could be accelerated by the library. As a result, few models could take advantage of TensorRT acceleration.

Recently, Neo added the capability to partition a model into sub-graphs, in which case a part of the model can be handled by TRT while the other part can be compiled by Apache TVM. To execute the compiled model, Neo runtime uses the heterogeneous execution mechanism to run both parts on the hardware. With this approach, Neo can provide the best available performance for a broader range of frameworks and models.

Bring your own codegen

We also expanded the heterogeneous execution approach to other hardware targets. Neo partnered with chip vendors to use the Bring Your Own Codegen (BYOC) mechanism in TVM to plug in partners’ proprietary toolchains for their ML accelerators, such as Ambarella’s CV Tools and Texas Instruments’ TIDL, with the Neo compilation API.

When you compile, Neo partitions a model so you can run the supported portion supported in the ML accelerators and the rest on the host CPU. With this approach, Neo maximizes the utilization of the ML accelerator on the chip, increases the types of models that you can compile for the chip, and makes it easier for you to take advantage of new ML accelerators from chip vendors.

Inference optimized containers

Like all deep learning compilers, Neo supports a subset of operators and models in a given framework. Before adding this feature, Neo could only compile a model if all the operators from the model were supported by Neo. Now, when you use Neo to compile a MXNet, PyTorch, or TensorFlow model for CPU or GPU inferences in SageMaker hosted endpoints on AWS, Neo partitions the models, compiles a portion to accelerate performance, and leaves the un-compiled part of the model to continue running natively in the framework. You can use Neo’s inference optimized containers to deploy on SageMaker hosted endpoints. As a result, you can optimize any MXNet, PyTorch, and TensorFlow model with Neo for any SageMaker hosted endpoint.

Compilation for dynamic models

Deep learning models contain dynamic features, such as control flow, dynamic operations, dynamic data structures, and dynamic input and output shapes that pose significant challenges to existing deep learning compilers. These models, including some object detection and semantic segmentation models, are becoming increasingly popular. Recently, we added the ability in Neo to compile these dynamic models. You can now use Neo to optimize models with dynamic features, and get up to two times the performance speedup.

Summary

We continually make improvements and add supported hardware endpoints, models, and frameworks to Neo based on your feedback. We encourage you to sign in to the SageMaker console or use the Neo compilation API to compile your trained models for the target hardware of your interest. For more information about Neo, see the following:

 


About the Authors

Tingwei Huang is a product management leader at AWS AI Service.

 

 

 

 

Vin Sharma is a Engineering Leader for AWS Deep Learning. He leads the team building Neo, which helps ML models train once and run anywhere in the cloud and at the edge.

Read More

Model dynamism Support in Amazon SageMaker Neo

Model dynamism Support in Amazon SageMaker Neo

Amazon SageMaker Neo was launched at AWS re:Invent 2018. It made notable performance improvement on models with statically known input and output data shapes, typically image classification models. These models are usually composed of a stack of blocks that contain compute-intensive operators, such as convolution and matrix multiplication. Neo applies a series of optimizations to boost the model’s performance and reduce memory usage. The static feature significantly simplifies the compilation, and you can decide on runtime inference tasks such as memory sizes ahead of time using a dedicated analysis pass. Runtime is just acted as a topological graph walker that invokes each operator sequentially.

However, we have been seeing a growing number of customers requiring more advanced models to fulfill tasks like object detection. These models contain dynamic features, such as control flow, dynamic operations, dynamic data structures, and dynamic input and output shapes. This posts significant challenges to the existing deep learning compiler because they have been mainly confined to static models. To address this problem, existing solutions either use just-in-time compilation to compile and run the dynamic portion (XLA), which causes extra compilation overhead, or convert the dynamic model into a static representation first (TFLite). To meet your requirements, we designed and implemented a suite of techniques ranging from the front-end parser to the backend runtime to handle object detection and segmentation models trained by TensorFlow, PyTorch, and MXNet. In this post, we’ll walk you through how Neo supports object detection and semantic segmentation models. We also compare inference performance improvements for both instance and edge type devices for Neo object detection and segmentation models.

Methodology

This section describes how object detection and semantic segmentation models are supported in Neo. We discuss the following:

  • How the front end handles popular frameworks differently
  • How the backend is designed to support dynamism
  • An example using the AWS Command Line Interface (AWS CLI) to demonstrate how to easy it is to perform inference for an object detection model in Neo

Frontend

The approaches vary for each framework because they handle dynamism, particularly control flow, differently. For example, MXNet doesn’t use any control flow to implement the object detection and segmentation models, which allows us to have a quick one-to-one operator mapping from MXNet to Relay operators. PyTorch has control flow primitives, such as If and Loop, which largely simplifies the conversion because we can create Relay If statements and recursion functions correspondingly.

Among the most popular frameworks, TensorFlow is the most difficult to support because it doesn’t directly employ conditional and looping operators to implement control flow. Instead, low-level data flow primitives, such as Merge, Exit, Switch, NextIteration, and Enter, are used to express complex control flow logic for the better support of parallel and distributed execution. For more information, see Implementation of Control Flow in TensorFlow.

To decompile these primitives into the original control flow operators, we proposed dedicated analysis and pattern matching techniques that have been contributed back to the Apache TVM. For more information, see the RFC Decompile TensorFlow Control Flow Primitives to Relay and Enhance TensorFlow Frontend Control Flow Support.

Backend

The backend compiler has worked well in supporting static models where the input data type and shape for each tensor is known at the compile-time. However, this assumption doesn’t hold for dynamic models, such as TensorFlow SSD, because the data shapes can only be determined at runtime.

To support dynamic data shapes, we introduced a special dimension called Any to represent statically unknown dimensions. For instance, a tensor type could be represented as Tensor[(5, Any), float32], where the second dimension was unknown. Accordingly, we defined some type inference rules to infer the type of the tensor when Any shape is involved.

To get the data shape of a tensor at runtime, we defined shape functions to compute the output shape of the tensor to determine the size of required memory. Based on the categories of the operators, shape functions were classified into three patterns:

  • Data-independent shapes – Are used for operators whose output shape is only determined by the shapes of the inputs, such as 2D convolution.
  • Data-dependent shapes – Require the real input value instead of the shape to compute the output shapes. For example, arange needs the value of start, stop, and step to compute the output shape.
  • Upper bound shapes – Are used to quickly estimate an upper bound shape for the output in order to avoid redundant computation. This is useful because operators, such as Non Maximum Suppression (NMS), involve non-trivial computation to infer the output shape at runtime, and the amount of computation for the shape function could be on par with that of running the operator.

To effectively run the dynamic models, we designed a virtual machine as an execution engine to invoke runtime type inference, handle control flow, and dispatch operator kernels. We compiled the model into a machine-dependent kernel code and machine-independent bytecode. They were then loaded and run by the virtual machine.

Because each instruction works on coarse-grained data, such as tensor, the instructions are compactly organized, meaning the dispatching overhead isn’t a concern. We designed the virtual machine in a register-based manner to simplify the design and allow users to read and modify the code easily. We designed a set of instructions to control running each type of bytecode, such as storage allocation, tensor memory allocation on the storage, control flow, and kernel invocation.

After the virtual machine loads the compiled bytecode and kernels, it interprets the bytecode in a dispatching loop by checking it op-code and invoking appropriate logic. For more information, see Nimble: Efficiently Compiling Dynamic Neural Networks for Model Inference.

Performing inference and object detection in Neo

This section provides an example to illustrate how you can compile a Faster R-CNN model from TensorFlow 1.15 and deploy it on an AWS C5 instance using Neo.

  1. Prepare the pre-trained model by downloading it from the TensorFlow Detection Model Zoo and extracting it:
    $ wget http://download.tensorflow.org/models/object_detection/faster_rcnn_resnet50_coco_2018_01_28.tar.gz
    $ tar -xzf faster_rcnn_resnet50_coco_2018_01_28.tar.gz

  1. Get the frozen protobuf file and upload it to Amazon Simple Storage Service (Amazon S3):
    $ tar -czf tf_frcnn.tar.gz faster_rcnn_resnet50_coco_2018_01_28.tar.gz
    $ aws s3 cp tf_frcnn.tar.gz s3://<your-bucket>/<your-input-folder>

We can now compile the model using Neo. For this post, we use the AWS CLI. We first create a configuration JSON file where the required information is fed (such as the input size, framework, location of the output artifacts, and target platform that we compile the model for):

  1. Create the configuration file with the following code:
    {
        "CompilationJobName": "compile-tf-ssd",
        "RoleArn": "arn:aws:iam::<your-account>:role/service-role/AmazonSageMaker-ExecutionRole-yyyymmddThhmmss",
        "InputConfig": {
            "S3Uri": "s3://<your-bucket>/<your-input-folder>/tf_frcnn.tar.gz",
            "DataInputConfig":  "{'image_tensor': [1,512,512,3]}",
            "Framework": "TENSORFLOW"
        },
        "OutputConfig": {
            "S3OutputLocation": "s3://<your-bucket>/<your-output-folder>",
            "TargetPlatform": {
                "Os": "LINUX",    
                "Arch": "X86_64"
            },
            "CompilerOptions": "{'mcpu': 'skylake-avx512'}"
        },
        "StoppingCondition": {"MaxRuntimeInSeconds": 1800}
    }

  1. Compile it with SageMaker CLI:
    $ aws sagemaker create-compilation-job --cli-input-json file://config.json --region us-west-2

Finally, we’re ready to deploy the compiled model with DLR.

  1. Before the deployment, download the compiled artifacts from the S3 bucket where it was saved to:
    $ aws s3 cp s3://<your-bucket>/<output-folder>/output_artifacts.tar.gz tf_frcnn_compiled.tar.gz
    $ mkdir compiled_model
    $ tar -xzf tf_frcnn_compiled.tar.gz -C compiled_model

  1. Install DLR for inference:
    $ pip install dlr

  1. Perform inference as the following:
    if __name__ == "__main__":
        data = cv2.imread("input_image.jpg")
        data = cv2.resize(data, (512, 512), interpolation=cv2.INTER_AREA)
        data = np.expand_dims(data, 0)
        model = dlr.DLRModel('compiled_model', 'cpu', 0)
        result = model.run(data)

Performance comparison

In this section, we compare the performance of the most widely used TF object detection and segmentation models on a variety of EC2 server platforms and NVIDIA Jetson based edge devices. We use the models from the TensorFlow Detection Model Zoo. As discussed earlier, these models show dynamism and are significantly more complex than the static models like ResNet50. We use Neo to compile these models and generate high-performance machine code for a variety of target platforms. Here, we show the performance comparison for these models across many hardware devices against the best baseline available for the hardware platforms.

EC2 C5.9x large server instance

C5 instances are Intel Xeon server instances suitable for compute-intensive deep learning applications. For this comparison, we report the average latency for the TensorFlow baseline and Neo-compiled model. All the reported latency numbers are in milliseconds. We observe that Neo outperforms TensorFlow for all the three models, and by up to 20% for the Mask R-CNN ResNet-50 model.

Model name TF 1.15.0 Neo Speedup
ssd_mobilenet_v1_coco 17.96 16.39 1.09579
faster_rcnn_resnet50_coco 152.62 142.3 1.07252
mask_rcnn_resnet50_atrous_coco 391.91 326.44 1.20056

EC2 m6g.8x large server instance

M6 instances are the ARM Graviton server instances suitable for compute-intensive deep learning applications. To get a baseline, we use the TensorFlow packages provided from ARM Tool-Solutions. Our observations are similar to C5 instances. Neo outperforms TensorFlow, and we observe significant speedup for large models like Faster RCNN and MaskRCNN.

Model name TF 1.15.0 Neo Speedup
ssd_mobilenet_v1_coco 29.04 28.75 1.01009
faster_rcnn_resnet50_coco 290.64 202.71 1.43377
mask_rcnn_resnet50_atrous_coco 623.98 368.81 1.69187

NVIDIA server instance and edge devices

Finally, we compare the performance of the MobileNet SSD model on NVIDIA Jetson based edge devices—Jetson Xavier and Jetson Nano. MobileNet SSD is a popular object detection model for edge devices. This is because it has low compute and memory requirements, and is suitable for already resource-constrained edge devices. To have a performance baseline, we use the TF-TRT package, where TensorFlow is integrated with NVIDIA TensorRT as the backend. We present the comparison in the following table. We observe that Neo achieves significant speedup for both Xavier and Nano edge devices.

Performance comparison for ssd_mobilenet_v1_coco
Hardware device TF 1.15 Neo Speedpup
NVIDIA Jetson Nano 163 140 1.16429
Jetson Xavier 109 56 1.94643

Summary

This post described how Neo supports model dynamism. Multiple techniques were proposed from the front-end parser to backend runtime to enable the model support. We compared the inference performance of Neo object detection and segmentation models against those required by the TensorFlow framework or TensorFlow backed with TensorRT. We observed that Neo obtained speedups for these models on both instances and edge devices.r

This solution doesn’t have any the service API changes, so you can still use the original API to compile new models. All code has been contributed back to the Apache TVM. For more information about compiling a model using Apache TVM, see Compile PyTorch Object Detection Models.

Acknowledgements: We sincerely thank the following engineers and applied scientists who have contributed to the support of dynamic models: Haichen Shen, Wei Chen, Yong Wu, Yao Wang, Animesh Jain, Trevor Morris, Rohan Mukherjee, Ricky Das

 


About the Author

Zhi Chen is a Senior Software Engineer at AWS AI who leads the deep learning compiler development in Amazon SageMaker Neo. He helps customers deploy the pre-trained deep learning models from different frameworks on various platforms. Zhi obtained his PhD from University of California, Irvine in Computer Science, where he focused on compilers and performance optimization.

Read More