The collaboration includes Amazon funding for faculty research projects, with an initial focus on machine learning and natural-language processing.Read More
Optimize your machine learning deployments with auto scaling on Amazon SageMaker
Machine learning (ML) has become ubiquitous. Our customers are employing ML in every aspect of their business, including the products and services they build, and for drawing insights about their customers.
To build an ML-based application, you have to first build the ML model that serves your business requirement. Building ML models involves preparing the data for training, extracting features, and then training and fine-tuning the model using the features. Next, the model has to be put to work so that it can generate inference (or predictions) from new data, which can then be used in the application. Although you can integrate the model directly into an application, the approach that works well for production-grade applications is to deploy the model behind an endpoint and then invoke the endpoint via a RESTful API call to obtain the inference. In this approach, the model is typically deployed on an infrastructure (compute, storage, and networking) that suits the price-performance requirements of the application. These requirements include the number inferences that the endpoint is expected to return in a second (called the throughput), how quickly the inference must be generated (the latency), and the overall cost of hosting the model.
Amazon SageMaker makes it easy to deploy ML models for inference at the best price-performance for any use case. It provides a broad selection of ML infrastructure and model deployment options to help meet all your ML inference needs. It is a fully managed service, so you can scale your model deployment, reduce inference costs, manage models more effectively in production, and reduce operational burden. One of the ways to minimize your costs is to provision only as much compute infrastructure as needed to serve the inference requests to the endpoint (also known as the inference workload) at any given time. Because the traffic pattern of inference requests can vary over time, the most cost-effective deployment system must be able to scale out when the workload increases and scale in when the workload decreases in real-time. SageMaker supports automatic scaling (auto scaling) for your hosted models. Auto scaling dynamically adjusts the number of instances provisioned for a model in response to changes in your inference workload. When the workload increases, auto scaling brings more instances online. When the workload decreases, auto scaling removes unnecessary instances so that you don’t pay for provisioned instances that you aren’t using.
With SageMaker, you can choose when to auto scale and how many instances to provision or remove to achieve the right availability and cost trade-off for your application. SageMaker supports three auto scaling options. The first and commonly used option is target tracking. In this option, you select an ideal value of an Amazon CloudWatch metric of your choice, such as the average CPU utilization or throughput that you want to achieve as a target, and SageMaker will automatically scale in or scale out the number of instances to achieve the target metric. The second option is to choose step scaling, which is an advanced method for scaling based on the size of the CloudWatch alarm breach. The third option is scheduled scaling, which lets you specify a recurring schedule for scaling your endpoint in and out based on anticipated demand. We recommend that you combine these scaling options for better resilience.
In this post, we provide a design pattern for deriving the right auto scaling configuration for your application. In addition, we provide a list of steps to follow, so even if your application has a unique behavior, such as different system characteristics or traffic patterns, this systemic approach can be applied to determine the right scaling policies. The procedure is further simplified with the use of Inference Recommender, a right-sizing and benchmarking tool built inside SageMaker. However, you can use any other benchmarking tool.
You can review the notebook we used to run this procedure to derive the right deployment configuration for our use case.
SageMaker hosting real-time endpoints and metrics
SageMaker real-time endpoints are ideal for ML applications that need to handle a variety of traffic and respond to requests in real time. The application setup begins with defining the runtime environment, including the containers, ML model, environment variables, and so on in the create-model API, and then defining the hosting details such as instance type and instance count for each variant in the create-endpoint-config API. The endpoint configuration API also allows you to split or duplicate traffic between variants using production and shadow variants. However, for this example, we define scaling policies using a single production variant. After setting up the application, you set up scaling, which involves registering the scaling target and applying scaling policies. Refer to Configuring autoscaling inference endpoints in Amazon SageMaker for more details on the various scaling options.
The following diagram illustrates the application and scaling setup in SageMaker.
Endpoint metrics
In order to understand the scaling exercise, it’s important to understand the metrics that the endpoint emits. At a high level, these metrics are categorized into three classes: invocation metrics, latency metrics, and utilization metrics.
The following diagram illustrates these metrics and the endpoint architecture.
The following tables elaborate on the details of each metric.
Invocation metrics
Metrics | Overview | Period | Units | Statistics |
Invocations | The number of InvokeEndpoint requests sent to a model endpoint. | 1 minute | None | Sum |
InvocationsPerInstance | The number of invocations sent to a model, normalized by InstanceCount in each variant. 1/numberOfInstances is sent as the value on each request, where numberOfInstances is the number of active instances for the variant behind the endpoint at the time of the request. | 1 minute | None | Sum |
Invocation4XXErrors | The number of InvokeEndpoint requests where the model returned a 4xx HTTP response code. | 1 minute | None | Average, Sum |
Invocation5XXErrors | The number of InvokeEndpoint requests where the model returned a 5xx HTTP response code. | 1 minute | None | Average, Sum |
Latency metrics
Metrics | Overview | Period | Units | Statistics |
ModelLatency | The interval of time taken by a model to respond as viewed from SageMaker. This interval includes the local communication times taken to send the request and to fetch the response from the container of a model and the time taken to complete the inference in the container. | 1 minute | Microseconds | Average, Sum, Min, Max, Sample Count |
OverheadLatency | The interval of time added to the time taken to respond to a client request by SageMaker overheads. This interval is measured from the time SageMaker receives the request until it returns a response to the client, minus the ModelLatency. Overhead latency can vary depending on multiple factors, including request and response payload sizes, request frequency, and authentication or authorization of the request. | 1 minute | Microseconds | Average, Sum, Min, Max, Sample Count |
Utilization metrics
Metrics | Overview | Period | Units |
CPUUtilization | The sum of each individual CPU core’s utilization. The CPU utilization of each core range is 0–100. For example, if there are four CPUs, the CPUUtilization range is 0–400%. | 1 minute | Percent |
MemoryUtilization | The percentage of memory that is used by the containers on an instance. This value range is 0–100%. | 1 minute | Percent |
GPUUtilization | The percentage of GPU units that are used by the containers on an instance. The value can range between 0–100 and is multiplied by the number of GPUs. | 1 minute | Percent |
GPUMemoryUtilization | The percentage of GPU memory used by the containers on an instance. The value range is 0–100 and is multiplied by the number of GPUs. For example, if there are four GPUs, the GPUMemoryUtilization range is 0–400%. | 1 minute | Percent |
DiskUtilization | The percentage of disk space used by the containers on an instance. This value range is 0–100%. | 1 minute | Percent |
Use case overview
We use a simple XGBoost classifier model for our application and have decided to host on the ml.c5.large instance type. However, the following procedure is independent of the model or deployment configuration, so you can adopt the same approach for your own application and deployment choice. We assume that you already have a desired instance type at the start of this process. If you need assistance in determining the ideal instance type for your application, you should use the Inference Recommender default job for getting instance type recommendations.
Scaling plan
The scaling plan is a three-step procedure, as illustrated in the following diagram:
- Identify the application characteristics – Knowing the bottlenecks of the application on the selected hardware is an essential part of this.
- Set scaling expectations – This involves determining the maximum number of requests per second, and how the request pattern will look (whether it will be smooth or spiky).
- Apply and evaluate – Scaling policies should be developed based on application characteristics and scaling expectations. As part of this final step, evaluate the policies by running the load that it is expected to handle. In addition, we recommend iterating the last step, until the scaling policy can handle the request load.
Identify application characteristics
In this section, we discuss the methods to identify application characteristics.
Benchmarking
To derive the right scaling policy, the first step in the plan is to determine application behavior on the chosen hardware. This can be achieved by running the application on a single host and increasing the request load to the endpoint gradually until it saturates. In many cases, after saturation, the endpoint can no longer handle any more requests and performance begins to deteriorate. This can be seen in the endpoint invocation metrics. We also recommend that you review hardware utilization metrics and understand the bottlenecks, if any. For CPU instances, the bottleneck can be in the CPU, memory, or disk utilization metrics, while for GPU instances, the bottleneck can be in GPU utilization and its memory. We discuss invocations and utilization metrics on ml.c5.large hardware in the following section. It’s also important to remember that CPU utilization is aggregated across all cores, therefore it is at 200% scale for an ml.c5.large two-core machine.
For benchmarking, we use the Inference Recommender default job. Inference Recommender default jobs will, by default, benchmark with multiple instance types. However, you can narrow down the search to your chosen instance type by passing those in supported instances. The service then provisioning the endpoint gradually increases the request and stops when the benchmark reaches saturation or if the endpoint invoke API call fails for 1% of the results. The hosting metrics can be used to determine the hardware bounds and set the right scaling limit. In the event that there is a hardware bottleneck, we recommend that you scale up the instance size in the same family or change the instance family entirely.
The following diagram illustrates the architecture of benchmarking using Inference Recommender.
Use the following code:
def trigger_inference_recommender(model_url, payload_url, container_url, instance_type, execution_role, framework,
framework_version, domain="MACHINE_LEARNING", task="OTHER", model_name="classifier",
mime_type="text/csv"):
model_package_arn = create_model_package(model_url, payload_url, container_url, instance_type,
framework, framework_version, domain, task, model_name, mime_type)
job_name = create_inference_recommender_job(model_package_arn, execution_role)
wait_for_job_completion(job_name)
return job_name
Analyze the result
We then analyze the results of the recommendation job using endpoint metrics. From the following hardware utilization graph, we confirm that the hardware limits are within the bounds. Furthermore, the CPUUtilization line increases proportional to request load, so it is necessary to have scaling limits on CPU utilization as well.
From the following figure, we confirm that the invocation flattens after it reaches its peak.
Next, we move on to the invocations and latency metrics for setting the scaling limit.
Find scaling limits
In this step, we run various scaling percentages to find the right scaling limit. As a general scaling rule, the hardware utilization percentage should be around 40% if you’re optimizing for availability, around 70% if you’re optimizing for cost, and around 50% if you want to balance availability and cost. The guidance gives an overview of the two dimensions: availability and cost. The lower the threshold, the better the availability. The higher the threshold, the better the cost. In the following figure, we plotted the graph with 55% as the upper limit and 45% as the lower limit for invocation metrics. The top graph shows invocations and latency metrics; the bottom graph shows utilization metrics.
You can use the following sample code to change the percentages and see what the limits are for the invocations, latency, and utilization metrics. We highly recommend that you play around with percentages and find the best fit based on your metrics.
def analysis_inference_recommender_result(job_name, index=0,
upper_threshold=80.0, lower_threshold=65.0):
Because we want to optimize for availability and cost in this example, we decided to use 50% aggregate CPU utilization. As we selected a two-core machine, our aggregated CPU utilization is 200%. We therefore set a threshold of 100% for CPU utilization because we’re doing 50% for two cores. In addition to the utilization threshold, we also set the InvocationPerInstance threshold to 5000. The value for InvocationPerInstance is derived by overlaying CPUUtilization = 100% over the invocations graph.
As part of step 1 of the scaling plan (shown in the following figure), we benchmarked the application using the Inference Recommender default job, analyzed the results, and determined the scaling limit based on cost and availability.
Set scaling expectations
The next step is to set expectations and develop scaling policies based on these expectations. This step involves defining the maximum and minimum requests to be served, as well as additional details, like what is the maximum request growth of the application should handle? Is it smooth or spiky traffic pattern? Data like this will help define the expectation and help you develop a scaling policy that meets your demand.
The following diagram illustrates an example traffic pattern.
For our application, the expectations are maximum requests per second (max) = 500, and minimum request per second (min) = 70.
Based on these expectations, we define MinCapacity and MaxCapacity using the following formula. For the following calculations, we normalize InvocationsPerInstance to seconds because it is per minute. Additionally, we define growth factor, which is the amount of additional capacity that you are willing to add when your scale exceeds the maximum requests per second. The growth_factor should always be greater than 1, and it is essential in planning for additional growth.
MinCapacity = ceil(min / (InvocationsPerInstance/60) )
MaxCapacity = ceil(max / (InvocationsPerInstance/60)) * Growth_factor
In the end, we arrive at MinCapacity = 1 and MaxCapacity = 8 (with 20% as growth factor), and we plan to handle a spiky traffic pattern.
Define scaling policies and verify
The final step is to define a scaling policy and evaluate its impact. The evaluation serves to validate the results of the calculations made so far. In addition, it helps us adjust the scaling setting if it doesn’t meet our needs. The evaluation is done using the Inference Recommender advanced job, where we specify the traffic pattern, MaxInvocations, and endpoint to benchmark against. In this case, we provision the endpoint and set the scaling policies, then run the Inference Recommender advanced job to validate the policy.
Target tracking
It is recommended to set up target tracking based on InvocationsPerInstance. The threshold has already been defined in step 1, so we set the CPUUtilization threshold to 100 and the InvocationsPerInstance threshold to 5000. First, we define a scaling policy based on the number of InvocationsPerInstance, and then we create a scaling policy that relies on CPU utilization.
As in the sample notebook, we use the following functions to register and set scaling policies:
def set_target_scaling_on_invocation(endpoint_name, variant_name, target_value,
scale_out_cool_down=10,
scale_in_cool_down=100):
policy_name = 'target-tracking-invocations-{}'.format(str(round(time.time())))
resource_id = "endpoint/{}/variant/{}".format(endpoint_name, variant_name)
response = aas_client.put_scaling_policy(
PolicyName=policy_name,
ServiceNamespace='sagemaker',
ResourceId=resource_id,
ScalableDimension='sagemaker:variant:DesiredInstanceCount',
PolicyType='TargetTrackingScaling',
TargetTrackingScalingPolicyConfiguration={
'TargetValue': target_value,
'PredefinedMetricSpecification': {
'PredefinedMetricType': 'SageMakerVariantInvocationsPerInstance',
},
'ScaleOutCooldown': scale_out_cool_down,
'ScaleInCooldown': scale_in_cool_down,
'DisableScaleIn': False
}
)
return policy_name, response
def set_target_scaling_on_cpu_utilization(endpoint_name, variant_name, target_value,
scale_out_cool_down=10,
scale_in_cool_down=100):
policy_name = 'target-tracking-cpu-util-{}'.format(str(round(time.time())))
resource_id = "endpoint/{}/variant/{}".format(endpoint_name, variant_name)
response = aas_client.put_scaling_policy(
PolicyName=policy_name,
ServiceNamespace='sagemaker',
ResourceId=resource_id,
ScalableDimension='sagemaker:variant:DesiredInstanceCount',
PolicyType='TargetTrackingScaling',
TargetTrackingScalingPolicyConfiguration={
'TargetValue': target_value,
'CustomizedMetricSpecification':
{
'MetricName': 'CPUUtilization',
'Namespace': '/aws/sagemaker/Endpoints',
'Dimensions': [
{'Name': 'EndpointName', 'Value': endpoint_name},
{'Name': 'VariantName', 'Value': variant_name}
],
'Statistic': 'Average',
'Unit': 'Percent'
},
'ScaleOutCooldown': scale_out_cool_down,
'ScaleInCooldown': scale_in_cool_down,
'DisableScaleIn': False
}
)
return policy_name, response
Because we need to handle spiky traffic patterns, the sample notebook uses ScaleOutCooldown = 10 and ScaleInCooldown = 100 as the cooldown values. As we evaluate the policy in the next step, we plan to adjust the cooldown period (if needed).
Evaluation target tracking
The evaluation is done using the Inference Recommender advanced job, where we specify the traffic pattern, MaxInvocations, and endpoint to benchmark against. In this case, we provision the endpoint and set the scaling policies, then run the Inference Recommender advanced job to validate the policy.
from inference_recommender import trigger_inference_recommender_evaluation_job
from result_analysis import analysis_evaluation_result
eval_job = trigger_inference_recommender_evaluation_job(model_package_arn=model_package_arn,
execution_role=role,
endpoint_name=endpoint_name,
instance_type=instance_type,
max_invocations=max_tps*60,
max_model_latency=10000,
spawn_rate=1)
print ("Evaluation job = {}, EndpointName = {}".format(eval_job, endpoint_name))
# In the next step, we will visualize the cloudwatch metrics and verify if we reach 30000 invocations.
max_value = analysis_evaluation_result(endpoint_name, variant_name, job_name=eval_job)
print("Max invocation realized = {}, and the expecation is {}".format(max_value, 30000))
Following benchmarking, we visualized the invocations graph to understand how the system responds to scaling policies. The scaling policy that we established can handle the requests and can reach up to 30,000 invocations without error.
Now, let’s consider what happens if we triple the rate of new user. Does the same policy apply? We can rerun the same evaluation set with a higher request rate and set the spawn rate (an additional user per minute) to 3.
With the above result, we confirm that the current auto-scaling policy can cover even the aggressive traffic pattern.
Step scaling
In addition to Target tracking, we also recommend using step scaling to have better control over aggressive traffic. Therefore, we defined an additional step scale with scaling adjustments to handle spiky traffic.
def set_step_scaling(endpoint_name, variant_name):
policy_name = 'step-scaling-{}'.format(str(round(time.time())))
resource_id = "endpoint/{}/variant/{}".format(endpoint_name, variant_name)
response = aas_client.put_scaling_policy(
PolicyName=policy_name,
ServiceNamespace='sagemaker',
ResourceId=resource_id,
ScalableDimension='sagemaker:variant:DesiredInstanceCount',
PolicyType='StepScaling',
StepScalingPolicyConfiguration={
'AdjustmentType': 'ChangeInCapacity',
'StepAdjustments': [
{
'MetricIntervalLowerBound': 0.0,
'MetricIntervalUpperBound': 5.0,
'ScalingAdjustment': 1
},
{
'MetricIntervalLowerBound': 5.0,
'MetricIntervalUpperBound': 80.0,
'ScalingAdjustment': 3
},
{
'MetricIntervalLowerBound': 80.0,
'ScalingAdjustment': 4
},
],
'MetricAggregationType': 'Average'
},
)
return policy_name, response
Evaluation step scaling
We then follow the same step to evaluate, and after the benchmark we confirm that the scaling policy can handle a spiky traffic pattern and reach 30,000 invocations without any errors.
Therefore, defining the scaling policies and evaluating the results using the Inference Recommender is a necessary part of validation.
Further tuning
In this section, we discuss further tuning options.
Multiple scaling options
As shown in our use case, you can pick multiple scaling policies that meet your needs. In addition to the options mentioned previously, you should also consider scheduled scaling if you forecast traffic for a period of time. The combination of scaling policies is powerful and should be evaluated using benchmarking tools like Inference Recommender.
Scale up or down
SageMaker Hosting offers over 100 instance types to host your model. Your traffic load may be limited by the hardware you have chosen, so consider other hosting hardware. For example, if you want a system to handle 1,000 requests per second, scale up instead of out. Accelerator instances such as G5 and Inf1 can process higher numbers of requests on a single host. Scaling up and down can provide better resilience for some traffic needs than scaling in and out.
Custom metrics
In addition to InvocationsPerInstance and other SageMaker hosting metrics, you can also define metrics for scaling your application. However, any custom metrics that are used for scaling should depict the load of the system. The metrics should increase in value when utilization is high, and decrease otherwise. The custom metrics could bring more granularity to the load and help in defining custom scaling policies.
Adjusting scaling alarm
By defining the scaling policy, you are creating an alarm for scaling, and these alarms are used for scale in and scale out. However, these alarms have a default number of data points on which they are alerted. In case you want to alter the number of data points of the alarm, you can do so. Nevertheless, after any update to scaling policies, it is recommended to evaluate the policy by using a benchmarking tool with the load it should handle.
Conclusion
The process of defining the scaling policy for your application can be challenging. You must understand the characteristics of the application, determine your scaling needs, and iterate scaling policies to meet those needs. This post has reviewed each of these steps and explained the approach you should take at each step. You can find your application characteristics and evaluate scaling policies by using the Inference Recommender benchmarking system. The proposed design pattern can help you create a scalable application within hours, rather than days, that takes into account the availability and cost of your application.
About the Authors
Mohan Gandhi is a Senior Software Engineer at AWS. He has been with AWS for the last 10 years and has worked on various AWS services like EMR, EFA and RDS. Currently, he is focused on improving the SageMaker Inference Experience. In his spare time, he enjoys hiking and marathons.
Vikram Elango is an AI/ML Specialist Solutions Architect at Amazon Web Services, based in Virginia USA. Vikram helps financial and insurance industry customers with design, thought leadership to build and deploy machine learning applications at scale. He is currently focused on natural language processing, responsible AI, inference optimization and scaling ML across the enterprise. In his spare time, he enjoys traveling, hiking, cooking and camping with his family.
Venkatesh Krishnan leads Product Management for Amazon SageMaker in AWS. He is the product owner for a portfolio of SageMaker services that enable customers to deploy machine learning models for Inference. Earlier he was the Head of Product, Integrations and the lead product manager for Amazon AppFlow, a new AWS service that he helped build from the ground up. Before joining Amazon in 2018, Venkatesh served in various research, engineering, and product roles at Qualcomm, Inc. He holds a PhD in Electrical and Computer Engineering from Georgia Tech and an MBA from ULCA’s Anderson School of Management.
Pai-Ling Yin brings an academic’s lens to the study of buying and selling at Amazon
How her background helps her manage a team charged with assisting internal partners to answer questions about the economic impacts of their decisions.Read More
Share medical image research on Amazon SageMaker Studio Lab for free
This post is co-written with Stephen Aylward, Matt McCormick, Brianna Major from Kitware and Justin Kirby from the Frederick National Laboratory for Cancer Research (FNLCR).
Amazon SageMaker Studio Lab provides no-cost access to a machine learning (ML) development environment to everyone with an email address. Like the fully featured Amazon SageMaker Studio, Studio Lab allows you to customize your own Conda environment and create CPU- and GPU-scalable JupyterLab version 3 notebooks, with easy access to the latest data science productivity tools and open-source libraries. Moreover, Studio Lab free accounts include a minimum of 15 GB of persistent storage, enabling you to continuously maintain and expend your projects across multiple sessions and allowing you to instantly pick up where your left off and even share your ongoing work and work environments with others.
A key issue faced by the medical image community is how to enable researchers to experiment and explore with these essential tools. To solve this challenge, AWS teams worked with Kitware and Frederick National Laboratory for Cancer Research (FNLCR) to bring together three major medical imaging AI resources for Studio Lab and the entire open-source JupyterLab community:
- MONAI core, an open-source PyTorch library for medical image deep learning
- Clinical data from The Cancer Imaging Archive (TCIA), a large, open-access database of medical imaging studies funded by the National Cancer Institute
- itkWidgets, an open-source Jupyter/Python library that provides interactive, 3D medical image visualizations directly within Jupyter Notebooks
These tools and data combine to allow medical imaging AI researchers to quickly develop and thoroughly evaluate clinically ready deep learning algorithms in a comprehensive and user-friendly environment. Team members from FNLCR and Kitware collaborated to create a series of Jupyter notebooks that demonstrate common workflows to programmatically access and visualize TCIA data. These notebooks use Studio Lab to allow researchers to run the notebooks without the need to set up their own local Jupyter development environment—you can quickly explore new ideas or integrate your work into presentations, workshops, and tutorials at conferences.
The following example illustrates Studio Lab running a Jupyter notebook that downloads TCIA prostate MRI data, segments it using MONAI, and displays the results using itkWidgets.
Although you can easily carry out smaller experiments and demos with the sample notebooks presented in this post on Studio Lab for free, it is recommended to use Amazon SageMaker Studio when you train your own medical image models at scale. Amazon SageMaker Studio is an integrated web-based development environment (IDE) with enterprise-grade security, governance, and monitoring features from which you can access purpose-built tools to perform all ML development steps. Open-source libraries like MONAI Core and itkWidgets also run on Amazon SageMaker Studio.
Install the solution
To run the TCIA notebooks on Studio Lab, you need to register an account using your email address on the Studio Lab website. Account requests may take 1–3 days to get approved.
After that, you can follow the installation steps to get started:
- Log in to Studio Lab and start a CPU runtime.
- In a separate tab, navigate to the TCIA notebooks GitHub repo and choose a notebook in the root folder of the repository.
- Choose Open Studio Lab to open the notebook in Studio Lab.
- Back in Studio Lab, choose Copy to project.
- In the new JupyterLab pop-up that opens, choose Clone Entire Repo.
- In the next window, keep the defaults and choose Clone.
- Choose OK when prompted to confirm to build the new Conda environment (
medical-image-ai
).
Building the Conda environment will take up to 5 minutes. - In the terminal that opened in the step before, run the following command to install NodeJS in the
studiolab
Conda environment, which is required to install the ImJoy JupyterLab 3 extension next:conda install -y -c conda-forge nodejs
We now install the ImJoy Jupyter extension using the Studio Lab Extension Manager to enable interactive visualizations. The Imjoy extension allows itkWidgets and other data-intensive processes to communicate with local and remote Jupyter environments, including Jupyter notebooks, JupyterLab, Studio Lab, and so on. - In the Extension Manager, search for “imjoy” and choose Install.
- Confirm to rebuild the kernel when prompted.
- Choose Save and Reload when the build is complete.
After the installation of the ImJoy extension, you will be able to see the ImJoy icon in the top menu of your notebooks.
To verify this, navigate to the file browser, choose the TCIA_Image_Visualalization_with_itkWidgets
notebook, and choose the medical-image-ai
kernel to run it.
The ImJoy icon will be visible in the upper left corner of the notebook menu.
With these installation steps, you have successfully installed the medical-image-ai
Python kernel and the ImJoy extension as the prerequisite to run the TCIA notebooks together with itkWidgets on Studio Lab.
Test the solution
We have created a set of notebooks and a tutorial that showcases the integration of these AI technologies in Studio Lab. Make sure to choose the medical-image-ai
Python kernel when running the TCIA notebooks in Studio Lab.
The first SageMaker notebook shows how to download DICOM images from TCIA and visualize those images using the cinematic volume rendering capabilities of itkWidgets.
The second notebook shows how the expert annotations that are available for hundreds of studies on TCIA can be downloaded as DICOM SEG and RTSTRUCT objects, visualized in 3D or as overlays on 2D slices, and used for training and evaluation of deep learning systems.
The third notebook shows how pre-trained MONAI deep learning models available on MONAI’s Model Zoo can be downloaded and used to segment TCIA (or your own) DICOM prostate MRI volumes.
Choose Open Studio Lab in these and other JupyterLab notebooks to launch those notebooks in the freely available Studio Lab environment.
Clean up
After you have followed the installation steps in this post and created the medical-image-ai
Conda environment, you may want to delete it to save storage space. To do so, use the following command:
conda remove --name medical-image-ai --all
You can also uninstall the ImJoy extension via the Extension Manager. Be aware that you will need to recreate the Conda environment and reinstall the ImJoy extension if you want to continue working with the TCIA notebooks in your Studio Lab account later.
Close your tab and don’t forget to choose Stop Runtime on the Studio Lab project page.
Conclusion
SageMaker Studio Lab is accessible to medical image AI research communities at no cost and can be used for medical image AI modeling and interactive medical image visualization in combination with MONAI and itkWidgets. You can use the TCIA open data and sample notebooks with Studio Lab at training events, like hackathons and workshops. With this solution, scientists and researchers can quickly experiment, collaborate, and innovate with medical image AI. If you have an AWS account and have set up a SageMaker Studio domain, you can also run these notebooks on Studio using the default Data Science Python kernel (with the ImJoy-jupyter-extension
installed) while selecting from a variety of compute instance types.
Studio Lab also launched a new feature at AWS re:Invent 2022 to take the notebooks developed in Studio Lab and run them as batch jobs on a recurring schedule in your AWS accounts. Therefore, you can scale your ML experiments beyond the free compute limitations of Studio Lab and use more powerful compute instances with much bigger datasets on your AWS accounts.
If you’re interested in learning more about how AWS can help your healthcare or life sciences organization, please contact an AWS representative. For more information on MONAI and itkWidgets, please contact Kitware. New data is being added to TCIA on an ongoing basis, and your suggestions and contributions are welcome by visiting the TCIA website.
Further reading
- Now in Preview – Amazon SageMaker Studio Lab, a Free Service to Learn and Experiment with ML
- Amazon SageMaker Studio Lab continues to democratize ML with more scale and functionality
- Run notebooks as batch jobs in Amazon SageMaker Studio Lab
About the Authors
Stephen Aylward is Senior Director of Strategic Initiatives at Kitware, an Adjunct Professor of Computer at The University of North Carolina at Chapel Hill, and a fellow of the MICCAI Society. Dr. Aylward founded Kitware’s office in North Carolina, has been a leader of several open-source initiatives, and is now Chair of the MONAI advisory board.
Matt McCormick, PhD, is a Distinguished Engineer at Kitware, where he leads development of the Insight Toolkit (ITK), a scientific image analysis toolkit. He has been a principal investigator and a co-investigator of several research grants from the National Institutes of Health (NIH), led engagements with United States national laboratories, and led various commercial projects providing advanced software for medical devices. Dr. McCormick is a strong advocate for community-driven open-source software, open science, and reproducible research.
Brianna Major is a Research and Development Engineer at Kitware with a passion for developing open source software and tools that will benefit the medical and scientific communities.
Justin Kirby is a Technical Project Manager at the Frederick National Laboratory for Cancer Research (FNLCR). His work is focused on methods to enable data sharing while preserving patient privacy to improve reproducibility and transparency in cancer imaging research. His team founded The Cancer Imaging Archive (TCIA) in 2010, which the research community has leveraged to publish over 200 datasets related to manuscripts, grants, challenge competitions, and major NCI research initiatives. These datasets have been discussed in over 1,500 peer reviewed publications.
Gang Fu is a Healthcare Solution Architect at AWS. He holds a PhD in Pharmaceutical Science from the University of Mississippi and has over ten years of technology and biomedical research experience. He is passionate about technology and the impact it can make on healthcare.
Alex Lemm is a Business Development Manager for Medical Imaging at AWS. Alex defines and executes go-to-market strategies with imaging partners and drives solutions development to accelerate AI/ML-based medical imaging research in the cloud. He is passionate about integrating open source ML frameworks with the AWS AI/ML stack.
Amazon SageMaker Automatic Model Tuning now supports three new completion criteria for hyperparameter optimization
Amazon SageMaker has announced the support of three new completion criteria for Amazon SageMaker automatic model tuning, providing you with an additional set of levers to control the stopping criteria of the tuning job when finding the best hyperparameter configuration for your model.
In this post, we discuss these new completion criteria, when to use them, and some of the benefits they bring.
SageMaker automatic model tuning
Automatic model tuning, also called hyperparameter tuning, finds the best version of a model as measured by the metric we choose. It spins up many training jobs on the dataset provided, using the algorithm chosen and hyperparameters ranges specified. Each training job can be completed early when the objective metric isn’t improving significantly, which is known as early stopping.
Until now, there were limited ways to control the overall tuning job, such as specifying the maximum number of training jobs. However, the selection of this parameter value is heuristic at best. A larger value increases tuning costs, and a smaller value may not yield the best version of the model at all times.
SageMaker automatic model tuning solves these challenges by giving you multiple completion criteria for the tuning job. It’s applied at the tuning level rather than at each individual training job level, which means it operates at a higher abstraction layer.
Benefits of tuning job completion criteria
With better control over when the tuning job will stop, you get the benefit of cost savings by not having the job run for extended periods and being computationally expensive. It also means you can ensure that the job doesn’t stop too early and you get a sufficiently good quality model that meets your objectives. You can choose to stop the tuning job when the models are no longer improving after a set of iterations or when the estimated residual improvement doesn’t justify the compute resources and time.
In addition to the existing maximum number of training job completion criteria MaxNumberOfTrainingJobs, automatic model tuning introduces the option to stop tuning based on a maximum tuning time, Improvement monitoring, and convergence detection.
Let’s explore each of these criteria.
Maximum tuning time
Previously, you had the option to define a maximum number of training jobs as a resource limit setting to control the tuning budget in terms of compute resource. However, this can lead to unnecessary longer or shorter training times than needed or desired.
With the addition of the maximum tuning time criteria, you can now allocate your training budget in terms of amount of time to run the tuning job and automatically terminate the job after a specified amount of time defined in seconds.
As seen above, we use the MaxRuntimeInSeconds
to define the tuning time in seconds. Setting the tuning time limit helps you limit the duration of the tuning job and also the projected cost of the experiment.
The total cost before any contractual discount can be estimated with the following formula:EstimatedComputeSeconds= MaxRuntimeInSeconds * MaxParallelTrainingJobs * InstanceCost
The max runtime in seconds could be used to bound cost and runtime. In other words, it’s a budget control completion criteria.
This feature is part of a resource control criteria and doesn’t take into account the convergence of the models. As we see later in this post, this criteria can be used in combination with other stopping criteria to achieve cost control without sacrificing accuracy.
Desired target metric
Another previously introduced criteria is to define the target objective goal upfront. The criteria monitors the performance of the best model based on a specific objective metric and stops tuning when the models reach the defined threshold in relation to a specified objective metric.
With the TargetObjectiveMetricValue
criteria, we can instruct SageMaker to stop tuning the model after the objective metric of the best model has reached the specified value:
In this example, we are instructed SageMaker to stop tuning the model when the objective metric of the best model has reached 0.95.
This method is useful when you have a specific target that you want your model to reach, such as a certain level of accuracy, precision, recall, F1-score, AUC, log-loss, and so on.
A typical use case for this criteria would be for a user who is already familiar with the model performance at given thresholds. A user in the exploration phase may first tune the model with a small subset of a larger dataset to identify a satisfactory evaluation metric threshold to target when training with the full dataset.
Improvement monitoring
This criteria monitors the models’ convergence after each iteration and stops the tuning if the models don’t improve after a defined number of training jobs. See the following configuration:
In this case we set the MaxNumberOfTrainingJobsNotImproving
to 10, which means if the objective metric stops improving after 10 training jobs, the tuning will be stopped and the best model and metric reported.
Improvement monitoring should be used to tune a tradeoff between model quality and overall workflow duration in a way that is likely transferable between different optimization problems.
Convergence detection
Convergence detection is a completion criteria that lets automatic model tuning decide when to stop tuning. Generally, automatic model tuning will stop tuning when it estimates that no significant improvement can be achieved. See the following configuration:
The criteria is best suited when you initially don’t know what stopping settings to select.
It’s also useful if you don’t know what target objective metric is reasonable for a good prediction given the problem and dataset in hand, and would rather have the tuning job complete when it is no longer improving.
Experiment with a comparison of completion criteria
In this experiment, given a regression task, we run 3 tuning experiments to find the optimal model within a search space of 2 hyperparameters having 200 hyperparameter configurations in total using the direct marketing dataset.
With everything else being equal, the first model was tuned with the BestObjectiveNotImproving
completion criteria, the second model was tuned with the CompleteOnConvergence
and the third model was tuned with no completion criteria defined.
When describing each job, we can observe that setting the BestObjectiveNotImproving
criteria has led to the most optimal resource and time relative to the objective metric with significantly fewer jobs ran.
The CompleteOnConvergence
criteria was also able to stop tuning halfway through the experiment resulting in fewer training jobs and shorter training time compared to not setting a criteria.
While not setting a completion criteria resulted in a costly experiment, defining the MaxRuntimeInSeconds
as part of the resource limit would be one way of minimizing the cost.
The results above show that when defining a completion criteria, Amazon SageMaker is able to intelligently stop the tuning process when it detects that the model is less likely to improve beyond the current result.
Note that the completion criteria supported in SageMaker automatic model tuning are not mutually exclusive and can be used concurrently when tuning a model.
When more than one completion criteria is defined, the tuning job completes when any of the criteria is met.
For example, a combination of a resource limit criteria like maximum tuning time with a convergence criteria, such as improvement monitoring or convergence detection, may produce an optimal cost control and an optimal objective metrics.
Conclusion
In this post, we discussed how you can now intelligently stop your tuning job by selecting a set of completion criteria newly introduced in SageMaker, such as maximum tuning time, improvement monitoring, or convergence detection.
We demonstrated with an experiment that intelligent stopping based on improvement observation across iteration may lead to a significantly optimized budget and time management compared to not defining a completion criteria.
We also showed that these criteria are not mutually exclusive and can be used concurrently when tuning a model, to take advantage of both, budget control and optimal convergence.
For more details on how to configure and run automatic model tuning, refer to Specify the Hyperparameter Tuning Job Settings.
About the Authors
Doug Mbaya is a Senior Partner Solution architect with a focus in data and analytics. Doug works closely with AWS partners, helping them integrate data and analytics solutions in the cloud.
Chaitra Mathur is a Principal Solutions Architect at AWS. She guides customers and partners in building highly scalable, reliable, secure, and cost-effective solutions on AWS. She is passionate about Machine Learning and helps customers translate their ML needs into solutions using AWS AI/ML services. She holds 5 certifications including the ML Specialty certification. In her spare time, she enjoys reading, yoga, and spending time with her daughters.
Iaroslav Shcherbatyi is a Machine Learning Engineer at AWS. He works mainly on improvements to the Amazon SageMaker platform and helping customers best use its features. In his spare time, he likes to go to gym, do outdoor sports such as ice skating or hiking, and to catch up on new AI research.
On a mission to demystify artificial intelligence
Parmida Beigi, an Amazon senior research scientist, shares a lifetime worth of experience, and uses her skills to help others grow into machine learning career paths.Read More
Create powerful self-service experiences with Amazon Lex on Talkdesk CX Cloud contact center
This blog post is co-written with Bruno Mateus, Jonathan Diedrich and Crispim Tribuna at Talkdesk.
Contact centers are using artificial intelligence (AI) and natural language processing (NLP) technologies to build a personalized customer experience and deliver effective self-service support through conversational bots.
This is the first of a two-part series dedicated to the integration of Amazon Lex with the Talkdesk CX Cloud contact center. In this post, we describe a solution architecture that combines the powerful resources of Amazon Lex and Talkdesk CX Cloud for the voice channel. In the second part of this series, we describe how to use the Amazon Lex chatbot UI with Talkdesk CX Cloud to allow customers to transition from a chatbot conversation to a live agent within the same chat window.
The benefits of Amazon Lex and Talkdesk CX Cloud are exemplified by WaFd Bank, a full-service commercial US bank in 200 locations and managing $20 billion in assets. The bank has invested in a digital transformation of its contact center to provide exceptional service to its clients. WaFd has pioneered an omnichannel banking experience that combines the advanced conversational AI capabilities of Amazon Lex voice and chat bots with Talkdesk Financial Services Experience Cloud for Banking.
“We wanted to combine the power of Amazon Lex’s conversational AI capabilities with the Talkdesk modern, unified contact center solution. This gives us the best of both worlds, enabling WaFd to serve its clients in the best way possible.”
-Dustin Hubbard, Chief Technology Officer at WaFd Bank.
To support WaFd’s vision, Talkdesk has extended its self-service virtual agent voice and chat capabilities with an integration with Amazon Lex and Amazon Polly. Additionally, the combination of Talkdesk Identity voice authentication with an Amazon Lex voicebot allows WaFd clients to resolve common banking transactions on their own. Tasks like account balance lookups are completed in seconds, a 90% reduction in time compared to WaFd’s legacy system. The newly designed Amazon Lex website chatbot has led to a substantial decrease in voicemail volume as its chatbot UI seamlessly integrates with Talkdesk systems.
In the following sections, we provide an overview of the components that have this integration possible. We then present the solution architecture, highlight its main components, and describe the customer journey from interacting with Amazon Lex to escalation to an agent. We end by explaining how contact centers can keep AI models up to date using Talkdesk AI Trainer.
Solution overview
The solution consists of the following key components:
- Amazon Lex – Amazon Lex combines with Amazon Polly to automate customer service interactions by adding conversational AI capabilities to your contact center. Amazon Lex delivers fast responses to customers’ most common questions and seamlessly hands over complex cases to a human agent. Augmenting your contact center operations with Amazon Lex bots provides an enhanced customer experience and helps you build an omnichannel experience, allowing customers to engage across phone lines, websites, and messaging platforms.
- Talkdesk CX Cloud contact center – Talkdesk, Inc. is a global cloud contact center leader for customer-obsessed companies. Talkdesk CX Cloud offers enterprise scale with consumer simplicity to deliver speed, agility, reliability, and security. As an AWS Partner, Talkdesk is using AI capabilities like Amazon Transcribe, a speech-to-text service, with the Talkdesk Agent Assist and Talkdesk Customer Experience Analytics products across a number of languages and accents. Talkdesk has extended its self-service virtual agent voice and chat capabilities with an integration with Amazon Lex and Amazon Polly. These virtual agents can automate routine tasks as well as seamlessly elevate complex interactions to a live agent.
- Authentication and voice biometrics with Talkdesk Identity – Talkdesk Identity provides fraud protection through self-service authentication using voice biometrics. Voice biometrics solutions provide contact centers with improved levels of security while streamlining the authentication process for the customer. This secure and efficient authentication experience allows contact centers to handle a wide range of self-service functionalities. For example, customers can check their balance, schedule a funds transfer, or activate/deactivate a card using a banking bot.
The following diagram illustrates our solution architecture.
The voice authentication call flow implemented in Talkdesk interacts with Amazon Lex as follows:
- When a phone call is initiated, a customer lookup is performed using the incoming caller’s phone number. If multiple customers are retrieved, further information, like date of birth, is requested in order to narrow down the list to a unique customer record.
- If the caller is identified and has previously enrolled in voice biometrics, the caller will be prompted to say their voice pass code. If successful, the caller is offered an authenticated Amazon Lex experience.
- If a caller is identified and not enrolled in voice biometrics, they can work with an agent to verify their identity and record their voice print as the password. For more information, visit the Talkdesk Voice Biometric documentation.
- If the caller is not identified or not enrolled in voice biometrics, the caller can interact with Amazon Lex to perform tasks that don’t require authentication, or they can request a transfer to an agent.
How Talkdesk integrates with Amazon Lex
When the call reaches Talkdesk Virtual Agent, Talkdesk uses the continuous streaming capability of the Amazon Lex API to enable conversation with the Amazon Lex bot. Talkdesk Virtual Agent has an Amazon Lex adapter that initiates an HTTP/2 bidirectional event stream through the StartConversation API operation. Talkdesk Virtual Agent and the Amazon Lex bot start exchanging information in real time following the sequence of events for an audio conversation. For more information, refer to Starting a stream to a bot.
All the context data from Talkdesk Studio is sent to Amazon Lex through session attributes established on the initial ConfigurationEvent. The Amazon Lex voicebot has been equipped with a welcome intent, which is invoked by Talkdesk to initiate the conversation and play a welcome message. In Amazon Lex, a session attribute is set to ensure the welcome intent and its message are used only once in any conversation. The greeting message can be customized to include the name of the authenticated caller, if provided from the Talkdesk system in session attributes.
The following diagram shows the basic components and events used to enable communications.
Agent escalation from Amazon Lex
If a customer requests agent assistance, all necessary information to ensure the customer is routed to the correct agent is made available by Amazon Lex to Talkdesk Studio through session attributes.
Examples of session attributes include:
- A flag to indicate the customer requests agent assistance
- The reason for the escalation, used by Talkdesk to route the call appropriately
- Additional data regarding the call to provide the agent with contextual information about the customer and their earlier interaction with the bot
- The sentiment of the interaction
Training
Talkdesk AI Trainer is a human-in-the-loop tool that is included in the operational flow of Talkdesk CX Cloud. It performs the continuous training and improvement of AI models by real agents without the need for specialized data science teams.
Talkdesk developed a connector that allows AI Trainer to automatically collect intent data from Amazon Lex intent models. Non-technical users can easily fine-tune these models to support Talkdesk AI products such as Talkdesk Virtual Agent. The connector was built by using the Amazon Lex Model Building API with the AWS SDK for Java 2.x.
It is possible to train intent data from Amazon Lex using real-world conversations between customers and (virtual) agents by:
- Requesting feedback of intent classifications with a low confidence level
- Adding new training phrases to intents
- Adding synonyms or regular expressions to slot types
AI Trainer receives data from Amazon Lex, namely intents and slot types. This data is then displayed and managed on Talkdesk AI Trainer, along with all the events that are part of the conversational orchestration taking place in Talkdesk Virtual Agent. Through the AI Trainer quality system or agreement, supervisors or administrators decide which improvements will be introduced in the Amazon Lex model and reflected in Talkdesk Virtual Agent.
Adjustments to production can be easily published on AI Trainer and sent to Amazon Lex. Continuously training AI models ensures that AI products reflect the evolution of the business and the latest needs of customers. This in turn helps increase the automation rate via self-servicing and resolve cases faster, resulting in a higher customer satisfaction.
Conclusion
In this post, we presented how the power of Amazon Lex conversational AI capabilities can be combined with the Talkdesk modern, unified contact center solution through the Amazon Lex API. We explained how Talkdesk voice biometrics offers the caller a self-service authenticated experience and how Amazon Lex provides contextual information to the agent to assist the caller more efficiently.
We are excited about the new possibilities that the integration of Amazon Lex and Talkdesk CX Cloud solutions offers to our clients. We at AWS Professional Services and Talkdesk are available to help you and your team implement your vision of an omnichannel experience.
The next post in this series will provide guidance on how to integrate an Amazon Lex chatbot to Talkdesk Studio, and how to enable customers to interact with a live agent from the chatbot.
About the authors
Grazia Russo Lassner is a Senior Consultant with the AWS Professional Services Natural Language AI team. She specializes in designing and developing conversational AI solutions using AWS technologies for customers in various industries. Outside of work, she enjoys beach weekends, reading the latest fiction books, and family.
Cecil Patterson is a Natural Language AI consultant with AWS Professional Services based in North Texas. He has many years of experience working with large enterprises to enable and support global infrastructure solutions. Cecil uses his experience and diverse skill set to build exceptional conversational solutions for customers of all types.
Bruno Mateus is a Principal Engineer at Talkdesk. With over 20 years of experience in the software industry, he specializes in large-scale distributed systems. When not working, he enjoys spending time outside with his family, trekking, mountain bike riding, and motorcycle riding.
Jonathan Diedrich is a Principal Solutions Consultant at Talkdesk. He works on enterprise and strategic projects to ensure technical execution and adoption. Outside of work, he enjoys ice hockey and games with his family.
Crispim Tribuna is a Senior Software Engineer at Talkdesk currently focusing on the AI-based virtual agent project. He has over 17 years of experience in computer science, with a focus on telecommunications, IPTV, and fraud prevention. In his free time, he enjoys spending time with his family, running (he has completed three marathons), and riding motorcycles.
Image classification model selection using Amazon SageMaker JumpStart
Researchers continue to develop new model architectures for common machine learning (ML) tasks. One such task is image classification, where images are accepted as input and the model attempts to classify the image as a whole with object label outputs. With many models available today that perform this image classification task, an ML practitioner may ask questions like: “What model should I fine-tune and then deploy to achieve the best performance on my dataset?” And an ML researcher may ask questions like: “How can I generate my own fair comparison of multiple model architectures against a specified dataset while controlling training hyperparameters and computer specifications, such as GPUs, CPUs, and RAM?” The former question addresses model selection across model architectures, while the latter question concerns benchmarking trained models against a test dataset.
In this post, you will see how the TensorFlow image classification algorithm of Amazon SageMaker JumpStart can simplify the implementations required to address these questions. Together with the implementation details in a corresponding example Jupyter notebook, you will have tools available to perform model selection by exploring pareto frontiers, where improving one performance metric, such as accuracy, is not possible without worsening another metric, such as throughput.
Solution overview
The following figure illustrates the model selection trade-off for a large number of image classification models fine-tuned on the Caltech-256 dataset, which is a challenging set of 30,607 real-world images spanning 256 object categories. Each point represents a single model, point sizes are scaled with respect to the number of parameters comprising the model, and the points are color-coded based on their model architecture. For example, the light green points represent the EfficientNet architecture; each light green point is a different configuration of this architecture with unique fine-tuned model performance measurements. The figure shows the existence of a pareto frontier for model selection, where higher accuracy is exchanged for lower throughput. Ultimately, the selection of a model along the pareto frontier, or the set of pareto efficient solutions, depends on your model deployment performance requirements.
If you observe test accuracy and test throughput frontiers of interest, the set of pareto efficient solutions on the preceding figure are extracted in the following table. Rows are sorted such that test throughput is increasing and test accuracy is decreasing.
Model Name | Number of Parameters | Test Accuracy | Test Top 5 Accuracy | Throughput (images/s) | Duration per Epoch(s) |
swin-large-patch4-window12-384 | 195.6M | 96.4% | 99.5% | 0.3 | 2278.6 |
swin-large-patch4-window7-224 | 195.4M | 96.1% | 99.5% | 1.1 | 698.0 |
efficientnet-v2-imagenet21k-ft1k-l | 118.1M | 95.1% | 99.2% | 4.5 | 1434.7 |
efficientnet-v2-imagenet21k-ft1k-m | 53.5M | 94.8% | 99.1% | 8.0 | 769.1 |
efficientnet-v2-imagenet21k-m | 53.5M | 93.1% | 98.5% | 8.0 | 765.1 |
efficientnet-b5 | 29.0M | 90.8% | 98.1% | 9.1 | 668.6 |
efficientnet-v2-imagenet21k-ft1k-b1 | 7.3M | 89.7% | 97.3% | 14.6 | 54.3 |
efficientnet-v2-imagenet21k-ft1k-b0 | 6.2M | 89.0% | 97.0% | 20.5 | 38.3 |
efficientnet-v2-imagenet21k-b0 | 6.2M | 87.0% | 95.6% | 21.5 | 38.2 |
mobilenet-v3-large-100-224 | 4.6M | 84.9% | 95.4% | 27.4 | 28.8 |
mobilenet-v3-large-075-224 | 3.1M | 83.3% | 95.2% | 30.3 | 26.6 |
mobilenet-v2-100-192 | 2.6M | 80.8% | 93.5% | 33.5 | 23.9 |
mobilenet-v2-100-160 | 2.6M | 80.2% | 93.2% | 40.0 | 19.6 |
mobilenet-v2-075-160 | 1.7M | 78.2% | 92.8% | 41.8 | 19.3 |
mobilenet-v2-075-128 | 1.7M | 76.1% | 91.1% | 44.3 | 18.3 |
mobilenet-v1-075-160 | 2.0M | 75.7% | 91.0% | 44.5 | 18.2 |
mobilenet-v1-100-128 | 3.5M | 75.1% | 90.7% | 47.4 | 17.4 |
mobilenet-v1-075-128 | 2.0M | 73.2% | 90.0% | 48.9 | 16.8 |
mobilenet-v2-075-96 | 1.7M | 71.9% | 88.5% | 49.4 | 16.6 |
mobilenet-v2-035-96 | 0.7M | 63.7% | 83.1% | 50.4 | 16.3 |
mobilenet-v1-025-128 | 0.3M | 59.0% | 80.7% | 50.8 | 16.2 |
This post provides details on how to implement large-scale Amazon SageMaker benchmarking and model selection tasks. First, we introduce JumpStart and the built-in TensorFlow image classification algorithms. We then discuss high-level implementation considerations, such as JumpStart hyperparameter configurations, metric extraction from Amazon CloudWatch Logs, and launching asynchronous hyperparameter tuning jobs. Finally, we cover the implementation environment and parameterization leading to the pareto efficient solutions in the preceding table and figure.
Introduction to JumpStart TensorFlow image classification
JumpStart provides one-click fine-tuning and deployment of a wide variety of pre-trained models across popular ML tasks, as well as a selection of end-to-end solutions that solve common business problems. These features remove the heavy lifting from each step of the ML process, making it easier to develop high-quality models and reducing time to deployment. The JumpStart APIs allow you to programmatically deploy and fine-tune a vast selection of pre-trained models on your own datasets.
The JumpStart model hub provides access to a large number of TensorFlow image classification models that enable transfer learning and fine-tuning on custom datasets. As of this writing, the JumpStart model hub contains 135 TensorFlow image classification models across a variety of popular model architectures from TensorFlow Hub, to include residual networks (ResNet), MobileNet, EfficientNet, Inception, Neural Architecture Search Networks (NASNet), Big Transfer (BiT), shifted window (Swin) transformers, Class-Attention in Image Transformers (CaiT), and Data-Efficient Image Transformers (DeiT).
Vastly different internal structures comprise each model architecture. For instance, ResNet models utilize skip connections to allow for substantially deeper networks, whereas transformer-based models use self-attention mechanisms that eliminate the intrinsic locality of convolution operations in favor of more global receptive fields. In addition to the diverse feature sets these different structures provide, each model architecture has several configurations that adjust the model size, shape, and complexity within that architecture. This results in hundreds of unique image classification models available on the JumpStart model hub. Combined with built-in transfer learning and inference scripts that encompass many SageMaker features, the JumpStart API is a great launching point for ML practitioners to get started training and deploying models quickly.
Refer to Transfer learning for TensorFlow image classification models in Amazon SageMaker and the following example notebook to learn about SageMaker TensorFlow image classification in more depth, including how to run inference on a pre-trained model as well as fine-tune the pre-trained model on a custom dataset.
Large-scale model selection considerations
Model selection is the process of selecting the best model from a set of candidate models. This process may be applied across models of the same type with different parameter weights and across models of different types. Examples of model selection across models of the same type include fitting the same model with different hyperparameters (for example, learning rate) and early stopping to prevent the overfitting of model weights to the train dataset. Model selection across models of different types includes selecting the best model architecture (for example, Swin vs. MobileNet) and selecting the best model configurations within a single model architecture (for example, mobilenet-v1-025-128
vs. mobilenet-v3-large-100-224
).
The considerations outlined in this section enable all of these model selection processes on a validation dataset.
Select hyperparameter configurations
TensorFlow image classification in JumpStart has a large number of available hyperparameters that can adjust the transfer learning script behaviors uniformly for all model architectures. These hyperparameters relate to data augmentation and preprocessing, optimizer specification, overfitting controls, and trainable layer indicators. You are encouraged to adjust the default values of these hyperparameters as necessary for your application:
For this analysis and the associated notebook, all hyperparameters are set to default values except for learning rate, number of epochs, and early stopping specification. Learning rate is adjusted as a categorical parameter by the SageMaker automatic model tuning job. Because each model has unique default hyperparameter values, the discrete list of possible learning rates includes the default learning rate as well as one-fifth the default learning rate. This launches two training jobs for a single hyperparameter tuning job, and the training job with the best reported performance on the validation dataset is selected. Because the number of epochs is set to 10, which is greater than the default hyperparameter setting, the selected best training job doesn’t always correspond to the default learning rate. Finally, an early stopping criterion is utilized with a patience, or the number of epochs to continue training with no improvement, of three epochs.
One default hyperparameter setting of particular importance is train_only_on_top_layer
, where, if set to True
, the model’s feature extraction layers are not fine-tuned on the provided training dataset. The optimizer will only train parameters in the top fully connected classification layer with output dimensionality equal to the number of class labels in the dataset. By default, this hyperparameter is set to True
, which is a setting targeted for transfer learning on small datasets. You may have a custom dataset where the feature extraction from the pre-training on the ImageNet dataset is not sufficient. In these cases, you should set train_only_on_top_layer
to False
. Although this setting will increase training time, you will extract more meaningful features for your problem of interest, thereby increasing accuracy.
Extract metrics from CloudWatch Logs
The JumpStart TensorFlow image classification algorithm reliably logs a variety of metrics during training that are accessible to SageMaker Estimator
and HyperparameterTuner objects. The constructor of a SageMaker Estimator
has a metric_definitions
keyword argument, which can be used to evaluate the training job by providing a list of dictionaries with two keys: Name for the name of the metric, and Regex
for the regular expression used to extract the metric from the logs. The accompanying notebook shows the implementation details. The following table lists the available metrics and associated regular expressions for all JumpStart TensorFlow image classification models.
Metric Name | Regular Expression |
number of parameters | “- Number of parameters: ([0-9\.]+)” |
number of trainable parameters | “- Number of trainable parameters: ([0-9\.]+)” |
number of non-trainable parameters | “- Number of non-trainable parameters: ([0-9\.]+)” |
train dataset metric | f”- {metric}: ([0-9\.]+)” |
validation dataset metric | f”- val_{metric}: ([0-9\.]+)” |
test dataset metric | f”- Test {metric}: ([0-9\.]+)” |
train duration | “- Total training duration: ([0-9\.]+)” |
train duration per epoch | “- Average training duration per epoch: ([0-9\.]+)” |
test evaluation latency | “- Test evaluation latency: ([0-9\.]+)” |
test latency per sample | “- Average test latency per sample: ([0-9\.]+)” |
test throughput | “- Average test throughput: ([0-9\.]+)” |
The built-in transfer learning script provides a variety of train, validation, and test dataset metrics within these definitions, as represented by the f-string replacement values. The exact metrics available vary based on the type of classification being performed. All compiled models have a loss
metric, which is represented by a cross-entropy loss for either a binary or categorical classification problem. The former is used when there is one class label; the latter is used if there are two or more class labels. If there is only a single class label, then the following metrics are computed, logged, and extractable via the f-string regular expressions in the preceding table: number of true positives (true_pos
), number of false positives (false_pos
), number of true negatives (true_neg
), number of false negatives (false_neg
), precision
, recall
, area under the receiver operating characteristic (ROC) curve (auc
), and area under the precision-recall (PR) curve (prc
). Similarly, if there are six or more class labels, a top-5 accuracy metric (top_5_accuracy
) is also be computed, logged, and extractable via the preceding regular expressions.
During training, metrics specified to a SageMaker Estimator
are emitted to CloudWatch Logs. When the training is complete, you can invoke the SageMaker DescribeTrainingJob API and inspect the FinalMetricDataList
key in the JSON response:
This API requires only the job name to be provided to the query, so, once completed, metrics can be obtained in future analyses so long as the training job name is appropriately logged and recoverable. For this model selection task, hyperparameter tuning job names are stored and subsequent analyses reattach a HyperparameterTuner
object given the tuning job name, extract the best training job name from the attached hyperparameter tuner, and then invoke the DescribeTrainingJob
API as described earlier to obtain metrics associated with the best training job.
Launch asynchronous hyperparameter tuning jobs
Refer to the corresponding notebook for implementation details on asynchronously launching hyperparameter tuning jobs, which uses the Python standard library’s concurrent futures module, a high-level interface for asynchronously running callables. Several SageMaker-related considerations are implemented in this solution:
- Each AWS account is affiliated with SageMaker service quotas. You should view your current limits to fully utilize your resources and potentially request resource limit increases as needed.
- Frequent API calls to create many simultaneous hyperparameter tuning jobs may exceed the Python SDK rate and throw throttling exceptions. A resolution to this is to create a SageMaker Boto3 client with a custom retry configuration.
- What happens if your script encounters an error or the script is stopped before completion? For such a large model selection or benchmarking study, you can log tuning job names and provide convenience functions to reattach hyperparameter tuning jobs that already exist:
Analysis details and discussion
The analysis in this post performs transfer learning for model IDs in the JumpStart TensorFlow image classification algorithm on the Caltech-256 dataset. All training jobs were performed on the SageMaker training instance ml.g4dn.xlarge, which contains a single NVIDIA T4 GPU.
The test dataset is evaluated on the training instance at the end of training. Model selection is performed prior to the test dataset evaluation to set model weights to the epoch with the best validation set performance. Test throughput is not optimized: the dataset batch size is set to the default training hyperparameter batch size, which isn’t adjusted to maximize GPU memory usage; reported test throughput includes data loading time because the dataset isn’t pre-cached; and distributed inference across multiple GPUs isn’t utilized. For these reasons, this throughput is a good relative measurement, but actual throughput would depend heavily on your inference endpoint deployment configurations for the trained model.
Although the JumpStart model hub contains many image classification architecture types, this pareto frontier is dominated by select Swin, EfficientNet, and MobileNet models. Swin models are larger and relatively more accurate, whereas MobileNet models are smaller, relatively less accurate, and suitable for resource constraints of mobile devices. It’s important to note that this frontier is conditioned on a variety of factors, including the exact dataset used and the fine-tuning hyperparameters selected. You may find that your custom dataset produces a different set of pareto efficient solutions, and you may desire longer training times with different hyperparameters, such as more data augmentation or fine-tuning more than just the top classification layer of the model.
Conclusion
In this post, we showed how to run large-scale model selection or benchmarking tasks using the JumpStart model hub. This solution can help you choose the best model for your needs. We encourage you to try out and explore this solution on your own dataset.
References
More information is available at the following resources:
- Image Classification – TensorFlow
- Run image classification with Amazon SageMaker JumpStart
- Build high performing image classification models using Amazon SageMaker JumpStart
About the authors
Dr. Kyle Ulrich is an Applied Scientist with the Amazon SageMaker built-in algorithms team. His research interests include scalable machine learning algorithms, computer vision, time series, Bayesian non-parametrics, and Gaussian processes. His PhD is from Duke University and he has published papers in NeurIPS, Cell, and Neuron.
Dr. Ashish Khetan is a Senior Applied Scientist with Amazon SageMaker built-in algorithms and helps develop machine learning algorithms. He got his PhD from University of Illinois Urbana Champaign. He is an active researcher in machine learning and statistical inference and has published many papers in NeurIPS, ICML, ICLR, JMLR, ACL, and EMNLP conferences.
AAAI: Prompt engineering and reasoning in the spotlight
Methods for controlling the outputs of large generative models and integrating symbolic reasoning with machine learning are among the conference’s hot topics.Read More
Computer vision for automated quality inspection
Learn how physics and computer science influence each other and about the importance of the scientific perspective when it comes to quantum technology.Read More