Modular functions design for Advanced Driver Assistance Systems (ADAS) on AWS

Modular functions design for Advanced Driver Assistance Systems (ADAS) on AWS

Over the last 10 years, a number of players have developed autonomous vehicle (AV) systems using deep neural networks (DNNs). These systems have evolved from simple rule-based systems to Advanced Driver Assistance Systems (ADAS) and fully autonomous vehicles. These systems require petabytes of data and thousands of compute units (vCPUs and GPUs) to train.

This post covers build approaches, different functional units of ADAS, design approaches to building a modular pipeline, and the challenges of building an ADAS system.

DNN training methods and design

AV systems are built with deep neural networks. When it comes to the design of an AV system, there are two main approaches. The difference is based on how the DNNs are trained and the system boundary.

  • Modular training – With a modular pipeline design, the system is split into individual functional units (for example, perception, localization, prediction, and planning). This is a common design paradigm used by many AV system providers. With the whole system split into individual modules, they can be built and trained independently.
  • End-to-end training – This approach involves training a DNN model that takes raw sensor data as input and outputs the driving command. This is a monolithic architecture and is mainly explored by researchers. The DNN architecture is typically based on reinforcement learning (RL) based on a reward/penalty system or imitation learning (IL) by observing a human driving the vehicle. Although the overall architecture is simple, it’s hard to interpret and diagnose the monolith. However, annotations are cheap because the system learns from the data collected through human behavior.

In addition to these two approaches, researchers are also exploring a hybrid approach that trains two different DNNs that are connected by an intermediate representation.

This post explains the functions based on a modular pipeline approach.

Automation levels

The SAE International (formerly called as Society of Automotive Engineers) J3016 standard defines six levels of driving automation, and is the most cited source for driving automation. This ranges from Level 0 (no automation) to Level 5 (full driving automation), as shown in the following table.

Level Name Feature
0 No Driving Automation Human drives
1 Driving assistance Human drives
2 Partial driving automation Human drives
3 Conditional driving automation System drives with human as backup
4 High driving automation System drives
5 Full driving automation System drives

Modular functions

The following diagram provides an overview of a modular functions design.

At the higher levels of automation (Level 2 and above), the AD system performs multiple functions:

  • Data collection – The AV system gathers information about the vehicle’s surroundings in real time with centimeter accuracy. The vehicle is equipped with various devices, and the functions of these devices vary and intersect in a number of ways. AV is still an evolving space and there is no consensus and standardization of types of sensors and devices attached. In addition to the devices listed here, vehicles might also have GPS for navigation, and use maps and Inertial Measurement Units (IMUs) to measure linear and angular acceleration. Depending on the type of ADAS system, you will see a combination of the following devices:
    • Cameras – Visual devices conceptually similar to human perception. Supports high resolution but bad at depth estimation and handling extreme weather conditions.
    • LiDAR – Expensive devices providing data about the surroundings as a 3D point cloud. Provides accurate depth and speed estimation.
    • Ultrasonics – Small, inexpensive sensors but works well only in short ranges.
    • Radar – Supports long and short ranges and works well in low visibility and extreme weather conditions.
  • Data fusion – Multiple devices that are part of the AV system provide signals but have their limitations; however, signals across the devices provide complementary information. AV systems fuse data from the devices that are integrated together to build a comprehensive perception. This integrated dataset is used to train the DNN.
  • Perception – AV systems analyze the raw data collected from the devices to construct information about the environment around the vehicle, including obstacles, traffic signs, and other objects. This is called road scene perception or simply perception. It involves detecting the objects and classifying them as nearby vehicles, pedestrians, traffic lights, and traffic signs. This function measures depth and performs lane detection, lane curvature estimation, curb detection, and occlusion. This information is key for path planning and route optimization.
  • Localization and mapping – To operate and optimize vehicle safely, the AV systems need an understanding of the location of the objects detected by perception. The AV system constructs a 3D map and updates the position of the host vehicle (ego vehicle) and its surroundings in the map. It tracks the detected objects and their current location. Advanced systems predict the kinematics of the objects that are in motion.
  • Prediction – With the information collected from other modules, AV systems predict how the immediate future of the environment is going to change. The DNN running on the vehicle predicts the position of the ego vehicle and the surrounding object interactions by projecting the kinematic states over time (position, velocity, acceleration, jerk). It can predict potential traffic violations and collisions or near collisions.
  • Path planning – This function is responsible for drawing out the possible routes the vehicle can take as the next action based on inputs from perception, localization, and prediction. To plan the best possible route, the AV system takes localization, maps, GPS data, and predictions as input. Some AV systems construct a bird’s-eye view by projecting kinematics of the ego vehicle and other objects onto a static route to provide a 3D map. Some also fuse data from other vehicles. Overall, the planning function finds the optimal route from all the possible ones with a goal to maximize driver comfort (for example, smooth turns vs. sharp turns, slow down vs. stop abruptly at stop signs).
  • Control and execution – Takes the input from the route planner to perform actions to accelerate, decelerate, stop, and rotate the steering wheel. The goal of the controller is to maintain the planned trajectory.
  • Training pipeline – DNNs providing predictions on the vehicle need to be trained. They are typically trained in an offline fashion with data collected from the vehicles. Training requires thousands of compute units for an extended period of time. The amount of data required to train and the required compute power varies based on the model architecture and the AV system provider. To train the DNNs, the AV system provider requires labeled data that is partly annotated by humans and partly automated. Typically, personally identifiable information (PII) such as license plate number and face are anonymized via blurring. Many providers augment the labeled data with simulation. It provides the ability to generate data for specific scenarios and augment real-world data. AV system providers also utilize tools to mine relevant data for training, fine-tuning, and handling edge cases. The trained models are validated for accuracy with offline simulation. Some providers use a dormant model strategy and deploy candidate models (dormant) side by side with the production models. Although predictions from the dormant models aren’t used to control the vehicle, it helps the providers validate the model’s accuracy in real-world scenarios.

Challenges

DNNs for AV workloads need to be trained with huge volumes of data. You would need a compute infrastructure that is scalable to train the DNNs, handle large volumes of training data, and consider factors to optimize training with models and data parallelism.

Training with large volumes of data

AV systems collect a large volume of data from the devices attached to the vehicle. Depending on the AV system provider, the vehicle fleet ranges from a handful to thousands of vehicles. The following are some typical challenges an AV system provider may encounter:

  • Collection, preprocessing, and storage of petabytes of data – Each vehicle collects more than 40 TB of data for every 8 hours of driving.
  • Identification of relevant representation data from a huge volume of data – This is essential to reduce biases in the datasets so that common scenarios (driving at normal speed with obstruction) don’t create class imbalance. To yield better accuracy, DNNs require large volumes of diverse, good quality data.
  • Volume of corner cases – ML models need to handle a wide range of corner cases. This is essential to ensure the safety of the AV system.
  • Training time – Given a huge volume of data, training time is often in multiple days or even weeks. This reduces the development velocity and ability to fail fast.

To address the large value challenge, you can utilize the Amazon SageMaker distributed data parallelism feature (SMDDP). SageMaker is a fully managed machine learning (ML) service. With data parallelism, a large volume of data is split into batches. Blocks of data are sent to multiple CPUs or GPUs called as nodes, and the results are combined. Each node has a copy of the DNN. SageMaker has developed the distributed data parallel library, which splits data per node and optimizes the communication between the nodes. You can use the SageMaker Python SDK to trigger a job with data parallelism with minimal modifications to the training script. Data parallelism supports popular deep learning frameworks PyTorch, PyTorch Lightening, TensorFlow, and Hugging Face Transformers.

The Hyundai motor company utilized SageMaker data parallelism to reduce training time for their autonomous driving models and achieved more than 90% scaling efficiency with eight instances, each having 8 GPUs. The following diagram illustrates this architecture.

For more details, refer to Hyundai reduces ML model training time for autonomous driving models using Amazon SageMaker.

For more information about distributed training with SageMaker, refer to the AWS re:Invent 2020 video Fast training and near-linear scaling with DataParallel in Amazon SageMaker and The science behind Amazon SageMaker’s distributed-training engines.

Labeling a large volume of data

The training pipeline requires a large volume of labeled datasets. One of the common challenges faced by our customers is development of annotation tools to label the image, video, and sensor (for example, 3D point cloud); custom workflows for object detection; and semantic segmentation tasks. You need the ability to customize your workflows.

Amazon SageMaker Ground Truth is a fully managed data labeling service that provides flexibility to build and manage custom workflows. With Ground Truth, you can label image, video, and point cloud data for object detection, object tracking, and semantic segmentation tasks. You can transfer data collected from the vehicles and stored on premises to AWS using a data transfer mechanism such as AWS Storage Gateway, AWS Direct Connect, AWS DataSync, AWS Snowball, or AWS Transfer Family. After the data is preprocessed (such as blurring faces and license plates), the cleaned dataset is ready for labeling. Ground Truth supports sensor fusion of LiDAR data with video inputs from cameras. You can choose to use human annotators through Amazon Mechanical Turk, trusted third-party vendors, or your own private workforce.

In the following figure, we provide a reference architecture to preprocess data using AWS Batch and using Ground Truth to label the datasets.

For more information, refer to Field Notes: Automating Data Ingestion and Labeling for Autonomous Vehicle Development and Labeling data for 3D object tracking and sensor fusion in Amazon SageMaker Ground Truth.

For more information on using Ground Truth to label 3D point cloud data, refer to Use Ground Truth to Label 3D Point Clouds.

Training infrastructure

As the AV systems mature, the DNNs need to be trained to handle multiple edge cases (for example, humans walking on highways), and the model gets complex and big. This results in training the DNNs with more data from mining the recorded data or through simulations to handle newer scenarios. This demands more compute capacity and scaling compute infrastructure.

To support the computing needs for ML workloads, SageMaker provides multiple instance types for training. Each family is designed for a few specific workloads; you can choose based on the vCPU, GPU, memory, storage, and networking configurations of the instances. For full, end-to-end AV development, companies largely rely on the m, c, g, and p families.

Some of our customers use our Deep Learning AMIs (DLAMI) to launch NVIDIA GPU-based Amazon Elastic Compute Cloud (Amazon EC2) instances in the p family. Each EC2 p family instance generation integrates the latest NVIDIA technology, including the p2 instances (Tesla K80), p3 instances (Volta V100), and p4d instances (Ampere A100).

The following figure summarizes the available instances:

When the DNNs are complex and can’t fit in memory of one GPU, you can use the SageMaker model parallelism library. This splits the layers across GPUs and instances. You can use the library to automatically partition your TensorFlow and PyTorch models across multiple GPUs and multiple nodes with minimal code changes.

MLOps

When it comes to operationalizing, from data scientists conducting experiments on revised models to deploying across thousands of vehicles, AV system providers need a set of tools that work end to end seamlessly for various needs:

  • Data collection and transformation at scale
  • Automated analysis and evaluation of models
  • Standardization of data pipelines
  • The ability to define and conduct experiments for data scientists
  • Monitoring model performance
  • Establishing a repeatable process and eliminating human intervention with end-to-end automation
  • Automated model deployment, which enables you to quickly deploy a trained model across millions of vehicles

SageMaker provides comprehensive MLOps tools. Data scientists can use Amazon SageMaker Experiments, which automatically tracks the inputs, parameters, configurations, and results of iterations as trials. You can further assign, group, and organize these trials into experiments. Amazon SageMaker Model Monitor helps continuously monitor the quality of your ML models in real time. You can set up automated alerts to notify when there are deviations in the model quality, such as data drift and anomalies. When it comes to orchestration, you can choose from a number of options, including the SageMaker Pipelines SDK, AWS Step Functions, Amazon Managed Apache Airflow (Amazon MWAA), and open-source tools like Kubeflow.

Conclusion

In this post, we covered the build approaches and different functional units of ADAS, a unified framework to build a modular pipeline, and the challenges of building an ADAS system. We provided reference architectures and links to case studies and blog posts that explain how our customers use SageMaker and other AWS services to build a scalable AV system. The proposed solutions can help our customers to address the challenges while building a scalable AV system. In a later post, we will do a deep dive into the DNNs used by ADAS systems.


About the Authors

Shreyas Subramanian is a Principal AI/ML specialist Solutions Architect, and helps customers by using Machine Learning to solve their business challenges using the AWS platform. Shreyas has a background in large scale optimization and Machine Learning, and in use of Machine Learning and Reinforcement Learning for accelerating optimization tasks.

Gopi Krishnamurthy is a Senior AI/ML Solutions Architect at Amazon Web Services based in New York City. He works with large Automotive customers as their trusted advisor to transform their Machine Learning workloads and migrate to the cloud. His core interests include deep learning and serverless technologies. Outside of work, he likes to spend time with his family and explore a wide range of music.

Read More

Boomi uses BYOC on Amazon SageMaker Studio to scale custom Markov chain implementation

Boomi uses BYOC on Amazon SageMaker Studio to scale custom Markov chain implementation

This post is co-written with Swagata Ashwani, Senior Data Scientist at Boomi.

Boomi is an enterprise-level software as a service (SaaS) independent software vendor (ISV) that creates developer enablement tooling for software engineers. These tools integrate via API into Boomi’s core service offering.

In this post, we discuss how Boomi used the bring-your-own-container (BYOC) approach to develop a new AI/ML enabled solution for their customers to tackle the “blank canvas” problem. Boomi’s machine learning (ML)-powered solution facilitates the rapid development of integrations on their platform, and enables faster time to market for their customers. Boomi funded this solution using the AWS PE ML FastStart program, a customer enablement program meant to take ML-enabled solutions from idea to production in a matter of weeks. Boomi built this solution using Amazon SageMaker Studio, an end-to-end browser-based IDE for AI/ML workloads, and Amazon Elastic Container Registry (Amazon ECR).

The blank canvas problem describes productivity and creativity issues faced by developers when starting a new task. An experienced developer knows at the onset of a new task what their code base will look like generally, but the process of building this code base is extensive and there’s no clear starting point. As the developer begins making progress on the blank canvas, their productivity is still low. The code written is usually boilerplate code providing the foundation for the business logic that can’t be written until most of the foundation is laid.

Boomi built a novel solution for the blank canvas problem using traditional development techniques. Boomi’s ML and data engineering teams needed the solution to be deployed quickly, in a repeatable and consistent way, at scale. The Boomi team uses the SageMaker BYOC paradigm to support their custom model. The Boomi team then used SageMaker Projects and SageMaker Pipelines to automate the training, testing, monitoring, and deployment of their custom model solution.

Customer use case

Markov chains are specialized structures for making predictive recommendations in a state machine. Markov chains are best known for their applications in web crawling and search engines. Boomi’s data science team implemented a Markov chain model that could be applied to common integration sequences, or steps, on their platform, hence the name Step Suggest.

Markov chains are built using a state machine and a probability matrix describing the likelihood of state transitions. Given a starting state, a Markov chain calculates the probability of a transition to another state allowed by the state machine. The data science team at Boomi applied the Markov Chain approach to the Step Suggest problem by treating integration steps as states in a state machine. Boomi’s Markov chain implementation takes the previous integration step and predicts the next integration step with significant accuracy.

Boomi had significant success with their application of Markov chains. However, the underlying algorithm for Step Suggest is complicated and proprietary. SageMaker has built-in support for several popular ML algorithms, but Boomi already had a working solution. Instead of starting from scratch, Boomi used the BYOC approach to import their existing models to SageMaker. As a result, Boomi’s team was able to use SageMaker for inference, CI/CD, and monitoring, without having to rebuild their Markov chain from scratch.

Solution details

The most important criteria for this solution were the reusability of existing models and the ease of deployment of those models to production. Boomi’s Step Suggest solution needs automated training and inference pipelines. At the time of the migration to SageMaker’s BYOC deployment model, Boomi’s solution was largely built and tested on individual laptops.

Boomi used Amazon ECR to store versions of their Step Suggest model. Amazon ECR stores and versions containerized applications in a container registry. The Boomi team built a Docker container with the model built off individual laptops and uploaded that container to an Amazon ECR domain. When the upload was complete, Boomi mounted the image to their SageMaker domain, where it could be imported and used for additional ML tasks like inference deployments to a hosted endpoint.

The exact steps to replicate this process are outlined Train and deploy deep learning models using JAX with Amazon SageMaker. This post discusses how to bring the JAX framework into your SageMaker domain. JAX is an up-and-coming ML framework for which SageMaker has no built-in support. Boomi implemented a similar workflow for their proprietary framework, extending the capabilities of their SageMaker deployment to satisfy the requirements of the Step Suggest project. There are a few prerequisites; proceed with the next steps before following the guide in the JAX post to practice the BYOC deployment paradigm with SageMaker.

Alternatives to SageMaker

Boomi was already an AWS customer before the AWS PE ML FastStart program. In fact, most of their data science team was using SageMaker notebook instances for model development. Data stored in Amazon Simple Storage Service (Amazon S3) trained models on notebook instances, which came pre-installed with the Jupyter Notebook software. This worked for model development, but Boomi needed a more robust solution to scale this workload to their customers.

The AWS PE ML FastStart program conducted a deep-dive session with Boomi’s data science and engineering teams. We decided SageMaker Studio would be a better solution for Boomi’s team to scale this solution quickly to their customers.

Why SageMaker?

SageMaker Studio brought several key advantages that SageMaker notebooks couldn’t do alone. First and foremost, Studio makes it easier to share notebook assets across a large team of data scientists like the one at Boomi. Boomi’s analysts were free to use SageMaker Data Wrangler for data preparation tasks, while Boomi’s data scientists could continue to use Jupyter notebooks. Most importantly, Studio maintained BYOC functionality. This was absolutely crucial because it meant the team could reuse the model assets they had already built.

Secondly, SageMaker Pipelines made it easy for Boomi’s team to visualize and modify their complex CI/CD requirements. The BYOC deployment paradigm requires additional integrations with Amazon ECR. To that end, the exact training and inference pipelines used by Boomi’s MLOps team necessitated additional steps for automated deployment, rollback, and monitoring. SageMaker Pipelines and the SageMaker StepFunctions SDK addressed this requirement.

Finally, SageMaker Projects presented the team with the ability to create AWS CloudFormation templates that standardized their ML development environments. Infrastructure as code (IaC) solutions like AWS CloudFormation reduce digital waste and standardize resource deployments in an AWS account. When CloudFormation templates are deployed through the AWS Service Catalog, as is done with SageMaker Projects, data science teams can operate freely without fear of violating any organization guardrails or best practices. Boomi’s cloud engineering team agreed this would be an important factor in scaling their data science team.

Feature deep dive

The following diagram illustrates the solution architecture and workflow.

Solution Architecture

The SageMaker BYOC paradigm enabled Boomi’s team to reuse a highly customized implementation of a proprietary ML algorithm. Boomi also had several custom preprocessing and postprocessing steps for their models. These proprietary steps allowed Boomi to bridge the gap between their data science and core product engineering teams. Implementing the processing logic within Studio, although possible, would be better suited for a built-in algorithm. The Studio BYOC paradigm enabled Boomi’s data science team to do what they did best without sacrificing speed and agility in their product’s development.

Because Boomi is a large organization with a strong cloud governance team, and because there are so many teams actively contributing to this project, having robust CI/CD is necessary. The CI/CD enabled by SageMaker Pipelines made it possible for the various contributing parties to collaborate. Boomi’s analysts contributed to preprocessing and postprocessing; the data science team customized, tuned, and built the model within a container; and the MLOps and systems engineering team were able to integrate Step Suggest into their core platform.

Conclusion

By leveraging Amazon SageMaker Studio, Amazon SageMaker Projects, and Amazon SageMaker Pipelines, Boomi made it easier to build MLOps Solutions at scale.

“AWS SageMaker Pipeline based solution has reduced the time needed to build, deploy, and manage our model by ~30%, thanks to its intuitive and user-friendly interface. By using this solution, we were able to deploy our model in just 4 weeks, 2x faster than if we had used traditional infrastructure.”

Boomi has an active relationship with their AWS account team. AWS account teams connect customers like Boomi with programs designed to address their business and technology needs. Connect with your AWS account team to learn more about programs like AWS PE ML FastStart to improve your time to market for new, innovative products built on or with AWS.


About the Authors

Dan Ferguson is an AI/ML Specialist Solutions Architect (SA) on the Private Equity Solutions Architecture at Amazon Web Services. Dan helps Private Equity backed portfolio companies leverage AI/ML technologies to achieve their business objectives.

Swagata Ashwani is a Senior Data Scientist at Boomi with over 6+ years experience in Data Science. Her interests include MLOps, natural language processing, and data visualization. She also actively engages herself in volunteering for Women in Data/AI and spreading more awareness and outreach within the AI community.
In her spare time she can be found plucking strums of her guitar, sipping masala chai and enjoying spicy Indian street food.

Read More

MLOps deployment best practices for real-time inference model serving endpoints with Amazon SageMaker

MLOps deployment best practices for real-time inference model serving endpoints with Amazon SageMaker

After you build, train, and evaluate your machine learning (ML) model to ensure it’s solving the intended business problem proposed, you want to deploy that model to enable decision-making in business operations. Models that support business-critical functions are deployed to a production environment where a model release strategy is put in place. Given the nature of ML models, where the data is continuously changing, you also want to ensure that a deployed model is still relevant to new data and that the model is updated when this is not the case. This includes choosing a deployment strategy that minimizes risks and downtime. This optimal deployment strategy should maintain high availability of the model, consider the business cost of deploying an inferior model to what is already in production, and contain functionality to easily roll back to a previous model version. Many of these recommended considerations and deployment patterns are also covered within the AWS Well Architected Framework – Machine Learning Lens.

In addition to choosing the right deployment strategy, that strategy should be implemented using a reliable mechanism that includes MLOps practices. MLOps includes practices that integrate ML workloads into release management, CI/CD, and operations, accounting for the unique aspects of ML projects, including considerations for deploying and monitoring models. Amazon SageMaker for MLOps provides purpose-built tools to automate and standardize steps across the ML lifecycle, including capabilities to deploy and manage new models using advanced deployment patterns.

In this post, we discuss how to deploy ML models with Amazon SageMaker in a repeatable and automated way, integrating the production variants and deployment guardrails capabilities of SageMaker with MLOps solutions. We give you an introduction of how to integrate the MLOps tools of SageMaker with SageMaker model deployment patterns, focusing on real-time single-model endpoints.

Solution overview

We explore the following model testing and guardrail patterns and their integration with SageMaker MLOps tools:

  • Model testing – We compare different model versions in production before replacing the current model version. This post compares the following model testing capabilities:
    • A/B testing – With A/B testing, you compare different versions of your model in production by distributing the endpoint traffic between your model variants. A/B testing is used in scenarios where closed loop feedback can directly tie model outputs to downstream business metrics. This feedback is then used to determine the statistical significance of changing from one model to another, helping you select the best model through live production testing.
    • Shadow tests – With shadow tests, you test a new version of your model in production by sending requests to the production model and the new model in parallel. The prediction response data from the production model is served to the application, while the new model version predictions are stored for testing but not served to the production application. Shadow testing is used in situations where there is no closed loop feedback mapping a business metric back to a model’s predictions. In this scenario, you use model quality and operational metrics to compare multiple models instead of any impact on downstream business metrics.
  • Shifting traffic – After you have tested the new version of the model and are satisfied with its performance, the next step is to shift traffic from the current model to the new one. The blue/green deployment guardrails in SageMaker allow you to easily switch from the current model in production (blue fleet) to a new one (green fleet) in a controlled way. Blue/green deployments avoid downtime during the updates of your model, like what you would have in an in-place deployment scenario. To maximize model availability, as of this writing, blue/green deployments are the default option for model updates in SageMaker. We discuss the following traffic shifting methods in this post:
    • All at once traffic shifting – 100% of your endpoint traffic is shifted from your blue fleet to your green fleet after the green fleet becomes available. We use alarms in Amazon CloudWatch that monitor your green fleet for a set amount of time (the baking period) and if no alarm is triggered, the blue fleet is then deleted by SageMaker after the baking period.
    • Canary traffic shifting – Your green fleet is first exposed to a smaller proportion of your traffic (a canary) and validated for any issues using CloudWatch alarms for a baking period while the blue fleet keeps receiving most of the endpoint traffic. After the green fleet is validated, all traffic is shifted to the new fleet and the blue fleet is then deleted by SageMaker.
    • Blue/green linear traffic shifting guardrail – You gradually shift traffic from your blue fleet to your green fleet in a step approach. Your model is then monitored with CloudWatch alarms for a baking period in each step before the Blue fleet is completely replaced.

This post focuses on describing architectures that utilize SageMaker MLOps features to perform controlled deployments of models via the deployment guardrails and modeling testing strategies we’ve listed. For general information on these patterns, refer to Take advantage of advanced deployment strategies using Amazon SageMaker deployment guardrails and Deployment guardrails.

Deploy a model with SageMaker

SageMaker offers a broad range of deployment options that vary from low latency and high throughput to long-running inference jobs. These options include considerations for batch, real-time, or near real-time inference. Each option offers different advanced features, such as the ability to run multiple models on a single endpoint. However, as previously mentioned, for this post, we only cover MLOps deployment patterns using single-model endpoints. To dive further into more advanced SageMaker deployment features for real-time inference, refer to Model hosting patterns in Amazon SageMaker, Part 2: Getting started with deploying real time models on SageMaker.

To understand the implementation of advanced deployment patterns using continuous delivery (CD) pipelines, let’s first discuss a key concept within SageMaker called model variants.

SageMaker model variants

Model variants allow you to deploy multiple versions of your model to the same endpoint to test your model. Model variants are deployed to separate instances, so there is no impact on other variants when one is updated. In SageMaker, model variants are implemented as production and shadow variants.

Production variants allow you to A/B test multiple versions of your model to compare their performance. In this scenario, all versions of your model return responses to the model requests. Your endpoint traffic is distributed between the existent variants either by traffic distribution, where you assign a weight for each variant, or by target variant, where a certain parameter (for instance Region or market) decides which model should be invoked.

Shadow variants allow you to shadow test a new version of your model. In this scenario, your model has a production variant and a shadow variant deployed in parallel to the same endpoint. The shadow variant receives the full (or sampled) data traffic from your endpoint. However, only the predictions of the production variants are sent back to the users of your application, and the predictions from the shadow variants are logged for analysis. Because shadow variants are launched on separate instances from the production variant, there is no performance impact to your production variant in this test. With this option, you are testing the new model and minimizing the risks of a low-performing model, and you can compare both models’ performance with the same data.

SageMaker deployment guardrails

Guardrails are an essential part of software development. They protect your application and minimize the risk of deployment of a new version of your application. Similarly, SageMaker deployment guardrails allow you to switch from one model version to another in a controlled way. As of December 2022, SageMaker guardrails provide implementation for blue/green, canary, and linear traffic shifting deployment options. When combined with model variants, deployment guardrails can be applied both to production and shadow variants of your model, ensuring no downtime during the update of a new variant, with the traffic shifting being controlled according to the option selected.

MLOps foundations for model deployment

In the broader context of an ML model building and deploying workflow, we want to employ CI/CD practices purpose built for the ML workflow. Similar to traditional CI/CD systems, we want to automate software tests, integration testing, and production deployments. However, we also need to include specific operations around the ML lifecycle that aren’t present in the traditional software development lifecycle such as model training, model experimentation, model testing, and model monitoring.

To achieve those ML-specific capabilities, MLOps foundations such as automated model testing, deployment guardrails, multi-account deployments, and automated model rollback are added to the model deployment process. This ensures that the already described capabilities allow for model testing and avoid downtime during the process of a model update. It also provides the reliability and traceability necessary for the continuous improvement of a production-ready model. Additionally, capabilities like the ability to package existing solutions into reusable templates and deploy models in a multi-account setup ensure the scalability of the model deployment patterns discussed in the post to several models across an organization.

The following figure demonstrates a common pattern for the connection of SageMaker capabilities to create an end-to-end model building and deployment pipeline. In this example, a model is developed in SageMaker using SageMaker Processing jobs to run data processing code that is used to prepare data for an ML algorithm. SageMaker Training jobs are then used to train an ML model on the data produced by the processing job. The model artifacts and associated metadata are stored in the SageMaker Model Registry as the last step of the training process. This is orchestrated by SageMaker Pipelines, which is a purpose-built CI/CD service for ML that helps automate and manage ML workflows at scale.

After the model is approved, it is tested in production with either an A/B testing or a shadow deployment. After the model is validated in production, we use the model registry to approve the model for production rollout to a SageMaker endpoint using one of the deployment guardrails options.

When the model update process is complete, SageMaker Model Monitor continually monitors the model performance for drifts into the model and data quality. This process is automated to multiple use cases using SageMaker Project templates mapping the infrastructure deployment to a multi-account setup in order to ensure complete resource isolation and easier cost control.

Single-model endpoint deployment patterns

When deploying models to a production environment for the first time, you don’t have a model running to compare with, and the deployed model will be the one used by your business application. After the model is deployed and monitored in a production environment, you might want to update the model, either on a regular basis or on demand, when new data is available or when your model has a performance gap detected. When updating an existing model, you want to ensure that the new model performs better than the current one and can handle the prediction request traffic from your business applications. During this validation period, you want the current model to still be available for a possible rollback to minimize the risk of downtime to your applications.

In a broader model development picture, models are typically trained in a data science development account. This includes experimentation workflows often used in the development of models as well as retraining workflows used in production-ready pipelines. All of the metadata for these experiments can be tracked using Amazon SageMaker Experiments during development. After the workflow is incorporated into a pipeline for production use, the metadata is automatically tracked through SageMaker Pipelines. To keep track of viable production models in one place, after experimentation has brought a model’s performance metrics (precision, recall, and so on) to an acceptable level for production, a condition step in the SageMaker pipeline allows the model to be registered into the model registry.

The model registry allows you to trigger the deployment of this model with a manual or automated approval process. This deployment takes place in an ML test account where operational tests such as integration tests, unit tests, model latency, and any additional model validation can be performed against the new model version. Note that A/B testing and shadow testing are not performed in the ML test account, but rather in the ML production account.

After the model passes all validations in the test account, it’s ready to be deployed to a production environment. A new approval process triggers this deployment, and SageMaker deployment guardrails allow for a controlled release and transparent model update process according to the traffic shifting mode selected.

The following diagram illustrates this solution architecture.

All at once traffic shifting

The all at once traffic shifting mode allows you to update a new model version (green fleet) by completely shifting 100% of the traffic from your current model (blue fleet) to your new model. With this option, you can configure a baking period during which both versions of your model are still running, and you can quickly and automatically roll back to the current version if your new model doesn’t perform as expected. The downside of this option is that all your data traffic is affected at once, so if there is an issue with your model deployment, all users using the application during the deployment process are affected. The following architecture shows how the all at once traffic shifting option handles model updates.

All at once traffic shifting can be incorporated into your MLOps tooling by defining an endpoint deployment configuration with BlueGreenUpdatePolicy set to ALL_AT_ONCE. In your MLOps pipeline, after a new model is approved for deployment to the ML production account, SageMaker checks if your model endpoint already exists. If so, the ALL_AT_ONCE configuration triggers an endpoint update that follows the architecture. Your endpoint rollback is controlled based on CloudWatch alarms defined by your endpoint AutoRollbackConfiguration, which when triggered automatically starts the model rollback to your current model version.

Canary traffic shifting

The canary traffic shifting mode allows you to test your new model (green fleet) with a small portion of the data traffic before either updating the running model (blue fleet) to the new version or rolling back the new version, depending on the outcome of the canary testing. The portion of the traffic used to test the new model is called the canary, and in this option your risk of a problematic new model is minimized to the canary traffic while the update time is still minimized.

Canary deployments allow you to minimize the risk of implementing a new model version by exposing the new model version to a smaller group of users to monitor effectiveness over a period of time. The downside is managing multiple versions for a period of time that allows for gathering performance metrics that are meaningful enough to determine performance impact. The benefit is the ability to isolate risk to a smaller group of users.

Canary traffic shifting can be incorporated into your MLOps tooling by defining an endpoint deployment configuration with a BlueGreenUpdatePolicy set to CANARY and defining the CanarySize to determine how much of your endpoint traffic should be redirected to a new model version. Similarly to all at once option, in your MLOps pipeline, after a new model is approved for deployment to the ML production account, SageMaker checks if your model endpoint already exists. If so, the CANARY configuration triggers an endpoint update that follows the architecture outlined in the following diagram. Your endpoint rollback is controlled based on CloudWatch alarms defined by your endpoint AutoRollbackConfiguration that when triggered automatically starts the model rollback to your current model version. Useful alarm types to deploy here are 500 status codes and model latency; however, these alarm settings should be customized to your specific business use case and ML technology.

Linear traffic shifting

In the linear traffic shifting model, you gradually change the traffic from your current model (blue fleet) to your new model version (green fleet) by increasing the data traffic sent to the new model in steps. This way, the proportion of traffic used to test your new model version gradually increases with each step, and a baking time for each step ensures that your model is still operational with the new traffic. With this option, you minimize the risk of deploying a low-performing model and gradually expose the new model to more data traffic. The downside of this approach is that your update time is longer and the costs of the running both models in parallel are increased.

Linear traffic shifting can be incorporated into your MLOps tooling by defining an endpoint deployment configuration with BlueGreenUpdatePolicy set to LINEAR and defining the LinearStepSize to determine how much of your traffic should be redirected to a new model in each step. Similarly to all at once option, in your MLOps pipeline, after a new model is approved for deployment to the ML production account, SageMaker checks if your model endpoint already exists. If so, the LINEAR configuration triggers an endpoint update that follows the architecture indicated in the following diagram. Your endpoint rollback is controlled based on CloudWatch alarms defined by your endpoint AutoRollbackConfiguration that when triggered automatically starts the model rollback to your current model version.

Deployment patterns with model production variants

Independently from the deployment pattern that you chose for your application, you can also utilize production variants to validate your model performance before updating your endpoint or implement additional deployment patterns such as shadow deployments. In this case, you want to add a manual or automated process to select the best model to be deployed before updating your endpoint. The following architecture shows how your endpoint traffic and response behave in a shadow deployment scenario. In this scenario, each prediction request is submitted to both the new and deployed models; however, only the currently deployed model serves the prediction response to the business application, while the prediction served from the new model is maintained only for analysis in performance against the currently deployed model. After model performance is evaluated, the new model version can be deployed to service prediction response traffic to business applications.

Rollback

Independently from the deployment strategy that you chose for your model deployment, you want to be able roll back to the previous model version if your new model performance is lower than your current model performance. To do so while minimizing the downtime of your application, you need to keep your current model running in parallel to the new one until you are confident that your new model performs better than the current one.

SageMaker deployment guardrails allow you to set alarms and automatically roll back to previous model versions during the model validation period. After the validation period is over, you might still need to roll back to a previous model version to solve a new problem that is discovered after the model update is complete. To do so, you can take advantage of the SageMaker model registry to reject and approved models and trigger a rollback process.

Conclusion

In this post, you learned how to combine SageMaker endpoint model variants and deployment guardrails with MLOps capabilities in order to create end-to-end patterns for model development. We provided an example implementation for canary and linear shifting deployment guardrails connected with SageMaker pipelines and the model registry via a SageMaker custom project. As a next step, try adapting the following template to implement the deployment strategy of your organization.

References


About the authors

Maira Ladeira Tanke is an ML Specialist Solutions Architect at AWS. With a background in data science, she has 9 years of experience architecting and building ML applications with customers across industries. As a technical lead, she helps customers accelerate their achievement of business value through emerging technologies and innovative solutions. In her free time, Maira enjoys traveling and spending time with her family someplace warm.

Clay Elmore is an AI/ML Specialist Solutions Architect at AWS. After spending many hours in a materials research lab, his background in chemical engineering was quickly left behind to pursue his interest in machine learning. He has worked on ML applications in many different industries ranging from energy trading to hospitality marketing. Clay has a special interest in bringing software development practices to ML and guiding customers towards repeatable, scalable solutions by using these principles. In his spare time, Clay enjoys skiing, solving Rubik’s cubes, reading, and cooking.

Shelbee Eigenbrode is a Principal AI and Machine Learning Specialist Solutions Architect at AWS. She has been in technology for 24 years spanning multiple industries, technologies, and roles. She is currently focusing on combining her DevOps and ML background into the domain of MLOps to help customers deliver and manage ML workloads at scale. With over 35 patents granted across various technology domains, she has a passion for continuous innovation and using data to drive business outcomes. Shelbee is a co-creator and instructor of the Practical Data Science specialization on Coursera. She is also the Co-Director of Women In Big Data (WiBD), Denver chapter. In her spare time, she likes to spend time with her family, friends, and overactive dogs.

Qiyun Zhao is a Senior Software Development Engineer with the Amazon SageMaker Inference Platform team. He is the lead developer of the deployment guardrails and shadow deployments, and he focuses on helping customers to manage ML workloads and deployments at scale with high availability. He also works on platform architecture evolutions for fast and secure ML jobs deployment and running ML online experiments at ease. In his spare time, he enjoys reading, gaming, and traveling.

Read More

AWS and Hugging Face collaborate to make generative AI more accessible and cost efficient

We’re thrilled to announce an expanded collaboration between AWS and Hugging Face to accelerate the training, fine-tuning, and deployment of large language and vision models used to create generative AI applications. Generative AI applications can perform a variety of tasks, including text summarization, answering questions, code generation, image creation, and writing essays and articles.

AWS has a deep history of innovation in generative AI. For example, Amazon uses AI to deliver a conversational experience with Alexa that customers are interacting with billions of times each week, and is increasingly using generative AI as part of new experiences like Create with Alexa. In addition, M5 a group within Amazon Search that helps teams across Amazon bring large models to their applications, trained large models to improve search results on Amazon.com. AWS is constantly innovating across all areas of ML including infrastructure, tools on Amazon SageMaker,  and AI services, such as Amazon CodeWhisperer, a service that improves developer productivity by generating code recommendations based on the code and comments in an IDE. AWS also created purpose-built ML accelerators for the training (AWS Trainium) and inference (AWS Inferentia) of large language and vision models on AWS.

Hugging Face selected AWS because it offers flexibility across state-of-the-art tools to train, fine-tune, and deploy Hugging Face models including Amazon SageMaker, AWS Trainium, and AWS Inferentia. Developers using Hugging Face can now easily optimize performance and lower cost to bring generative AI applications to production faster.

High-performance and cost-efficient generative AI

Building, training, and deploying large language and vision models is an expensive and time-consuming process that requires deep expertise in machine learning (ML). Since the models are very complex and can contain hundreds of billions of parameters, generative AI is largely out of reach for many developers.

To close this gap, Hugging Face is now collaborating with AWS to make it easier for developers to access AWS services and deploy Hugging Face models specifically for generative AI applications. The benefits are: faster training and scaling low-latency and high-throughput inference. For example, the Amazon EC2 Trn1 instances powered by AWS Trainium deliver faster time to train while offering up to 50% cost-to-train savings over comparable GPU-based instances. Amazon EC2’s new Inf2 instances, powered by the latest generation of AWS Inferentia, are purpose-built to deploy the latest generation of large language and vision models and raise the performance of Inf1 by delivering up to 4x higher throughput and up to 10x lower latency. Developers can use AWS Trainium and AWS Inferentia through managed services such as Amazon SageMaker, a service with tools and workflows for ML. Or they can self-manage on Amazon EC2.

Get started today

Customers can start using Hugging Face models on AWS in three ways: through SageMaker JumpStart, the Hugging Face AWS Deep Learning Containers (DLCs),  or the tutorials to deploy your models to AWS Trainium or AWS Inferentia. The Hugging Face DLC is packed with optimized transformers, datasets, and tokenizers libraries to enable you to fine-tune and deploy generative AI applications at scale in hours instead of weeks – with minimal code changes. SageMaker JumpStart and the Hugging Face DLCs are available in all regions where Amazon SageMaker is available and come at no additional cost. Read documentation and discussion forums to learn more or try the sample notebooks today.

Read More

Fine-tune text-to-image Stable Diffusion models with Amazon SageMaker JumpStart

Fine-tune text-to-image Stable Diffusion models with Amazon SageMaker JumpStart

In November 2022, we announced that AWS customers can generate images from text with Stable Diffusion models in Amazon SageMaker JumpStart. Stable Diffusion is a deep learning model that allows you to generate realistic, high-quality images and stunning art in just a few seconds. Although creating impressive images can find use in industries ranging from art to NFTs and beyond, today we also expect AI to be personalizable. Today, we announce that you can personalize the image generation model to your use case by fine-tuning it on your custom dataset in Amazon SageMaker JumpStart. This can be useful when creating art, logos, custom designs, NFTs, and so on, or fun stuff such as generating custom AI images of your pets or avatars of yourself.

In this post, we provide an overview of how to fine-tune the Stable Diffusion model in two ways: programmatically through JumpStart APIs available in the SageMaker Python SDK, and JumpStart’s user interface (UI) in Amazon SageMaker Studio. We also discuss how to make design choices including dataset quality, size of training dataset, choice of hyperparameter values, and applicability to multiple datasets. Finally, we discuss the over 80 publicly available fine-tuned models with different input languages and styles recently added in JumpStart.

Stable Diffusion and transfer learning

Stable Diffusion is a text-to-image model that enables you to create photorealistic images from just a text prompt. A diffusion model trains by learning to remove noise that was added to a real image. This de-noising process generates a realistic image. These models can also generate images from text alone by conditioning the generation process on the text. For instance, Stable Diffusion is a latent diffusion where the model learns to recognize shapes in a pure noise image and gradually brings these shapes into focus if the shapes match the words in the input text. The text must first be embedded into a latent space using a language model. Then, a series of noise addition and noise removal operations are performed in the latent space with a U-Net architecture. Finally, the de-noised output is decoded into the pixel space.

In machine learning (ML), the ability to transfer the knowledge learned in one domain to another is called transfer learning. You can use transfer learning to produce accurate models on your smaller datasets, with much lower training costs than the ones involved in training the original model. With transfer learning, you can fine-tune the stable diffusion model on your own dataset with as little as five images. For example, on the left are training images of a dog named Doppler used to fine-tune the model, in the middle and right are images generated by the fine-tuned model when asked to predict Doppler’s image on the beach and a pencil sketch.

On the left are images of a white chair used to fine-tune the model and an image of the chair in red generated by the fine-tuned model. On the right are images of an ottoman used to fine-tune the model and an image of a cat sitting on an ottoman.

Fine-tuning large models like Stable Diffusion usually requires you to provide training scripts. There are a host of issues, including out of memory issues, payload size issues, and more. Furthermore, you have to run end-to-end tests to make sure that the script, the model, and the desired instance work together in an efficient manner. JumpStart simplifies this process by providing ready-to-use scripts that have been robustly tested. The JumpStart fine-tuning script for Stable Diffusion models builds on the fine-tuning script from DreamBooth. You can access these scripts with a single click through the Studio UI or with very few lines of code through the JumpStart APIs.

Note that by using the Stable Diffusion model, you agree to the CreativeML Open RAIL++-M License.

Use JumpStart programmatically with the SageMaker SDK

This section describes how to train and deploy the model with the SageMaker Python SDK. We choose an appropriate pre-trained model in JumpStart, train this model with a SageMaker training job, and deploy the trained model to a SageMaker endpoint. Furthermore, we run inference on the deployed endpoint, all using the SageMaker Python SDK. The following examples contain code snippets. For the full code with all of the steps in this demo, see the Introduction to JumpStart – Text to Image example notebook.

Train and fine-tune the Stable Diffusion model

Each model is identified by a unique model_id. The following code shows how to fine-tune a Stable Diffusion 2.1 base model identified by model_id model-txt2img-stabilityai-stable-diffusion-v2-1-base on a custom training dataset. For a full list of model_id values and which models are fine-tunable, refer to Built-in Algorithms with pre-trained Model Table. For each model_id, in order to launch a SageMaker training job through the Estimator class of the SageMaker Python SDK, you need to fetch the Docker image URI, training script URI, and pre-trained model URI through the utility functions provided in SageMaker. The training script URI contains all the necessary code for data processing, loading the pre-trained model, model training, and saving the trained model for inference. The pre-trained model URI contains the pre-trained model architecture definition and the model parameters. The pre-trained model URI is specific to the particular model. The pre-trained model tarballs have been pre-downloaded from Hugging Face and saved with the appropriate model signature in Amazon Simple Storage Service (Amazon S3) buckets, such that the training job runs in network isolation. See the following code:

from sagemaker import image_uris, model_uris, script_uris

# Currently, not all the stable diffusion models in jumpstart support finetuning. Thus, we manually select a model
# which supports finetuning.
train_model_id, train_model_version, train_scope = (
"model-txt2img-stabilityai-stable-diffusion-v2-1-base",
"*",
"training",
)

# Tested with ml.g4dn.2xlarge (16GB GPU memory) and ml.g5.2xlarge (24GB GPU memory) instances. Other instances may work as well.
# If ml.g5.2xlarge instance type is available, please change the following instance type to speed up training.
training_instance_type = "ml.g4dn.2xlarge"

# Retrieve the docker image
train_image_uri = image_uris.retrieve(
region=None,
framework=None,  # automatically inferred from model_id
model_id=train_model_id,
model_version=train_model_version,
image_scope=train_scope,
instance_type=training_instance_type,
)

# Retrieve the training script. This contains all the necessary files including data processing, model training etc.
train_source_uri = script_uris.retrieve(
model_id=train_model_id, model_version=train_model_version, script_scope=train_scope
)

# Retrieve the pre-trained model tarball to further fine-tune
train_model_uri = model_uris.retrieve(
model_id=train_model_id, model_version=train_model_version, model_scope=train_scope
)

With these model-specific training artifacts, you can construct an object of the Estimator class:

# Create SageMaker Estimator instance
sd_estimator = Estimator(
    role=aws_role,
    image_uri=train_image_uri,
    source_dir=train_source_uri,
    model_uri=train_model_uri,
    entry_point="transfer_learning.py",  # Entry-point file in source_dir and present in train_source_uri.
    instance_count=1,
    instance_type=training_instance_type,
    max_run=360000,
    hyperparameters=hyperparameters,
    output_path=s3_output_location,
    base_job_name=training_job_name,
)

# Launch a SageMaker Training job by passing s3 path of the training data
sd_estimator.fit({"training": training_dataset_s3_path}, logs=True)

Training dataset

The following are the instructions for how the training data should be formatted:

  • Input – A directory containing the instance images, dataset_info.json, with the following configuration:
    • Images may be of .png, .jpg, or .jpeg format
    • The dataset_info.json file must be of the format {'instance_prompt':<<instance_prompt>>}
  • Output – A trained model that can be deployed for inference

The S3 path should look like s3://bucket_name/input_directory/. Note the trailing / is required.

The following is an example format of the training data:

input_directory
    |---instance_image_1.png
    |---instance_image_2.png
    |---instance_image_3.png
    |---instance_image_4.png
    |---instance_image_5.png
    |---dataset_info.json

For instructions on how to format the data while using prior preservation, refer to the section Prior Preservation in this post.

We provide a default dataset of cat images. It consists of eight images (instance images corresponding to instance prompt) of a single cat with no class images. It can be downloaded from GitHub. If using the default dataset, try the prompt “a photo of a riobugger cat” while doing inference in the demo notebook.

License: MIT.

Hyperparameters

Next, for transfer learning on your custom dataset, you might need to change the default values of the training hyperparameters. You can fetch a Python dictionary of these hyperparameters with their default values by calling hyperparameters.retrieve_default, update them as needed, and then pass them to the Estimator class. See the following code:

from sagemaker import hyperparameters
# Retrieve the default hyper-parameters for fine-tuning the model
hyperparameters = hyperparameters.retrieve_default(
model_id=train_model_id, model_version=train_model_version
)

# [Optional] Override default hyperparameters with custom values
hyperparameters["max_steps"] = "400"

The following hyperparameters are supported by the fine-tuning algorithm:

  • with_prior_preservation – Flag to add prior preservation loss. Prior preservation is a regularizer that avoids overfitting. (Choices: [“True”,“False”], default: “False”.)
  • num_class_images – The minimum class images for prior preservation loss. If with_prior_preservation = True and there aren’t enough images already present in class_data_dir, additional images will be sampled with class_prompt. (Values: positive integer, default: 100.)
  • Epochs – The number of passes that the fine-tuning algorithm takes through the training dataset. (Values: positive integer, default: 20.)
  • Max_steps – The total number of training steps to perform. If not None, overrides epochs. (Values: “None” or a string of integer, default: “None”.)
  • Batch size –: The number of training examples that are worked through before the model weights are updated. Same as the batch size during class images generation if with_prior_preservation = True. (Values: positive integer, default: 1.)
  • learning_rate – The rate at which the model weights are updated after working through each batch of training examples. (Values: positive float, default: 2e-06.)
  • prior_loss_weight – The weight of prior preservation loss. (Values: positive float, default: 1.0.)
  • center_crop – Whether to crop the images before resizing to the desired resolution. (Choices: [“True”/“False”], default: “False”.)
  • lr_scheduler – The type of learning rate scheduler. (Choices: ["linear", "cosine", "cosine_with_restarts", "polynomial", "constant", "constant_with_warmup"], default: "constant".) For more information, see Learning Rate Schedulers.
  • adam_weight_decay – The weight decay to apply (if not zero) to all layers except all bias and LayerNorm weights in AdamW optimizer. (Value: float, default: 1e-2.)
  • adam_beta1 – The beta1 hyperparameter (exponential decay rate for the first moment estimates) for the AdamW optimizer. (Value: float, default: 0.9.)
  • adam_beta2 – The beta2 hyperparameter (exponential decay rate for the first moment estimates) for the AdamW optimizer. (Value: float, default: 0.999.)
  • adam_epsilon – The epsilon hyperparameter for the AdamW optimizer. It is usually set to a small value to avoid division by 0. (Value: float, default: 1e-8.)
  • gradient_accumulation_steps – The number of updates steps to accumulate before performing a backward/update pass. (Value: integer, default: 1.)
  • max_grad_norm – The maximum gradient norm (for gradient clipping). (Value: float, default: 1.0.)
  • seed – Fix the random state to achieve reproducible results in training. (Value: integer, default: 0.)

Deploy the fine-trained model

After model training is finished, you can directly deploy the model to a persistent, real-time endpoint. We fetch the required Docker Image URIs and script URIs and deploy the model. See the following code:

inference_instance_type = "ml.g4dn.2xlarge"

# Retrieve the inference docker container uri
deploy_image_uri = image_uris.retrieve(
    region=None,
    framework=None,  # automatically inferred from model_id
    image_scope="inference",
    model_id=train_model_id,
    model_version=train_model_version,
    instance_type=inference_instance_type,
)

# Retrieve the inference script uri. This includes scripts for model loading, inference handling etc.
deploy_source_uri = script_uris.retrieve(
    model_id=train_model_id, model_version=train_model_version, script_scope="inference"
)

# Use the estimator from the previous step to deploy to a SageMaker endpoint
finetuned_predictor = sd_estimator.deploy(
    initial_instance_count=1,
    instance_type=inference_instance_type,
    entry_point="inference.py",  # entry point file in source_dir and present in deploy_source_uri
    image_uri=deploy_image_uri,
    source_dir=deploy_source_uri,
    endpoint_name=endpoint_name,
)

On the left are the training images of a cat named riobugger used to fine-tune the model (default parameters except max_steps = 400). In the middle and right are the images generated by the fine-tuned model when asked to predict riobugger’s image on the beach and a pencil sketch.

For more details on inference, including supported parameters, response format, and so on, refer to Generate images from text with the stable diffusion model on Amazon SageMaker JumpStart.

Access JumpStart through the Studio UI

In this section, we demonstrate how to train and deploy JumpStart models through the Studio UI. The following video shows how to find the pre-trained Stable Diffusion model on JumpStart, train it, and then deploy it. The model page contains valuable information about the model and how to use it. After configuring the SageMaker training instance, choose Train. After the model is trained, you can deploy the trained model by choosing Deploy. After the endpoint is in the “in service” stage, it’s ready to respond to inference requests.

To accelerate the time to inference, JumpStart provides a sample notebook that shows how to run inference on the newly created endpoint. To access the notebook in Studio, choose Open Notebook in the Use Endpoint from Studio section of the model endpoint page.

JumpStart also provides a simple notebook which you can use to fine-tune the stable diffusion model and deploy the resulting fine-tuned model. You can use it to generate fun images of your dog. To access the notebook, search for “Generate Fun images of your dog” in the JumpStart search bar. To execute the notebook, you can use as little as five training images and upload to the local studio folder. If you have more than five images, you can upload them as well. Notebook uploads the training images to S3, trains the model on your dataset and deploy the resulting model. Training may take 20 mins to finish. You can change the number of steps to speed up the training. Notebook provides some sample prompts to try with the deployed model but you can try any prompt that you like. You can also adapt the notebook to create avatars of yourself or your pets. For instance, instead of your dog, you can upload images of your cat in the first step and then change the prompts from dogs to cats and the model will generate images of your cat.

Fine-tuning considerations

Training Stable Diffusion models tends to overfit quickly. To get good-quality images, we must find a good balance between the available training hyperparameters such as number of training steps and the learning rate. In this section, we show some experimental results and provide guidance on how set these parameters.

Recommendations

Consider the following recommendations:

  • Start with good quality of training images (4–20). If training on human faces, you may need more images.
  • Train for 200–400 steps when training on dogs or cats and other non-human subjects. If training on human faces, you may need more steps. If overfitting happens, reduce the nnumber of steps. If under-fitting happens (the fine-tuned model can’t generate the target subject’s image), increase the number of steps.
  • If training on non-human faces, you may set with_prior_preservation = False because it doesn’t significantly impact performance. On human faces, you may need to set with_prior_preservation=True.
  • If setting with_prior_preservation=True, use the ml.g5.2xlarge instance type.
  • When training on multiple subjects sequentially, if the subjects are very similar (for example, all dogs), the model retains the last subject and forgets the previous subjects. If subjects are different (for example, first a cat then a dog), the model retains both subjects.
  • We recommend using a low learning rate and progressively increasing the number of steps until the results are satisfactory.

Training dataset

The quality of the fine-tuned model is directly impacted by the quality of the training images. Therefore, you need to collect high-quality images to get good results. Blurred or low-resolution images will impact the quality of the fine-tuned model. Keep in mind the following additional parameters:

  • Number of training images – You may fine-tune the model on as little as four training images. We experimented with training datasets of size as little as 4 images and as many as 16 images. In both cases, fine-tuning was able to adapt the model to the subject.
  • Dataset formats – We tested the fine-tuning algorithm on images of format .png, .jpg, and .jpeg. Other formats may also work.
  • Image resolution – Training images may be any resolution. The fine-tuning algorithm will resize all training images before starting fine-tuning. That being said, if you want to have more control over the cropping and resizing of the training images, we recommend resizing the images yourself to the base resolution of the model (in this example, 512×512 pixels).

Experiment settings

In the experiment in this post, while fine-tuning we use the default values of the hyperparameters unless specified. Furthermore, we use one of the four datasets:

  • Dog1-8 – Dog 1 with 8 images
  • Dog1-16 – Dog 1 with 16 images
  • Dog2-4 – Dog 2 with four images
  • Cat-8 – Cat with 8 images

To reduce cluttering, we only show one representative image of the dataset in each section along with the dataset name. You can find the full training set in the section Experimentation Datasets in this post.

Overfitting

Stable Diffusion models tend to overfit when fine-tuning on a few images. Therefore, you need to select the parameters such as epochs, max_epochs, and learning rate carefully. In this section, we used the Dog1-16 dataset.

To evaluate the model’s performance, we evaluate the fine-tuned model for four tasks:

  • Can the fine-tuned model generate images of the subject (Doppler dog) in the same setting as it was trained on?
    • Observation – Yes it can. It’s worth noting that model performance increases with the number of training steps.
  • Can the fine-tuned model generate images of the subject in a different setting than it was trained on? For example, can it generate images of Doppler on a beach?
    • Observation – Yes it can. It’s worth noting that model performance increases with the number of training steps up to a certain point. If the model is being trained for too long, however, the model performance degrades as the model tends to overfit.
  • Can the fine-tuned model generate images of a class which the training subject belong to? For example, can it generate an image of a generic dog?
    • Observation – As we increase the number of training steps, the model starts to overfit. As a result, it forgets the generic class of a dog and will only produce images related to the subject.
  • Can the fine-tuned model generate images of a class or subject not in the training dataset? For example, can it generate an image of a cat?
    • Observation – As we increase the number of training steps, the model starts to overfit. As a result, it will only produce images related to the subject, regardless of the class specified.

We fine-tune the model for a different number of steps (by setting max_steps hyperparameters) and for each fine-tuned model, we generate images on each of the following four prompts (shown in the following examples from left to right:

  • “A photo of a Doppler dog”
  • “A photo of a Doppler dog on a beach”
  • “A photo of a dog”
  • “A photo of a cat”

The following images are from the model trained with 50 steps.

The following model was trained with 100 steps.

We trained the following model with 200 steps.

The following images are from a model trained with 400 steps.

Lastly, the following images are the result of 800 steps.

Train on multiple datasets

While fine-tuning, you may want to fine-tune on multiple subjects and have the fine-tuned model be able to generate images of all the subjects. Unfortunately, JumpStart is currently limited to training on a single subject. You can’t fine-tune the model on multiple subjects at the same time. Furthermore, fine-tuning the model for different subjects sequentially results in the model forgetting the first subject if the subjects are similar.

We consider the following experimentation in this section:

  1. Fine-tune the model for Subject A.
  2. Fine-tune the resulting model from Step 1 for Subject B.
  3. Generate images of Subject A and Subject B using the output model from Step 2.

In the following experiments, we observe that:

  • If A is dog 1 and B is dog 2, then all images generated in Step 3 resemble dog 2
  • If A is dog 2 and B is dog 1, then all images generated in Step 3 resemble dog 1
  • If A is dog 1 and B is cat, then images generated with dog prompts resemble dog 1 and images generated with cat prompts resemble cat

Train on dog 1 and then dog 2

In Step 1, we fine-tune the model for 200 steps on eight images of dog 1. In Step 2, we fine-tune the model further for 200 steps on four images of dog 2.

The following are the images generated by the fine-tuned model at the end of Step 2 for different prompts.

Train on dog 2 and then dog 1

In Step 1, we fine-tune the model for 200 steps on four images of dog 2. In Step 2, we fine-tune the model further for 200 steps on eight images of dog 1.

The following are the images generated by the fine-tuned model at the end of Step 2 with different prompts.

Train on dogs and cats

In Step 1, we fine-tune the model for 200 steps on eight images of a cat. Then we fine-tune the model further for 200 steps on eight images of dog 1.

The following are the images generated by the fine-tuned model at the end of Step 2. Images with cat-related prompts look like the cat in Step 1 of the fine-tuning, and images with dog-related prompts look like the dog in Step 2 of the fine-tuning.

Prior preservation

Prior preservation is a technique that uses additional images of the same class that we are trying to train on. For instance, if the training data consists of images of a particular dog, with prior preservation, we incorporate class images of generic dogs. It tries to avoid overfitting by showing images of different dogs while training for a particular dog. A tag indicating the specific dog present in the instance prompt is missing in the class prompt. For instance, the instance prompt may be “a photo of a riobugger cat” and the class prompt may be “a photo of a cat.” You can enable prior preservation by setting the hyperparameter with_prior_preservation = True. If setting with_prior_preservation = True, you must include class_prompt in dataset_info.json and may include any class images available to you. The following is the training dataset format when setting with_prior_preservation = True:

  • Input – A directory containing the instance images, dataset_info.json and (optional) directory class_data_dir. Note the following:
    • Images may be of .png, .jpg, .jpeg format.
    • The dataset_info.json file must be of the format {'instance_prompt':<<instance_prompt>>,'class_prompt':<<class_prompt>>}.
    • The class_data_dir directory must have class images. If class_data_dir is not present or there aren’t enough images already present in class_data_dir, additional images will be sampled with class_prompt.

For datasets such as cats and dogs, prior preservation doesn’t significantly impact the performance of the fine-tuned model and therefore can be avoided. However, when training on faces, this is necessary. For more information, refer to Training Stable Diffusion with Dreambooth using Diffusers.

Instance types

Fine-tuning Stable Diffusion models require accelerated computation provided by GPU-supported instances. We experiment our fine-tuning with ml.g4dn.2xlarge (16 GB CUDA memory, 1 GPU) and ml.g5.2xlarge (24 GB CUDA memory, 1 GPU) instances. The memory requirement is higher when generating class images. Therefore, if setting with_prior_preservation=True, use the ml.g5.2xlarge instance type, because training runs into the CUDA out of memory issue on the ml.g4dn.2xlarge instance. The JumpStart fine-tuning script currently utilizes single GPU and therefore, fine-tuning on multi-GPU instances will not yield performance gain. For more information on different instance types, refer to Amazon EC2 Instance Types.

Limitations and bias

Even though Stable Diffusion has impressive performance in generating images, it suffers from several limitations and biases. These include but are not limited to:

  • The model may not generate accurate faces or limbs because the training data doesn’t include sufficient images with these features
  • The model was trained on the LAION-5B dataset, which has adult content and may not be fit for product use without further considerations
  • The model may not work well with non-English languages because the model was trained on English language text
  • The model can’t generate good text within images

For more information on limitations and bias, see Stable Diffusion v2-1-base Model Card. These limitations for the pre-trained model can also carry over to the fine-tuned models.

Clean up

After you’re done running the notebook, make sure to delete all resources created in the process to ensure that the billing is stopped. Code to clean up the endpoint is provided in the associated Introduction to JumpStart – Text to Image example notebook.

Publicly available fine-tuned models in JumpStart

Even though Stable Diffusion models released by StabilityAI have impressive performance, they have limitations in terms of the language or domain it was trained on. For instance, Stable Diffusion models were trained on English text, but you may need to generate images from non-English text. Alternatively, Stable Diffusion models were trained to generate photorealistic images, but you may need to generate animated or artistic images.

JumpStart provides over 80 publicly available models with various languages and themes. These models are often fine-tuned versions from Stable Diffusion models released by StabilityAI. If your use case matches with one of the fine-tuned models, you don’t need to collect your own dataset and fine-tune it. You can simply deploy one of these models through the Studio UI or using easy-to-use JumpStart APIs. To deploy a pre-trained Stable Diffusion model in JumpStart, refer to Generate images from text with the stable diffusion model on Amazon SageMaker JumpStart.

The following are some of the examples of images generated by the different models available in JumpStart.

Note that these models are not fine-tuned using JumpStart scripts or DreamBooth scripts. You can download the full list of publicly available fine-tuned models with example prompts from here.

For more example generated images from these models, please see section Open Sourced Fine-tuned models in the Appendix.

Conclusion

In this post, we showed how to fine-tune the Stable Diffusion model for text-to-image and then deploy it using JumpStart. Furthermore, we discussed some of the considerations you should make while fine-tuning the model and how it can impact the fine-tuned model’s performance. We also discussed the over 80 ready-to-use fine-tuned models available in JumpStart. We showed code snippets in this post—for the full code with all of the steps in this demo, see the Introduction to JumpStart – Text to Image example notebook. Try out the solution on your own and send us your comments.

To learn more about the model and the DreamBooth fine-tuning, see the following resources:

To learn more about JumpStart, check out the following blog posts:


About the Authors

Dr. Vivek Madan is an Applied Scientist with the Amazon SageMaker JumpStart team. He got his PhD from University of Illinois at Urbana-Champaign and was a Post Doctoral Researcher at Georgia Tech. He is an active researcher in machine learning and algorithm design and has published papers in EMNLP, ICLR, COLT, FOCS, and SODA conferences.

Heiko Hotz is a Senior Solutions Architect for AI & Machine Learning with a special focus on natural language processing (NLP), large language models (LLMs), and generative AI. Prior to this role, he was the Head of Data Science for Amazon’s EU Customer Service. Heiko helps our customers be successful in their AI/ML journey on AWS and has worked with organizations in many industries, including insurance, financial services, media and entertainment, healthcare, utilities, and manufacturing. In his spare time, Heiko travels as much as possible.


Appendix: Experiment datasets

This section contains the datasets used in the experiments in this post.

Dog1-8

Dog1-16

Dog2-4

Dog3-8

Appendix: Open Sourced Fine-tuned models

The following are some of the examples of images generated by the different models available in JumpStart. Each image is captioned with a model_id starting with a prefix huggingface-txt2img- followed by the prompt used to generate the image in the next line.

Read More

Scaling Large Language Model (LLM) training with Amazon EC2 Trn1 UltraClusters

Scaling Large Language Model (LLM) training with Amazon EC2 Trn1 UltraClusters

Modern model pre-training often calls for larger cluster deployment to reduce time and cost. At the server level, such training workloads demand faster compute and increased memory allocation. As models grow to hundreds of billions of parameters, they require a distributed training mechanism that spans multiple nodes (instances).

In October 2022, we launched Amazon EC2 Trn1 Instances, powered by AWS Trainium, which is the second generation machine learning accelerator designed by AWS. Trn1 instances are purpose built for high-performance deep learning model training while offering up to 50% cost-to-train savings over comparable GPU-based instances. In order to bring down training time from weeks to days, or days to hours, and distribute a large model’s training job, we can use an EC2 Trn1 UltraCluster, which consists of densely packed, co-located racks of Trn1 compute instances all interconnected by non-blocking petabyte scale networking. It is our largest UltraCluster to date, offering 6 exaflops of compute power on demand with up to 30,000 Trainium chips.

In this post, we use a Hugging Face BERT-Large model pre-training workload as a simple example to explain how to useTrn1 UltraClusters.

Trn1 UltraClusters

A Trn1 UltraCluster is a placement group of Trn1 instances in a data center. As part of a single cluster run, you can spin up a cluster of Trn1 instances with Trainium accelerators. The following diagram shows an example.

Trn1 Ultracluster

UltraClusters of Trn1 instances are co-located in a data center, and interconnected using Elastic Fabric Adapter (EFA), which is a petabyte scale, non-blocking network interface, with up to 800 Gbps networking bandwidth, which is twice the bandwidth supported by AWS P4d instances (1.6 Tbps, four times greater with the upcoming Trn1n instances). These EFA interfaces help run model training workloads that use Neuron Collective Communication Libraries at scale. Trn1 UltraClusters also include co-located network attached storage services like Amazon FSx for Lustre to enable high throughput access to large datasets, ensuring clusters operate efficiently. Trn1 UltraClusters can host up to 30,000 Trainium devices and deliver up to 6 exaflops of compute in a single cluster. EC2 Trn1 UltraClusters deliver up to 6 exaflops of compute, literally an on-demand supercomputer, with a pay-as-you-go usage model. In this post, we use some HPC tools like Slurm to ramp an UltraCluster and manage workloads.

Solution overview

AWS offers a wide variety of services for distributed model training or inferencing workloads at scale, including AWS Batch, Amazon Elastic Kubernetes Service (Amazon EKS), and UltraClusters. This post focuses on model training in an UltraCluster. Our solution uses the AWS ParallelCluster management tool to create the necessary infrastructure and environment to spin up a Trn1 UltraCluster. The infrastructure consists of a head node and multiple Trn1 compute nodes within a virtual private cloud (VPC). We use Slurm as the cluster management and job scheduling system. The following diagram illustrates our solution architecture.

Solution overiew

For more details and how to deploy this solution, see Train a model on AWS Trn1 ParallelCluster.

Let’s look at some important steps of this solution:

  1. Create a VPC and subnets.
  2. Configure the compute fleet.
  3. Create the cluster.
  4. Inspect the cluster.
  5. Launch your training job.

Prerequisites

To follow along with this post, a broad familiarity with core AWS services such as Amazon Elastic Compute Cloud (Amazon EC2) is implied, and basic familiarity with deep learning and PyTorch would be helpful.

Create VPC and subnets

An easy way to create the VPC and subnets is through the Amazon Virtual Private Cloud (Amazon VPC) console. Complete instructions can be found on GitHub. After the VPC and subnets are installed, you need to configure the instances in the compute fleet. Briefly, this is made possible by an installation script specified by CustomActions in the YAML file used for creating the ParallelCluster (see Create ParallelCluster). A ParallelCluster requires a VPC that has two subnets and a Network Address Translation (NAT) gateway, as shown in the preceding architecture diagram. This VPC has to reside in the Availability Zones where Trn1 instances are available. Also, in this VPC, you need to have a public subnet and a private subnet to hold the head node and Trn1 compute nodes, respectively. You also need a NAT gateway internet access, such that Trn1 compute nodes can download AWS Neuron packages. In general, the compute nodes will receive updates for the OS packages, Neuron driver and runtime, and EFA driver for multi-instance training.

As for the head node, in addition to the aforementioned components for the compute nodes, it also receives the PyTorch-NeuronX and NeuronX compiler, which enables the model compilation process in XLA devices such as Trainium.

Configure the compute fleet

In the YAML file for creating the Trn1 UltraCluster, InstanceType is specified as trn1.32xlarge. MaxCount and MinCount are used to indicate your compute fleet size range. You may use MinCount to keep some or all Trn1 instances available at all time. MinCount may be set to zero so that if there is no running job, the Trn1 instances are released from this cluster.

Trn1 may also be deployed in an UltraCluster with multiple queues. In the following example, there is only one queue being set up for Slurm job submission:

InstanceType: trn1.32xlarge
MaxCount: 16
MinCount: 0
Name: queue1-i1

If you need more than one queue, you can specify multiple InstanceType, each with its own MaxCount, MinCount, and Name:

InstanceType: trn1.32xlarge
MaxCount: 8
MinCount: 0
Name: queue-0
InstanceType: trn1.32xlarge
MaxCount: 8
MinCount: 0
Name: queue-1

Here, two queues are set up, so that user has the flexibility to choose the resources for their Slurm job.

Create the cluster

To launch a Trn1 UltraCluster, use the following pcluster command from where your ParallelCluster tool is installed:

pcluster create-cluster --cluster-configuration <YAML FILE NAME> -n <CLUSTER NAME> 

We use the following options in this command:

  • --cluster-configuration – This option expects a YAML file that describes the cluster configuration
  • -n (or --cluster-name) – The name of this cluster

This command creates a Trn1 cluster in your AWS account. You can check the progress of cluster creation on the AWS CloudFormation console. For more information, refer to Using the AWS CloudFormation console.

Alternatively, you can use the following command to see the status of your request:

pcluster describe-cluster -n <CLUSTER NAME>

and the command will indicate the status, for example:

{
"creationTime": "2023-01-09T03:26:17.235Z",
"headNode": {
"launchTime": "2023-01-09T03:29:23.000Z",
"instanceId": "XXXXX",
"publicIpAddress": "XX.XX.XXX.XXX",
"instanceType": "c5.4xlarge",
"state": "running",
"privateIpAddress": "XX.XX.XX.XXX"
},
"version": "3.3.0",
"clusterConfiguration": {
"url": "XXXX....“
},
"tags": [
{
"value": "3.2.1",
"key": "parallelcluster:version"
},
{
"value": "PC16Trn1",
"key": "parallelcluster:cluster-name"
}
],
"cloudFormationStackStatus": "CREATE_IN_PROGRESS",
"clusterName": "PC16Trn1",
"computeFleetStatus": "UNKNOWN",
"cloudformationStackArn": "arn:aws:cloudformation:us-west-2:...:stack/PC16Trn1/...",
"lastUpdatedTime": "2023-01-09T03:26:17.235Z",
"region": "us-west-2",
"clusterStatus": "CREATE_IN_PROGRESS",
"scheduler": {
"type": "slurm"
}
}

The following are parameters of interest from the output:

  • instanceId – This is the instance ID of the head node, which will be listed on the Amazon EC2 console
  • computeFleetStatus – This attribute indicates readiness of the compute nodes
  • Tags – This attribute indicates the version of pcluster tool used to create this cluster

Inspect the cluster

You can use the aforementioned pcluster describe-cluster command to check the cluster. After the cluster is created, you will observe the following in the output:

"clusterStatus": "CREATE_COMPLETE"

At this point, you may SSH into the head node (identified by instance ID on the Amazon EC2 console). The following is a logical diagram of the cluster.

Logical diagram of cluster

After you SSH into the head node, you can verify the compute fleet and their status with a Slurm command such as sinfo to view the node information for the system. The following is an example output:

PARTITION     AVAIL     TIMELIMIT     NODES     STATE     NODELIST
compute1*     up         infinite      16       alloc     compute1-st-queue1-i1-[1-16]

This indicates that there is one queue as shown by a single partition. There are 16 nodes available, and resources are allocated. From the head node, you can SSH into any given compute node:

ssh compute1-st-queue1-i1-16

Use exit to get back to the head node.

Likewise, you can SSH into a compute node from another compute node. Each compute node has Neuron tools installed, such as neuron-top. You can invoke neuron-top during the training script run to inspect NeuronCore utilization at each node.

Launch your training job

We use the Hugging Face BERT-Large Pretraining Tutorial as an example to run on this cluster. After the training data and scripts are downloaded to the cluster, we use the Slurm controller to manage and orchestrate our workload. We submit the training job with the sbatch command. The shell script invokes the Python script via the neuron_parallel_compile API to compile the model into graphs without a full training run. See the following code:

sbatch --exclusive --nodes=16 
--wrap "srun neuron_parallel_compile ./run_dp_bert_large_hf_pretrain_bf16_s128.sh"

We use the following options in this command:

  • --exclusive – This job will use all nodes and will not share nodes with other jobs while running the current job.
  • --nodes – The number of nodes for this job.
  • --wrap – This defines a command string that is run by the Slurm controller. In this case, it simply compiles the model in parallel using all nodes.

After the model is compiled successfully, you may start the full training job with the following command:

sbatch  --exclusive --nodes=16 
--wrap "srun ./run_dp_bert_large_hf_pretrain_bf16_s128.sh"

This command will launch the training job for the Hugging Face BERT-Large model. With 16 Trn1.32xlarge nodes, you can expect it to complete in less than 8 hours.

At this point, you can use a Slurm command such as squeue to inspect the submitted job. An example output is as follows:

JOBID    PARTITION     NAME     USER    ST     TIME     NODES     NODELIST(REASON)
3        compute1      wrap     ubuntu   R     45:27    16        compute1-st-queue1-i1-[1-16]

This output shows the job is running (R) on 16 compute nodes.

As the job is running, outputs are captured and appended in a Slurm log file. From the head node‘s terminal, you can inspect it in real time.

tail -f slurm-3.out

Also, in the same directory as the Slurm log file, there is a corresponding directory for this job. This directory includes the following (for example):

-rw-rw-r— 1 ubuntu ubuntu 3772 Jan 10 21:41 results.json
-rw-rw-r— 1 ubuntu ubuntu 4160336620 Jan 10 21:42 ckpt_2593.pt
-rw-rw-r— 1 ubuntu ubuntu 106712 Jan 10 21:43 log_ph1_bf16_1_2
-rw-rw-r— 1 ubuntu ubuntu 429325 Jan 10 21:58 log_ph1_bf16_0_2
.....

This directory is accessible to all compute nodes. results.json captures the metadata of this particular job run, such as the model’s configuration, batch size, total steps, gradient accumulation steps, and training dataset name. The model checkpoint and output log per each compute node are also captured in this directory.

Consider scalability of the cluster

In a Trn1 UltraCluster, multiple interconnected Trn1 instances run a large model training workload in parallel and reduce total computation time or time to convergence. There are two measures of scalability of a cluster: strong scaling and weak scaling. Typically, for model training, the need is to speed up the training run, because usage cost is determined by sample throughput for rounds of gradient updates. Strong scaling refers to the scenario where the total problem size stays the same as the number of processors increases, strong scaling is an important measure of scalability for model training. In evaluating strong scaling, (i.e the impact of parallelization), we want to keep global batch size the same and see how much time it takes to convergence. In such scenario, we need to adjust gradient accumulation micro-step according to number of compute nodes. This is achieved with the following in the training shell script run_dp_bert_large_hf_pretrain_bf16_s128.sh:

GRAD_ACCUM_USTEPS=$(($GRAD_ACCUM_USTEPS/$WORLD_SIZE_JOB))

On the other hand, if you want to evaluate how many more workloads can be run at a fixed time by adding more nodes, use weak scaling to measure scalability. In weak scaling, the problem size increases at the same rate as the number of NeuronCoress, thereby keeping the amount of work per NeuronCores the same. To evaluate weak scaling, or the effect of adding more nodes on the increased workload, simply remove the above line from the training script, and keep the number of steps for gradient accumulation constant with a default value (32) provided in the training script.

Evaluate your results

We provide some benchmark results in the Neuron performance page to demonstrate the effect of scaling. The data demonstrates the benefit of using multiple instances to parallelize the training job for many different large models to train at scale.

Clean up your infrastructure

To delete all the infrastructure of this UltraCluster, use the pcluster command to delete the cluster and its resources:

pcluster delete-cluster -n <CLUSTER NAME>

Conclusion

In this post, we discussed how scaling your training job across an Trn1-UltraCluster, which is powered by Trainium accelerators in AWS, reduces the time to train a model. We also provided a link to the Neuron samples repository, which contains instructions on how to deploy a distributed training job for a BERT-Large model. Trn1-UltraCluster runs distributed training workloads to train ultra-large deep learning models at scale. A distributed training setup results in much faster model convergence as compared to training on a single Trn1 instance.

To learn more about how to get started with Trainium-powered Trn1 instances, visit the Neuron documentation.


About the Authors

K.C. Tung is a Senior Solution Architect in AWS Annapurna Labs. He specializes in large deep learning model training and deployment at scale in cloud. He has a Ph.D. in molecular biophysics from the University of Texas Southwestern Medical Center in Dallas. He has spoken at AWS Summits and AWS Reinvent. Today he helps customers to train and deploy large PyTorch and TensorFlow models in AWS cloud. He is the author of two books: Learn TensorFlow Enterprise and TensorFlow 2 Pocket Reference.

Jeffrey Huynh is a Principal Engineer in AWS Annapurna Labs. He is passionate about helping customers run their training and inference workloads on Trainium and Inferentia accelerator devices using AWS Neuron SDK. He is a Caltech/Stanford alumni with degrees in Physics and EE. He enjoys running, tennis, cooking, and reading about science and technology.

Shruti Koparkar is a Senior Product Marketing Manager at AWS. She helps customers explore, evaluate, and adopt EC2 accelerated computing infrastructure for their machine learning needs.

Read More