New performance improvements in Amazon SageMaker model parallel library

New performance improvements in Amazon SageMaker model parallel library

Foundation models are large deep learning models trained on a vast quantity of data at scale. They can be further fine-tuned to perform a variety of downstream tasks and form the core backbone of enabling several AI applications. The most prominent category is large-language models (LLM), including auto-regressive models such as GPT variants trained to complete natural text. LLMs typically contain billions of parameters, making them rarely fit on one single accelerator, and require model parallelism techniques. Another category is diffusion models, notably Stable Diffusion, that has pushed AI image generation to an unprecedented milestone where remarkable visuals can be generated from a simple text description. Diffusion models are typically much smaller than LLMs and distributed training remains to play a critical role in facilitating development.

SageMaker model parallel (SMP) library is a large-model training solution available on Amazon SageMaker platform. It can be integrated with PyTorch models to easily apply a range of state-of-the-art large-model distributed training techniques to train at scale. Earlier this year, SMP launched sharded data parallelism, a distributed training technique powered by Amazon in-house MiCS technology under the hood. Sharded data parallel shards model parameters, gradients, and optimizer states across data-parallel workers. MiCS performs a number of optimizations including scale-aware partitioning to provide near-linear scalability.  In Train gigantic models with near-linear scaling using sharded data parallelism, we shared that sharded data parallel in SMP achieved  39.7% speed up compared to DeepSpeed ZeRO-3 on a 30B parameter GPT-2 model with sequence length 2048.

To help our customers further minimize training costs and accelerate time-to-market, we are thrilled to introduce two new performance improvements in SageMaker model parallel — SMDDP Collectives and FlashAttention. SMDDP Collectives is the most performant collective library on AWS infrastructure for large model training offered by SageMaker distributed data parallel library. FlashAttention is introduced in Dao et al., which re-implements the attention mechanism in an IO-aware manner, reducing the memory bandwidth requirement and saving on attention speed and memory footprint. These two components collectively push our sharded data parallel technique to be 30.58% faster when training a 100B parameter GPT-NeoX model on 32 p4d.24xlarge instances. For customers who are already using sharded data parallel on supported models, no code changes are necessary to benefit from the performance boost offered by these latest features. Stability AI, the inventor of the Stable Diffusion family of models that showed unparalleled image generation abilities, chose to use SMP to build foundation models. With SMP,  Stability AI achieved 163 TFLOPs per GPU for a 13B-parameter GPT-NeoX on 32 p4d.24xlarge instances, a 58% speed up compared to DeepSpeed. You can learn more about Stability AI’s mission and partnership with AWS in the talk of Stability AI CEO at AWS re:Invent 2022 or in this blog post.

“Our mission at Stability AI is to build the foundation to activate humanity’s potential through AI. To achieve this mission, we need to efficiently train open-source foundation models on hundreds of accelerated compute instances. We rely on SageMaker and its distributed training libraries to optimize performance and implement state-of-the-art strategies to shard models and data across our training cluster. These optimizations reduce our training costs, help us meet customer needs faster, and speed up the development of new models.”

— Emad Mostaque, Founder and CEO of Stability AI.

In this blog post, we’ll first present our latest performance improvements in the SageMaker model parallel library. Then, we’ll revisit how to train foundational models using sharded data parallel.  Finally, we’ll benchmark performance of 13B, 50B, and 100B parameter auto-regressive models and wrap up with future work.

New performance improvements in  SageMaker model parallel library

Starting from AWS Deep Learning Containers (DLC) PyTorch 1.12.1, SageMaker model parallel library v1.13 comes with the following two new components that are critical in improving training performance. They are currently available on ml.p4d.24xlarge instance with Elastic Fabric Adapter (EFA) enabled:

1. AWS-optimized AllGather from SMDDP Collectives

In sharded data parallel, since only a shard of the model state is present on a GPU, an AllGather collective is needed to gather the full set of parameters from across all GPUs in the sharding group during forward or backward pass computations. In the previous versions of SageMaker model parallel, we used NVIDIA Collective Communications Library (NCCL) for these collectives. However, NCCL is a general purpose collective communications library not designed for AWS infrastructure, which leads to sub-optimal performance even with EFA enabled.

Previously, we had developed the SMDDP Collectives library that provided an AWS-optimized implementation of the All-Reduce collective to speedup performance of pure data parallel training. To improve the performance of large model training with sharded data parallelism, we expanded the SMDDP Collectives library to include an optimized implementation of the AllGather collective. The key advantage of SMDDP Collectives AllGather is that it adopts an all-to-all-type communication pattern for inter-node communication, enabling our collective to have high-throughput and be less latency-sensitive. In addition, our AllGather collective offloads the communication-related processing to the CPU, thereby freeing up valuable GPU cycles for gradient computation, leading to significant performance improvement especially on large models.

2. FlashAttention

In modern transformer architecture, one of the largest sources of memory consumption is the activation footprint in the self-attention layer. This is because each attention head computes an SxS attention matrix for each input, where S is the sequence length, and this matrix goes through several operations, such as dropout, softmax, and matrix multiplication, with each intermediate output requiring memory space for use in back-propagation.

FlashAttention (Dao et al.) is a recent innovation from HazyResearch in Stanford that re-implements the self-attention mechanism in an I/O-aware manner. The main insight behind FlashAttention is that the self-attention mechanism is bottlenecked by memory bandwidth to and from GPU high bandwidth memory (HBM). This means that the self-attention layer can be computed in chunks across the sequence dimension, with each chunk going through the entire self-attention pipeline at a time. The intermediate results for a chunk are stored at the high-bandwidth SRAM, avoiding the expensive round-trip to the HBM for every iteration. Although a naive implementation would run into the issue of the cross-chunk dependency at the softmax layer, FlashAttention introduces a clever implementation that side-steps this dependency. Combined with re-computation in backward pass, FlashAttention results in substantial memory savings and performance improvement (25% faster training for GPT-NeoX 13B over 16 p4d nodes), due to avoidance of the HBM round-trip and storage of SxS matrices. You can find visuals and more explanations in HazyResearch’s FlashAttention repository.

Train foundation models at scale with SageMaker model parallel

To train foundation models with SMP powered by SMDDP Collectives, there’s no additional changes required in your sharded data parallel training jobs. If you’re new to using sharded data parallel, follow this complete tutorial notebook and blog post that will walk you through the entire process, from data processing, defining and submitting training jobs, to monitoring training logs. A ready-to-use training script for GPT-2 model can be found at train_gpt_simple.py. For training a different model type, you can follow the API document to learn about how to apply SMP APIs.

We highlight the key hyperparameters in the PyTorch Estimator of a sharded data parallel training job as below. The hyperparameter ddp_dist_backend in smp_options now has a new option, "auto" , as its default value. With "auto", SMP will use AWS-optimized AllGather for sharded data parallelism jobs and fall back to NCCL otherwise. You can refer to this document for supported configurations. If you want to run sharded data parallel in SMP specifically with NCCL as the communication backend of choice, you can set “ddp_dist_backend" to "nccl" in smp_options.

import sagemaker
from sagemaker.pytorch import PyTorch

smp_options = {
    "enabled": True,
    "parameters": {
        "ddp": True,
        "ddp_dist_backend": "auto", #OR "nccl" to disable SMDDP Collectives
        # To enable sharded data parallelism.
        # Here we shard model states across 128 GPUs.
        "sharded_data_parallel_degree": 128,  
    }
}

smp_estimator = PyTorch(
    entry_point="train_gpt_simple.py",
    role=sagemaker.get_execution_role(),
    instance_type='ml.p4d.24xlarge',
    instance_count=32,
    distribution={
        "smdistributed": {"modelparallel": smp_options},
        ...
    },
    ...
)

smp_estimator.fit(inputs=data_channels)

With the latest SMPv1.13 release, the sharded data parallel training technique supports FlashAttention for popular models including BERT, RoBERTa, GPT-2, GPT-J, GPT-Neo and GPT-NeoX out-of-the-box. This is enabled by passing tensor_parallelism=True during model creation without setting tensor_parallel_degree. You can find an example in the same training script train_gpt_simple.py .

Benchmarking performance

We benchmarked sharded data parallelism in the SageMaker model parallel library on three different scales of models to understand how the two new features, FlashAttention and AWS-optimized AllGather, contribute to performance improvement. Placement group is not required to reproduce these benchmarks on SageMaker.

13B parameter GPT-NeoX

In this setting, we focus on understanding the performance gain contributed by FlashAttention and we leave AWS-optimized AllGather out of the picture. Using flash attention saves substantial GPU memory, which helps us increase batch size or reduce sharding degree, thereby improving performance. As the below results show, we observed an average of about 20.4% speedup in SMP with flash attention for 13B parameter GPT-NeoX model on various configurations across 16-64 p4d nodes. Memory usage during standard attention computation scales in a quadratic manner with an increase in sequence length, but FlashAttention has memory usage linear in sequence length. Hence FlashAttention is even more helpful as sequence length increases and makes it possible to use larger sequence lengths. Being memory-efficient without trading off model quality, FlashAttention has gained traction quickly in the large model training community in the past months including integration with Hugging Face Diffusers and Mosaic ML.

Configuration Performance
Model/Training Cluster SMP Without FlashAttention
(TFLOPs/GPU)
 With FlashAttention
(TFLOPs/GPU)
% Speedup
13B GPT-NeoX
Seq length: 2048
Global batch size: 1024
FP16
16 p4d.24xlarge nodes Activation checkpointing
sharded_data_parallel_degree:64
gradient_accumulation: 1
130 159 22.31
13B GPT-NeoX
Seq length: 2048
Global batch size: 2048
FP16
32 p4d.24xlarge nodes Activation checkpointing
sharded_data_parallel_degree:64
gradient_accumulation: 1
131 157 19.85
13B GPT-NeoX
Seq length: 2048
Global batch size: 4096
FP16
64 p4d.24xlarge nodes Activation checkpointing
sharded_data_parallel_degree:64
gradient_accumulation: 1
131 156 19.08

50B parameter Bloom

Now, we look at how AWS-optimized AllGather from SMDDP Collectives speedup large model training with SMP. We benchmark a 50B-parameter Bloom model and compare the performance with and without AWS-optimized AllGather collective. We observe that SMDDP collectives speeds up model training by upto 40% across 32 nodes to 64 nodes training jobs. SMDDP collectives help achieve better performance due to better utilization of the 400 Gbps network bandwidth available with p4d.24xlarge instances. This coupled with the design choice to offload communication-related processing to the CPU, helps achieve good compute-to-network overlap leading to optimized performance. Compute-to-network overlap especially becomes important in large models since the size of data communicated across nodes scales linearly with an increase in the model size.

Configuration Performance
Model/Training Cluster SMP Without AWS-optimized AllGather
(TFLOPs/GPU)
 With AWS-optimized AllGather
(TFLOPs/GPU)
% Speedup
50B Bloom
Seq length: 2048
Global batch size: 2048
BF16
32 p4d.24xlarge nodes Activation checkpointing
sharded_data_parallel_degree:128
gradient_accumulation: 1
102 143 40.20
50B Bloom
Seq length: 2048
Global batch size: 4096
BF16
64 p4d.24xlarge nodes Activation checkpointing
sharded_data_parallel_degree:128
gradient_accumulation: 1
101 140 38.61

100B parameter GPT-NeoX

Finally, we benchmark SMP with both of the latest features enabled. It shows that this new release of SMP v1.13 is 30% faster than the previous version on a 100B-parameter GPT-NeoX model.

Configuration Performance
Model/Training Cluster SMP Without FlashAttention and without AWS-optimized AllGather
(TFLOPs/GPU)
With FlashAttention + AWS-optimized AllGather
(TFLOPs/GPU)
% Speedup
100B GPT-NeoX
Seq length: 2048
Global batch size: 2048
FP16
32 p4d.24xlarge nodes Activation checkpointing
sharded_data_parallel_degree:256
offload_activations

  • Without FlashAttention: batch size is 4 with gradient accumulation of 2 steps.
  • With FlashAttention: batch size is 8 with no gradient accumulation
121 158 30.58
100B GPT-NeoX
Seq length: 2048
Global batch size: 4096
FP16
64 p4d.24xlarge nodes Activation checkpointing
sharded_data_parallel_degree:256
offload_activations

  • Without FlashAttention: batch size is 4 with gradient accumulation of 2 steps.
  • With FlashAttention: batch size is 8 with no gradient accumulation
122 158 29.51

For future work, we’ll be working on supporting an AWS-optimized Reduce-Scatter in SMDDP Collectives. The Reduce-Scatter collective is critical in averaging and sharding gradients computed in the backward pass. We expect this to further speed up SMP library in the future releases.

Conclusion

In this post, we discuss the two latest performance improvements for sharded data parallel technique in SageMaker model parallel library. LLMs show great promise in improving the quality and re-usability of ML models. AWS teams are working closely with customers to keep reducing their training costs and time-to-market. You can find more SageMaker model parallel examples in Amazon SageMaker Examples GitHub repo or attend our next distributed training workshops. If you are interested in speeding up large model training, check out these features and let us know what you build!


About the authors

Arjun Balasubramanian is a Senior Software Engineer at AWS focused on building high-performance, hardware accelerated collective communication algorithms for distributed deep learning. He is broadly interested in systems for large-scale machine learning and networking. Outside of work, he enjoys traveling and playing various sports.

Zhaoqi Zhu is a Software Development Engineer at AWS, specializing in distributed deep learning systems and working on the SageMaker Distributed Data Parallel library. Outside of work, Zhaoqi is passionate about soccer and hopes to not receive any red card in the upcoming season.

Can Karakus is a Senior Applied Scientist at AWS, optimizing large-scale distributed deep learning on AWS. His research interests cover deep learning, distributed optimization, distributed systems, and information theory. Outside of work, he enjoys cycling, traveling, reading and learning.

Rahul Huilgol is a Senior Software Engineer at AWS. He works on distributed deep learning systems, towards making it easy and performant to train large deep learning models in the cloud. In his spare time, he enjoys photography, biking and gardening.

Suhit Kodgule is a Software Development Engineer with AWS Artificial Intelligence group working on deep learning frameworks. In his spare time, he enjoys hiking, traveling and cooking.

Fei Wu is a Software Engineer at AWS. He works on distributed training for large-scale deep learning models on cloud. Outside of work, he enjoys basketball, gaming and cooking.

Read More

Next generation Amazon SageMaker Experiments – Organize, track, and compare your machine learning trainings at scale

Next generation Amazon SageMaker Experiments – Organize, track, and compare your machine learning trainings at scale

Today, we’re happy to announce updates to our Amazon SageMaker Experiments capability of Amazon SageMaker that lets you organize, track, compare and evaluate machine learning (ML) experiments and model versions from any integrated development environment (IDE) using the SageMaker Python SDK or boto3, including local Jupyter Notebooks.

Machine learning (ML) is an iterative process. When solving a new use case, data scientists and ML engineers iterate through various parameters to find the best model configurations (aka hyperparameters) that can be used in production to solve the identified business challenge. Over time, after experimenting with multiple models and hyperparameters, it becomes difficult for ML teams to efficiently manage model runs to find the optimal one without a tool to keep track of the different experiments. Experiment tracking systems streamline the processes to compare different iterations and helps simplify collaboration and communication in a team, thereby increasing productivity and saving time. This is achieved by organizing and managing ML experiments in an effortless way to draw conclusions from them, for example, finding the training run with the best accuracy.

To solve this challenge, SageMaker provides SageMaker Experiments, a fully integrated SageMaker capability. It provides the flexibility to log your model metrics, parameters, files, artifacts, plot charts from the different metrics, capture various metadata, search through them and support model reproducibility. Data scientists can quickly compare the performance and hyperparameters for model evaluation through visual charts and tables. They can also use SageMaker Experiments to download the created charts and share the model evaluation with their stakeholders.

With the new updates to SageMaker Experiments, it is now a part of the SageMaker SDK, simplifying the data scientist work and eliminating the need to install an extra library to manage multiple model executions. We are introducing the following new core concepts:

  • Experiment: A collection of runs that are grouped together. An experiment includes runs for multiple types that can be initiated from anywhere using the SageMaker Python SDK.
  • Run: Each execution step of a model training process. A run consists of all the inputs, parameters, configurations, and results for one iteration of model training. Custom parameters and metrics can be logged using the log_parameter, log_parameters, and log_metric functions. Custom input and output can be logged using the log_file function.

The concepts that are implemented as part of a Run class are made available from any IDE where the SageMaker Python SDK is installed. For SageMaker Training, Processing and

Transform Jobs, the SageMaker Experiment Run is automatically passed to the job if the job is invoked within a run context. You can recover the run object using load_run() from your job. Finally, with the new functionalities’ integration, data scientists can also automatically log a confusion matrix, precision and recall graphs, and a ROC curve for classification use cases using the run.log_confusion_matrix, run.log_precision_recall, and run.log_roc_curve functions, respectively.

In this blog post, we will provide examples of how to use the new SageMaker Experiments functionalities in a Jupyter notebook via the SageMaker SDK. We will demonstrate these capabilities using a PyTorch example to train an MNIST handwritten digits classification example. The experiment will be organized as follow:

  1. Creating experiment’s runs and logging parameters: We will first create a new experiment, start a new run for this experiment, and log parameters to it.
  2. Logging model performance metrics:We will log model performance metrics and plot metric graphs.
  3. Comparing model runs:We will compare different model runs according to the model hyperparameters. We will discuss how to compare those runs and how to use SageMaker Experiments to select the best model.
  4. Running experiments from SageMaker jobs: We will also provide an example of how to automatically share your experiment’s context with a SageMaker processing, training or batch transform job. This allows you to automatically recover your run context with the load_run function inside your job.
  5. Integrating SageMaker Clarify reports: We will demonstrate how we can now integrate SageMaker Clarify bias and explainability reports to a single view with your trained model report.

Prerequisites

For this blog post, we will use Amazon SageMaker Studio to showcase how to log metrics from a Studio notebook using the updated SageMaker Experiments functionalities. To execute the commands presented in our example, you need the following prerequisites:

  • SageMaker Studio Domain
  • SageMaker Studio user profile with SageMaker full access
  • A SageMaker Studio notebook with at least an ml.t3.medium instance type

If you do not have a SageMaker Domain and user profile available, you can create one using this quick setup guide.

Logging parameters

For this exercise, we will use torchvision, a PyTorch package that provides popular datasets, model architectures, and common image transformations for computer vision. SageMaker Studio provides a set of Docker images for common data science use cases that are made available in Amazon ECR. For PyTorch, you have the option of selecting images optimized for CPU or GPU training. For this example, we will select the image PyTorch 1.12 Python 3.8 CPU Optimized and the Python 3 kernel. The examples described below will focus on the SageMaker Experiments functionalities and are not code complete.

Let’s download the data with the torchvision package and track the number of data samples for the train and test datasets as parameters with SageMaker Experiments. For this example, let’s assume train_set and test_set as already downloaded torchvision datasets.

from sagemaker.session import Session
from sagemaker.experiments.run import Run
import os

# create an experiment and start a new run
experiment_name = "local-experiment-example"
run_name = "experiment-run"

with Run(experiment_name=experiment_name, sagemaker_session=Session(), run_name=run_name) as run:
    run.log_parameters({
        "num_train_samples": len(train_set.data),
        "num_test_samples": len(test_set.data)
    })
    for f in os.listdir(train_set.raw_folder):
        print("Logging", train_set.raw_folder+"/"+f)
        run.log_file(train_set.raw_folder+"/"+f, name=f, is_output=False)

In this example, we use the run.log_parameters to log the number of train and test data samples and run.log_file to upload the raw datasets to Amazon S3 and log them as inputs to our experiment.

Training a model and logging model metrics

Now that we’ve downloaded our MNIST dataset, let’s train a CNN model to recognize the digits. While training the model, we want to load our existing experiment run, log new parameters to it, and track the model performance by logging model metrics.

We can use the load_run function to load our previous run and use it to log our model training

with load_run(experiment_name=experiment_name, run_name=run_name, sagemaker_session=Session()) as run:
    train_model(
        run=run,
        train_set=train_set,
        test_set=test_set,
        epochs=10,
        hidden_channels=5,
        optimizer="adam"
    )

We can then use run.log_parameter and run.log_parameters to log one or multiple model parameters to our run.

# log the parameters of your model
run.log_parameter("device", "cpu")
run.log_parameters({
    "data_dir": data_dir,
    "optimizer": optimizer,
    "epochs": epochs,
    "hidden_channels": hidden_channels
})

And we can use run.log_metric to log performance metrics to our experiment.

run.log_metric(name=metric_type+":loss", value=loss, step=epoch)
run.log_metric(name=metric_type+":accuracy", value=accuracy, step=epoch)

For classification models, you can also use run.log_confusion_matrix, run.log_precision_recall, and run.log_roc_curve,  to automatically plot the confusion matrix, precision recall graph, and the ROC curve of your model. Since our model solves a multiclass classification problem, let’s log only the confusion matrix for it.

# log confusion matrix
with torch.no_grad():
    for data, target in test_loader:
        data, target = data.to(device), target.to(device)
        output = model(data)
        pred = output.max(1, keepdim=True)[1] 
        run.log_confusion_matrix(target, pred, "Confusion-Matrix-Test-Data")

When looking at our run details, we can now see the generated metrics as shown in the screenshot below:

The run details page provides further information about the metrics.

And the new model parameters are tracked on the parameters overview page.

You can also analyze your model performance by class using the automatically plotted confusion matrix, which can also be downloaded and used for different reports. And you can plot extra graphs to analyze the performance of your model based on the logged metrics.

Comparing multiple model parameters

As a data scientist, you want to find the best possible model. That includes training a model multiple times with different hyperparameters and comparing the performance of the model with those hyperparameters. To do so, SageMaker Experiments allows us to create multiple runs in the same experiment. Let’s explore this concept by training our model with different num_hidden_channels and optimizers.

# define the list of parameters to train the model with
num_hidden_channel_param = [5, 10, 20]
optimizer_param = ["adam", "sgd"]
run_id = 0
# train the model using SageMaker Experiments to track the model parameters, 
# metrics and performance
sm_session = Session()
for i, num_hidden_channel in enumerate(num_hidden_channel_param):
    for k, optimizer in enumerate(optimizer_param):
        run_id += 1
        run_name = "experiment-run-"+str(run_id)
        print(run_name)
        print(f"Training model with: {num_hidden_channel} hidden channels and {optimizer} as optimizer")
        # Defining an experiment run for each model training run
        with Run(experiment_name=experiment_name, run_name=run_name, sagemaker_session=sm_session) as run:
            train_model(
                run=run, 
                train_set=train_set,
                test_set=test_set,
                epochs=10, 
                hidden_channels=num_hidden_channel,
                optimizer=optimizer
            )

We are now creating six new runs for our experiment. Each one will log the model parameters, metrics, and confusion matrix. We can then compare the runs to select the best-performing model for the problem. When analyzing the runs, we can plot the metric graphs for the different runs as a single plot, comparing the performance of the runs across the different training steps (or epochs).

Using SageMaker Experiments with SageMaker training, processing and batch transform jobs

In the example above, we used SageMaker Experiments to log model performance from a SageMaker Studio notebook where the model was trained locally in the notebook. We can do the same to log model performance from SageMaker processing, training and batch transform jobs. With the new automatic context passing capabilities, we do not need to specifically share the experiment configuration with the SageMaker job, as it will be automatically captured.

The example below will focus on the SageMaker Experiments functionalities and is not code complete.

from sagemaker.pytorch import PyTorch
from sagemaker.experiments.run import Run
from sagemaker.session import Session
from sagemaker import get_execution_role
role = get_execution_role()

# set new experiment configuration
exp_name = "training-job-experiment-example"
run_name = "experiment-run-example"

# Start training job with experiment setting
with Run(experiment_name=exp_name, run_name=run_name, sagemaker_session=Session()) as run:
    est = PyTorch(
        entry_point="<MODEL_ENTRY_POINT>",
        dependencies=["<MODEL_DEPENDENCIES>"],
        role=role,
        model_dir=False,
        framework_version="1.12",
        py_version="py38",
        instance_type='ml.c5.xlarge',
        instance_count=1,
            hyperparameters={
            "epochs": 10,
            "hidden_channels":5,
            "optimizer": "adam",
        },
        keep_alive_period_in_seconds=3600
    )
    
    est.fit()

In our model script file, we can get the run context using load_run(). In SageMaker processing and training jobs, we do not need to provide the experiment configuration for loading the configuration. For batch transform jobs, we need to provide experiment_name and run_name to load the experiment’s configuration.

with load_run() as run:
    run.log_parameters({...})
    train_model(run, ...)

In addition to the information we get when running SageMaker Experiments from a notebook script, the run from a SageMaker job will automatically populate the job parameters and outputs.

The new SageMaker Experiments SDK also ensures backwards compatibility with the previous version using the concepts of trials and trial components. Any experiment triggered using the previous SageMaker Experiments version will be automatically made available in the new UI, for analyzing the experiments.

Integrating SageMaker Clarify and model training reports

SageMaker Clarify helps improve our ML models by detecting potential bias and helping explain how these models make predictions. Clarify provides pre-built containers that run as SageMaker processing jobs after your model has been trained, using information about your data (data configuration), model (model configuration), and the sensitive data columns that we want to analyze for possible bias (bias configuration). Up until now, SageMaker Experiments displayed our model training and Clarify reports as individual trial components that were connected via a trial.

With the new SageMaker Experiments, we can also integrate SageMaker Clarify reports with our model training having one source of truth that allows us to further understand our model. For an integrated report, all we need to do is to have the same run name for our training and Clarify jobs. The following example demonstrates how we can integrate the reports using an XGBoost model to predict the income of adults across the United States. The model uses the UCI Adult dataset. For this exercise, we assume that the model was already trained and that we already calculated the data, model, and bias configurations.

with Run(
    experiment_name='clarify-experiment',
    run_name="joint-run",
    sagemaker_session=sagemaker_session,
) as run:
    xgb.fit({"train": train_input}, logs=False)
    clarify_processor.run_bias(
        data_config=bias_data_config,
        bias_config=bias_config,
        model_config=model_config,
        model_predicted_label_config=predictions_config,
        pre_training_methods="all",
        post_training_methods="all",
    )
    clarify_processor.run_explainability(
        data_config=explainability_data_config,
        model_config=model_config,
        explainability_config=shap_config,
    )

With this setup, we get a combined view that includes the model metrics, joint inputs and outputs, and the Clarify reports for model statistical bias and explainability.

Conclusion

In this post, we explored the new generation of SageMaker Experiments, an integrated part of SageMaker SDK. We demonstrated how to log your ML workflows from anywhere with the new Run class. We presented the new Experiments UI that allows you to track your experiments and plot graphs for a single run metric as well as to compare multiple runs with the new analysis capability. We provided examples of logging experiments from a SageMaker Studio notebook and from a SageMaker Studio training job. Finally, we showed how to integrate model training and SageMaker Clarify reports in a unified view, allowing you to further understand your model.

We encourage you to try out the new Experiments functionalities and connect with the Machine Learning & AI community if you have any questions or feedback!


About the Authors

Maira Ladeira Tanke is a Machine Learning Specialist at AWS. With a background in Data Science, she has 9 years of experience architecting and building ML applications with customers across industries. As a technical lead, she helps customers accelerate their achievement of business value through emerging technologies and innovative solutions. In her free time, Maira enjoys traveling and spending time with her family someplace warm.

Mani Khanuja is an Artificial Intelligence and Machine Learning Specialist SA at Amazon Web Services (AWS). She helps customers using machine learning to solve their business challenges using the AWS. She spends most of her time diving deep and teaching customers on AI/ML projects related to computer vision, natural language processing, forecasting, ML at the edge, and more. She is passionate about ML at edge, therefore, she has created her own lab with self-driving kit and prototype manufacturing production line, where she spends lot of her free time.

Dewen Qi is a Software Development Engineer at AWS. She currently participating in building a collection of platform services and tools in AWS SageMaker to help customer in making their ML projects successful. She is also passionate about bringing the concept of MLOps to broader audience. Outside of work, Dewen enjoys practicing Cello.

Abhishek Agarwal is a Senior Product Manager for Amazon SageMaker. He is passionate about working with customers and making machine learning more accessible. In his spare time, Abhishek enjoys painting, biking and learning about innovative technologies.

Dana Benson is a Software Engineer working in the Amazon SageMaker Experiments, Lineage, and Search team. Prior to joining AWS, Dana spent time enabling smart home functionality in Alexa and mobile ordering at Starbucks.

Read More

Introducing Fortuna: A library for uncertainty quantification

Introducing Fortuna: A library for uncertainty quantification

Proper estimation of predictive uncertainty is fundamental in applications that involve critical decisions. Uncertainty can be used to assess the reliability of model predictions, trigger human intervention, or decide whether a model can be safely deployed in the wild.

We introduce Fortuna, an open-source library for uncertainty quantification. Fortuna provides calibration methods, such as conformal prediction, that can be applied to any trained neural network to obtain calibrated uncertainty estimates. The library further supports a number of Bayesian inference methods that can be applied to deep neural networks written in Flax. The library makes it easy to run benchmarks and will enable practitioners to build robust and reliable AI solutions by taking advantage of advanced uncertainty quantification techniques.

The problem of overconfidence in deep learning

If you have ever looked at class probabilities returned by a trained deep neural network classifier, you might have observed that the probability of one class was much larger than the others. Something like this, for example:

p = [0.0001, 0.0002, …, 0.9991, 0.0003, …, 0.0001]

If this is the case for the majority of the predictions, your model might be overconfident. In order to evaluate the validity of the probabilities returned by the classifier, we may compare them with the actual accuracy achieved over a holdout data set. Indeed, it is natural to assume that the proportion of correctly classified data points should approximately match the estimated probability of the predicted class. This concept is known as calibration [Guo C. et al., 2017].

Unfortunately, many trained deep neural networks are miscalibrated, meaning that the estimated probability of the predicted class is much higher than the proportion of correctly classified input data points. In other words, the classifier is overconfident.

Being overconfident might be problematic in practice. A doctor may not order relevant additional tests, as a result of an overconfident healthy diagnosis produced by an AI. A self-driving car may decide not to brake because it confidently assessed that the object in front was not a person. A governor may decide to evacuate a town because the probability of an eminent natural disaster estimated by an AI is too high. In these and many other applications, calibrated uncertainty estimates are critical to assess the reliability of model predictions, fall back to a human decision-maker, or decide whether a model can be safely deployed.

Fortuna: A library for uncertainty quantification

There are many published techniques to either estimate or calibrate the uncertainty of predictions, e.g., Bayesian inference [Wilson A.G., 2020], temperature scaling [Guo C. et al., 2017], and conformal prediction [Angelopoulos A.N. et al., 2022] methods. However, existing tools and libraries for uncertainty quantification have a narrow scope and do not offer a breadth of techniques in a single place. This results in a significant overhead, hindering the adoption of uncertainty into production systems.

In order to fill this gap, we launch Fortuna, a library for uncertainty quantification that brings together prominent methods across the literature and makes them available to users with a standardized and intuitive interface.

As an example, suppose you have training, calibration, and test data loaders in tensorflow.Tensor format, namely train_data_loader, calib_data_loader and test_data_loader. Furthermore, you have a deep learning model written in Flax, namely model. Then you can use Fortuna to:

  1. fit a posterior distribution;
  2. calibrate the model outputs;
  3. make calibrated predictions;
  4. estimate uncertainty estimates;
  5. compute evaluation metrics.

The following code does all of this for you.

from fortuna.data import DataLoader
from fortuna.prob_model.classification import ProbClassifier
from fortuna.metric.classification import expected_calibration_error

# convert data loaders
train_data_loader = DataLoader.from_tensorflow_data_loader(train_data_loader)
calib_data_loader = DataLoader.from_tensorflow_data_loader(calib_data_loader)
test_data_loader = DataLoader.from_tensorflow_data_loader(test_data_loader)

# define and train a probabilistic model
prob_model = ProbClassifier(model=model)
train_status = prob_model.train(train_data_loader=train_data_loader, calib_data_loader=calib_data_loader)

# make predictions and estimate uncertainty
test_inputs_loader = test_data_loader.to_inputs_loader()
test_means = prob_model.predictive.mean(inputs_loader=test_inputs_loader)
test_modes = prob_model.predictive.mode(inputs_loader=test_inputs_loader, means=test_means)

# compute the expected calibration error and plot a reliability diagram
test_targets = test_data_loader.to_array_targets()
ece = expected_calibration_error(preds=test_modes, probs=test_means, targets=test_targets)

The code above makes use of several default choices, including SWAG [Maddox W.J. et al., 2019] as a posterior inference method, temperature scaling [Guo C. et al., 2017] to calibrate the model outputs, and a standard Gaussian prior distribution, as well as the configuration of the posterior fitting and calibration processes. You can easily configure all of these components, and you are highly encouraged to do so if you are looking for a specific configuration or if you want to compare several ones.

Usage modes

Fortuna offers three usage modes: 1/ Starting from Flax models, 2/ Starting from model outputs, and 3/ Starting from uncertainty estimates. Their pipelines are depicted in the following figure, each starting from one of the green panels. The code snippet above is an example of using Fortuna starting from Flax models, which allows training a model using Bayesian inference procedures. Alternatively, you can start either by model outputs or directly from your own uncertainty estimates. Both these latter modes are framework independent and help you obtain calibrated uncertainty estimates starting from a trained model.

1/ Starting from uncertainty estimates

Starting from uncertainty estimates has minimal compatibility requirements, and it is the quickest level of interaction with the library. This usage mode offers conformal prediction methods for both classification and regression. These take uncertainty estimates in numpy.ndarray format and return rigorous sets of predictions that retain a user-given level of probability. In one-dimensional regression tasks, conformal sets may be thought of as calibrated versions of confidence or credible intervals.

Mind that if the uncertainty estimates that you provide in inputs are inaccurate, conformal sets might be large and unusable. For this reason, if your application allows it, please consider the Starting from model outputs and Starting from Flax models usage modes detailed below.

2/ Starting from model outputs

This mode assumes you have already trained a model in some framework and arrive at Fortuna with model outputs in numpy.ndarray format for each input data point. This usage mode allows you to calibrate your model outputs, estimate uncertainty, compute metrics, and obtain conformal sets.

Compared to the Starting from uncertainty estimates usage mode, Starting from model outputs provides better control, as it can make sure uncertainty estimates have been appropriately calibrated. However, if the model had been trained with classical methods, the resulting quantification of model (aka epistemic) uncertainty may be poor. To mitigate this problem, please consider the Starting from Flax models usage mode.

3/ Starting from Flax models

Starting from Flax models has higher compatibility requirements than the Starting from uncertainty estimates and Starting from model outputs usage modes, as it requires deep learning models written in Flax. However, it enables you to replace standard model training with scalable Bayesian inference procedures, which may significantly improve the quantification of predictive uncertainty.

Bayesian methods work by representing uncertainty over which solution is correct, given limited information, through uncertainty over model parameters. This type of uncertainty is called “epistemic” uncertainty. Because neural networks can represent many different solutions, corresponding to different settings of their parameters, Bayesian methods can be especially impactful in deep learning. We provide many scalable Bayesian inference procedures, which can often be used to provide uncertainty estimates, as well as improved accuracy and calibration, with essentially no training-time overhead.

Conclusion

We announced the general availability of Fortuna, a library for uncertainty quantification in deep learning. Fortuna brings together prominent methods across the literature, e.g., conformal methods, temperature scaling, and Bayesian inference, and makes them available to users with a standardized and intuitive interface. To get started with Fortuna, you can consult the following resources:

Try Fortuna out, and let us know what you think! You are encouraged to contribute to the library or leave your suggestions and contributions—just create an issue or open a pull request. On our side, we will keep on improving Fortuna, increase its coverage of uncertainty quantification methods and add further examples that showcase its usefulness in several scenarios.


About the authors

 

Gianluca Detommaso is an Applied Scientist at AWS. He currently works on uncertainty quantification in deep learning. In his spare time, Gianluca likes to practice sports, eating great food and learning new skills.

Alberto Gasparin is an Applied Scientist within Amazon Community Shopping since July 2021. His interests include natural language processing, information retrieval and uncertainty quantification. He is a food and wine enthusiast.

Michele Donini is a Sr Applied Scientist at AWS. He leads a team of scientists working on Responsible AI and his research interests are Algorithmic Fairness and Explainable Machine Learning.

Matthias Seeger is a Principal Applied Scientist at AWS.

Cedric Archambeau is a Principal Applied Scientist at AWS and Fellow of the European Lab for Learning and Intelligent Systems.

Andrew Gordon Wilson is an Associate Professor at the Courant Institute of Mathematical Sciences and Center for Data Science at New York University, and an Amazon Visiting Academic at AWS. He is particularly engaged in building methods for Bayesian and probabilistic deep learning, scalable Gaussian processes, Bayesian optimization, and physics-inspired machine learning.

Read More

Best practices for Amazon SageMaker Training Managed Warm Pools

Best practices for Amazon SageMaker Training Managed Warm Pools

Amazon SageMaker Training Managed Warm Pools gives you the flexibility to opt in to reuse and hold on to the underlying infrastructure for a user-defined period of time. This is done while also maintaining the benefit of passing the undifferentiated heavy lifting of managing compute instances in to Amazon SageMaker Model Training. In this post, we outline the key benefits and pain points addressed by SageMaker Training Managed Warm Pools, as well as benchmarks and best practices.

Overview of SageMaker Training Managed Warm Pools

SageMaker Model Training is a fully managed capability that spins up instances for every job, trains a model, runs and then spins down instances after the job. You’re only billed for the duration of the job down to the second. This fully managed capability gives you the freedom to focus on your machine learning (ML) algorithm and not worry about undifferentiated heavy lifting like infrastructure management while training your models.

This mechanism necessitates a finite startup time for a training job. Although this startup time, also known as cold-start startup time, is fairly low, some of our most demanding customer use cases require even lower startup times, such as under 20 seconds. There are two prominent use cases that have these requirements:

  • The first is active ML experimentation by data scientists using the Amazon SageMaker training platform, especially while training large models, like GPT3, that require multiple iterations to get to a production-ready state.
  • The second is the programmatic launch of a large number (in the order of several hundred or thousands) of consecutive jobs on the same kind of instances on a scheduled cadence. For example, parameter search or incremental training.

For such use cases, every second spent on overhead, like the startup time for a training job, has a cumulative effect on all these jobs.

With SageMaker Training Managed Warm Pools, data scientists and ML engineers have the ability to opt in to keep SageMaker training instances or multi-instance clusters warm for a prespecified and reconfigurable time (keep_alive_period_in_seconds) after each training job completes. So even though you incur a cold-start penalty for the first training job run on an instance or cluster, for all the subsequent training jobs, the instances are already up and running. As a result, these subsequent training jobs that start on an instance before the keep_alive_period_in_seconds expires don’t incur the cold-start startup time overhead. This can reduce training job startup times to roughly less than 20 seconds (P90).

Data scientists and ML engineers can use SageMaker Training Managed Warm Pools to keep single or multiple instances warm in between training runs for experimentation or run multiple jobs consecutively on the same single or multi-instance cluster. You only pay for the duration of training jobs and the reconfigurable keep_alive_period_in_seconds like everywhere else you specify for every single instance.

In essence, with SageMaker Training Managed Warm Pools, you get a combination of SageMaker managed instance utilization with the ability to opt in and provision capacity and self-manage utilization for short intervals of time. These intervals are configurable before a job, but if during the keep_alive_period_in_seconds interval, you need to reduce or increase it, you can do so. Increases to keep_alive_period_in_seconds can be done in intervals of up to 60 minutes, with a max period for an instance or cluster being 7 days.

To get started with warm pools, first request a warm pool quota limit increase, then specify the keep_alive_period_in_seconds parameter when starting a training job.

Benchmarks

We performed benchmarking tests to measure job startup latency using a 1.34 GB TensorFlow image, 2 GB of data, and different training data input modes (Amazon FSx, Fast File Mode, File Mode). The tests were run across a variety of instance types from the m4, c4, m5, and c5 families in the us-east-2 Region. The startup latency was measured as the time of job creation to the start of the actual training job on the instances. The first jobs that started the cluster and created the warm pool had a startup latency of 2–3 minutes. This higher latency is due to the time taken to provision the infrastructure, download the image, and download the data. The consequent jobs that utilized the warm pool cluster had a startup latency of approximately 20 seconds for Fast File Mode (FFM) or Amazon FSx, and 70 seconds for File Mode (FM). This delta is a result of FM requiring the entire dataset to be downloaded from Amazon S3 prior to the start of the job.

Your choice of training data input mode affects the startup time, even with Warm Pools. Guidance on what input mode to select is in the best practices section later in this post.

The following table summarizes the job startup latency P90 for different training data input modes.

Data Input Mode Startup Latency P90 (seconds)
First Job Warm Pool Jobs (second job onwards)
FSx 136 19
Fast File Mode 143 21
File Mode 176 70

Best practices for using warm pools

In the following section, we share some best practices when using warm pools.

When should you use warm pools?

Warm pools are recommended in the following scenarios:

  • You are interactively experimenting and tuning your script over a series of short jobs.
  • You are running your own custom-made, large-scale hyperparameter optimization (for example, Syne Tune).
  • You have a batch process that runs a large number (in the order of several hundreds or thousands) of consecutive jobs on the same kind of instances on a daily or weekly cadence. For example, training an ML model per city.

Warm pools are not recommended when it’s unlikely that someone will reuse the warm pool before it expires. For example, a single lengthy job that runs via an automated ML pipeline.

Minimize warm pool training job startup latency

Training jobs that reuse a warm pool start faster than the first job that created the warm pool. This is due to keeping the ML instances running between jobs with a cached training container Docker image to skip pulling the container from Amazon Elastic Container Registry (Amazon ECR). However, even when reusing a warm pool, certain initialization steps occur for all jobs. Optimizing these steps can reduce your job startup time (both first and subsequent jobs). Consider the following:

  • Training data input mode can affect startup time – Managed training data input channels are recreated for each training job, contributing to job startup latency. So doing initial experiments over a smaller dataset will allow for faster startup time (and faster training time). For later stages of experimentation, when a large dataset is needed, consider using an input mode type that has minimal or fixed initialization time. For example, FILE input mode copies the entire dataset from Amazon Simple Storage Service (Amazon S3) to the training instance, which is time-consuming for large datasets (even with warm pools). Fast File Mode is better suited for lower startup latency because only S3 object metadata needs to be read from Amazon S3 before the workload can start. The Amazon FSx for Lustre, or Amazon Elastic File System (Amazon EFS) file system input mode, has a fixed initialization time regardless of the number of files in the file system, which is beneficial when working with a large dataset.
    For more information on how to choose an input channel, see Choose the best data source for your Amazon SageMaker training job.
  • Reduce runtime installation of packages – Any software installation that takes place during container startup, for example, Python’s pip or operating system apt-get, will increase training job latency. Minimizing this startup latency requires making a trade-off between the flexibility and simplicity of runtime installations vs. installation at container build time. If you use your own Docker container with SageMaker, refer to Adapting Your Own Docker Container to Work with SageMaker. If you rely on prebuilt SageMaker container images, you’ll need to extend a prebuilt container and explicitly manage these containers. Consider this if your runtime installs significantly increase startup latency.
  • Avoid updating your Docker image frequently – If you use your own Docker container with SageMaker, try to avoid updating it every job run. If the Docker image changes between the job submissions, the warm pool will be reused, but the startup process will need to re-pull the container image from Amazon ECR instead of reusing a cached container image. If the Docker image must be updated, confine the updates to the last Docker layer to take advantage of Docker layer caching. Ideally, you should remove the Dockerfile content that’s likely to change over iterations, like hyperparameter, dataset definitions, and the ML code itself. To iterate on ML code without having to rebuild Docker images with each change, you can adopt the framework container paradigm advocated in the SageMaker Training Toolkit. If you’d like to develop a framework container with your own code, refer to this Amazon SageMaker tutorial.

Share warm pools between multiple users

When working with a large team of data scientists, you can share warm pools that have matching job criteria, such as the same AWS Identity and Access Management (IAM) role or container image.

Let’s look at an example timeline. User-1 starts a training job that completes and results in a new warm pool created. When user-2 starts a training job, the job will reuse the existing warm pool, resulting in a fast job startup. While user-2’s job is running with the warm pool in use, if another user starts a training job, then a second warm pool will be created.

This reuse behavior helps reduce costs by sharing warm pools between users that start similar jobs. If you want to avoid sharing warm pools between users, then users’ jobs must not have matching job criteria (for example, they must use a different IAM role).

Notify users on job completion

When using warm pools for experimentation, we recommend notifying users when their job is complete. This allows users to resume experimentation before the warm pool expires or stop the warm pool if it’s no longer needed. You can also automatically trigger notifications through Amazon EventBridge.

Further tools for fast experimentation and troubleshooting training jobs

With warm pools, you can start a job in less than 20 seconds. Some scenarios require real-time, hands-on interactive experimentation and troubleshooting. The open-source SageMaker SSH Helper library allows you to shell into a SageMaker training container and conduct remote development and debugging.

Conclusion

With SageMaker Training Managed Warm Pools, you can keep your model training hardware instances warm after every job for a specified period. This can reduce the startup latency for a model training job by up to 8x. SageMaker Training Managed Warm Pools are available in all public AWS Regions where SageMaker Model Training is available.

To get started, see Train Using SageMaker Managed Warm Pools.


About the authors

Romi DattaDr. Romi Datta  is a Senior Manager of Product Management in the Amazon SageMaker team responsible for training, processing and feature store. He has been in AWS for over 4 years, holding several product management leadership roles in SageMaker, S3 and IoT. Prior to AWS he worked in various product management, engineering and operational leadership roles at IBM, Texas Instruments and Nvidia. He has an M.S. and Ph.D. in Electrical and Computer Engineering from the University of Texas at Austin, and an MBA from the University of Chicago Booth School of Business.

Arun Nagarajan is a Principal Engineer with the Amazon SageMaker team focussing on the Training and MLOps areas. He has been with the SageMaker team from the launch year, enjoyed contributing to different areas in SageMaker including the realtime inference and Model Monitor products. He likes to explore the outdoors in the Pacific Northwest area and climb mountains.

Amy You is a Software Development Manager at AWS SageMaker. She focuses on bringing together a team of software engineers to build, maintain and develop new capabilities of the SageMaker Training platform that helps customers train their ML models more efficiently and easily. She has a passion for ML and AI technology, especially related to image and vision from her graduate studies. In her spare time, she loves working on music and art with her family.

Sifei Li is a Software Engineer in Amazon AI where she’s working on building Amazon Machine Learning Platforms and was part of the launch team for Amazon SageMaker. In her spare time, she likes playing music and reading.

Jenna Zhao is a Software Development Engineer at AWS SageMaker. She is passionate about ML/AI technology and has been focusing on building SageMaker Training platform that enables customers to quickly and easily train machine learning models. Outside of work, she enjoys traveling and spending time with her family.

Paras Mehra is a Senior Product Manager at AWS. He is focused on helping build Amazon SageMaker Training and Processing. In his spare time, Paras enjoys spending time with his family and road biking around the Bay Area. You can find him on LinkedIn.

Gili Nachum is a senior AI/ML Specialist Solutions Architect who works as part of the EMEA Amazon Machine Learning team. Gili is passionate about the challenges of training deep learning models, and how machine learning is changing the world as we know it. In his spare time, Gili enjoy playing table tennis.

Olivier Cruchant is a Machine Learning Specialist Solutions Architect at AWS, based in France. Olivier helps AWS customers – from small startups to large enterprises – develop and deploy production-grade machine learning applications. In his spare time, he enjoys reading research papers and exploring the wilderness with friends and family.

Emily Webber joined AWS just after SageMaker launched, and has been trying to tell the world about it ever since! Outside of building new ML experiences for customers, Emily enjoys meditating and studying Tibetan Buddhism.

Read More

How to evaluate the quality of the synthetic data – measuring from the perspective of fidelity, utility, and privacy

How to evaluate the quality of the synthetic data – measuring from the perspective of fidelity, utility, and privacy

In an increasingly data-centric world, enterprises must focus on gathering both valuable physical information and generating the information that they need but can’t easily capture. Data access, regulation, and compliance are an increasing source of friction for innovation in analytics and artificial intelligence (AI).

For highly regulated sectors such as Financial Services, Healthcare, Life Sciences, Automotive, Robotics, and Manufacturing, the problem is even greater. It causes barriers to system design, data sharing (Internal and external), monetization, analytics, and machine learning (ML).

Synthetic data is a tool that addresses many data challenges, particularly AI and analytics issues like privacy protection, regulatory compliance, accessibility, data scarcity, and bias. This also includes data sharing and time to data (and therefore time to market).

Synthetic data is algorithmically generated. It mirrors statistical properties and patterns from the source data. But importantly it contains no sensitive, private, or personal data points.

You ask questions of the synthetic data and get the same answers that you would from the real data.

In our earlier post, we demonstrated how to use adversarial networks like Generative Adversarial Networks (GANS) to generate tabular datasets to enhance credit fraud model training.

For business stakeholders to adopt synthetic data for their ML and analytics projects, it’s imperative to not only make sure that the generated synthetic data will fit the purpose and the expected downstream applications, but also for them to be able to measure and demonstrate the quality of the generated data.

With increasing legal and ethical obligations in preserving privacy, one of synthetic data’s strengths is the ability to remove sensitive and original information during its synthesis. Therefore, in addition to quality, we need metrics to evaluate the risk of private information leaks, if any, and assess that the process of generation isn’t “memorizing” or copying any of the original data.

To achieve all of this, we can map the quality of synthetic data into dimensions, which help the users, stakeholders, and us to better understand the generated data.

The three dimensions of synthetic data quality evaluation

The synthetic data generated is measured against three key dimensions:

  1. Fidelity
  2. Utility
  3. Privacy

These are some of the questions about any generated synthetic data that should be answered by a synthetic data quality report:

  • How similar is this synthetic data as compared to the original training set?
  • How useful is this synthetic data for our downstream applications?
  • Has any information been leaked from the original training data into the synthetic data?
  • Has any data which is considered sensitive in the real world (from other data sets not used for training the model) been inadvertently synthesized by our model?

The metrics that translate each one of these dimensions for the end-users are somewhat flexible. After all, the data to be generated can vary in terms of distributions, size, and behaviors. They should also be easy to grasp and interpret.

Ultimately, the metrics must be completely data-driven, and not requiring any prior knowledge or domain-specific information. However, if the user wants to apply specific rules and constraints applicable to a specific business domain, then they should be able to define them during the synthesis process to make sure that the domain-specific fidelity is met.

We look at each of these metrics in more detail in the following sections.

Metrics to understand fidelity

In any data science project, we must understand whether a certain sample population is relevant to the problem that we’re solving. Similarly, for the process of assessing the relevance of the synthetic data generated, we must evaluate it in terms of fidelity as compared to the original.

Visual representations of these metrics make them easier to comprehend. We could illustrate whether the cardinality and ratio of categories were respected, the correlations between the different variables were kept, and so on.

Visualizing the data not only helps to evaluate the quality of the synthetic data, but also fits in as one of the initial steps in the data science lifecycle for a better understanding of the data.

Let’s dive into some fidelity metrics in more detail.

Exploratory statistical comparisons

Within the exploratory statistical comparisons, the features of the original and synthetic datasets are explored using key statistical measures, such as the mean, median, standard deviation, distinct values, missing values, minima, maxima, quartile ranges for continuous features, and the number of records per category, missing values per category, and most occurring characters for categorical attributes.

This comparison should be conducted between the original hold-out dataset and the synthetic data. This evaluation would reveal if the datasets compared are statistically similar. If they aren’t, then we’ll have an understanding of which features and measures are different. You should consider retraining and regenerating the synthetic data with different parameters if a significant difference is noted.

This test acts as an initial screening to make sure that the synthetic data has reasonable fidelity to the original dataset and can therefore usefully undergo more rigorous testing.

Histogram similarity score

The histogram similarity score measures each feature’s marginal distributions of the synthetic and original datasets.

The similarity score is bounded between zero and one, with a score of one indicating that the synthetic data distributions perfectly overlap the distributions of the original data.

A score close to one would give the users the confidence that the holdout dataset and the synthetic dataset are statistically similar.

Mutual information score

The mutual information score measures the mutual dependence of two features, numerical or categorical, indicating how much information can be obtained from one feature by observing another.

Mutual information can measure non-linear relationships, providing a more comprehensive understanding of the synthetic data quality as it lets us understand the extent of the variable’s relations preservation.

A score of one indicates that the mutual dependence between features has been perfectly captured in the synthetic data.

Correlation score

The correlation score measures how well the correlations in the original dataset have been captured in the synthetic data.

Correlations between two or more columns are extremely important for ML applications, which help uncover relationships between features and the target variable and help create a well-trained model.

The correlation score is bounded between zero and one, with a score of one indicating that the correlations have been perfectly matched.

Unlike structured tabular data, which we commonly encounter in data problems, some types of structured data have a particular behavior where past observations have a probability of influencing the following observation. These are known as time-series or sequential data – for example, a dataset with hourly measurements of room temperature.

This behavior means that there is a requirement to define certain metrics that can specifically measure the quality of these time-series datasets

Autocorrelation and partial autocorrelation score

Although similar to correlation, autocorrelation shows the relationship of a time series at its present value as it relates to its previous values. Removing the effects of the previous time lags yields partial autocorrelation. Therefore, the autocorrelation score measures how well the synthetic data has captured the significant autocorrelations, or partial correlations, from the original dataset.

Metrics to understand utility

Now we may have statistically realized that the synthetic data is similar to the original dataset. In addition, we must also assess how well the synthesized dataset fares on common data science problems when trained on several ML algorithms.

Using the following utility metrics, we aim to build confidence that we can actually achieve performance on downstream applications regarding how the original data has performed.

Prediction score

Measuring the performance of synthetic data as compared to the original real data can be done through ML models. The downstream model score captures the quality of the synthetic data by comparing the performance of ML models trained on both the synthetic and original datasets and validated on withheld testing data from the original dataset. This provides a Train Synthetic Test Real (TSTR) score and a Train Real Test Real (TRTR) score respectively.

TSTR scores and the feature importance score

TSTR, TRTR scores, and the Feature Importance Score (Image by author)

The score incorporates a wide range of the most trusted ML algorithms for either regression or classification tasks. Using several classifiers and regressors makes sure that the score is more generalizable across most algorithms, so that the synthetic data may be considered useful in the future.

In the end, if the TSTR score and TRTR score are comparable, this indicates that the synthetic data has the quality to be used to train effective ML models for real-world applications.

Feature importance score

Highly related to the prediction score, the feature importance (FI) score extends it by adding interpretability to the TSTR and TRTR scores.

The F1 score compares the changes and stability of the feature’s importance order obtained with the prediction score. A synthetic set of data is considered to be of high utility if it yields the same order of feature importance as the original real data.

QScore

To make sure that a model trained on our newly generated data is going to produce the same answers to the same questions as a model trained using the original data, we use the Qscore. This measures the downstream performance of the synthetic data by running many random aggregation-based queries on both the synthetic and original (and holdout) datasets.

The idea here is that both of these queries should return similar results.

A high QScore makes sure that downstream applications that utilize querying and aggregation operations can provide close to equal value as that of the original dataset.

Metrics to understand privacy

With privacy regulations already in place, it’s an ethical obligation and a legal requirement to make sure that sensitive information is protected.

Before this synthetic data can be shared freely and used for downstream applications, we must consider the privacy metrics that can help the stakeholder understand where the generated synthetic data stands as compared to the original data in terms of the extent of leaked information. Moreover, we must make critical decisions regarding how the synthetic data can be shared and used.

Exact match score

A direct and intuitive evaluation of privacy is to look for copies of the real data among the synthetic records. The exact match score counts the number of real records that can be found among the synthetic set.

The score should be zero, stating that no real information is present as-is in the synthetic data. This metric acts as a screening mechanism before we evaluate further privacy metrics.

Neighbors’ privacy score

Furthermore, the neighbors’ privacy score measures the ratio of synthetic records that might be too close in similarity to the real ones. This means that, although they aren’t direct copies, they are potential points of privacy leakage and a source of useful information for inference attacks.

The score is calculated by conducting a high-dimensional nearest-neighbors search on the synthetic data overlapped with the original data.

Membership inference score

In the data science lifecycle, once a model has been trained, it no longer needs access to the training samples and can make predictions on unseen data. Similarly, in our case, once the synthesizer model is trained, samples of synthetic data can be generated without the need for the original data.

Through a type of attack called “membership inference attack”, attackers can attempt to reveal the data that was used to create the synthetic data, without having the access to the original data. This results in a compromise of privacy.

The membership inference score measures the likelihood of a membership inference attack being successful.

membership inference score

A low score suggests the feasibility of inference that a particular record was a member of the training dataset that led to the creation of the synthetic data. In other words, the attacks can infer details of an individual record, thereby compromising privacy.

A high membership inference score indicates that an attacker is unlikely to determine if a particular record was part of the original dataset used to create the synthetic data. This also means that no individual’s information was compromised through the synthetic data.

The holdout concept

An important best practice that we must follow is to make sure that the synthetic data is general enough and doesn’t overfit the original data on which it was trained. In typical data science flow, while building ML models such as a Random Forest classifier, we set aside test data, train the models using the training data, and evaluate the metrics on unseen test data.

Similarly, for synthetic data, we keep aside a sample of the original data  –  generally referred to as a hold-out dataset or unseen withheld test data – and evaluate the generated synthetic data against the hold-out dataset.

The holdout dataset is expected to be a representation of the original data, yet not seen when the synthetic data was generated. Therefore, it’s vital to have similar scores for all of the metrics when comparing the original to the holdout and the synthetic datasets.

When similar scores are obtained, we can establish that the synthetic data points aren’t a result of memorization of the original data points, while preserving the same fidelity and utility.

Final thoughts

The world is starting to understand the strategic importance of synthetic data .  As data scientists and data generators, it’s our duty to build trust in the synthetic data that we generate and make sure that it’s for a purpose.

Synthetic data is evolving into a must-have in the data science development toolkit. MIT Technology Review has noted synthetic data as one of the breakthrough technologies of 2022. We can’t imagine building excellent value AI models without synthetic data, claims Gartner.

According to McKinsey, synthetic data minimizes costs and barriers that you would otherwise have when developing algorithms or getting access to data.

The generation of synthetic data is about knowing the downstream applications and understanding the trade-offs between the different dimensions for the quality of synthetic data.

Summary

As the user of the synthetic data, it’s essential to define the context of the use case for which every sample of synthetic will be used in the future. Just as with real data, the quality of the synthetic data is dependent on the use case intended, as well as the parameters chosen for synthetization.

For example, keeping outliers in the synthetic data as in the original data is useful for a fraud detection use case. However, it’s not useful in a healthcare use case with privacy concerns, as outliers generally could be information leakage.

Moreover, a tradeoff exists between fidelity, utility, and privacy. Data can’t be optimized for all three simultaneously. These metrics enable the stakeholders to prioritize what is essential for each use case and manage expectations from the generated synthetic data.

Ultimately, when we see the values of each metric and when they meet expectations, stakeholders can be confident in the solutions that they build using the synthetic data.

The use cases for structured synthetic data cover a wide gamut of application from test data for software development to creating Synthetic control arms in clinical trials.

Reach out to explore these opportunities or built a PoC to demonstrate the value.


Faris Haddad is the Data & Insights Lead in the AABG Strategic Pursuits team. He helps enterprises successfully become data-driven.

Read More

Augment fraud transactions using synthetic data in Amazon SageMaker

Augment fraud transactions using synthetic data in Amazon SageMaker

Developing and training successful machine learning (ML) fraud models requires access to large amounts of high-quality data. Sourcing this data is challenging because available datasets are sometimes not large enough or sufficiently unbiased to usefully train the ML model and may require significant cost and time. Regulation and privacy requirements further prevent data use or sharing even within an enterprise organization. The process of authorizing the use of, and access to, sensitive data often delays or derails ML projects. Alternatively, we can tackle these challenges by generating and using synthetic data.

Synthetic data describes artificially created datasets that mimic the content and patterns in the original dataset in order to address regulatory risk and compliance, time, and costs of sourcing. Synthetic data generators use the real data to learn relevant features, correlations, and patterns in order to generate required amounts of synthetic data matching the statistical qualities of the originally ingested dataset.

Synthetic Data has been in use in lab environments for over two decades; the market has evidence of utility that is accelerating adoption in commercial and public sectors. Gartner predicts that by 2024, 60 percent of the data used for the development of ML and analytics solutions will be synthetically generated and that the use of synthetic data will continue to increase substantially.

The Financial Conduct Authority, a UK regulatory body, acknowledges that “Access to data is the catalyst for innovation, and synthetic financial data could play a role in supporting innovation and enabling new entrants to develop, test, and demonstrate the value of new solutions.”

Amazon SageMaker GroundTruth currently supports synthetic data generation of labeled synthetic image data. This blog post explores tabular synthetic data generation. Structured data, such as single and relational tables, and time series data are the types most often encountered in enterprise analytics.

This is a two-part blog post; we create synthetic data in part one and evaluate its quality in part two.

In this blog post, you will learn how to use the open-source library ydata-synthetic and AWS SageMaker notebooks to synthesize tabular data for a fraud use case, where we do not have enough fraudulent transactions to train a high-accuracy fraud model. The general process of training a fraud model is covered in this post.

Overview of the solution

The aim of this tutorial is to synthesize the minority class of a highly imbalanced credit card fraud dataset using an optimized generative adversarial network (GAN) called WGAN-GP to learn patterns and statistical properties of original data and then create endless samples of synthetic data that resemble the original data. This process can also be used to enhance the original data by up-sampling rare events like fraud or to generate edge cases that are not present in the original.

We use a credit card fraud dataset published by ULB, which can be downloaded from Kaggle. Generating synthetic data for the minority class helps address problems related to imbalanced datasets, which can help in developing more accurate models.

We use AWS services, including Amazon SageMaker and Amazon S3, which incur costs to use cloud resources.

Set up the development environment

SageMaker provides a managed Jupyter notebook instance for model building, training, and deployment.

Prerequisites:

You must have an AWS account to run SageMaker. You can get started with SageMaker and try hands-on tutorials.

For instructions on setting up your Jupyter Notebook working environment, see Get Started with Amazon SageMaker Notebook Instances.

Step 1: Set up your Amazon SageMaker instance

  1. Sign in to the AWS console and search for “SageMaker.”
  2. Select Studio.
  3. Select Notebook instances on the left bar, and select Create notebook instance.
    SageMaker Landing page
  4. From the next page (as shown in the following image), select the configurations of the virtual machine (VM) according to your needs, and select Create notebook instance. Note that we used an ML optimized VM with no GPU and 5 GB of data, ml.t3.medium running an Amazon Linux 2, and Jupyter Lab 3 kernel.
    Create notebook instance
  5. A notebook instance will be ready for you to use within a few minutes.
  6. Select Open JupyterLab to launch.
  7. Now that we have a JupyterLab with our required specifications, we will install the synthetic library.
pip install ydata-synthetic

Step 2: Download or extract the real dataset to create synthetic data

Download the reference data from Kaggle either manually, as we do here, or programmatically through Kaggle API if you have a Kaggle account. If you explore this dataset, you’ll notice that the “fraud” class contains much less data than the “not fraud” class.

If you use this data directly for machine learning predictions, the models might always learn to predict “not fraud.” A model will easily have a higher accuracy in nonfraud cases since fraud cases are rare. However, since detecting the fraud cases is our objective in this exercise, we will boost the fraud class numbers with synthetic data modeled on the real data.

Create a data folder in JupyterLab and upload the Kaggle data file into it. This will let you use the data within the notebook since SageMaker comes with storage that you would have specified when you instantiated the notebook.

This dataset is 144 MB

You can then read the data using standard code via the pandas library:

import pandas as pd
data = pd.read_csv('./data/creditcard.csv')

Fraud-detection data has certain characteristics, namely:

  • Large class imbalances (typically towards nonfraud data points).
  • Privacy-related concerns (owing to the presence of sensitive data).
  • A degree of dynamism, in that a malicious user is always trying to avoid detection by systems monitoring for fraudulent transactions.
  • The available data sets are very large and often unlabeled.

Now that you have inspected the dataset, let’s filter the minority class (the “fraud” class from the credit card dataset) and perform transformations as required. You can check out the data transformations from this notebook.

When this minority class dataset is synthesized and added back to the original dataset, it allows the generation of a larger synthesized dataset that addresses the imbalance in data. We can achieve greater prediction accuracy by training a fraud detection model using the new dataset.

Let’s synthesize the new fraud dataset.

Step 3: Train the synthesizers and create the model

Since you have the data readily available within SageMaker, it’s time to put our synthetic GAN models to work.

A generative adversarial network (GAN) has two parts:

The generator learns to generate plausible data. The generated instances become negative training examples for the discriminator.

The discriminator learns to distinguish the generator’s fake data from real data. The discriminator penalizes the generator for producing implausible results.

When training begins, the generator produces obviously fake data, and the discriminator quickly learns to tell that it’s fake. As training progresses, the generator gets closer to producing output that can fool the discriminator. Finally, if generator training goes well, the discriminator gets worse at telling the difference between real and fake. It starts to classify fake data as real, and its accuracy decreases.

Both the generator and the discriminator are neural networks. The generator output is connected directly to the discriminator input. Through backpropagation, the discriminator’s classification provides a signal that the generator uses to update its weights.

Step 4: Sample synthetic data from the synthesizer

Now that you have built and trained your model, it’s time to sample the required data by feeding noise to the model. This enables you to generate as much synthetic data as you want.

In this case, you generate an equal quantity of synthetic data to the quantity of actual data because this it makes it easier to compare the similar sample sizes in Step 5.

We have the option to sample rows containing fraudulent transactions—which, when combined with the nonsynthetic fraud data, will lead to an equal distribution of “fraud” and “not-fraud” classes. The original Kaggle dataset contained 492 frauds out of 284,807 transactions, so we create a same sample from the synthesizer.

# use the same shape as the real data
synthetic_fraud = synthesizer.sample(492)

We have the option to up-sample rows containing fraudulent transactions in a process called data augmentation—which, when combined with the nonsynthetic fraud data, will lead to an equal distribution of “fraud” and “not-fraud” classes.

Step 5: Compare and evaluate the synthetic data against the real data

Though this step is optional, you can qualitatively visualize and assess the generated synthetic data against the actual data using a scatter plot.

This helps us iterate our model by tweaking parameters, changing sample size, and making other transformations to generate the most accurate synthetic data. This nature of accuracy is always depends on the purpose of the synthesis

The image below depicts how similar the actual fraud and the synthetic fraud data points are across the training steps. This gives a good qualitative inspection of the similarity between the synthetic and the actual data and how that improves as we run it through more epochs (transit of entire training dataset through algorithm). Note that as we run more epochs, the synthetic data pattern set gets closer to the original data.

Step 6: Clean up

Finally, stop your notebook instance when you’re done with the synthesis to avoid unexpected costs.

Conclusion

As machine learning algorithms and coding frameworks evolve rapidly, high-quality data at scale is the scarcest resource in ML. Good-quality synthetic datasets can be used in a variety of tasks.

In this blog post, you learned the importance of synthesizing the dataset by using an open-source library that uses WGAN-GP. This is an active research area with thousands of papers on GANs published and many hundreds of named GANs available for you to experiment with. There are variants that are optimized for specific use cases like relational tables and time series data.

You can find all the code used for this article in this notebook, and of course, more tutorials like this are available from the SageMaker official documentation page.

In the second part of this two-part blog post series, we will do a deep dive into how to evaluate the quality of the synthetic data from a perspective of fidelity, utility, and privacy.


About the Author

Faris Haddad is the Data & Insights Lead in the AABG Strategic Pursuits team. He helps enterprises successfully become data-driven.

Read More

LightOn Lyra-fr model is now available on Amazon SageMaker

LightOn Lyra-fr model is now available on Amazon SageMaker

We are thrilled to announce the availability of the LightOn Lyra-fr foundation model for customers using Amazon SageMaker. LightOn is a leader in building foundation models specializing in European languages. Lyra-fr is a state-of-the-art French language model that can be used to build conversational AI, copywriting tools, text classifiers, semantic search, and more. You can easily try out this model and use it with Amazon SageMaker JumpStart. JumpStart is the machine learning (ML) hub of SageMaker that provides access to foundation models in addition to built-in algorithms and end-to-end solution templates to help you quickly get started with ML.

In this blog, we will demonstrate how to use the Lyra-fr model in SageMaker.

Foundation models

Foundation models are typically trained on billions of parameters and are adaptable to a wide category of use cases. The most well-known foundation models today are used to summarize articles, create digital art, and generate code from simple text instructions. These models are expensive to train, so customers want to use existing pre-trained foundation models and fine-tune them as needed rather than train these models themselves. SageMaker provides a curated list of models that you can choose from on the SageMaker console. You can test these models directly on the web interface. When you want to use a foundation model at scale, you can do so easily without leaving SageMaker by using pre-built notebooks from model providers. Because the models are hosted and deployed on AWS, you can rest assured that your data, whether used for evaluating or using the model at scale, is never shared with third parties.

Lyra-fr is the largest French language model available on the market today. It is a 10 billion parameter model, trained and made accessible by LightOn. Lyra-fr was trained on a large corpus of French curated data, and it is capable of writing human-like text and solving complex tasks such as classification, question answering, and summarization. All of this while maintaining reasonable inference speed, in the range of 1–2 seconds for the average request. You can simply describe the task you want to perform in natural language, and Lyra-fr will generate responses of the level of a native French speaker. Lyra-fr offers business-ready intelligence primitives, such as steerable generation and text classification, in just a few lines of code. For more challenging tasks, performance can be improved in a “few shot” learning mode, providing in the prompt a couple of input-output examples.

Using Lyra-fr on SageMaker

We’ll take you on a walkthrough of how to use the Lyra-fr model in 3 simple steps:

  • Discover – Find the Lyra-fr model on the AWS Management Console for SageMaker.
  • Test – Test the model using the web interface.
  • Deploy – Use a notebook to deploy and test the advanced capabilities of the model.

Discover

To make it easy to discover foundation models like the Lyra-fr, we have consolidated all the foundation models in one place. To find the Lyra-fr model:

  1. Sign in to the AWS Management Console for SageMaker.
  2. On the left navigation panel, you should see a section called JumpStart with Foundation models under it. Request access to this feature if you don’t have access yet.
  3. Once your account is allowlisted, you will see a list of models on the right. This is where you will find the Lyra-fr 10B model.
  4. Clicking on View model will show the full model card with additional options.

Test

A common use case is to run ad hoc tests to make sure the model meets your needs. You can test the Lyra-fr model directly from the SageMaker console. In this example, we’re going to use a simple text prompt by asking the model to generate a list of article ideas for the topic of “watercolor” or “l’aquarelle” in French.

  1. From the model card shown in the previous section, select Try out model. This will open a new tab with the test interface.
  2. On this interface, provide the text input you would like to pass to the model. You can also tune any parameters you would like using the sliders on the right. Once you’re satisfied, select Generate text.

Note that foundation models and their output are from the model provider, and AWS is not responsible for the content or accuracy therein.

Deploy

Text generation models work best when you provide examples of information you want the model to provide. This is called few-shot learning. We will demo this capability using the Lyra-fr sample notebook. The sample notebook goes through how to deploy the Lyra-fr model on SageMaker, how to summarize and generate text, and few-shot learning.

It also includes examples of making the inference requests directly using JSON or with the Lyra Python SDK. The Lyra Python SDK takes care of formatting the input, calling the endpoint, and unpacking the output. There is one class per endpoint: Create, Analyze, Select, Embed, Compare, and Tokenize. Note that this example uses an ml.p4d.24xlarge instance. If your default limit for your AWS account is 0, you need to request a limit increase for this GPU instance.

SageMaker offers a managed notebook experience through SageMaker Studio. For details on how to set up SageMaker Studio, see the Amazon SageMaker Developer Guide. We’re going to clone this GitHub repo into the SageMaker Studio in this demo, but the notebook will work in other environments as well.

Let’s take a look at how to run the notebook:

  1. Go to the model card from the Discover section in this blog post, and select View notebook. You should see a new tab open in GitHub with the Lyra-fr notebook.
  2. In GitHub, select lightonmuse-sagemaker-sdk; this will bring you to the repo. Select the Code button and copy the HTTPS URL.
  3. Open SageMaker Studio. Select Clone a Repository and then paste in the URL copied from above.
  4. Navigate to the Lyra-fr notebook using the file browser on the left.
  5. This notebook runs end to end without additional input needed and also cleans up the resources it creates. We can take a look at the “using Create for sentiment analysis” example. This example uses the Lyra Python SDK and demonstrates few-shot learning by teaching the model with a few examples of what text should be categorized as positive (positifs), negative (négatifs), or mixed (mitigés).
  6. You can see that, with the Lyra Python SDK, all you have to do is provide the name of the SageMaker endpoint and the input. The SDK handles all the parsing, formatting, and setup for you.
  7. Running this prompt returns that the last statement is a positive one.

Clean up

After you have tested the endpoint, make sure you delete the SageMaker inference endpoint and delete the model to avoid incurring charges.

Conclusion

In this post, we showed you how to discover, test, and deploy the Lyra-fr model using Amazon SageMaker. Request access to try out the foundation model in SageMaker today, and let us know your feedback!


About the authors

Iacopo Poli is the CTO of LightOn, responsible for strategic technical choices for the company in building very large language models and offering them to the public. He is passionate about democratization of Machine Learning through intuitive interfaces. In his spare time, he enjoys the quest for the best restaurants in Paris.

Alan TanAlan Tan is a Senior Product Manager with SageMaker, leading efforts on large model inference. He’s passionate about applying machine learning to the area of analytics. Outside of work, he enjoys the outdoors.

Read More