Introducing Fortuna: A library for uncertainty quantification

Introducing Fortuna: A library for uncertainty quantification

Proper estimation of predictive uncertainty is fundamental in applications that involve critical decisions. Uncertainty can be used to assess the reliability of model predictions, trigger human intervention, or decide whether a model can be safely deployed in the wild.

We introduce Fortuna, an open-source library for uncertainty quantification. Fortuna provides calibration methods, such as conformal prediction, that can be applied to any trained neural network to obtain calibrated uncertainty estimates. The library further supports a number of Bayesian inference methods that can be applied to deep neural networks written in Flax. The library makes it easy to run benchmarks and will enable practitioners to build robust and reliable AI solutions by taking advantage of advanced uncertainty quantification techniques.

The problem of overconfidence in deep learning

If you have ever looked at class probabilities returned by a trained deep neural network classifier, you might have observed that the probability of one class was much larger than the others. Something like this, for example:

p = [0.0001, 0.0002, …, 0.9991, 0.0003, …, 0.0001]

If this is the case for the majority of the predictions, your model might be overconfident. In order to evaluate the validity of the probabilities returned by the classifier, we may compare them with the actual accuracy achieved over a holdout data set. Indeed, it is natural to assume that the proportion of correctly classified data points should approximately match the estimated probability of the predicted class. This concept is known as calibration [Guo C. et al., 2017].

Unfortunately, many trained deep neural networks are miscalibrated, meaning that the estimated probability of the predicted class is much higher than the proportion of correctly classified input data points. In other words, the classifier is overconfident.

Being overconfident might be problematic in practice. A doctor may not order relevant additional tests, as a result of an overconfident healthy diagnosis produced by an AI. A self-driving car may decide not to brake because it confidently assessed that the object in front was not a person. A governor may decide to evacuate a town because the probability of an eminent natural disaster estimated by an AI is too high. In these and many other applications, calibrated uncertainty estimates are critical to assess the reliability of model predictions, fall back to a human decision-maker, or decide whether a model can be safely deployed.

Fortuna: A library for uncertainty quantification

There are many published techniques to either estimate or calibrate the uncertainty of predictions, e.g., Bayesian inference [Wilson A.G., 2020], temperature scaling [Guo C. et al., 2017], and conformal prediction [Angelopoulos A.N. et al., 2022] methods. However, existing tools and libraries for uncertainty quantification have a narrow scope and do not offer a breadth of techniques in a single place. This results in a significant overhead, hindering the adoption of uncertainty into production systems.

In order to fill this gap, we launch Fortuna, a library for uncertainty quantification that brings together prominent methods across the literature and makes them available to users with a standardized and intuitive interface.

As an example, suppose you have training, calibration, and test data loaders in tensorflow.Tensor format, namely train_data_loader, calib_data_loader and test_data_loader. Furthermore, you have a deep learning model written in Flax, namely model. Then you can use Fortuna to:

  1. fit a posterior distribution;
  2. calibrate the model outputs;
  3. make calibrated predictions;
  4. estimate uncertainty estimates;
  5. compute evaluation metrics.

The following code does all of this for you.

from fortuna.data import DataLoader
from fortuna.prob_model.classification import ProbClassifier
from fortuna.metric.classification import expected_calibration_error

# convert data loaders
train_data_loader = DataLoader.from_tensorflow_data_loader(train_data_loader)
calib_data_loader = DataLoader.from_tensorflow_data_loader(calib_data_loader)
test_data_loader = DataLoader.from_tensorflow_data_loader(test_data_loader)

# define and train a probabilistic model
prob_model = ProbClassifier(model=model)
train_status = prob_model.train(train_data_loader=train_data_loader, calib_data_loader=calib_data_loader)

# make predictions and estimate uncertainty
test_inputs_loader = test_data_loader.to_inputs_loader()
test_means = prob_model.predictive.mean(inputs_loader=test_inputs_loader)
test_modes = prob_model.predictive.mode(inputs_loader=test_inputs_loader, means=test_means)

# compute the expected calibration error and plot a reliability diagram
test_targets = test_data_loader.to_array_targets()
ece = expected_calibration_error(preds=test_modes, probs=test_means, targets=test_targets)

The code above makes use of several default choices, including SWAG [Maddox W.J. et al., 2019] as a posterior inference method, temperature scaling [Guo C. et al., 2017] to calibrate the model outputs, and a standard Gaussian prior distribution, as well as the configuration of the posterior fitting and calibration processes. You can easily configure all of these components, and you are highly encouraged to do so if you are looking for a specific configuration or if you want to compare several ones.

Usage modes

Fortuna offers three usage modes: 1/ Starting from Flax models, 2/ Starting from model outputs, and 3/ Starting from uncertainty estimates. Their pipelines are depicted in the following figure, each starting from one of the green panels. The code snippet above is an example of using Fortuna starting from Flax models, which allows training a model using Bayesian inference procedures. Alternatively, you can start either by model outputs or directly from your own uncertainty estimates. Both these latter modes are framework independent and help you obtain calibrated uncertainty estimates starting from a trained model.

1/ Starting from uncertainty estimates

Starting from uncertainty estimates has minimal compatibility requirements, and it is the quickest level of interaction with the library. This usage mode offers conformal prediction methods for both classification and regression. These take uncertainty estimates in numpy.ndarray format and return rigorous sets of predictions that retain a user-given level of probability. In one-dimensional regression tasks, conformal sets may be thought of as calibrated versions of confidence or credible intervals.

Mind that if the uncertainty estimates that you provide in inputs are inaccurate, conformal sets might be large and unusable. For this reason, if your application allows it, please consider the Starting from model outputs and Starting from Flax models usage modes detailed below.

2/ Starting from model outputs

This mode assumes you have already trained a model in some framework and arrive at Fortuna with model outputs in numpy.ndarray format for each input data point. This usage mode allows you to calibrate your model outputs, estimate uncertainty, compute metrics, and obtain conformal sets.

Compared to the Starting from uncertainty estimates usage mode, Starting from model outputs provides better control, as it can make sure uncertainty estimates have been appropriately calibrated. However, if the model had been trained with classical methods, the resulting quantification of model (aka epistemic) uncertainty may be poor. To mitigate this problem, please consider the Starting from Flax models usage mode.

3/ Starting from Flax models

Starting from Flax models has higher compatibility requirements than the Starting from uncertainty estimates and Starting from model outputs usage modes, as it requires deep learning models written in Flax. However, it enables you to replace standard model training with scalable Bayesian inference procedures, which may significantly improve the quantification of predictive uncertainty.

Bayesian methods work by representing uncertainty over which solution is correct, given limited information, through uncertainty over model parameters. This type of uncertainty is called “epistemic” uncertainty. Because neural networks can represent many different solutions, corresponding to different settings of their parameters, Bayesian methods can be especially impactful in deep learning. We provide many scalable Bayesian inference procedures, which can often be used to provide uncertainty estimates, as well as improved accuracy and calibration, with essentially no training-time overhead.

Conclusion

We announced the general availability of Fortuna, a library for uncertainty quantification in deep learning. Fortuna brings together prominent methods across the literature, e.g., conformal methods, temperature scaling, and Bayesian inference, and makes them available to users with a standardized and intuitive interface. To get started with Fortuna, you can consult the following resources:

Try Fortuna out, and let us know what you think! You are encouraged to contribute to the library or leave your suggestions and contributions—just create an issue or open a pull request. On our side, we will keep on improving Fortuna, increase its coverage of uncertainty quantification methods and add further examples that showcase its usefulness in several scenarios.


About the authors

 

Gianluca Detommaso is an Applied Scientist at AWS. He currently works on uncertainty quantification in deep learning. In his spare time, Gianluca likes to practice sports, eating great food and learning new skills.

Alberto Gasparin is an Applied Scientist within Amazon Community Shopping since July 2021. His interests include natural language processing, information retrieval and uncertainty quantification. He is a food and wine enthusiast.

Michele Donini is a Sr Applied Scientist at AWS. He leads a team of scientists working on Responsible AI and his research interests are Algorithmic Fairness and Explainable Machine Learning.

Matthias Seeger is a Principal Applied Scientist at AWS.

Cedric Archambeau is a Principal Applied Scientist at AWS and Fellow of the European Lab for Learning and Intelligent Systems.

Andrew Gordon Wilson is an Associate Professor at the Courant Institute of Mathematical Sciences and Center for Data Science at New York University, and an Amazon Visiting Academic at AWS. He is particularly engaged in building methods for Bayesian and probabilistic deep learning, scalable Gaussian processes, Bayesian optimization, and physics-inspired machine learning.

Read More

Best practices for Amazon SageMaker Training Managed Warm Pools

Best practices for Amazon SageMaker Training Managed Warm Pools

Amazon SageMaker Training Managed Warm Pools gives you the flexibility to opt in to reuse and hold on to the underlying infrastructure for a user-defined period of time. This is done while also maintaining the benefit of passing the undifferentiated heavy lifting of managing compute instances in to Amazon SageMaker Model Training. In this post, we outline the key benefits and pain points addressed by SageMaker Training Managed Warm Pools, as well as benchmarks and best practices.

Overview of SageMaker Training Managed Warm Pools

SageMaker Model Training is a fully managed capability that spins up instances for every job, trains a model, runs and then spins down instances after the job. You’re only billed for the duration of the job down to the second. This fully managed capability gives you the freedom to focus on your machine learning (ML) algorithm and not worry about undifferentiated heavy lifting like infrastructure management while training your models.

This mechanism necessitates a finite startup time for a training job. Although this startup time, also known as cold-start startup time, is fairly low, some of our most demanding customer use cases require even lower startup times, such as under 20 seconds. There are two prominent use cases that have these requirements:

  • The first is active ML experimentation by data scientists using the Amazon SageMaker training platform, especially while training large models, like GPT3, that require multiple iterations to get to a production-ready state.
  • The second is the programmatic launch of a large number (in the order of several hundred or thousands) of consecutive jobs on the same kind of instances on a scheduled cadence. For example, parameter search or incremental training.

For such use cases, every second spent on overhead, like the startup time for a training job, has a cumulative effect on all these jobs.

With SageMaker Training Managed Warm Pools, data scientists and ML engineers have the ability to opt in to keep SageMaker training instances or multi-instance clusters warm for a prespecified and reconfigurable time (keep_alive_period_in_seconds) after each training job completes. So even though you incur a cold-start penalty for the first training job run on an instance or cluster, for all the subsequent training jobs, the instances are already up and running. As a result, these subsequent training jobs that start on an instance before the keep_alive_period_in_seconds expires don’t incur the cold-start startup time overhead. This can reduce training job startup times to roughly less than 20 seconds (P90).

Data scientists and ML engineers can use SageMaker Training Managed Warm Pools to keep single or multiple instances warm in between training runs for experimentation or run multiple jobs consecutively on the same single or multi-instance cluster. You only pay for the duration of training jobs and the reconfigurable keep_alive_period_in_seconds like everywhere else you specify for every single instance.

In essence, with SageMaker Training Managed Warm Pools, you get a combination of SageMaker managed instance utilization with the ability to opt in and provision capacity and self-manage utilization for short intervals of time. These intervals are configurable before a job, but if during the keep_alive_period_in_seconds interval, you need to reduce or increase it, you can do so. Increases to keep_alive_period_in_seconds can be done in intervals of up to 60 minutes, with a max period for an instance or cluster being 7 days.

To get started with warm pools, first request a warm pool quota limit increase, then specify the keep_alive_period_in_seconds parameter when starting a training job.

Benchmarks

We performed benchmarking tests to measure job startup latency using a 1.34 GB TensorFlow image, 2 GB of data, and different training data input modes (Amazon FSx, Fast File Mode, File Mode). The tests were run across a variety of instance types from the m4, c4, m5, and c5 families in the us-east-2 Region. The startup latency was measured as the time of job creation to the start of the actual training job on the instances. The first jobs that started the cluster and created the warm pool had a startup latency of 2–3 minutes. This higher latency is due to the time taken to provision the infrastructure, download the image, and download the data. The consequent jobs that utilized the warm pool cluster had a startup latency of approximately 20 seconds for Fast File Mode (FFM) or Amazon FSx, and 70 seconds for File Mode (FM). This delta is a result of FM requiring the entire dataset to be downloaded from Amazon S3 prior to the start of the job.

Your choice of training data input mode affects the startup time, even with Warm Pools. Guidance on what input mode to select is in the best practices section later in this post.

The following table summarizes the job startup latency P90 for different training data input modes.

Data Input Mode Startup Latency P90 (seconds)
First Job Warm Pool Jobs (second job onwards)
FSx 136 19
Fast File Mode 143 21
File Mode 176 70

Best practices for using warm pools

In the following section, we share some best practices when using warm pools.

When should you use warm pools?

Warm pools are recommended in the following scenarios:

  • You are interactively experimenting and tuning your script over a series of short jobs.
  • You are running your own custom-made, large-scale hyperparameter optimization (for example, Syne Tune).
  • You have a batch process that runs a large number (in the order of several hundreds or thousands) of consecutive jobs on the same kind of instances on a daily or weekly cadence. For example, training an ML model per city.

Warm pools are not recommended when it’s unlikely that someone will reuse the warm pool before it expires. For example, a single lengthy job that runs via an automated ML pipeline.

Minimize warm pool training job startup latency

Training jobs that reuse a warm pool start faster than the first job that created the warm pool. This is due to keeping the ML instances running between jobs with a cached training container Docker image to skip pulling the container from Amazon Elastic Container Registry (Amazon ECR). However, even when reusing a warm pool, certain initialization steps occur for all jobs. Optimizing these steps can reduce your job startup time (both first and subsequent jobs). Consider the following:

  • Training data input mode can affect startup time – Managed training data input channels are recreated for each training job, contributing to job startup latency. So doing initial experiments over a smaller dataset will allow for faster startup time (and faster training time). For later stages of experimentation, when a large dataset is needed, consider using an input mode type that has minimal or fixed initialization time. For example, FILE input mode copies the entire dataset from Amazon Simple Storage Service (Amazon S3) to the training instance, which is time-consuming for large datasets (even with warm pools). Fast File Mode is better suited for lower startup latency because only S3 object metadata needs to be read from Amazon S3 before the workload can start. The Amazon FSx for Lustre, or Amazon Elastic File System (Amazon EFS) file system input mode, has a fixed initialization time regardless of the number of files in the file system, which is beneficial when working with a large dataset.
    For more information on how to choose an input channel, see Choose the best data source for your Amazon SageMaker training job.
  • Reduce runtime installation of packages – Any software installation that takes place during container startup, for example, Python’s pip or operating system apt-get, will increase training job latency. Minimizing this startup latency requires making a trade-off between the flexibility and simplicity of runtime installations vs. installation at container build time. If you use your own Docker container with SageMaker, refer to Adapting Your Own Docker Container to Work with SageMaker. If you rely on prebuilt SageMaker container images, you’ll need to extend a prebuilt container and explicitly manage these containers. Consider this if your runtime installs significantly increase startup latency.
  • Avoid updating your Docker image frequently – If you use your own Docker container with SageMaker, try to avoid updating it every job run. If the Docker image changes between the job submissions, the warm pool will be reused, but the startup process will need to re-pull the container image from Amazon ECR instead of reusing a cached container image. If the Docker image must be updated, confine the updates to the last Docker layer to take advantage of Docker layer caching. Ideally, you should remove the Dockerfile content that’s likely to change over iterations, like hyperparameter, dataset definitions, and the ML code itself. To iterate on ML code without having to rebuild Docker images with each change, you can adopt the framework container paradigm advocated in the SageMaker Training Toolkit. If you’d like to develop a framework container with your own code, refer to this Amazon SageMaker tutorial.

Share warm pools between multiple users

When working with a large team of data scientists, you can share warm pools that have matching job criteria, such as the same AWS Identity and Access Management (IAM) role or container image.

Let’s look at an example timeline. User-1 starts a training job that completes and results in a new warm pool created. When user-2 starts a training job, the job will reuse the existing warm pool, resulting in a fast job startup. While user-2’s job is running with the warm pool in use, if another user starts a training job, then a second warm pool will be created.

This reuse behavior helps reduce costs by sharing warm pools between users that start similar jobs. If you want to avoid sharing warm pools between users, then users’ jobs must not have matching job criteria (for example, they must use a different IAM role).

Notify users on job completion

When using warm pools for experimentation, we recommend notifying users when their job is complete. This allows users to resume experimentation before the warm pool expires or stop the warm pool if it’s no longer needed. You can also automatically trigger notifications through Amazon EventBridge.

Further tools for fast experimentation and troubleshooting training jobs

With warm pools, you can start a job in less than 20 seconds. Some scenarios require real-time, hands-on interactive experimentation and troubleshooting. The open-source SageMaker SSH Helper library allows you to shell into a SageMaker training container and conduct remote development and debugging.

Conclusion

With SageMaker Training Managed Warm Pools, you can keep your model training hardware instances warm after every job for a specified period. This can reduce the startup latency for a model training job by up to 8x. SageMaker Training Managed Warm Pools are available in all public AWS Regions where SageMaker Model Training is available.

To get started, see Train Using SageMaker Managed Warm Pools.


About the authors

Romi DattaDr. Romi Datta  is a Senior Manager of Product Management in the Amazon SageMaker team responsible for training, processing and feature store. He has been in AWS for over 4 years, holding several product management leadership roles in SageMaker, S3 and IoT. Prior to AWS he worked in various product management, engineering and operational leadership roles at IBM, Texas Instruments and Nvidia. He has an M.S. and Ph.D. in Electrical and Computer Engineering from the University of Texas at Austin, and an MBA from the University of Chicago Booth School of Business.

Arun Nagarajan is a Principal Engineer with the Amazon SageMaker team focussing on the Training and MLOps areas. He has been with the SageMaker team from the launch year, enjoyed contributing to different areas in SageMaker including the realtime inference and Model Monitor products. He likes to explore the outdoors in the Pacific Northwest area and climb mountains.

Amy You is a Software Development Manager at AWS SageMaker. She focuses on bringing together a team of software engineers to build, maintain and develop new capabilities of the SageMaker Training platform that helps customers train their ML models more efficiently and easily. She has a passion for ML and AI technology, especially related to image and vision from her graduate studies. In her spare time, she loves working on music and art with her family.

Sifei Li is a Software Engineer in Amazon AI where she’s working on building Amazon Machine Learning Platforms and was part of the launch team for Amazon SageMaker. In her spare time, she likes playing music and reading.

Jenna Zhao is a Software Development Engineer at AWS SageMaker. She is passionate about ML/AI technology and has been focusing on building SageMaker Training platform that enables customers to quickly and easily train machine learning models. Outside of work, she enjoys traveling and spending time with her family.

Paras Mehra is a Senior Product Manager at AWS. He is focused on helping build Amazon SageMaker Training and Processing. In his spare time, Paras enjoys spending time with his family and road biking around the Bay Area. You can find him on LinkedIn.

Gili Nachum is a senior AI/ML Specialist Solutions Architect who works as part of the EMEA Amazon Machine Learning team. Gili is passionate about the challenges of training deep learning models, and how machine learning is changing the world as we know it. In his spare time, Gili enjoy playing table tennis.

Olivier Cruchant is a Machine Learning Specialist Solutions Architect at AWS, based in France. Olivier helps AWS customers – from small startups to large enterprises – develop and deploy production-grade machine learning applications. In his spare time, he enjoys reading research papers and exploring the wilderness with friends and family.

Emily Webber joined AWS just after SageMaker launched, and has been trying to tell the world about it ever since! Outside of building new ML experiences for customers, Emily enjoys meditating and studying Tibetan Buddhism.

Read More

How to evaluate the quality of the synthetic data – measuring from the perspective of fidelity, utility, and privacy

How to evaluate the quality of the synthetic data – measuring from the perspective of fidelity, utility, and privacy

In an increasingly data-centric world, enterprises must focus on gathering both valuable physical information and generating the information that they need but can’t easily capture. Data access, regulation, and compliance are an increasing source of friction for innovation in analytics and artificial intelligence (AI).

For highly regulated sectors such as Financial Services, Healthcare, Life Sciences, Automotive, Robotics, and Manufacturing, the problem is even greater. It causes barriers to system design, data sharing (Internal and external), monetization, analytics, and machine learning (ML).

Synthetic data is a tool that addresses many data challenges, particularly AI and analytics issues like privacy protection, regulatory compliance, accessibility, data scarcity, and bias. This also includes data sharing and time to data (and therefore time to market).

Synthetic data is algorithmically generated. It mirrors statistical properties and patterns from the source data. But importantly it contains no sensitive, private, or personal data points.

You ask questions of the synthetic data and get the same answers that you would from the real data.

In our earlier post, we demonstrated how to use adversarial networks like Generative Adversarial Networks (GANS) to generate tabular datasets to enhance credit fraud model training.

For business stakeholders to adopt synthetic data for their ML and analytics projects, it’s imperative to not only make sure that the generated synthetic data will fit the purpose and the expected downstream applications, but also for them to be able to measure and demonstrate the quality of the generated data.

With increasing legal and ethical obligations in preserving privacy, one of synthetic data’s strengths is the ability to remove sensitive and original information during its synthesis. Therefore, in addition to quality, we need metrics to evaluate the risk of private information leaks, if any, and assess that the process of generation isn’t “memorizing” or copying any of the original data.

To achieve all of this, we can map the quality of synthetic data into dimensions, which help the users, stakeholders, and us to better understand the generated data.

The three dimensions of synthetic data quality evaluation

The synthetic data generated is measured against three key dimensions:

  1. Fidelity
  2. Utility
  3. Privacy

These are some of the questions about any generated synthetic data that should be answered by a synthetic data quality report:

  • How similar is this synthetic data as compared to the original training set?
  • How useful is this synthetic data for our downstream applications?
  • Has any information been leaked from the original training data into the synthetic data?
  • Has any data which is considered sensitive in the real world (from other data sets not used for training the model) been inadvertently synthesized by our model?

The metrics that translate each one of these dimensions for the end-users are somewhat flexible. After all, the data to be generated can vary in terms of distributions, size, and behaviors. They should also be easy to grasp and interpret.

Ultimately, the metrics must be completely data-driven, and not requiring any prior knowledge or domain-specific information. However, if the user wants to apply specific rules and constraints applicable to a specific business domain, then they should be able to define them during the synthesis process to make sure that the domain-specific fidelity is met.

We look at each of these metrics in more detail in the following sections.

Metrics to understand fidelity

In any data science project, we must understand whether a certain sample population is relevant to the problem that we’re solving. Similarly, for the process of assessing the relevance of the synthetic data generated, we must evaluate it in terms of fidelity as compared to the original.

Visual representations of these metrics make them easier to comprehend. We could illustrate whether the cardinality and ratio of categories were respected, the correlations between the different variables were kept, and so on.

Visualizing the data not only helps to evaluate the quality of the synthetic data, but also fits in as one of the initial steps in the data science lifecycle for a better understanding of the data.

Let’s dive into some fidelity metrics in more detail.

Exploratory statistical comparisons

Within the exploratory statistical comparisons, the features of the original and synthetic datasets are explored using key statistical measures, such as the mean, median, standard deviation, distinct values, missing values, minima, maxima, quartile ranges for continuous features, and the number of records per category, missing values per category, and most occurring characters for categorical attributes.

This comparison should be conducted between the original hold-out dataset and the synthetic data. This evaluation would reveal if the datasets compared are statistically similar. If they aren’t, then we’ll have an understanding of which features and measures are different. You should consider retraining and regenerating the synthetic data with different parameters if a significant difference is noted.

This test acts as an initial screening to make sure that the synthetic data has reasonable fidelity to the original dataset and can therefore usefully undergo more rigorous testing.

Histogram similarity score

The histogram similarity score measures each feature’s marginal distributions of the synthetic and original datasets.

The similarity score is bounded between zero and one, with a score of one indicating that the synthetic data distributions perfectly overlap the distributions of the original data.

A score close to one would give the users the confidence that the holdout dataset and the synthetic dataset are statistically similar.

Mutual information score

The mutual information score measures the mutual dependence of two features, numerical or categorical, indicating how much information can be obtained from one feature by observing another.

Mutual information can measure non-linear relationships, providing a more comprehensive understanding of the synthetic data quality as it lets us understand the extent of the variable’s relations preservation.

A score of one indicates that the mutual dependence between features has been perfectly captured in the synthetic data.

Correlation score

The correlation score measures how well the correlations in the original dataset have been captured in the synthetic data.

Correlations between two or more columns are extremely important for ML applications, which help uncover relationships between features and the target variable and help create a well-trained model.

The correlation score is bounded between zero and one, with a score of one indicating that the correlations have been perfectly matched.

Unlike structured tabular data, which we commonly encounter in data problems, some types of structured data have a particular behavior where past observations have a probability of influencing the following observation. These are known as time-series or sequential data – for example, a dataset with hourly measurements of room temperature.

This behavior means that there is a requirement to define certain metrics that can specifically measure the quality of these time-series datasets

Autocorrelation and partial autocorrelation score

Although similar to correlation, autocorrelation shows the relationship of a time series at its present value as it relates to its previous values. Removing the effects of the previous time lags yields partial autocorrelation. Therefore, the autocorrelation score measures how well the synthetic data has captured the significant autocorrelations, or partial correlations, from the original dataset.

Metrics to understand utility

Now we may have statistically realized that the synthetic data is similar to the original dataset. In addition, we must also assess how well the synthesized dataset fares on common data science problems when trained on several ML algorithms.

Using the following utility metrics, we aim to build confidence that we can actually achieve performance on downstream applications regarding how the original data has performed.

Prediction score

Measuring the performance of synthetic data as compared to the original real data can be done through ML models. The downstream model score captures the quality of the synthetic data by comparing the performance of ML models trained on both the synthetic and original datasets and validated on withheld testing data from the original dataset. This provides a Train Synthetic Test Real (TSTR) score and a Train Real Test Real (TRTR) score respectively.

TSTR scores and the feature importance score

TSTR, TRTR scores, and the Feature Importance Score (Image by author)

The score incorporates a wide range of the most trusted ML algorithms for either regression or classification tasks. Using several classifiers and regressors makes sure that the score is more generalizable across most algorithms, so that the synthetic data may be considered useful in the future.

In the end, if the TSTR score and TRTR score are comparable, this indicates that the synthetic data has the quality to be used to train effective ML models for real-world applications.

Feature importance score

Highly related to the prediction score, the feature importance (FI) score extends it by adding interpretability to the TSTR and TRTR scores.

The F1 score compares the changes and stability of the feature’s importance order obtained with the prediction score. A synthetic set of data is considered to be of high utility if it yields the same order of feature importance as the original real data.

QScore

To make sure that a model trained on our newly generated data is going to produce the same answers to the same questions as a model trained using the original data, we use the Qscore. This measures the downstream performance of the synthetic data by running many random aggregation-based queries on both the synthetic and original (and holdout) datasets.

The idea here is that both of these queries should return similar results.

A high QScore makes sure that downstream applications that utilize querying and aggregation operations can provide close to equal value as that of the original dataset.

Metrics to understand privacy

With privacy regulations already in place, it’s an ethical obligation and a legal requirement to make sure that sensitive information is protected.

Before this synthetic data can be shared freely and used for downstream applications, we must consider the privacy metrics that can help the stakeholder understand where the generated synthetic data stands as compared to the original data in terms of the extent of leaked information. Moreover, we must make critical decisions regarding how the synthetic data can be shared and used.

Exact match score

A direct and intuitive evaluation of privacy is to look for copies of the real data among the synthetic records. The exact match score counts the number of real records that can be found among the synthetic set.

The score should be zero, stating that no real information is present as-is in the synthetic data. This metric acts as a screening mechanism before we evaluate further privacy metrics.

Neighbors’ privacy score

Furthermore, the neighbors’ privacy score measures the ratio of synthetic records that might be too close in similarity to the real ones. This means that, although they aren’t direct copies, they are potential points of privacy leakage and a source of useful information for inference attacks.

The score is calculated by conducting a high-dimensional nearest-neighbors search on the synthetic data overlapped with the original data.

Membership inference score

In the data science lifecycle, once a model has been trained, it no longer needs access to the training samples and can make predictions on unseen data. Similarly, in our case, once the synthesizer model is trained, samples of synthetic data can be generated without the need for the original data.

Through a type of attack called “membership inference attack”, attackers can attempt to reveal the data that was used to create the synthetic data, without having the access to the original data. This results in a compromise of privacy.

The membership inference score measures the likelihood of a membership inference attack being successful.

membership inference score

A low score suggests the feasibility of inference that a particular record was a member of the training dataset that led to the creation of the synthetic data. In other words, the attacks can infer details of an individual record, thereby compromising privacy.

A high membership inference score indicates that an attacker is unlikely to determine if a particular record was part of the original dataset used to create the synthetic data. This also means that no individual’s information was compromised through the synthetic data.

The holdout concept

An important best practice that we must follow is to make sure that the synthetic data is general enough and doesn’t overfit the original data on which it was trained. In typical data science flow, while building ML models such as a Random Forest classifier, we set aside test data, train the models using the training data, and evaluate the metrics on unseen test data.

Similarly, for synthetic data, we keep aside a sample of the original data  –  generally referred to as a hold-out dataset or unseen withheld test data – and evaluate the generated synthetic data against the hold-out dataset.

The holdout dataset is expected to be a representation of the original data, yet not seen when the synthetic data was generated. Therefore, it’s vital to have similar scores for all of the metrics when comparing the original to the holdout and the synthetic datasets.

When similar scores are obtained, we can establish that the synthetic data points aren’t a result of memorization of the original data points, while preserving the same fidelity and utility.

Final thoughts

The world is starting to understand the strategic importance of synthetic data .  As data scientists and data generators, it’s our duty to build trust in the synthetic data that we generate and make sure that it’s for a purpose.

Synthetic data is evolving into a must-have in the data science development toolkit. MIT Technology Review has noted synthetic data as one of the breakthrough technologies of 2022. We can’t imagine building excellent value AI models without synthetic data, claims Gartner.

According to McKinsey, synthetic data minimizes costs and barriers that you would otherwise have when developing algorithms or getting access to data.

The generation of synthetic data is about knowing the downstream applications and understanding the trade-offs between the different dimensions for the quality of synthetic data.

Summary

As the user of the synthetic data, it’s essential to define the context of the use case for which every sample of synthetic will be used in the future. Just as with real data, the quality of the synthetic data is dependent on the use case intended, as well as the parameters chosen for synthetization.

For example, keeping outliers in the synthetic data as in the original data is useful for a fraud detection use case. However, it’s not useful in a healthcare use case with privacy concerns, as outliers generally could be information leakage.

Moreover, a tradeoff exists between fidelity, utility, and privacy. Data can’t be optimized for all three simultaneously. These metrics enable the stakeholders to prioritize what is essential for each use case and manage expectations from the generated synthetic data.

Ultimately, when we see the values of each metric and when they meet expectations, stakeholders can be confident in the solutions that they build using the synthetic data.

The use cases for structured synthetic data cover a wide gamut of application from test data for software development to creating Synthetic control arms in clinical trials.

Reach out to explore these opportunities or built a PoC to demonstrate the value.


Faris Haddad is the Data & Insights Lead in the AABG Strategic Pursuits team. He helps enterprises successfully become data-driven.

Read More

Augment fraud transactions using synthetic data in Amazon SageMaker

Augment fraud transactions using synthetic data in Amazon SageMaker

Developing and training successful machine learning (ML) fraud models requires access to large amounts of high-quality data. Sourcing this data is challenging because available datasets are sometimes not large enough or sufficiently unbiased to usefully train the ML model and may require significant cost and time. Regulation and privacy requirements further prevent data use or sharing even within an enterprise organization. The process of authorizing the use of, and access to, sensitive data often delays or derails ML projects. Alternatively, we can tackle these challenges by generating and using synthetic data.

Synthetic data describes artificially created datasets that mimic the content and patterns in the original dataset in order to address regulatory risk and compliance, time, and costs of sourcing. Synthetic data generators use the real data to learn relevant features, correlations, and patterns in order to generate required amounts of synthetic data matching the statistical qualities of the originally ingested dataset.

Synthetic Data has been in use in lab environments for over two decades; the market has evidence of utility that is accelerating adoption in commercial and public sectors. Gartner predicts that by 2024, 60 percent of the data used for the development of ML and analytics solutions will be synthetically generated and that the use of synthetic data will continue to increase substantially.

The Financial Conduct Authority, a UK regulatory body, acknowledges that “Access to data is the catalyst for innovation, and synthetic financial data could play a role in supporting innovation and enabling new entrants to develop, test, and demonstrate the value of new solutions.”

Amazon SageMaker GroundTruth currently supports synthetic data generation of labeled synthetic image data. This blog post explores tabular synthetic data generation. Structured data, such as single and relational tables, and time series data are the types most often encountered in enterprise analytics.

This is a two-part blog post; we create synthetic data in part one and evaluate its quality in part two.

In this blog post, you will learn how to use the open-source library ydata-synthetic and AWS SageMaker notebooks to synthesize tabular data for a fraud use case, where we do not have enough fraudulent transactions to train a high-accuracy fraud model. The general process of training a fraud model is covered in this post.

Overview of the solution

The aim of this tutorial is to synthesize the minority class of a highly imbalanced credit card fraud dataset using an optimized generative adversarial network (GAN) called WGAN-GP to learn patterns and statistical properties of original data and then create endless samples of synthetic data that resemble the original data. This process can also be used to enhance the original data by up-sampling rare events like fraud or to generate edge cases that are not present in the original.

We use a credit card fraud dataset published by ULB, which can be downloaded from Kaggle. Generating synthetic data for the minority class helps address problems related to imbalanced datasets, which can help in developing more accurate models.

We use AWS services, including Amazon SageMaker and Amazon S3, which incur costs to use cloud resources.

Set up the development environment

SageMaker provides a managed Jupyter notebook instance for model building, training, and deployment.

Prerequisites:

You must have an AWS account to run SageMaker. You can get started with SageMaker and try hands-on tutorials.

For instructions on setting up your Jupyter Notebook working environment, see Get Started with Amazon SageMaker Notebook Instances.

Step 1: Set up your Amazon SageMaker instance

  1. Sign in to the AWS console and search for “SageMaker.”
  2. Select Studio.
  3. Select Notebook instances on the left bar, and select Create notebook instance.
    SageMaker Landing page
  4. From the next page (as shown in the following image), select the configurations of the virtual machine (VM) according to your needs, and select Create notebook instance. Note that we used an ML optimized VM with no GPU and 5 GB of data, ml.t3.medium running an Amazon Linux 2, and Jupyter Lab 3 kernel.
    Create notebook instance
  5. A notebook instance will be ready for you to use within a few minutes.
  6. Select Open JupyterLab to launch.
  7. Now that we have a JupyterLab with our required specifications, we will install the synthetic library.
pip install ydata-synthetic

Step 2: Download or extract the real dataset to create synthetic data

Download the reference data from Kaggle either manually, as we do here, or programmatically through Kaggle API if you have a Kaggle account. If you explore this dataset, you’ll notice that the “fraud” class contains much less data than the “not fraud” class.

If you use this data directly for machine learning predictions, the models might always learn to predict “not fraud.” A model will easily have a higher accuracy in nonfraud cases since fraud cases are rare. However, since detecting the fraud cases is our objective in this exercise, we will boost the fraud class numbers with synthetic data modeled on the real data.

Create a data folder in JupyterLab and upload the Kaggle data file into it. This will let you use the data within the notebook since SageMaker comes with storage that you would have specified when you instantiated the notebook.

This dataset is 144 MB

You can then read the data using standard code via the pandas library:

import pandas as pd
data = pd.read_csv('./data/creditcard.csv')

Fraud-detection data has certain characteristics, namely:

  • Large class imbalances (typically towards nonfraud data points).
  • Privacy-related concerns (owing to the presence of sensitive data).
  • A degree of dynamism, in that a malicious user is always trying to avoid detection by systems monitoring for fraudulent transactions.
  • The available data sets are very large and often unlabeled.

Now that you have inspected the dataset, let’s filter the minority class (the “fraud” class from the credit card dataset) and perform transformations as required. You can check out the data transformations from this notebook.

When this minority class dataset is synthesized and added back to the original dataset, it allows the generation of a larger synthesized dataset that addresses the imbalance in data. We can achieve greater prediction accuracy by training a fraud detection model using the new dataset.

Let’s synthesize the new fraud dataset.

Step 3: Train the synthesizers and create the model

Since you have the data readily available within SageMaker, it’s time to put our synthetic GAN models to work.

A generative adversarial network (GAN) has two parts:

The generator learns to generate plausible data. The generated instances become negative training examples for the discriminator.

The discriminator learns to distinguish the generator’s fake data from real data. The discriminator penalizes the generator for producing implausible results.

When training begins, the generator produces obviously fake data, and the discriminator quickly learns to tell that it’s fake. As training progresses, the generator gets closer to producing output that can fool the discriminator. Finally, if generator training goes well, the discriminator gets worse at telling the difference between real and fake. It starts to classify fake data as real, and its accuracy decreases.

Both the generator and the discriminator are neural networks. The generator output is connected directly to the discriminator input. Through backpropagation, the discriminator’s classification provides a signal that the generator uses to update its weights.

Step 4: Sample synthetic data from the synthesizer

Now that you have built and trained your model, it’s time to sample the required data by feeding noise to the model. This enables you to generate as much synthetic data as you want.

In this case, you generate an equal quantity of synthetic data to the quantity of actual data because this it makes it easier to compare the similar sample sizes in Step 5.

We have the option to sample rows containing fraudulent transactions—which, when combined with the nonsynthetic fraud data, will lead to an equal distribution of “fraud” and “not-fraud” classes. The original Kaggle dataset contained 492 frauds out of 284,807 transactions, so we create a same sample from the synthesizer.

# use the same shape as the real data
synthetic_fraud = synthesizer.sample(492)

We have the option to up-sample rows containing fraudulent transactions in a process called data augmentation—which, when combined with the nonsynthetic fraud data, will lead to an equal distribution of “fraud” and “not-fraud” classes.

Step 5: Compare and evaluate the synthetic data against the real data

Though this step is optional, you can qualitatively visualize and assess the generated synthetic data against the actual data using a scatter plot.

This helps us iterate our model by tweaking parameters, changing sample size, and making other transformations to generate the most accurate synthetic data. This nature of accuracy is always depends on the purpose of the synthesis

The image below depicts how similar the actual fraud and the synthetic fraud data points are across the training steps. This gives a good qualitative inspection of the similarity between the synthetic and the actual data and how that improves as we run it through more epochs (transit of entire training dataset through algorithm). Note that as we run more epochs, the synthetic data pattern set gets closer to the original data.

Step 6: Clean up

Finally, stop your notebook instance when you’re done with the synthesis to avoid unexpected costs.

Conclusion

As machine learning algorithms and coding frameworks evolve rapidly, high-quality data at scale is the scarcest resource in ML. Good-quality synthetic datasets can be used in a variety of tasks.

In this blog post, you learned the importance of synthesizing the dataset by using an open-source library that uses WGAN-GP. This is an active research area with thousands of papers on GANs published and many hundreds of named GANs available for you to experiment with. There are variants that are optimized for specific use cases like relational tables and time series data.

You can find all the code used for this article in this notebook, and of course, more tutorials like this are available from the SageMaker official documentation page.

In the second part of this two-part blog post series, we will do a deep dive into how to evaluate the quality of the synthetic data from a perspective of fidelity, utility, and privacy.


About the Author

Faris Haddad is the Data & Insights Lead in the AABG Strategic Pursuits team. He helps enterprises successfully become data-driven.

Read More

LightOn Lyra-fr model is now available on Amazon SageMaker

LightOn Lyra-fr model is now available on Amazon SageMaker

We are thrilled to announce the availability of the LightOn Lyra-fr foundation model for customers using Amazon SageMaker. LightOn is a leader in building foundation models specializing in European languages. Lyra-fr is a state-of-the-art French language model that can be used to build conversational AI, copywriting tools, text classifiers, semantic search, and more. You can easily try out this model and use it with Amazon SageMaker JumpStart. JumpStart is the machine learning (ML) hub of SageMaker that provides access to foundation models in addition to built-in algorithms and end-to-end solution templates to help you quickly get started with ML.

In this blog, we will demonstrate how to use the Lyra-fr model in SageMaker.

Foundation models

Foundation models are typically trained on billions of parameters and are adaptable to a wide category of use cases. The most well-known foundation models today are used to summarize articles, create digital art, and generate code from simple text instructions. These models are expensive to train, so customers want to use existing pre-trained foundation models and fine-tune them as needed rather than train these models themselves. SageMaker provides a curated list of models that you can choose from on the SageMaker console. You can test these models directly on the web interface. When you want to use a foundation model at scale, you can do so easily without leaving SageMaker by using pre-built notebooks from model providers. Because the models are hosted and deployed on AWS, you can rest assured that your data, whether used for evaluating or using the model at scale, is never shared with third parties.

Lyra-fr is the largest French language model available on the market today. It is a 10 billion parameter model, trained and made accessible by LightOn. Lyra-fr was trained on a large corpus of French curated data, and it is capable of writing human-like text and solving complex tasks such as classification, question answering, and summarization. All of this while maintaining reasonable inference speed, in the range of 1–2 seconds for the average request. You can simply describe the task you want to perform in natural language, and Lyra-fr will generate responses of the level of a native French speaker. Lyra-fr offers business-ready intelligence primitives, such as steerable generation and text classification, in just a few lines of code. For more challenging tasks, performance can be improved in a “few shot” learning mode, providing in the prompt a couple of input-output examples.

Using Lyra-fr on SageMaker

We’ll take you on a walkthrough of how to use the Lyra-fr model in 3 simple steps:

  • Discover – Find the Lyra-fr model on the AWS Management Console for SageMaker.
  • Test – Test the model using the web interface.
  • Deploy – Use a notebook to deploy and test the advanced capabilities of the model.

Discover

To make it easy to discover foundation models like the Lyra-fr, we have consolidated all the foundation models in one place. To find the Lyra-fr model:

  1. Sign in to the AWS Management Console for SageMaker.
  2. On the left navigation panel, you should see a section called JumpStart with Foundation models under it. Request access to this feature if you don’t have access yet.
  3. Once your account is allowlisted, you will see a list of models on the right. This is where you will find the Lyra-fr 10B model.
  4. Clicking on View model will show the full model card with additional options.

Test

A common use case is to run ad hoc tests to make sure the model meets your needs. You can test the Lyra-fr model directly from the SageMaker console. In this example, we’re going to use a simple text prompt by asking the model to generate a list of article ideas for the topic of “watercolor” or “l’aquarelle” in French.

  1. From the model card shown in the previous section, select Try out model. This will open a new tab with the test interface.
  2. On this interface, provide the text input you would like to pass to the model. You can also tune any parameters you would like using the sliders on the right. Once you’re satisfied, select Generate text.

Note that foundation models and their output are from the model provider, and AWS is not responsible for the content or accuracy therein.

Deploy

Text generation models work best when you provide examples of information you want the model to provide. This is called few-shot learning. We will demo this capability using the Lyra-fr sample notebook. The sample notebook goes through how to deploy the Lyra-fr model on SageMaker, how to summarize and generate text, and few-shot learning.

It also includes examples of making the inference requests directly using JSON or with the Lyra Python SDK. The Lyra Python SDK takes care of formatting the input, calling the endpoint, and unpacking the output. There is one class per endpoint: Create, Analyze, Select, Embed, Compare, and Tokenize. Note that this example uses an ml.p4d.24xlarge instance. If your default limit for your AWS account is 0, you need to request a limit increase for this GPU instance.

SageMaker offers a managed notebook experience through SageMaker Studio. For details on how to set up SageMaker Studio, see the Amazon SageMaker Developer Guide. We’re going to clone this GitHub repo into the SageMaker Studio in this demo, but the notebook will work in other environments as well.

Let’s take a look at how to run the notebook:

  1. Go to the model card from the Discover section in this blog post, and select View notebook. You should see a new tab open in GitHub with the Lyra-fr notebook.
  2. In GitHub, select lightonmuse-sagemaker-sdk; this will bring you to the repo. Select the Code button and copy the HTTPS URL.
  3. Open SageMaker Studio. Select Clone a Repository and then paste in the URL copied from above.
  4. Navigate to the Lyra-fr notebook using the file browser on the left.
  5. This notebook runs end to end without additional input needed and also cleans up the resources it creates. We can take a look at the “using Create for sentiment analysis” example. This example uses the Lyra Python SDK and demonstrates few-shot learning by teaching the model with a few examples of what text should be categorized as positive (positifs), negative (négatifs), or mixed (mitigés).
  6. You can see that, with the Lyra Python SDK, all you have to do is provide the name of the SageMaker endpoint and the input. The SDK handles all the parsing, formatting, and setup for you.
  7. Running this prompt returns that the last statement is a positive one.

Clean up

After you have tested the endpoint, make sure you delete the SageMaker inference endpoint and delete the model to avoid incurring charges.

Conclusion

In this post, we showed you how to discover, test, and deploy the Lyra-fr model using Amazon SageMaker. Request access to try out the foundation model in SageMaker today, and let us know your feedback!


About the authors

Iacopo Poli is the CTO of LightOn, responsible for strategic technical choices for the company in building very large language models and offering them to the public. He is passionate about democratization of Machine Learning through intuitive interfaces. In his spare time, he enjoys the quest for the best restaurants in Paris.

Alan TanAlan Tan is a Senior Product Manager with SageMaker, leading efforts on large model inference. He’s passionate about applying machine learning to the area of analytics. Outside of work, he enjoys the outdoors.

Read More

Automatically identify languages in multi-lingual audio using Amazon Transcribe

Automatically identify languages in multi-lingual audio using Amazon Transcribe

If you operate in a country with multiple official languages or across multiple regions, your audio files can contain different languages. Participants may be speaking entirely different languages or may switch between languages. Consider a customer service call to report a problem in an area with a substantial multi-lingual population. Although the conversation could begin in one language, it’s feasible that the customer might change to another language to describe the problem, depending on comfort level or usage preferences with other languages. In a similar vein, the customer care representative may transition between languages while conveying operating or troubleshooting instructions.

With a minimum of 3 seconds of audio, Amazon Transcribe can automatically identify and efficiently generate transcripts in the languages spoken in the audio without needing humans to specify the languages. This applies to various use cases such as transcribing customer calls, converting voicemails to text, capturing meeting interactions, tracking user forum communications, or monitoring media content production and localization workflows.

This post walks through the steps for transcribing a multi-language audio file using Amazon Transcribe. We discuss how to make audio files available to Amazon Transcribe and enable transcription of multi-lingual audio files when calling Amazon Transcribe APIs.

Solution overview

Amazon Transcribe is an AWS service that makes it easy for you to convert speech to text. Adding speech to text functionality to any application is simple with the help of Amazon Transcribe, an automated speech recognition (ASR) service. You can ingest audio input using Amazon Transcribe, create clear transcripts that are easy to read and review, increase accuracy with customization, and filter information to protect client privacy.

The solution also uses Amazon Simple Storage Service (Amazon S3), an object storage service built to store and retrieve any amount of data from anywhere. It’s a simple storage service that offers industry-leading durability, availability, performance, security, and virtually unlimited scalability at very low cost. When you store data in Amazon S3, you work with resources known as buckets and objects. A bucket is a container for objects. An object is a file and any metadata that describes the file.

In this post, we walk you through the following steps to implement a multi-multilingual audio transcription solution:

  1. Create an S3 bucket.
  2. Upload your audio file to the bucket.
  3. Create the transcription job.
  4. Review the job output.

Prerequisites

For this walkthrough, you should have the following prerequisites:

Amazon Transcribe provide the option to store transcribed output in either a service managed or customer managed S3 bucket. For this post, we have Amazon Transcribe write the results to a service managed S3 bucket.

Note that Amazon Transcribe is a Regional service and the Amazon Transcribe API endpoints being called need to be in the same Region as the S3 buckets.

Create an S3 bucket to store your audio input files

To create your S3 bucket, complete the following steps:

  1. On the Amazon S3 console, choose Create bucket.
  2. For Bucket name, enter a globally unique name for the bucket.
  3. For AWS Region, choose the same Region as your Amazon Transcribe API endpoints.
  4. Leave all defaults as is.
  5. Choose Create bucket.

Upload your audio file to the S3 bucket

Upload your multi-lingual audio file to the S3 bucket in your AWS account. For the purpose of this exercise, we use the following sample multi-lingual audio file. It captures a customer support call involving English and Spanish languages.

  1. On the Amazon S3 console, choose Buckets in the navigation pane.
  2. Choose the bucket you created previously for storing the input audio files.
  3. Choose Upload.
  4. Choose Add files.
  5. Choose the audio file you want to transcribe from your local computer.
  6. Choose Upload.

Your audio file will shortly be available in the S3 bucket.

Create the transcription job

With the audio file uploaded, we now create a transcription job.

  1. On the Amazon Transcribe console, choose Transcription jobs in the navigation pane.
  2. Choose Create job.
  3. For Name, enter a unique name for the job.
    This will also be the name of the output transcript file.
  4. For Language settings, select Automatic multiple languages identification.
    This feature enables Amazon Transcribe to automatically identify and transcribe all languages spoken in the audio file.
  5. For Language options for automatic language identification, leave it unselected.
    Amazon Transcribe automatically identifies and transcribes all languages spoken in the audio. To improve transcription accuracy, you can optionally select two or more languages you know were spoken in the audio.
  6. For Model type, only the General model option is available at the time of writing this post.
  7. For Input data, choose Browse S3.
  8. Choose the audio source file we uploaded previously.
  9. For Output data, you can select either Service-managed S3 bucket or Customer specified S3 bucket. For this post, select Service-managed S3 bucket.
  10. Choose Next.
  11. Choose Create job.

Review the job output

When the transcription job is complete, open the transcription job.

Scroll down to the Transcription preview section. The audio transcription is displayed on the Text tab. The transcription includes both the English and Spanish portions of the conversation.

You can optionally download a copy of the transcript as a JSON file, which you could use for further post-call analytics.

Clean up

To avoid incurring future charges, empty and delete the S3 bucket you created for storing the input audio source file. Make sure you have the files stored elsewhere because this will permanently remove all objects contained within the bucket. On the Amazon Transcribe console, select and delete the job previously created for the transcription.

Conclusion

In this post, we created an end-to-end workflow to automate identification and transcription of multi-lingual audio files, without writing any code. We used the new functionality in Amazon Transcribe to automatically identify different languages in an audio file and transcribe each language correctly.

For more information, refer to Language identification with batch transcription jobs.


About the Authors

Murtuza Bootwala is a Senior Solutions Architect at AWS with an interest in AI/ML technologies. He enjoys working with customers to help them achieve their business outcomes. Outside of work, he enjoys outdoor activities and spending time with family.

Victor Rojo is passionate about AI/ML and software development. He helped get Amazon Alexa up and running in the US and Mexico. He also brought Amazon Textract to AWS Partners and got AWS Contact Center Intelligence (CCI) off the ground. He’s currently the Global Tech Leader for Conversational AI Partners.

Babu Srinivasan is an AWS Sr. Specialist SA (Language AI Services) based out of Chicago. He focuses on Amazon Transcribe (speech to text), helping our customers use AI services to solve business problems. Outside of work, he enjoys woodworking and performing magic shows.

Read More

Translate multiple source language documents to multiple target languages using Amazon Translate

Translate multiple source language documents to multiple target languages using Amazon Translate

Enterprises need to translate business-critical content such as marketing materials, instruction manuals, and product catalogs across multiple languages to communicate with a global audience of customers, partners, and stakeholders. Identifying the source language in each document before calling a translate job creates complexities and adds another step to your workflow. For example, an international product company with its customer support operations located in their corporate office requires their agents to translate emails or documents to support customer requests. Previously, they had to set up workflows to identify dominant language in each document, group them by language type, and set up a batch translate job for each source language. Now, Amazon Translate’s automatic language detection feature for batch translation jobs allows you to translate a batch of documents in various languages with a single translate job. This removes the need for you to orchestrate the document translate workflow that required dominant language identification and grouping. Amazon Translate also allows translation to multiple target languages for translation (up to 10 languages). A single translation job can translate documents to multiple target languages. This feature eliminates the need to create separate batch jobs for individual target languages. Customers can now create documentation in multiple languages, all with a single API call.

In this post, we demonstrate how to translate documents into multiple target languages in a batch translation job.

Solution overview

Automatic detection of source language for batch translate jobs allows you to translate documents written in various supported languages in a single operation. You can also provide up to 10 languages as targets. The job processes each document, identifies the dominant source language, and translates it to the target language. Amazon Translate uses Amazon Comprehend to determine the dominant language in each of your source documents, and uses it as the source language.

In the following sections, we demonstrate how to create a batch translation job via the AWS Management Console or the AWS SDK.

Create a batch translation job via console

In this example, we configure Amazon Translate batch translation to automatically detect the source language and translate it to English and Hindi, using the input and output Amazon Simple Storage Service (Amazon S3) bucket locations provided.

create translation job

Next, we create an AWS Identity and Access Management (IAM) role that gets provisioned as part of the configuration. The role is given access to the input and output S3 buckets.

After the job is created, you can monitor the progress of the batch translation job in the Translation jobs section.

translation jobs section

When the translation job is complete, you can navigate to the output S3 bucket location and observe that the documents have been translated to their target language. Our input consisted of two files, sample-doc.txt and sample-doc-2.txt, in two different languages. Each document was translated into two target languages, for a total of four documents.

output S3 bucket

Create a batch translation job via the AWS SDK

The following Python Boto3 code uses the batch translation call to translate documents in your source S3 bucket. Specify the following parameters:

  • InputDataConfig – Provide the S3 bucket location of your input documents
  • OutputDataConfig – Provide the S3 bucket location of your output documents
  • DataAccessRoleArn – Create an IAM role that gives Amazon Translate permission to access your input and output S3 buckets
  • SourceLanguageCode: Use auto
  • TargetLanguageCodes: Choose up to 10 target languages
import boto3

client = boto3.client('translate')


def lambda_handler(event, context):

    response = client.start_text_translation_job(
        JobName='auto-translate-multi-language-sdk',
        InputDataConfig={
            'S3Uri': 's3://<<REPLACE-WITH-YOUR-INPUT-BUCKET>>/input-sdk',
            'ContentType': 'text/plain'
        },
        OutputDataConfig={
            'S3Uri': 's3://<<REPLACE-WITH-YOUR-OUTPUT-BUCKET>>/output-sdk',
        },
        DataAccessRoleArn='<<REPLACE-WITH-THE-IAM-ROLE-ARN>>',
        SourceLanguageCode='auto',
        TargetLanguageCodes=[
            'en', 'hi'
        ]
    )

Clean up

To clean up after using this solution, complete the following steps:

  1. Delete the S3 buckets that you created.
  2. Delete IAM roles that you set up.
  3. Delete any other resources that you set up for this post.

Conclusion

With today’s need to have a global reach with limited resources, Amazon Translate helps you simplify your multi-language processing workflows. With the introduction of automatically detecting the dominant language in your source document for batch translation jobs, and translating them to up to 10 target languages, you can focus on your business logic rather than dealing with the operational burden of sorting documents and managing multiple batch translation jobs.

We strive to add features to our service that make it easier for our customers innovate. Try this solution and let us know how this helped simplify your document processing workloads.


About the authors

Kishore Dhamodaran is a Senior Solutions Architect at AWS. Kishore helps strategic customers with their cloud enterprise strategy and migration journey, leveraging his years of industry and cloud experience.

Sid Padgaonkar is the Sr. Product Manager for Amazon Translate, AWS’s natural language processing service. On weekends you will find him playing squash and exploring the food scene in the Pacific NW.

Read More