Amazon funding will assist with a senior capstone course where students will receive mentorship from Amazon leading researchers, software developers, and engineers.Read More
Prepare time series data with Amazon SageMaker Data Wrangler
Time series data is widely present in our lives. Stock prices, house prices, weather information, and sales data captured over time are just a few examples. As businesses increasingly look for new ways to gain meaningful insights from time-series data, the ability to visualize data and apply desired transformations are fundamental steps. However, time-series data possesses unique characteristics and nuances compared to other kinds of tabular data, and require special considerations. For example, standard tabular or cross-sectional data is collected at a specific point in time. In contrast, time series data is captured repeatedly over time, with each successive data point dependent on its past values.
Because most time series analyses rely on the information gathered across a contiguous set of observations, missing data and inherent sparseness can reduce the accuracy of forecasts and introduce bias. Additionally, most time series analysis approaches rely on equal spacing between data points, in other words, periodicity. Therefore, the ability to fix data spacing irregularities is a critical prerequisite. Finally, time series analysis often requires the creation of additional features that can help explain the inherent relationship between input data and future predictions. All these factors differentiate time series projects from traditional machine learning (ML) scenarios and demand a distinct approach to its analysis.
This post walks through how to use Amazon SageMaker Data Wrangler to apply time series transformations and prepare your dataset for time series use cases.
Use cases for Data Wrangler
Data Wrangler provides a no-code/low-code solution to time series analysis with features to clean, transform, and prepare data faster. It also enables data scientists to prepare time series data in adherence to their forecasting model’s input format requirements. The following are a few ways you can use these capabilities:
- Descriptive analysis– Usually, step one of any data science project is understanding the data. When we plot time series data, we get a high-level overview of its patterns, such as trend, seasonality, cycles, and random variations. It helps us decide the correct forecasting methodology for accurately representing these patterns. Plotting can also help identify outliers, preventing unrealistic and inaccurate forecasts. Data Wrangler comes with a seasonality-trend decomposition visualization for representing components of a time series, and an outlier detection visualization to identify outliers.
- Explanatory analysis– For multi-variate time series, the ability to explore, identify, and model the relationship between two or more time series is essential for obtaining meaningful forecasts. The Group by transform in Data Wrangler creates multiple time series by grouping data for specified cells. Additionally, Data Wrangler time series transforms, where applicable, allow specification of additional ID columns to group on, enabling complex time series analysis.
- Data preparation and feature engineering– Time series data is rarely in the format expected by time series models. It often requires data preparation to convert raw data into time series-specific features. You may want to validate that time series data is regularly or equally spaced prior to analysis. For forecasting use cases, you may also want to incorporate additional time series characteristics, such as autocorrelation and statistical properties. With Data Wrangler, you can quickly create time series features such as lag columns for multiple lag periods, resample data to multiple time granularities, and automatically extract statistical properties of a time series, to name a few capabilities.
Solution overview
This post elaborates on how data scientists and analysts can use Data Wrangler to visualize and prepare time series data. We use the bitcoin cryptocurrency dataset from cryptodatadownload with bitcoin trading details to showcase these capabilities. We clean, validate, and transform the raw dataset with time series features and also generate bitcoin volume price forecasts using the transformed dataset as input.
The sample of bitcoin trading data is from January 1 – November 19, 2021, with 464,116 data points. The dataset attributes include a timestamp of the price record, the opening or first price at which the coin was exchanged for a particular day, the highest price at which the coin was exchanged on the day, the last price at which the coin was exchanged on the day, the volume exchanged in the cryptocurrency value on the day in BTC, and corresponding USD currency.
Prerequisites
Download the Bitstamp_BTCUSD_2021_minute.csv
file from cryptodatadownload and upload it to Amazon Simple Storage Service (Amazon S3).
Import bitcoin dataset in Data Wrangler
To start the ingestion process to Data Wrangler, complete the following steps:
- On the SageMaker Studio console, on the File menu, choose New, then choose Data Wrangler Flow.
- Rename the flow as desired.
- For Import data, choose Amazon S3.
- Upload the
Bitstamp_BTCUSD_2021_minute.csv
file from your S3 bucket.
You can now preview your data set.
- In the Details pane, choose Advanced configuration and deselect Enable sampling.
This is a relatively small data set, so we don’t need sampling.
- Choose Import.
You have successfully created the flow diagram and are ready to add transformation steps.
Add transformations
To add data transformations, choose the plus sign next to Data types and choose Edit data types.
Ensure that Data Wrangler automatically inferred the correct data types for the data columns.
In our case, the inferred data types are correct. However, suppose one data type was incorrect. You can easily modify them through the UI, as shown in the following screenshot.
Let’s kick off the analysis and start adding transformations.
Data cleaning
We first perform several data cleaning transformations.
Drop column
Let’s start by dropping the unix
column, because we use the date
column as the index.
- Choose Back to data flow.
- Choose the plus sign next to Data types and choose Add transform.
- Choose + Add step in the TRANSFORMS pane.
- Choose Manage columns.
- For Transform, choose Drop column.
- For Column to drop, choose unix.
- Choose Preview.
- Choose Add to save the step.
Handle missing
Missing data is a well-known problem in real-world datasets. Therefore, it’s a best practice to verify the presence of any missing or null values and handle them appropriately. Our dataset doesn’t contain missing values. But if there were, we would use the Handle missing time series transform to fix them. Commonly used strategies for handling missing data include dropping rows with missing values or filling the missing values with reasonable estimates. Because time series data relies on a sequence of data points across time, filling missing values is the preferred approach. The process of filling missing values is referred to as imputation. The Handle missing time series transform allows you to choose from multiple imputation strategies.
- Choose + Add step in the TRANSFORMS pane.
- Choose the Time Series transform.
- For Transform, Choose Handle missing.
- For Time series input type, choose Along column.
- For Method for imputing values, choose Forward fill.
The Forward fill method replaces the missing values with the non-missing values preceding the missing values.
Backward fill, Constant Value, Most common value and Interpolate are other imputation strategies available in Data Wrangler. Interpolation techniques rely on neighboring values for filling missing values. Time series data often exhibits correlation between neighboring values, making interpolation an effective filling strategy. For additional details on the functions you can use for applying interpolation, refer to pandas.DataFrame.interpolate.
Validate timestamp
In time series analysis, the timestamp column acts as the index column, around which the analysis revolves. Therefore, it’s essential to make sure the timestamp column doesn’t contain invalid or incorrectly formatted time stamp values. Because we’re using the date
column as the timestamp column and index, let’s confirm its values are correctly formatted.
- Choose + Add step in the TRANSFORMS pane.
- Choose the Time Series transform.
- For Transform, choose Validate timestamps.
The Validate timestamps transform allows you to check that the timestamp column in your dataset doesn’t have values with an incorrect timestamp or missing values.
- For Timestamp Column, choose date.
- For Policy dropdown, choose Indicate.
The Indicate policy option creates a Boolean column indicating if the value in the timestamp column is a valid date/time format. Other options for Policy include:
- Error – Throws an error if the timestamp column is missing or invalid
- Drop – Drops the row if the timestamp column is missing or invalid
- Choose Preview.
A new Boolean column named date_is_valid
was created, with true
values indicating correct format and non-null entries. Our dataset doesn’t contain invalid timestamp values in the date
column. But if it did, you could use the new Boolean column to identify and fix those values.
- Choose Add to save this step.
Time series visualization
After we clean and validate the dataset, we can better visualize the data to understand its different component.
Resample
Because we’re interested in daily predictions, let’s transform the frequency of data to daily.
The Resample transformation changes the frequency of the time series observations to a specified granularity, and comes with both upsampling and downsampling options. Applying upsampling increases the frequency of the observations (for example from daily to hourly), whereas downsampling decreases the frequency of the observations (for example from hourly to daily).
Because our dataset is at minute granularity, let’s use the downsampling option.
- Choose + Add step.
- Choose the Time Series transform.
- For Transform, choose Resample.
- For Timestamp, choose date.
- For Frequency unit, choose Calendar day.
- For Frequency quantity, enter 1.
- For Method to aggregate numeric values, choose mean.
- Choose Preview.
The frequency of our dataset has changed from per minute to daily.
- Choose Add to save this step.
Seasonal-Trend decomposition
After resampling, we can visualize the transformed series and its associated STL (Seasonal and Trend decomposition using LOESS) components using the Seasonal-Trend-decomposition visualization. This breaks down original time series into distinct trend, seasonality and residual components, giving us a good understanding of how each pattern behaves. We can also use the information when modelling forecasting problems.
Data Wrangler uses LOESS, a robust and versatile statistical method for modelling trend and seasonal components. It’s underlying implementation uses polynomial regression for estimating nonlinear relationships present in the time series components (seasonality, trend, and residual).
- Choose Back to data flow.
- Choose the plus sign next to the Steps on Data Flow.
- Choose Add analysis.
- In the Create analysis pane, for Analysis type, choose Time Series.
- For Visualization, choose Seasonal-Trend decomposition.
- For Analysis Name, enter a name.
- For Timestamp column, choose date.
- For Value column, choose Volume USD.
- Choose Preview.
The analysis allows us to visualize the input time series and decomposed seasonality, trend, and residual.
- Choose Save to save the analysis.
With the seasonal-trend decomposition visualization, we can generate four patterns, as shown in the preceding screenshot:
- Original – The original time series re-sampled to daily granularity.
-
Trend – The polynomial trend with an overall negative trend pattern for the year 2021, indicating a decrease in
Volume USD
value. - Season – The multiplicative seasonality represented by the varying oscillation patterns. We see a decrease in seasonal variation, characterized by decreasing amplitude of oscillations.
- Residual – The remaining residual or random noise. The residual series is the resulting series after trend and seasonal components have been removed. Looking closely, we observe spikes between January and March, and between April and June, suggesting room for modelling such particular events using historical data.
These visualizations provide valuable leads to data scientists and analysts into existing patterns and can help you choose a modelling strategy. However, it’s always a good practice to validate the output of STL decomposition with the information gathered through descriptive analysis and domain expertise.
To summarize, we observe a downward trend consistent with original series visualization, which increases our confidence in incorporating the information conveyed by trend visualization into downstream decision-making. In contrast, the seasonality visualization helps inform the presence of seasonality and the need for its removal by applying techniques such as differencing, it doesn’t provide the desired level of detailed insight into various seasonal patterns present, thereby requiring deeper analysis.
Feature engineering
After we understand the patterns present in our dataset, we can start to engineer new features aimed to increase the accuracy of the forecasting models.
Featurize datetime
Let’s start the feature engineering process with more straightforward date/time features. Date/time features are created from the timestamp
column and provide an optimal avenue for data scientists to start the feature engineering process. We begin with the Featurize datetime time series transformation to add the month, day of the month, day of the year, week of the year, and quarter features to our dataset. Because we’re providing the date/time components as separate features, we enable ML algorithms to detect signals and patterns for improving prediction accuracy.
- Choose + Add step.
- Choose the Time Series transform.
- For Transform, choose Featurize datetime.
- For Input Column, choose date.
- For Output Column, enter
date
(this step is optional). - For Output mode, choose Ordinal.
- For Output format, choose Columns.
- For date/time features to extract, select Month, Day, Week of year, Day of year, and Quarter.
- Choose Preview.
The dataset now contains new columns named date_month
, date_day
, date_week_of_year
, date_day_of_year
, and date_quarter
. The information retrieved from these new features could help data scientists derive additional insights from the data and into the relationship between input features and output features.
- Choose Add to save this step.
Encode categorical
Date/time features aren’t limited to integer values. You may also choose to consider certain extracted date/time features as categorical variables and represent them as one-hot encoded features, with each column containing binary values. The newly created date_quarter
column contains values between 0-3, and can be one-hot encoded using four binary columns. Let’s create four new binary features, each representing the corresponding quarter of the year.
- Choose + Add step.
- Choose the Encode categorical transform.
- For Transform, choose One-hot encode.
- For Input column, choose date_quarter.
- For Output style, choose Columns.
- Choose Preview.
- Choose Add to add the step.
Lag feature
Next, let’s create lag features for the target column Volume USD
. Lag features in time series analysis are values at prior timestamps that are considered helpful in inferring future values. They also help identify autocorrelation (also known as serial correlation) patterns in the residual series by quantifying the relationship of the observation with observations at previous time steps. Autocorrelation is similar to regular correlation but between the values in a series and its past values. It forms the basis for the autoregressive forecasting models in the ARIMA series.
With the Data Wrangler Lag feature transform, you can easily create lag features n periods apart. Additionally, we often want to create multiple lag features at different lags and let the model decide the most meaningful features. For such a scenario, the Lag features transform helps create multiple lag columns over a specified window size.
- Choose Back to data flow.
- Choose the plus sign next to the Steps on Data Flow.
- Choose + Add step.
- Choose Time Series transform.
- For Transform, choose Lag features.
- For Generate lag features for this column, choose Volume USD.
- For Timestamp Column, choose date.
- For Lag, enter
7
. - Because we’re interested in observing up to the previous seven lag values, let’s select Include the entire lag window.
- To create a new column for each lag value, select Flatten the output.
- Choose Preview.
Seven new columns are added, suffixed with the lag_number
keyword for the target column Volume USD
.
- Choose Add to save the step.
Rolling window features
We can also calculate meaningful statistical summaries across a range of values and include them as input features. Let’s extract common statistical time series features.
Data Wrangler implements automatic time series feature extraction capabilities using the open source tsfresh package. With the time series feature extraction transforms, you can automate the feature extraction process. This eliminates the time and effort otherwise spent manually implementing signal processing libraries. For this post, we extract features using the Rolling window features transform. This method computes statistical properties across a set of observations defined by the window size.
- Choose + Add step.
- Choose the Time Series transform.
- For Transform, choose Rolling window features.
- For Generate rolling window features for this column, choose Volume USD.
- For Timestamp Column, choose date.
- For Window size, enter
7
.
Specifying a window size of 7
computes features by combining the value at the current timestamp and values for the previous seven timestamps.
- Select Flatten to create a new column for each computed feature.
- Choose your strategy as Minimal subset.
This strategy extracts eight features that are useful in downstream analyses. Other strategies include Efficient Subset, Custom subset, and All features. For full list of features available for extraction, refer to Overview on extracted features.
- Choose Preview.
We can see eight new columns with specified window size of 7
in their name, appended to our dataset.
- Choose Add to save the step.
Export the dataset
We have transformed the time series dataset and are ready to use the transformed dataset as input for a forecasting algorithm. The last step is to export the transformed dataset to Amazon S3. In Data Wrangler, you can choose Export step to automatically generate a Jupyter notebook with Amazon SageMaker Processing code for processing and exporting the transformed dataset to a S3 bucket. However, because our dataset contains just over 300 records, let’s take advantage of the Export data option in the Add Transform view to export the transformed dataset directly to Amazon S3 from Data Wrangler.
- Choose Export data.
- For S3 location, choose Browser and choose your S3 bucket.
- Choose Export data.
Now that we have successfully transformed the bitcoin dataset, we can use Amazon Forecast to generate bitcoin predictions.
Clean up
If you’re done with this use case, clean up the resources you created to avoid incurring additional charges. For Data Wrangler you can shutdown the underlying instance when finished. Refer to Shut Down Data Wrangler documentation for details. Alternatively, you can continue to Part 2 of this series to use this dataset for forecasting.
Summary
This post demonstrated how to utilize Data Wrangler to simplify and accelerate time series analysis using its built-in time series capabilities. We explored how data scientists can easily and interactively clean, format, validate, and transform time series data into the desired format, for meaningful analysis. We also explored how you can enrich your time series analysis by adding a comprehensive set of statistical features using Data Wrangler. To learn more about time series transformations in Data Wrangler, see Transform Data.
About the Author
Roop Bains is a Solutions Architect at AWS focusing on AI/ML. He is passionate about helping customers innovate and achieve their business objectives using Artificial Intelligence and Machine Learning. In his spare time, Roop enjoys reading and hiking.
Nikita Ivkin is an Applied Scientist, Amazon SageMaker Data Wrangler.
Alexa Prize has a new home
Amazon Science is now the destination for information on the SocialBot, TaskBot, and SimBot challenges, including FAQs, team updates, publications, and other program information.Read More
Automate a shared bikes and scooters classification model with Amazon SageMaker Autopilot
Amazon SageMaker Autopilot makes it possible for organizations to quickly build and deploy an end-to-end machine learning (ML) model and inference pipeline with just a few lines of code or even without any code at all with Amazon SageMaker Studio. Autopilot offloads the heavy lifting of configuring infrastructure and the time it takes to build an entire pipeline, including feature engineering, model selection, and hyperparameter tuning.
In this post, we show how to go from raw data to a robust and fully deployed inference pipeline with Autopilot.
Solution overview
We use Lyft’s public dataset on bikesharing for this simulation to predict whether or not a user participates in the Bike Share for All program. This is a simple binary classification problem.
We want to showcase how easy it is to build an automated and real-time inference pipeline to classify users based on their participation in the Bike Share for All program. To this end, we simulate an end-to-end data ingestion and inference pipeline for an imaginary bikeshare company operating in the San Francisco Bay Area.
The architecture is broken down into two parts: the ingestion pipeline and the inference pipeline.
We primarily focus on the ML pipeline in the first section of this post, and review the data ingestion pipeline in the second part.
Prerequisites
To follow along with this example, complete the following prerequisites:
- Create a new SageMaker notebook instance.
- Create an Amazon Kinesis Data Firehose delivery stream with an AWS Lambda transform function. For instructions, see Amazon Kinesis Firehose Data Transformation with AWS Lambda. This step is optional and only needed to simulate data streaming.
Data exploration
Let’s download and visualize the dataset, which is located in a public Amazon Simple Storage Service (Amazon S3) bucket and static website:
The following screenshot shows a subset of the data before transformation.
The last column of the data contains the target we want to predict, which is a binary variable taking either a Yes or No value, indicating whether the user participates in the Bike Share for All program.
Let’s take a look at the distribution of our target variable for any data imbalance.
As shown in the graph above, the data is imbalanced, with fewer people participating in the program.
We need to balance the data to prevent an over-representation bias. This step is optional because Autopilot also offers an internal approach to handle class imbalance automatically, which defaults to a F1 score validation metric. Additionally, if you choose to balance the data yourself, you can use more advanced techniques for handling class imbalance, such as SMOTE or GAN.
For this post, we downsample the majority class (No) as a data balancing technique:
The following code enriches the data and under-samples the overrepresented class:
We deliberately left our categorical features not encoded, including our binary target value. This is because Autopilot takes care of encoding and decoding the data for us as part of the automatic feature engineering and pipeline deployment, as we see in the next section.
The following screenshot shows a sample of our data.
The data in the following graphs looks otherwise normal, with a bimodal distribution representing the two peaks for the morning hours and the afternoon rush hours, as you would expect. We also observe low activities on weekends and at night.
In the next section, we feed the data to Autopilot so that it can run an experiment for us.
Build a binary classification model
Autopilot requires that we specify the input and output destination buckets. It uses the input bucket to load the data and the output bucket to save the artifacts, such as feature engineering and the generated Jupyter notebooks. We retain 5% of the dataset to evaluate and validate the model’s performance after the training is complete and upload 95% of the dataset to the S3 input bucket. See the following code:
After we upload the data to the input destination, it’s time to start Autopilot:
All we need to start experimenting is to call the fit() method. Autopilot needs the input and output S3 location and the target attribute column as the required parameters. After feature processing, Autopilot calls SageMaker automatic model tuning to find the best version of a model by running many training jobs on your dataset. We added the optional max_candidates parameter to limit the number of candidates to 30, which is the number of training jobs that Autopilot launches with different combinations of algorithms and hyperparameters in order to find the best model. If you don’t specify this parameter, it defaults to 250.
We can observe the progress of Autopilot with the following code:
The training takes some time to complete. While it’s running, let’s look at the Autopilot workflow.
To find the best candidate, use the following code:
The following screenshot shows our output.
Our model achieved a validation accuracy of 96%, so we’re going to deploy it. We could add a condition such that we only use the model if the accuracy is above a certain level.
Inference pipeline
Before we deploy our model, let’s examine our best candidate and what’s happening in our pipeline. See the following code:
The following diagram shows our output.
Autopilot has built the model and has packaged it in three different containers, each sequentially running a specific task: transform, predict, and reverse-transform. This multi-step inference is possible with a SageMaker inference pipeline.
A multi-step inference can also chain multiple inference models. For instance, one container can perform principal component analysis before passing the data to the XGBoost container.
Deploy the inference pipeline to an endpoint
The deployment process involves just a few lines of code:
Let’s configure our endpoint for prediction with a predictor:
Now that we have our endpoint and predictor ready, it’s time to use the testing data we set aside and test the accuracy of our model. We start by defining a utility function that sends the data one line at a time to our inference endpoint and gets a prediction in return. Because we have an XGBoost model, we drop the target variable before sending the CSV line to the endpoint. Additionally, we removed the header from the testing CSV before looping through the file, which is also another requirement for XGBoost on SageMaker. See the following code:
The following screenshot shows our output.
Now let’s calculate the accuracy of our model.
See the following code:
We get an accuracy of 92%. This is slightly lower than the 96% obtained during the validation step, but it’s still high enough. We don’t expect the accuracy to be exactly the same because the test is performed with a new dataset.
Data ingestion
We downloaded the data directly and configured it for training. In real life, you may have to send the data directly from the edge device into the data lake and have SageMaker load it directly from the data lake into the notebook.
Kinesis Data Firehose is a good option and the most straightforward way to reliably load streaming data into data lakes, data stores, and analytics tools. It can capture, transform, and load streaming data into Amazon S3 and other AWS data stores.
For our use case, we create a Kinesis Data Firehose delivery stream with a Lambda transformation function to do some lightweight data cleaning as it traverses the stream. See the following code:
This Lambda function performs light transformation of the data streamed from the devices onto the data lake. It expects a CSV formatted data file.
For the ingestion step, we download the data and simulate a data stream to Kinesis Data Firehose with a Lambda transform function and into our S3 data lake.
Let’s simulate streaming a few lines:
Clean up
It’s important to delete all the resources used in this exercise to minimize cost. The following code deletes the SageMaker inference endpoint we created as well the training and testing data we uploaded:
Conclusion
ML engineers, data scientists, and software developers can use Autopilot to build and deploy an inference pipeline with little to no ML programming experience. Autopilot saves time and resources, using data science and ML best practices. Large organizations can now shift engineering resources away from infrastructure configuration towards improving models and solving business use cases. Startups and smaller organizations can get started on machine learning with little to no ML expertise.
We recommend learning more about other important features SageMaker has to offer, such as the Amazon SageMaker Feature Store, which integrates with Amazon SageMaker Pipelines to create, add feature search and discovery, and reuse automated ML workflows. You can run multiple Autopilot simulations with different feature or target variants in your dataset. You could also approach this as a dynamic vehicle allocation problem in which your model tries to predict vehicle demand based on time (such as time of day or day of the week) or location, or a combination of both.
About the Authors
Doug Mbaya is a Senior Solution architect with a focus in data and analytics. Doug works closely with AWS partners, helping them integrate data and analytics solution in the cloud. Doug’s prior experience includes supporting AWS customers in the ride sharing and food delivery segment.
Valerio Perrone is an Applied Science Manager working on Amazon SageMaker Automatic Model Tuning and Autopilot.
Amazon Scholar Ranjit Jhala named ACM Fellow
Jhala received the ACM honor for lifetime contributions to software verification, developing innovative tools to help computer programmers test their code.Read More
Apply profanity masking in Amazon Translate
Amazon Translate is a neural machine translation service that delivers fast, high-quality, affordable, and customizable language translation. This post shows how you can mask profane words and phrases with a grawlix string (“?$#@$”).
Amazon Translate typically chooses clean words for your translation output. But in some situations, you want to prevent words that are commonly considered as profane terms from appearing in the translated output. For example, when you’re translating video captions or subtitle content, or enabling in-game chat, and you want the translated content to be age appropriate and clear of any profanity, Amazon Translate allows you to mask the profane words and phrases using the profanity masking setting. You can apply profanity masking to both real-time translation or asynchronous batch processing in Amazon Translate. When using Amazon Translate with profanity masking enabled, the five-character sequence ?$#@$ is used to mask each profane word or phrase, regardless of the number of characters. Amazon Translate detects each profane word or phrase literally, not contextually.
Solution overview
To mask profane words and phrases in your translation output, you can enable the profanity option under the additional settings on the Amazon Translate console when you run the translations with Amazon Translate both through real-time and asynchronous batch processing requests. The following sections demonstrate using profanity masking for real-time translation requests via the Amazon Translate console, AWS Command Line Interface (AWS CLI), or with the Amazon Translate SDK (Python Boto3).
Amazon Translate console
To demonstrate handling profanity with real-time translation, we use the following sample text in French to be translated into English:
Complete the following steps on the Amazon Translate console:
- Choose French (fr) as the Source language.
- Choose English (en) as the Target Language.
- Enter the preceding example text in the Source Language text area.
The translated text appears under Target language. It contains a word that is considered profane in English.
- Expand Additional settings and enable Profanity.
The word is now replaced with the grawlix string ?$#@$.
AWS CLI
Calling the translate-text
AWS CLI command with --settings Profanity=MASK
masks profane words and phrases in your translated text.
The following AWS CLI commands are formatted for Unix, Linux, and macOS. For Windows, replace the backslash () Unix continuation character at the end of each line with a caret (
^
).
You get a response like the following snippet:
Amazon Translate SDK (Python Boto3)
The following Python 3 code uses the real-time translation call with the profanity setting:
Conclusion
You can use the profanity masking setting to mask words and phrases that are considered profane to keep your translated text clean and meet your business requirements. To learn more about all the ways you can customize your translations, refer to Customizing Your Translations using Amazon Translate.
About the Authors
Siva Rajamani is a Boston-based Enterprise Solutions Architect at AWS. He enjoys working closely with customers and supporting their digital transformation and AWS adoption journey. His core areas of focus are serverless, application integration, and security. Outside of work, he enjoys outdoors activities and watching documentaries.
Sudhanshu Malhotra is a Boston-based Enterprise Solutions Architect for AWS. He’s a technology enthusiast who enjoys helping customers find innovative solutions to complex business challenges. His core areas of focus are DevOps, machine learning, and security. When he’s not working with customers on their journey to the cloud, he enjoys reading, hiking, and exploring new cuisines.
Watson G. Srivathsan is the Sr. Product Manager for Amazon Translate, AWS’s natural language processing service. On weekends you will find him exploring the outdoors in the Pacific Northwest.
How Süddeutsche Zeitung optimized their audio narration process with Amazon Polly
This is a guest post by Jakob Kohl, a Software Developer at the Süddeutsche Zeitung. Süddeutsche Zeitung is one of the leading quality dailies in Germany when it comes to paid subscriptions and unique users. Its website, SZ.de, reaches more than 15 million monthly unique users as of October 2021.
Thanks to smart speakers and podcasts, the audio industry has experienced a real boom in recent years. At Süddeutsche Zeitung, we’re constantly looking for new ways to make our diverse journalism even more accessible. As pioneers in digital journalism, we want to open up more opportunities for Süddeutsche Zeitung readers to consume articles. We started looking for solutions that could provide high-quality audio narration for our articles. Our ultimate goal was to launch a “listen to the article” feature.
In this post, we share how we optimized our audio narration process with Amazon Polly, a service that turns text into lifelike speech using advanced deep learning technologies.
Why Amazon Polly?
We believe that Vicki, the German neural Amazon Polly voice, is currently the best German voice on the market. Amazon Polly offers the impressive feature to switch between languages, correctly pronouncing for example English movie titles as well as personal names in different languages (for an example, listen to the article Schall und Wahn on our website).
A big part of our infrastructure already runs on AWS, so using Amazon Polly was a perfect fit. We can combine Amazon Polly with the following components:
- An Amazon Simple Notification Service (Amazon SNS) topic to which we can subscribe for articles. The articles are sent to this topic by the CMS whenever they’re saved by an editor.
- An Amazon CloudFront distribution with Lambda@Edge to paywall premium articles, which we can reuse for audio versions of articles.
The Amazon Polly API is easy to use and well documented. It took us less than a week to get our proof of concept to work.
The challenge
Hundreds of new articles are published every day on SZ.de. After initial publication, they might get updated several times for various reasons—new paragraphs are added in news-driven articles, typos are fixed, teasers are changed, or metadata is optimized for search engines.
Generating speech for the initial publication of an article is straightforward, because the whole text needs to be synthesized. But how can we quickly generate the audio for updated versions of articles without paying twice for the same content? Our biggest challenge was to prevent sending the whole text to Amazon Polly repeatedly for every single update.
Our technical solution
Every time an editor saves an article, the new version of the article is published to an SNS topic. An AWS Lambda function is subscribed to this topic and called for every new version of an article. This function runs the following steps:
- Check if the new version of the article has already been completely synthesized. If so, the function stops immediately (this may happen when only metadata is changed that doesn’t affect the audio).
- Convert the article into multiple SSML documents, roughly one for each text paragraph.
- For each SSML document, the function checks if it has already been synthesized to audio using calculated hashes. For example:
- If an article is saved for the first time, all SSML documents must be synthesized.
- If a typo has been fixed in a single paragraph, only the SSML document for this paragraph must be re-synthesized.
- If a new paragraph is added to the article, only the SSML document for this new paragraph must be synthesized.
- Send all not-yet-synthesized SSML documents separately to Amazon Polly.
These checks help optimize performance and reduce cost by preventing the synthesis of an entire article multiple times. We avoid incurring additional charges due to minor changes such as a title edit or metadata adjustments for SEO reasons.
The following diagram illustrates the solution workflow.
After Amazon Polly synthesizes the SSML documents, the audio files are sent to an output bucket in Amazon Simple Storage Service (Amazon S3). A second Lambda function is listening for object creation on that bucket, waits for the completion of all audio fragments of an article, and merges them into a final audio file using FFmpeg from a Lambda layer. This final audio is sent to another S3 bucket, which is used as the origin in our CloudFront distribution. In CloudFront, we reuse an existing paywall for premium articles for the corresponding audio version.
Based on our freemium model, we provide a shortened audio version of premium articles. Non-subscribers are able to listen to the first paragraph for free, but are required to purchase a subscription to access the full article.
Conclusion
Integration of Amazon Polly into our existing infrastructure was very straightforward. Our content requires minimal customization because we only include paragraphs and some additional breaks. The most challenging part was performance and cost optimization, which we achieved by splitting the article up into multiple SSML documents corresponding to paragraphs, checking for changes in each SSML document, and building the whole audio file by merging the fragments. With these optimizations, we are able to achieve the following:
- Decrease the amount of synthesized characters by at least 50% by only synthesizing real changes.
- Reduce the time it takes for a change in the article text to appear in the audio because there is less audio to synthesize.
- Add arbitrary audio files between paragraphs without re-synthesizing the whole article. For example, we can include a sound file in the shortened audio version of a premium articles to separate the first paragraph from the ensuing note that a subscription is needed to listen to the full version.
In the first month after the launch of the “listen to the article” feature in our SZ.de articles, we received a lot of positive user feedback. We were able to reach almost 30,000 users during the first 2 months after launch. From these users, approximately 200 converted into a paid subscription only from listening to the teaser of an article behind our paywall. The “listen to the article” feature isn’t behind our paywall, but users can only listen to premium articles fully if they have a subscription. Our website also offers free articles without a paywall. In the future, we will expand the feature to other SZ platforms, especially our mobile news apps.
About the Author
Jakob Kohl is a Software Developer at the Süddeutsche Zeitung, where he enjoys working with modern technologies on an agile website team. He is one of the main developers of the “listen to an SZ article” feature. In his leisure time, he likes building wooden furniture, where technical and visual design is as important as in web development.
Alexa AI team discusses NeurIPS workshop best paper award
Paper deals with detecting and answering out-of-domain requests for task-oriented dialogue systems.Read More
Reduce costs and complexity of ML preprocessing with Amazon S3 Object Lambda
Amazon Simple Storage Service (Amazon S3) is an object storage service that offers industry-leading scalability, data availability, security, and performance. Often, customers have objects in S3 buckets that need further processing to be used effectively by consuming applications. Data engineers must support these application-specific data views with trade-offs between persisting derived copies or transforming data at the consumer level. Neither solution is ideal because it introduces operational complexity, causes data consistency challenges, and wastes more expensive computing resources.
These trade-offs broadly apply to many machine learning (ML) pipelines that train on unstructured data, such as audio, video, and free-form text, among other sources. In each example, the training job must download data from S3 buckets, prepare an application-specific view, and then use an AI algorithm. This post demonstrates a design pattern for reducing costs, complexity, and centrally managing this second step. It uses the concrete example of image processing, though the approach broadly applies to any workload. The economic benefits are also most pronounced when the transformation step doesn’t require GPU, but the AI algorithm does.
The proposed solution also centralizes data transformation code and enables just-in-time (JIT) transformation. Furthermore, the approach uses a serverless infrastructure to reduce operational overhead and undifferentiated heavy lifting.
Solution overview
When ML algorithms process unstructured data like images and video, it requires various normalization tasks (such as grey-scaling and resizing). This step exists to accelerate model convergence, avoid overfitting, and improve prediction accuracy. You often perform these preprocessing steps on instances that later run the AI training. That approach creates inefficiencies, because those resources typically have more expensive processors (for example, GPUs) than these tasks require. Instead, our solution externalizes those operations across economic, horizontally scalable Amazon S3 Object Lambda functions.
This design pattern has three critical benefits. First, it centralizes the shared data transformation steps, such as image normalization and removing ML pipeline code duplication. Next, S3 Object Lambda functions avoid data consistency issues in derived data through JIT conversions. Third, the serverless infrastructure reduces operational overhead, increases access time, and limits costs to the per-millisecond time running your code.
An elegant solution exists in which you can centralize these data preprocessing and data conversion operations with S3 Object Lambda. S3 Object Lambda enables you to add code that modifies data from Amazon S3 before returning it to an application. The code runs within an AWS Lambda function, a serverless compute service. Lambda can instantly scale to tens of thousands of parallel runs while supporting dozens of programming languages and even custom containers. For more information, see Introducing Amazon S3 Object Lambda – Use Your Code to Process Data as It Is Being Retrieved from S3.
The following diagram illustrates the solution architecture.
In this solution, you have an S3 bucket that contains the raw images to be processed. Next, you create an S3 Access Point for these images. If you build multiple ML models, you can create separate S3 Access Points for each model. Alternatively, AWS Identity and Access Management (IAM) policies for access points support sharing reusable functions across ML pipelines. Then you attach a Lambda function that has your preprocessing business logic to the S3 Access Point. After you retrieve the data, you call the S3 Access Point to perform JIT data transformations. Finally, you update your ML model to use the new S3 Object Lambda Access Point to retrieve data from Amazon S3.
Create the normalization access point
This section walks through the steps to create the S3 Object Lambda access point.
Raw data is stored in an S3 bucket. To provide the user with the right set of permissions to access this data, while avoiding complex bucket policies that can cause unexpected impact to another application, you need to create S3 Access Points. S3 Access Points are unique host names that you can use to reach S3 buckets. With S3 Access Points, you can create individual access control policies for each access point to control access to shared datasets easily and securely.
- Create your access point.
- Create a Lambda function that performs the image resizing and conversion. See the following Python code:
import boto3
import cv2
import numpy as np
import requests
import io
def lambda_handler(event, context):
print(event)
object_get_context = event["getObjectContext"]
request_route = object_get_context["outputRoute"]
request_token = object_get_context["outputToken"]
s3_url = object_get_context["inputS3Url"]
# Get object from S3
response = requests.get(s3_url)
nparr = np.fromstring(response.content, np.uint8)
img = cv2.imdecode(nparr, flags=1)
# Transform object
new_shape=(256,256)
resized = cv2.resize(img, new_shape, interpolation= cv2.INTER_AREA)
gray_scaled = cv2.cvtColor(resized,cv2.COLOR_BGR2GRAY)
# Transform object
is_success, buffer = cv2.imencode(".jpg", gray_scaled)
if not is_success:
raise ValueError('Unable to imencode()')
transformed_object = io.BytesIO(buffer).getvalue()
# Write object back to S3 Object Lambda
s3 = boto3.client('s3')
s3.write_get_object_response(
Body=transformed_object,
RequestRoute=request_route,
RequestToken=request_token)
return {'status_code': 200}
- Create an Object Lambda access point using the supporting access point from Step 1.
The Lambda function uses the supporting access point to download the original objects.
- Update Amazon SageMaker to use the new S3 Object Lambda access point to retrieve data from Amazon S3. See the following bash code:
aws s3api get-object --bucket arn:aws:s3-object-lambda:us-west-2:12345678901:accesspoint/image-normalizer --key images/test.png
Cost savings analysis
Traditionally, ML pipelines copy images and other files from Amazon S3 to SageMaker instances and then perform normalization. However, transforming these actions on training instances has inefficiencies. First, Lambda functions horizontally scale to handle the burst then elastically shrink, only charging per millisecond when the code is running. Many preprocessing steps don’t require GPUs and can even use ARM64. That creates an incentive to move that processing to more economical compute such as Lambda functions powered by AWS Graviton2 processors.
Using an example from the Lambda pricing calculator, you can configure the function with 256 MB of memory and compare the costs for both x86 and Graviton (ARM64). We chose this size because it’s sufficient for many single-image data preparation tasks. Next, use the SageMaker pricing calculator to compute expenses for an ml.p2.xlarge instance. This is the smallest supported SageMaker training instance with GPU support. These results show up to 90% compute savings for operations that don’t use GPUs and can shift to Lambda. The following table summarizes these findings.
Lambda with x86 | Lambda with Graviton2 (ARM) | SageMaker ml.p2.xlarge | |
Memory (GB) | 0.25 | 0.25 | 61 |
CPU | — | — | 4 |
GPU | — | — | 1 |
Cost/hour | $0.061 | $0.049 | $0.90 |
Conclusion
You can build modern applications to unlock insights into your data. These different applications have unique data view requirements, such as formatting and preprocessing actions. Addressing these other use cases can result in data duplication, increasing costs, and more complexity to maintain consistency. This post offers a solution for efficiently handling these situations using S3 Object Lambda functions.
Not only does this remove the need for duplication, but it also forms a path to scale these actions across less expensive compute horizontally! Even optimizing the transformation code for the ml.p2.xlarge instance would still be significantly more costly because of the idle GPUs.
For more ideas on using serverless and ML, see Machine learning inference at scale using AWS serverless and Deploying machine learning models with serverless templates.
About the Authors
Nate Bachmeier is an AWS Senior Solutions Architect nomadically explores New York, one cloud integration at a time. He specializes in migrating and modernizing customers’ workloads. Besides this, Nate is a full-time student and has two kids.
Marvin Fernandes is a Solutions Architect at AWS, based in the New York City area. He has over 20 years of experience building and running financial services applications. He is currently working with large enterprise customers to solve complex business problems by crafting scalable, flexible, and resilient cloud architectures.
Automated reasoning’s scientific frontiers
Distributing proof search, reasoning about distributed systems, and automating regulatory compliance are just three fruitful research areas.Read More