Method using hyperboloid embeddings improves on methods that use vector embeddings by up to 33%.Read More
Amazon at WSDM: The future of graph neural networks
Amazon’s George Karypis will give a keynote address on graph neural networks, a field in which “there is some fundamental theoretical stuff that we still need to understand.”Read More
How SIGNAL IDUNA operationalizes machine learning projects on AWS
This post is co-authored with Jan Paul Assendorp, Thomas Lietzow, Christopher Masch, Alexander Meinert, Dr. Lars Palzer, Jan Schillemans of SIGNAL IDUNA.
At SIGNAL IDUNA, a large German insurer, we are currently reinventing ourselves with our transformation program VISION2023 to become even more customer oriented. Two aspects are central to this transformation: the reorganization of large parts of the workforce into cross functional and agile teams, and becoming a truly data-driven company. Here, the motto “You build it, you run it” is an important requirement for a cross-functional team that builds a data or machine learning (ML) product. This places tight constraints on how much work team can spend to productionize and run a product.
This post shows how SIGNAL IDUNA tackles this challenge and utilizes the AWS Cloud to enable cross-functional teams to build and operationalize their own ML products. To this end, we first introduce the organizational structure of agile teams, which sets the central requirements for the cloud infrastructure used to develop and run a product. Next, we show how three central teams at SIGNAL IDUNA enable cross-functional teams to build data products in the AWS Cloud with minimal assistance, by providing a suitable workflow and infrastructure solutions that can easily be used and adapted. Finally, we review our approach and compare it with a more classical approach where development and operation are separated more strictly.
Agile@SI – the Foundation of Organizational Change
Since the start of 2021, SIGNAL IDUNA has begun placing its strategy Agile@SI into action and establishing agile methods for developing customer-oriented solutions across the entire company [1]. Previous tasks and goals are now undertaken by cross-functional teams, called squads. These squads employ agile methods (such as the Scrum framework), make their own decisions, and build customer-oriented products. Typically, the squads are located in business divisions, such as marketing, and many have a strong emphasis on building data-driven and ML powered products. As an example, typical use cases in insurance are customer churn prediction and product recommendation.
Due to the complexity of ML, creating an ML solution by a single squad is challenging, and thus requires the collaboration of different squads.
SIGNAL IDUNA has three essential teams that support creating ML solutions. Surrounded by these three squads is the team that is responsible for the development and the long-term operation and of the ML solution. This approach follows the AWS shared responsibility model [2].
In the image above, all of the squads are represented in an overview.
Cloud Enablement
The underlying cloud infrastructure for the entire organization is provided by the squad Cloud Enablement. It is their task to enable the teams to build products upon cloud technologies on their own. This improves the time to market building new products like ML, and it follows the principle of “You build it, you run it”.
Data Office/Data Lake
Moving data into the cloud, as well as finding the right dataset, is supported by the squad Data Office/Data Lake. They set up a data catalogue that can be used to search and select required datasets. Their aim is to establish data transparency and governance. Additionally, they are responsible for establishing and operating a Data Lake that helps teams to access and process relevant data.
Data Analytics Platform
Our squad Data Analytics Platform (DAP) is a cloud and ML focused team at SIGNAL IDUNA that is proficient in ML engineering, data engineering, as well as data science. We enable internal teams using public cloud for ML by providing infrastructure components and knowledge. Our products and services are presented in detail in the following section.
Enabling Cross-Functional Teams to Build ML Solutions
To enable cross-functional teams at SIGNAL IDUNA to build ML solutions, we need a fast and versatile way to provision reusable cloud infrastructure as well as an efficient workflow for onboarding teams to utilize the cloud capabilities.
To this end, we created a standardized onboarding and support process, and provided modular infrastructure templates as Infrastructure as Code (IaC). These templates contain infrastructure components designed for common ML use cases that can be easily tailored to the requirements of a specific use case.
The Workflow of Building ML Solutions
There are three main technical roles involved in building and operating ML solutions: The data scientist, ML engineer, and a data engineer. Each role is part of the cross-functional squad and has different responsibilities. The data scientist has the required domain knowledge of functional as well as technical requirements of the use case. The ML engineer specializes in building automated ML solutions and model deployment. And the data engineer makes sure that the data flows from on-premises and within the cloud.
The process of providing the platform is as follows:
The infrastructure of the specific use case is defined in IaC and versioned in a central project repository. This also includes pipelines for model training and deployment, as well as other data science related code artifacts. Data scientists, ML engineers, and data engineers have access to the project repository and can configure and update all of the infrastructure code autonomously. This enables the team to rapidly alter the infrastructure if needed. However, the ML engineer can always support in developing and updating infrastructure or ML models.
Reusable and Modular Infrastructure Components
The hierarchical and modular IaC resources are implemented in Terraform and include infrastructure for common data science and ETL use cases. This lets us reuse infrastructure code and enforce required security and compliance policies, such as using AWS Key Management Service (KMS) encryption for data, as well as encapsulating infrastructure in Amazon Virtual Private Cloud (VPC) environments without direct internet access.
The hierarchical IaC structure is as follows:
- Modules encapsulate basic AWS services with the required configuration for security and access management. This includes best practice configurations such as the prevention of public access to Amazon Simple Storage Service (S3) buckets, or enforcing encryption for all files stored.
- In some cases, you need a variety of services to automate processes, such as to deploy ML models in different stages. Therefore, we defined Solutions as a bundle of different modules in a joint configuration for different types of tasks.
- In addition, we offer complete Blueprints that combine solutions in different environments to meet the many potential needs of a project. In our MLOps blueprint, we define a deployable infrastructure for training, provisioning, and monitoring ML models that are integrated and distributed in AWS accounts. We discuss further details in the next section.
These products are versioned in a central repository by the DAP squad. This lets us continuously improve our IaC and consider new features from AWS, such as Amazon SageMaker Model Registry. Each squad can reference these resources, parameterize them as needed, and finally deploy them in their own AWS accounts.
MLOps Architecture
We provide a ready-to-use blueprint with specific solutions to cover the entire MLOps process. The blueprint contains infrastructure distributed over four AWS accounts for building and deploying ML models. This lets us isolate resources and workflows for the different steps in the MLOps process. The following figure shows the multi-account architecture, and we describe how the responsibility over specific steps of the process is divided between the different technical roles.
The modeling account includes services for the development of ML models. First, the data engineer employs an ETL process to provide relevant data from the SIGNAL IDUNA data lake, the centralized gateway for data-driven workflows in the AWS Cloud. Subsequently, the dataset can be utilized by the data scientist to train and evaluate model candidates. Once ready for extensive experiments, a model candidate is integrated into an automated training pipeline by the ML engineer. We use Amazon SageMaker Pipelines to automate training, hyperparameter tuning, and model evaluation at scale. This also includes model lineage and a standardized approval mechanism for models to be staged for deployment into production. Automated unit tests and code analysis ensure quality and reliability of the code for each step of the pipeline, such as data preprocessing, model training, and evaluation. Once a model is evaluated and approved, we use Amazon SageMaker ModelPackages as an interface to the trained model and relevant meta data.
The tooling account contains automated CI/CD pipelines with different stages for testing and deployment of trained models. In the test stage, models are deployed into the serving-nonprod account. Although model quality is evaluated in the training pipeline prior to the model being staged for production, here we run performance and integration tests in an isolated testing environment. After passing the testing stage, models are deployed into the serving-prod account to be integrated into production workflows.
Separating the stages of the MLOps workflow into different AWS accounts lets us isolate development and testing from production. Therefore, we can enforce a strict access and security policy. Furthermore, tailored IAM roles ensure that specific services can only access data and other services required for its scope, following the principle of least privilege. Services within the serving environments can additionally be made accessible to external business processes. For example, a business process can query an endpoint within the serving-prod environment for model predictions.
Benefits of our Approach
This process has many advantages as compared to a strict separation of development and operation for both the ML models, as well as the required infrastructure:
- Isolation: Every team receives their own set of AWS accounts that are completely isolated from other teams’ environments. This makes it easy to manage access rights and keep the data private to those who are entitled to work with it.
- Cloud enablement: Team members with little prior experience in cloud DevOps (such as many data scientists) can easily watch the whole process of designing and managing infrastructure since (almost) nothing is hidden from them behind a central service. This creates a better understanding of the infrastructure, which can in turn help them create data science products more efficiently.
- Product ownership: The use of preconfigured infrastructure solutions and managed services keeps the barrier to managing an ML product in production very low. Therefore, a data scientist can easily take ownership of a model that is put into production. This minimizes the well-known risk of failing to put a model into production after development.
- Innovation: Since ML engineers are involved long before a model is ready to put into production, they can create infrastructure solutions suitable for new use cases while the data scientists develop an ML model.
- Adaptability: Since the IaC solution developed by DAP are freely available, any team can easily adapt these to match a specific need for their use case.
- Open source: All new infrastructure solutions can easily be made available via the central DAP code repo to be used by other teams. Over time, this will create a rich code base with infrastructure components tailored to different use cases.
Summary
In this post, we illustrated how cross-functional teams at SIGNAL IDUNA are being enabled to build and run ML products on AWS. Central to our approach is the usage of a dedicated set of AWS accounts for each team in combination with bespoke IaC blueprints and solutions. These two components enable a cross-functional team to create and operate production quality infrastructure. In turn, they can take full end-to-end ownership of their ML products.
Refer to Amazon SageMaker Model Building Pipelines – Amazon SageMaker to learn more.
Find more information on ML on AWS on our official page.
References
[1] https://www.handelsblatt.com/finanzen/versicherungsbranche-vorbild-spotify-signal-iduna-wird-von-einer-handwerker-versicherung-zum-agilen-konzern/27381902.html [2] https://blog.crisp.se/wp-content/uploads/2012/11/SpotifyScaling.pdf [3] https://aws.amazon.com/compliance/shared-responsibility-model/About the Authors
Jan Paul Assendorp is an ML engineer with a strong data science focus. He builds ML models and automates model training and the deployment into production environments.
Thomas Lietzow is the Scrum Master of the squad Data Analytics Platform.
Christopher Masch is the Product Owner of the squad Data Analytics Platform with knowledge in data engineering, data science, and ML engineering.
Alexander Meinert is part of the Data Analytics Platform team and works as an ML engineer. Started with statistics, grew on data science projects, found passion for ML methods and architecture.
Dr. Lars Palzer is a data scientist and part of the Data Analytics Platform team. After helping to build the MLOps architecture components, he is now using them to build ML products.
Jan Schillemans is a ML engineer with a software engineering background. He focusses on applying software engineering best practices onto ML environments (MLOps).
Bongo Learn provides real-time feedback to improve learning outcomes with Amazon Transcribe
Real-time feedback helps drive learning. This is especially important for designing presentations, learning new languages, and strengthening other essential skills that are critical to succeed in today’s workplace. However, many students and lifelong learners lack access to effective face-to-face instruction to hone these skills. In addition, with the rapid adoption of remote learning, educators are seeking more effective ways to engage their students and provide feedback and guidance in online learning environments. Bongo is filling that gap using video-based engagement and personalized feedback.
Bongo is a video assessment solution that enables experiential learning and soft skill development at scale. Their Auto Analysis is an automated reporting feature that provides deeper insight into an individual’s performance and progress. Organizations around the world—both corporate and higher education institutions—use Bongo’s Auto Analysis to facilitate automated feedback for a variety of use cases, including individual presentations, objection handling, and customer interaction training. The Auto Analysis platform, which runs on AWS and uses Amazon Transcribe, allows learners to demonstrate what they can do on video and helps evaluators get an authentic representation of a learner’s competency across a range of skills.
When users complete a video assignment, Bongo uses Amazon Transcribe, a deep learning-powered automatic speech recognition (ASR), to convert speech into text. Bongo analyzes the transcripts to identify the use of keywords and filler words, and assess clarity and effectiveness of the individual’s delivery. Bongo then auto-generates personalized feedback reports based on these performance insights, which learners can utilize as they practice iteratively. Learners can then submit their recording for feedback from evaluators and peers. Learners have reported a strong preference for receiving private and detailed feedback prior to submitting their work for evaluation or peer review.
Why Bongo chose Amazon Transcribe
During the technical evaluation process, Bongo looked at several speech-to-text vendors and machine learning services. Bruce Fischer, CTO at Bongo, says, “When choosing a vendor, AWS’ breadth and depth of services enabled us to build a complete solution through a single vendor. That saved us valuable development and deployment time. In addition, Amazon Transcribe produces high-quality transcripts with timestamps that allow Bongo Auto Analysis to provide accurate feedback to learners and improve learning outcomes. We are excited with how the service has evolved and how its new capabilities enable us to innovate faster.”
Since launch, Bongo has added the custom vocabulary feature of Amazon Transcribe. For example, it can recognize business jargon that is common in sales presentations. Foreign language learning is another important use case for Bongo customers. The automatic language detection feature in Amazon Transcribe and overall language support (37 different languages for batch processing) allows Bongo to deliver Auto Analysis in several languages, such as French, Spanish, German, and Portuguese.
Recently, Bongo launched auto-captioning for their on-demand videos. Powered by Amazon Transcribe, captions help address the accessibility needs of Bongo users with learning disabilities and impairments.
Amazon Transcribe enables Bongo’s Auto Analysis to quickly and accurately transcribe learner videos and provide feedback on the video that helps a learner employ a ‘practice, reflect, improve’ loop. This enables learners to increase content comprehension, retention, and learning outcomes, and reduces instructor assessment time since they are viewing a better work product. Teachers can focus on providing insightful feedback without spending time on the metrics the Auto Analysis produces automatically.
– Josh Kamrath, Bongo’s CEO.
Recently, Dr. Lynda Randall and Dr. Jessica Jaynes from California State University, Fullerton, conducted a research study to analyze the effectiveness of Bongo in an actual classroom setting on student engagement and learning outcomes.[1] The study results showed how the use of Bongo helped increase student comprehension and retention of concepts.
Conclusion
The Bongo team is now looking at how to incorporate other AWS AI services, such as Amazon Comprehend to do further language processing and Amazon Rekognition for visual analysis of videos. Bongo and their AWS team will continue working together to create the best experience for learners and instructors alike. To learn more about Amazon Transcribe and test it yourself, visit the Amazon Transcribe console.
[1] Randall, L.E., & Jaynes, J. A comparison of three assessment types of student engagement and content knowledge in online instruction. Online Learning Journal. (Status: Accepted. Publication date TBD)About Bongo
Bongo is an embedded solution that drives meaningful assessment, experiential learning, and skill development at scale through video-based engagement and personalized feedback. Organizations use our video workflows to create opportunities for practice, demonstration, analysis, and collaboration. When individuals show what they can do within a real-world learning environment, evaluators get an authentic representation of their competency.
About the Author
Roshni Madaiah is an Account Manager on the AWS EdTech team, where she helps Education Technology customers build cutting edge solutions to transform learning and enrich student experience. Prior to AWS, she worked with enterprises and commercial customers to drive business outcomes via technical solutions. Outside of work, she enjoys traveling, reading and cooking without recipes.
Amazon Robotics, Hampton University team up to establish robotics program
Amazon funding will assist with a senior capstone course where students will receive mentorship from Amazon leading researchers, software developers, and engineers.Read More
Prepare time series data with Amazon SageMaker Data Wrangler
Time series data is widely present in our lives. Stock prices, house prices, weather information, and sales data captured over time are just a few examples. As businesses increasingly look for new ways to gain meaningful insights from time-series data, the ability to visualize data and apply desired transformations are fundamental steps. However, time-series data possesses unique characteristics and nuances compared to other kinds of tabular data, and require special considerations. For example, standard tabular or cross-sectional data is collected at a specific point in time. In contrast, time series data is captured repeatedly over time, with each successive data point dependent on its past values.
Because most time series analyses rely on the information gathered across a contiguous set of observations, missing data and inherent sparseness can reduce the accuracy of forecasts and introduce bias. Additionally, most time series analysis approaches rely on equal spacing between data points, in other words, periodicity. Therefore, the ability to fix data spacing irregularities is a critical prerequisite. Finally, time series analysis often requires the creation of additional features that can help explain the inherent relationship between input data and future predictions. All these factors differentiate time series projects from traditional machine learning (ML) scenarios and demand a distinct approach to its analysis.
This post walks through how to use Amazon SageMaker Data Wrangler to apply time series transformations and prepare your dataset for time series use cases.
Use cases for Data Wrangler
Data Wrangler provides a no-code/low-code solution to time series analysis with features to clean, transform, and prepare data faster. It also enables data scientists to prepare time series data in adherence to their forecasting model’s input format requirements. The following are a few ways you can use these capabilities:
- Descriptive analysis– Usually, step one of any data science project is understanding the data. When we plot time series data, we get a high-level overview of its patterns, such as trend, seasonality, cycles, and random variations. It helps us decide the correct forecasting methodology for accurately representing these patterns. Plotting can also help identify outliers, preventing unrealistic and inaccurate forecasts. Data Wrangler comes with a seasonality-trend decomposition visualization for representing components of a time series, and an outlier detection visualization to identify outliers.
- Explanatory analysis– For multi-variate time series, the ability to explore, identify, and model the relationship between two or more time series is essential for obtaining meaningful forecasts. The Group by transform in Data Wrangler creates multiple time series by grouping data for specified cells. Additionally, Data Wrangler time series transforms, where applicable, allow specification of additional ID columns to group on, enabling complex time series analysis.
- Data preparation and feature engineering– Time series data is rarely in the format expected by time series models. It often requires data preparation to convert raw data into time series-specific features. You may want to validate that time series data is regularly or equally spaced prior to analysis. For forecasting use cases, you may also want to incorporate additional time series characteristics, such as autocorrelation and statistical properties. With Data Wrangler, you can quickly create time series features such as lag columns for multiple lag periods, resample data to multiple time granularities, and automatically extract statistical properties of a time series, to name a few capabilities.
Solution overview
This post elaborates on how data scientists and analysts can use Data Wrangler to visualize and prepare time series data. We use the bitcoin cryptocurrency dataset from cryptodatadownload with bitcoin trading details to showcase these capabilities. We clean, validate, and transform the raw dataset with time series features and also generate bitcoin volume price forecasts using the transformed dataset as input.
The sample of bitcoin trading data is from January 1 – November 19, 2021, with 464,116 data points. The dataset attributes include a timestamp of the price record, the opening or first price at which the coin was exchanged for a particular day, the highest price at which the coin was exchanged on the day, the last price at which the coin was exchanged on the day, the volume exchanged in the cryptocurrency value on the day in BTC, and corresponding USD currency.
Prerequisites
Download the Bitstamp_BTCUSD_2021_minute.csv
file from cryptodatadownload and upload it to Amazon Simple Storage Service (Amazon S3).
Import bitcoin dataset in Data Wrangler
To start the ingestion process to Data Wrangler, complete the following steps:
- On the SageMaker Studio console, on the File menu, choose New, then choose Data Wrangler Flow.
- Rename the flow as desired.
- For Import data, choose Amazon S3.
- Upload the
Bitstamp_BTCUSD_2021_minute.csv
file from your S3 bucket.
You can now preview your data set.
- In the Details pane, choose Advanced configuration and deselect Enable sampling.
This is a relatively small data set, so we don’t need sampling.
- Choose Import.
You have successfully created the flow diagram and are ready to add transformation steps.
Add transformations
To add data transformations, choose the plus sign next to Data types and choose Edit data types.
Ensure that Data Wrangler automatically inferred the correct data types for the data columns.
In our case, the inferred data types are correct. However, suppose one data type was incorrect. You can easily modify them through the UI, as shown in the following screenshot.
Let’s kick off the analysis and start adding transformations.
Data cleaning
We first perform several data cleaning transformations.
Drop column
Let’s start by dropping the unix
column, because we use the date
column as the index.
- Choose Back to data flow.
- Choose the plus sign next to Data types and choose Add transform.
- Choose + Add step in the TRANSFORMS pane.
- Choose Manage columns.
- For Transform, choose Drop column.
- For Column to drop, choose unix.
- Choose Preview.
- Choose Add to save the step.
Handle missing
Missing data is a well-known problem in real-world datasets. Therefore, it’s a best practice to verify the presence of any missing or null values and handle them appropriately. Our dataset doesn’t contain missing values. But if there were, we would use the Handle missing time series transform to fix them. Commonly used strategies for handling missing data include dropping rows with missing values or filling the missing values with reasonable estimates. Because time series data relies on a sequence of data points across time, filling missing values is the preferred approach. The process of filling missing values is referred to as imputation. The Handle missing time series transform allows you to choose from multiple imputation strategies.
- Choose + Add step in the TRANSFORMS pane.
- Choose the Time Series transform.
- For Transform, Choose Handle missing.
- For Time series input type, choose Along column.
- For Method for imputing values, choose Forward fill.
The Forward fill method replaces the missing values with the non-missing values preceding the missing values.
Backward fill, Constant Value, Most common value and Interpolate are other imputation strategies available in Data Wrangler. Interpolation techniques rely on neighboring values for filling missing values. Time series data often exhibits correlation between neighboring values, making interpolation an effective filling strategy. For additional details on the functions you can use for applying interpolation, refer to pandas.DataFrame.interpolate.
Validate timestamp
In time series analysis, the timestamp column acts as the index column, around which the analysis revolves. Therefore, it’s essential to make sure the timestamp column doesn’t contain invalid or incorrectly formatted time stamp values. Because we’re using the date
column as the timestamp column and index, let’s confirm its values are correctly formatted.
- Choose + Add step in the TRANSFORMS pane.
- Choose the Time Series transform.
- For Transform, choose Validate timestamps.
The Validate timestamps transform allows you to check that the timestamp column in your dataset doesn’t have values with an incorrect timestamp or missing values.
- For Timestamp Column, choose date.
- For Policy dropdown, choose Indicate.
The Indicate policy option creates a Boolean column indicating if the value in the timestamp column is a valid date/time format. Other options for Policy include:
- Error – Throws an error if the timestamp column is missing or invalid
- Drop – Drops the row if the timestamp column is missing or invalid
- Choose Preview.
A new Boolean column named date_is_valid
was created, with true
values indicating correct format and non-null entries. Our dataset doesn’t contain invalid timestamp values in the date
column. But if it did, you could use the new Boolean column to identify and fix those values.
- Choose Add to save this step.
Time series visualization
After we clean and validate the dataset, we can better visualize the data to understand its different component.
Resample
Because we’re interested in daily predictions, let’s transform the frequency of data to daily.
The Resample transformation changes the frequency of the time series observations to a specified granularity, and comes with both upsampling and downsampling options. Applying upsampling increases the frequency of the observations (for example from daily to hourly), whereas downsampling decreases the frequency of the observations (for example from hourly to daily).
Because our dataset is at minute granularity, let’s use the downsampling option.
- Choose + Add step.
- Choose the Time Series transform.
- For Transform, choose Resample.
- For Timestamp, choose date.
- For Frequency unit, choose Calendar day.
- For Frequency quantity, enter 1.
- For Method to aggregate numeric values, choose mean.
- Choose Preview.
The frequency of our dataset has changed from per minute to daily.
- Choose Add to save this step.
Seasonal-Trend decomposition
After resampling, we can visualize the transformed series and its associated STL (Seasonal and Trend decomposition using LOESS) components using the Seasonal-Trend-decomposition visualization. This breaks down original time series into distinct trend, seasonality and residual components, giving us a good understanding of how each pattern behaves. We can also use the information when modelling forecasting problems.
Data Wrangler uses LOESS, a robust and versatile statistical method for modelling trend and seasonal components. It’s underlying implementation uses polynomial regression for estimating nonlinear relationships present in the time series components (seasonality, trend, and residual).
- Choose Back to data flow.
- Choose the plus sign next to the Steps on Data Flow.
- Choose Add analysis.
- In the Create analysis pane, for Analysis type, choose Time Series.
- For Visualization, choose Seasonal-Trend decomposition.
- For Analysis Name, enter a name.
- For Timestamp column, choose date.
- For Value column, choose Volume USD.
- Choose Preview.
The analysis allows us to visualize the input time series and decomposed seasonality, trend, and residual.
- Choose Save to save the analysis.
With the seasonal-trend decomposition visualization, we can generate four patterns, as shown in the preceding screenshot:
- Original – The original time series re-sampled to daily granularity.
-
Trend – The polynomial trend with an overall negative trend pattern for the year 2021, indicating a decrease in
Volume USD
value. - Season – The multiplicative seasonality represented by the varying oscillation patterns. We see a decrease in seasonal variation, characterized by decreasing amplitude of oscillations.
- Residual – The remaining residual or random noise. The residual series is the resulting series after trend and seasonal components have been removed. Looking closely, we observe spikes between January and March, and between April and June, suggesting room for modelling such particular events using historical data.
These visualizations provide valuable leads to data scientists and analysts into existing patterns and can help you choose a modelling strategy. However, it’s always a good practice to validate the output of STL decomposition with the information gathered through descriptive analysis and domain expertise.
To summarize, we observe a downward trend consistent with original series visualization, which increases our confidence in incorporating the information conveyed by trend visualization into downstream decision-making. In contrast, the seasonality visualization helps inform the presence of seasonality and the need for its removal by applying techniques such as differencing, it doesn’t provide the desired level of detailed insight into various seasonal patterns present, thereby requiring deeper analysis.
Feature engineering
After we understand the patterns present in our dataset, we can start to engineer new features aimed to increase the accuracy of the forecasting models.
Featurize datetime
Let’s start the feature engineering process with more straightforward date/time features. Date/time features are created from the timestamp
column and provide an optimal avenue for data scientists to start the feature engineering process. We begin with the Featurize datetime time series transformation to add the month, day of the month, day of the year, week of the year, and quarter features to our dataset. Because we’re providing the date/time components as separate features, we enable ML algorithms to detect signals and patterns for improving prediction accuracy.
- Choose + Add step.
- Choose the Time Series transform.
- For Transform, choose Featurize datetime.
- For Input Column, choose date.
- For Output Column, enter
date
(this step is optional). - For Output mode, choose Ordinal.
- For Output format, choose Columns.
- For date/time features to extract, select Month, Day, Week of year, Day of year, and Quarter.
- Choose Preview.
The dataset now contains new columns named date_month
, date_day
, date_week_of_year
, date_day_of_year
, and date_quarter
. The information retrieved from these new features could help data scientists derive additional insights from the data and into the relationship between input features and output features.
- Choose Add to save this step.
Encode categorical
Date/time features aren’t limited to integer values. You may also choose to consider certain extracted date/time features as categorical variables and represent them as one-hot encoded features, with each column containing binary values. The newly created date_quarter
column contains values between 0-3, and can be one-hot encoded using four binary columns. Let’s create four new binary features, each representing the corresponding quarter of the year.
- Choose + Add step.
- Choose the Encode categorical transform.
- For Transform, choose One-hot encode.
- For Input column, choose date_quarter.
- For Output style, choose Columns.
- Choose Preview.
- Choose Add to add the step.
Lag feature
Next, let’s create lag features for the target column Volume USD
. Lag features in time series analysis are values at prior timestamps that are considered helpful in inferring future values. They also help identify autocorrelation (also known as serial correlation) patterns in the residual series by quantifying the relationship of the observation with observations at previous time steps. Autocorrelation is similar to regular correlation but between the values in a series and its past values. It forms the basis for the autoregressive forecasting models in the ARIMA series.
With the Data Wrangler Lag feature transform, you can easily create lag features n periods apart. Additionally, we often want to create multiple lag features at different lags and let the model decide the most meaningful features. For such a scenario, the Lag features transform helps create multiple lag columns over a specified window size.
- Choose Back to data flow.
- Choose the plus sign next to the Steps on Data Flow.
- Choose + Add step.
- Choose Time Series transform.
- For Transform, choose Lag features.
- For Generate lag features for this column, choose Volume USD.
- For Timestamp Column, choose date.
- For Lag, enter
7
. - Because we’re interested in observing up to the previous seven lag values, let’s select Include the entire lag window.
- To create a new column for each lag value, select Flatten the output.
- Choose Preview.
Seven new columns are added, suffixed with the lag_number
keyword for the target column Volume USD
.
- Choose Add to save the step.
Rolling window features
We can also calculate meaningful statistical summaries across a range of values and include them as input features. Let’s extract common statistical time series features.
Data Wrangler implements automatic time series feature extraction capabilities using the open source tsfresh package. With the time series feature extraction transforms, you can automate the feature extraction process. This eliminates the time and effort otherwise spent manually implementing signal processing libraries. For this post, we extract features using the Rolling window features transform. This method computes statistical properties across a set of observations defined by the window size.
- Choose + Add step.
- Choose the Time Series transform.
- For Transform, choose Rolling window features.
- For Generate rolling window features for this column, choose Volume USD.
- For Timestamp Column, choose date.
- For Window size, enter
7
.
Specifying a window size of 7
computes features by combining the value at the current timestamp and values for the previous seven timestamps.
- Select Flatten to create a new column for each computed feature.
- Choose your strategy as Minimal subset.
This strategy extracts eight features that are useful in downstream analyses. Other strategies include Efficient Subset, Custom subset, and All features. For full list of features available for extraction, refer to Overview on extracted features.
- Choose Preview.
We can see eight new columns with specified window size of 7
in their name, appended to our dataset.
- Choose Add to save the step.
Export the dataset
We have transformed the time series dataset and are ready to use the transformed dataset as input for a forecasting algorithm. The last step is to export the transformed dataset to Amazon S3. In Data Wrangler, you can choose Export step to automatically generate a Jupyter notebook with Amazon SageMaker Processing code for processing and exporting the transformed dataset to a S3 bucket. However, because our dataset contains just over 300 records, let’s take advantage of the Export data option in the Add Transform view to export the transformed dataset directly to Amazon S3 from Data Wrangler.
- Choose Export data.
- For S3 location, choose Browser and choose your S3 bucket.
- Choose Export data.
Now that we have successfully transformed the bitcoin dataset, we can use Amazon Forecast to generate bitcoin predictions.
Clean up
If you’re done with this use case, clean up the resources you created to avoid incurring additional charges. For Data Wrangler you can shutdown the underlying instance when finished. Refer to Shut Down Data Wrangler documentation for details. Alternatively, you can continue to Part 2 of this series to use this dataset for forecasting.
Summary
This post demonstrated how to utilize Data Wrangler to simplify and accelerate time series analysis using its built-in time series capabilities. We explored how data scientists can easily and interactively clean, format, validate, and transform time series data into the desired format, for meaningful analysis. We also explored how you can enrich your time series analysis by adding a comprehensive set of statistical features using Data Wrangler. To learn more about time series transformations in Data Wrangler, see Transform Data.
About the Author
Roop Bains is a Solutions Architect at AWS focusing on AI/ML. He is passionate about helping customers innovate and achieve their business objectives using Artificial Intelligence and Machine Learning. In his spare time, Roop enjoys reading and hiking.
Nikita Ivkin is an Applied Scientist, Amazon SageMaker Data Wrangler.
Alexa Prize has a new home
Amazon Science is now the destination for information on the SocialBot, TaskBot, and SimBot challenges, including FAQs, team updates, publications, and other program information.Read More
Automate a shared bikes and scooters classification model with Amazon SageMaker Autopilot
Amazon SageMaker Autopilot makes it possible for organizations to quickly build and deploy an end-to-end machine learning (ML) model and inference pipeline with just a few lines of code or even without any code at all with Amazon SageMaker Studio. Autopilot offloads the heavy lifting of configuring infrastructure and the time it takes to build an entire pipeline, including feature engineering, model selection, and hyperparameter tuning.
In this post, we show how to go from raw data to a robust and fully deployed inference pipeline with Autopilot.
Solution overview
We use Lyft’s public dataset on bikesharing for this simulation to predict whether or not a user participates in the Bike Share for All program. This is a simple binary classification problem.
We want to showcase how easy it is to build an automated and real-time inference pipeline to classify users based on their participation in the Bike Share for All program. To this end, we simulate an end-to-end data ingestion and inference pipeline for an imaginary bikeshare company operating in the San Francisco Bay Area.
The architecture is broken down into two parts: the ingestion pipeline and the inference pipeline.
We primarily focus on the ML pipeline in the first section of this post, and review the data ingestion pipeline in the second part.
Prerequisites
To follow along with this example, complete the following prerequisites:
- Create a new SageMaker notebook instance.
- Create an Amazon Kinesis Data Firehose delivery stream with an AWS Lambda transform function. For instructions, see Amazon Kinesis Firehose Data Transformation with AWS Lambda. This step is optional and only needed to simulate data streaming.
Data exploration
Let’s download and visualize the dataset, which is located in a public Amazon Simple Storage Service (Amazon S3) bucket and static website:
The following screenshot shows a subset of the data before transformation.
The last column of the data contains the target we want to predict, which is a binary variable taking either a Yes or No value, indicating whether the user participates in the Bike Share for All program.
Let’s take a look at the distribution of our target variable for any data imbalance.
As shown in the graph above, the data is imbalanced, with fewer people participating in the program.
We need to balance the data to prevent an over-representation bias. This step is optional because Autopilot also offers an internal approach to handle class imbalance automatically, which defaults to a F1 score validation metric. Additionally, if you choose to balance the data yourself, you can use more advanced techniques for handling class imbalance, such as SMOTE or GAN.
For this post, we downsample the majority class (No) as a data balancing technique:
The following code enriches the data and under-samples the overrepresented class:
We deliberately left our categorical features not encoded, including our binary target value. This is because Autopilot takes care of encoding and decoding the data for us as part of the automatic feature engineering and pipeline deployment, as we see in the next section.
The following screenshot shows a sample of our data.
The data in the following graphs looks otherwise normal, with a bimodal distribution representing the two peaks for the morning hours and the afternoon rush hours, as you would expect. We also observe low activities on weekends and at night.
In the next section, we feed the data to Autopilot so that it can run an experiment for us.
Build a binary classification model
Autopilot requires that we specify the input and output destination buckets. It uses the input bucket to load the data and the output bucket to save the artifacts, such as feature engineering and the generated Jupyter notebooks. We retain 5% of the dataset to evaluate and validate the model’s performance after the training is complete and upload 95% of the dataset to the S3 input bucket. See the following code:
After we upload the data to the input destination, it’s time to start Autopilot:
All we need to start experimenting is to call the fit() method. Autopilot needs the input and output S3 location and the target attribute column as the required parameters. After feature processing, Autopilot calls SageMaker automatic model tuning to find the best version of a model by running many training jobs on your dataset. We added the optional max_candidates parameter to limit the number of candidates to 30, which is the number of training jobs that Autopilot launches with different combinations of algorithms and hyperparameters in order to find the best model. If you don’t specify this parameter, it defaults to 250.
We can observe the progress of Autopilot with the following code:
The training takes some time to complete. While it’s running, let’s look at the Autopilot workflow.
To find the best candidate, use the following code:
The following screenshot shows our output.
Our model achieved a validation accuracy of 96%, so we’re going to deploy it. We could add a condition such that we only use the model if the accuracy is above a certain level.
Inference pipeline
Before we deploy our model, let’s examine our best candidate and what’s happening in our pipeline. See the following code:
The following diagram shows our output.
Autopilot has built the model and has packaged it in three different containers, each sequentially running a specific task: transform, predict, and reverse-transform. This multi-step inference is possible with a SageMaker inference pipeline.
A multi-step inference can also chain multiple inference models. For instance, one container can perform principal component analysis before passing the data to the XGBoost container.
Deploy the inference pipeline to an endpoint
The deployment process involves just a few lines of code:
Let’s configure our endpoint for prediction with a predictor:
Now that we have our endpoint and predictor ready, it’s time to use the testing data we set aside and test the accuracy of our model. We start by defining a utility function that sends the data one line at a time to our inference endpoint and gets a prediction in return. Because we have an XGBoost model, we drop the target variable before sending the CSV line to the endpoint. Additionally, we removed the header from the testing CSV before looping through the file, which is also another requirement for XGBoost on SageMaker. See the following code:
The following screenshot shows our output.
Now let’s calculate the accuracy of our model.
See the following code:
We get an accuracy of 92%. This is slightly lower than the 96% obtained during the validation step, but it’s still high enough. We don’t expect the accuracy to be exactly the same because the test is performed with a new dataset.
Data ingestion
We downloaded the data directly and configured it for training. In real life, you may have to send the data directly from the edge device into the data lake and have SageMaker load it directly from the data lake into the notebook.
Kinesis Data Firehose is a good option and the most straightforward way to reliably load streaming data into data lakes, data stores, and analytics tools. It can capture, transform, and load streaming data into Amazon S3 and other AWS data stores.
For our use case, we create a Kinesis Data Firehose delivery stream with a Lambda transformation function to do some lightweight data cleaning as it traverses the stream. See the following code:
This Lambda function performs light transformation of the data streamed from the devices onto the data lake. It expects a CSV formatted data file.
For the ingestion step, we download the data and simulate a data stream to Kinesis Data Firehose with a Lambda transform function and into our S3 data lake.
Let’s simulate streaming a few lines:
Clean up
It’s important to delete all the resources used in this exercise to minimize cost. The following code deletes the SageMaker inference endpoint we created as well the training and testing data we uploaded:
Conclusion
ML engineers, data scientists, and software developers can use Autopilot to build and deploy an inference pipeline with little to no ML programming experience. Autopilot saves time and resources, using data science and ML best practices. Large organizations can now shift engineering resources away from infrastructure configuration towards improving models and solving business use cases. Startups and smaller organizations can get started on machine learning with little to no ML expertise.
We recommend learning more about other important features SageMaker has to offer, such as the Amazon SageMaker Feature Store, which integrates with Amazon SageMaker Pipelines to create, add feature search and discovery, and reuse automated ML workflows. You can run multiple Autopilot simulations with different feature or target variants in your dataset. You could also approach this as a dynamic vehicle allocation problem in which your model tries to predict vehicle demand based on time (such as time of day or day of the week) or location, or a combination of both.
About the Authors
Doug Mbaya is a Senior Solution architect with a focus in data and analytics. Doug works closely with AWS partners, helping them integrate data and analytics solution in the cloud. Doug’s prior experience includes supporting AWS customers in the ride sharing and food delivery segment.
Valerio Perrone is an Applied Science Manager working on Amazon SageMaker Automatic Model Tuning and Autopilot.
Amazon Scholar Ranjit Jhala named ACM Fellow
Jhala received the ACM honor for lifetime contributions to software verification, developing innovative tools to help computer programmers test their code.Read More
Apply profanity masking in Amazon Translate
Amazon Translate is a neural machine translation service that delivers fast, high-quality, affordable, and customizable language translation. This post shows how you can mask profane words and phrases with a grawlix string (“?$#@$”).
Amazon Translate typically chooses clean words for your translation output. But in some situations, you want to prevent words that are commonly considered as profane terms from appearing in the translated output. For example, when you’re translating video captions or subtitle content, or enabling in-game chat, and you want the translated content to be age appropriate and clear of any profanity, Amazon Translate allows you to mask the profane words and phrases using the profanity masking setting. You can apply profanity masking to both real-time translation or asynchronous batch processing in Amazon Translate. When using Amazon Translate with profanity masking enabled, the five-character sequence ?$#@$ is used to mask each profane word or phrase, regardless of the number of characters. Amazon Translate detects each profane word or phrase literally, not contextually.
Solution overview
To mask profane words and phrases in your translation output, you can enable the profanity option under the additional settings on the Amazon Translate console when you run the translations with Amazon Translate both through real-time and asynchronous batch processing requests. The following sections demonstrate using profanity masking for real-time translation requests via the Amazon Translate console, AWS Command Line Interface (AWS CLI), or with the Amazon Translate SDK (Python Boto3).
Amazon Translate console
To demonstrate handling profanity with real-time translation, we use the following sample text in French to be translated into English:
Complete the following steps on the Amazon Translate console:
- Choose French (fr) as the Source language.
- Choose English (en) as the Target Language.
- Enter the preceding example text in the Source Language text area.
The translated text appears under Target language. It contains a word that is considered profane in English.
- Expand Additional settings and enable Profanity.
The word is now replaced with the grawlix string ?$#@$.
AWS CLI
Calling the translate-text
AWS CLI command with --settings Profanity=MASK
masks profane words and phrases in your translated text.
The following AWS CLI commands are formatted for Unix, Linux, and macOS. For Windows, replace the backslash () Unix continuation character at the end of each line with a caret (
^
).
You get a response like the following snippet:
Amazon Translate SDK (Python Boto3)
The following Python 3 code uses the real-time translation call with the profanity setting:
Conclusion
You can use the profanity masking setting to mask words and phrases that are considered profane to keep your translated text clean and meet your business requirements. To learn more about all the ways you can customize your translations, refer to Customizing Your Translations using Amazon Translate.
About the Authors
Siva Rajamani is a Boston-based Enterprise Solutions Architect at AWS. He enjoys working closely with customers and supporting their digital transformation and AWS adoption journey. His core areas of focus are serverless, application integration, and security. Outside of work, he enjoys outdoors activities and watching documentaries.
Sudhanshu Malhotra is a Boston-based Enterprise Solutions Architect for AWS. He’s a technology enthusiast who enjoys helping customers find innovative solutions to complex business challenges. His core areas of focus are DevOps, machine learning, and security. When he’s not working with customers on their journey to the cloud, he enjoys reading, hiking, and exploring new cuisines.
Watson G. Srivathsan is the Sr. Product Manager for Amazon Translate, AWS’s natural language processing service. On weekends you will find him exploring the outdoors in the Pacific Northwest.