Amazon SageMaker Feature Store provides an end-to-end solution to automate feature engineering for machine learning (ML). For many ML use cases, raw data like log files, sensor readings, or transaction records need to be transformed into meaningful features that are optimized for model training.
Feature quality is critical to ensure a highly accurate ML model. Transforming raw data into features using aggregation, encoding, normalization, and other operations is often needed and can require significant effort. Engineers must manually write custom data preprocessing and aggregation logic in Python or Spark for each use case.
This undifferentiated heavy lifting is cumbersome, repetitive, and error-prone. The SageMaker Feature Store Feature Processor reduces this burden by automatically transforming raw data into aggregated features suitable for batch training ML models. It lets engineers provide simple data transformation functions, then handles running them at scale on Spark and managing the underlying infrastructure. This enables data scientists and data engineers to focus on the feature engineering logic rather than implementation details.
In this post, we demonstrate how a car sales company can use the Feature Processor to transform raw sales transaction data into features in three steps:
- Local runs of data transformations.
- Remote runs at scale using Spark.
- Operationalization via pipelines.
We show how SageMaker Feature Store ingests the raw data, runs feature transformations remotely using Spark, and loads the resulting aggregated features into a feature group. These engineered features are can then be used to train ML models.
For this use case, we see how SageMaker Feature Store helps convert the raw car sales data into structured features. These features are subsequently used to gain insights like:
- Average and maximum price of red convertibles from 2010
- Models with best mileage vs. price
- Sales trends of new vs. used cars over the years
- Differences in average MSRP across locations
We also see how SageMaker Feature Store pipelines keep the features updated as new data comes in, enabling the company to continually gain insights over time.
Solution overview
We work with the dataset car_data.csv
, which contains specifications such as model, year, status, mileage, price, and MSRP for used and new cars sold by the company. The following screenshot shows an example of the dataset.
The solution notebook feature_processor.ipynb
contains the following main steps, which we explain in this post:
- Create two feature groups: one called
car-data
for raw car sales records and another calledcar-data-aggregated
for aggregated car sales records. - Use the
@feature_processor
decorator to load data into the car-data feature group from Amazon Simple Storage Service (Amazon S3). - Run the
@feature_processor code
remotely as a Spark application to aggregate the data. - Operationalize the feature processor via SageMaker pipelines and schedule runs.
- Explore the feature processing pipelines and lineage in Amazon SageMaker Studio.
- Use aggregated features to train an ML model.
Prerequisites
To follow this tutorial, you need the following:
- An AWS account.
- SageMaker Studio set up.
- AWS Identity and Access Management (IAM) permissions. When creating this IAM role, follow the best practice of granting least privileged access.
For this post, we refer to the following notebook, which demonstrates how to get started with Feature Processor using the SageMaker Python SDK.
Create feature groups
To create the feature groups, complete the following steps:
- Create a feature group definition for
car-data
as follows:
The features correspond to each column in the car_data.csv
dataset (Model
, Year
, Status
, Mileage
, Price
, and MSRP
).
- Add the record identifier
id
and event timeingest_time
to the feature group:
- Create a feature group definition for
car-data-aggregated
as follows:
For the aggregated feature group, the features are model year status, average mileage, max mileage, average price, max price, average MSRP, max MSRP, and ingest time. We add the record identifier model_year_status
and event time ingest_time
to this feature group.
- Now, create the
car-data
feature group:
- Create the
car-data-aggregated
feature group:
You can navigate to the SageMaker Feature Store option under Data on the SageMaker Studio Home menu to see the feature groups.
Use the @feature_processor decorator to load data
In this section, we locally transform the raw input data (car_data.csv
) from Amazon S3 into the car-data
feature group using the Feature Store Feature Processor. This initial local run allows us to develop and iterate before running remotely, and could be done on a sample of the data if desired for faster iteration.
With the @feature_processor
decorator, your transformation function runs in a Spark runtime environment where the input arguments provided to your function and its return value are Spark DataFrames.
- Install the Feature Processor SDK from the SageMaker Python SDK and its extras using the following command:
The number of input parameters in your transformation function must match the number of inputs configured in the @feature_processor
decorator. In this case, the @feature_processor
decorator has car-data.csv
as input and the car-data
feature group as output, indicating this is a batch operation with the target_store
as OfflineStore
:
- Define the
transform()
function to transform the data. This function performs the following actions:- Convert column names to lowercase.
- Add the event time to the
ingest_time
column. - Remove punctuation and replace missing values with NA.
- Call the
transform()
function to store the data in thecar-data
feature group:
The output shows that the data is ingested successfully into the car-data feature group.
The output of the transform_df.show()
function is as follows:
We have successfully transformed the input data and ingested it in the car-data
feature group.
Run the @feature_processor code remotely
In this section, we demonstrate running the feature processing code remotely as a Spark application using the @remote
decorator described earlier. We run the feature processing remotely using Spark to scale to large datasets. Spark provides distributed processing on clusters to handle data that is too big for a single machine. The @remote
decorator runs the local Python code as a single or multi-node SageMaker training job.
- Use the
@remote
decorator along with the@feature_processor
decorator as follows:
The spark_config
parameter indicates this is run as a Spark application
. The SparkConfig instance configures the Spark configuration and dependencies.
- Define the
aggregate()
function to aggregate the data using PySpark SQL and user-defined functions (UDFs). This function performs the following actions:- Concatenate
model
,year
, andstatus
to createmodel_year_status
. - Take the average of
price
to createavg_price
. - Take the max value of
price
to createmax_price
. - Take the average of
mileage
to createavg_mileage
. - Take the max value of
mileage
to createmax_mileage
. - Take the average of
msrp
to createavg_msrp
. - Take the max value of
msrp
to createmax_msrp
. - Group by
model_year_status
.
- Concatenate
- Run the
aggregate()
function, which creates a SageMaker training job to run the Spark application:
As a result, SageMaker creates a training job to the Spark application defined earlier. It will create a Spark runtime environment using the sagemaker-spark-processing image
.
We use SageMaker Training jobs here to run our Spark feature processing application. With SageMaker Training, you can reduce startup times to 1 minute or less by using warm pooling, which is unavailable in SageMaker Processing. This makes SageMaker Training better optimized for short batch jobs like feature processing where startup time is important.
- To view the details, on the SageMaker console, choose Training jobs under Training in the navigation pane, then choose the job with the name
aggregate-<timestamp>
.
The output of the aggregate() function generates telemetry code. Inside the output, you will see the aggregated data as follows:
When the training job is complete, you should see following output:
Operationalize the feature processor via SageMaker pipelines
In this section, we demonstrate how to operationalize the feature processor by promoting it to a SageMaker pipeline and scheduling runs.
- First, upload the transformation_code.py file containing the feature processing logic to Amazon S3:
- Next, create a Feature Processor pipeline car_data_pipeline using the .to_pipeline() function:
- To run the pipeline, use the following code:
- Similarly, you can create a pipeline for aggregated features called
car_data_aggregated_pipeline
and start a run. - Schedule the
car_data_aggregated_pipeline
to run every 24 hours:
In the output section, you will see the ARN of pipeline and the pipeline execution role, and the schedule details:
- To get all the Feature Processor pipelines in this account, use the
list_pipelines()
function on the Feature Processor:
The output will be as follows:
We have successfully created SageMaker Feature Processor pipelines.
Explore feature processing pipelines and ML lineage
In SageMaker Studio, complete the following steps:
- On the SageMaker Studio console, on the Home menu, choose Pipelines.
You should see two pipelines created: car-data-ingestion-pipeline
and car-data-aggregated-ingestion-pipeline
.
- Choose the
car-data-ingestion-pipeline
.
It shows the run details on the Executions tab.
- To view the feature group populated by the pipeline, choose Feature Store under Data and choose
car-data
.
You will see the two feature groups we created in the previous steps.
- Choose the
car-data
feature group.
You will see the features details on the Features tab.
View pipeline runs
To view the pipeline runs, complete the following steps:
- On the Pipeline Executions tab, select
car-data-ingestion-pipeline
.
This will show all the runs.
- Choose one of the links to see the details of the run.
- To view lineage, choose Lineage.
The full lineage for car-data
shows the input data source car_data.csv
and upstream entities. The lineage for car-data-aggregated
shows the input car-data
feature group.
- Choose Load features and then choose Query upstream lineage on
car-data
andcar-data-ingestion-pipeline
to see all the upstream entities.
The full lineage for car-data
feature group should look like the following screenshot.
Similarly, the lineage for the car-aggregated-data
feature group should look like the following screenshot.
SageMaker Studio provides a single environment to track scheduled pipelines, view runs, explore lineage, and view the feature processing code.
The aggregated features such as average price, max price, average mileage, and more in the car-data-aggregated
feature group provide insight into the nature of the data. You can also use these features as a dataset to train a model to predict car prices, or for other operations. However, training the model is out of scope for this post, which focuses on demonstrating the SageMaker Feature Store capabilities for feature engineering.
Clean up
Don’t forget to clean up the resources created as part of this post to avoid incurring ongoing charges.
- Disable the scheduled pipeline via the
fp.schedule()
method with the state parameter asDisabled
:
- Delete both feature groups:
The data residing in the S3 bucket and offline feature store can incur costs, so you should delete them to avoid any charges.
- Delete the S3 objects.
- Delete the records from the feature store.
Conclusion
In this post, we demonstrated how a car sales company used SageMaker Feature Store Feature Processor to gain valuable insights from their raw sales data by:
- Ingesting and transforming batch data at scale using Spark
- Operationalizing feature engineering workflows via SageMaker pipelines
- Providing lineage tracking and a single environment to monitor pipelines and explore features
- Preparing aggregated features optimized for training ML models
By following these steps, the company was able to transform previously unusable data into structured features that could then be used to train a model to predict car prices. SageMaker Feature Store enabled them to focus on feature engineering rather than the underlying infrastructure.
We hope this post helps you unlock valuable ML insights from your own data using SageMaker Feature Store Feature Processor!
For more information on this, refer to Feature Processing and the SageMaker example on Amazon SageMaker Feature Store: Feature Processor Introduction.
About the Authors
Dhaval Shah is a Senior Solutions Architect at AWS, specializing in Machine Learning. With a strong focus on digital native businesses, he empowers customers to leverage AWS and drive their business growth. As an ML enthusiast, Dhaval is driven by his passion for creating impactful solutions that bring positive change. In his leisure time, he indulges in his love for travel and cherishes quality moments with his family.
Ninad Joshi is a Senior Solutions Architect at AWS, helping global AWS customers design secure, scalable, and cost effective solutions in cloud to solve their complex real-world business challenges. His work in Machine Learning (ML) covers a wide range of AI/ML use cases, with a primary focus on End-to-End ML, Natural Language Processing, and Computer Vision. Prior to joining AWS, Ninad worked as a software developer for 12+ years. Outside of his professional endeavors, Ninad enjoys playing chess and exploring different gambits.