Promote feature discovery and reuse across your organization using Amazon SageMaker Feature Store and its feature-level metadata capability

Amazon SageMaker Feature Store helps data scientists and machine learning (ML) engineers securely store, discover, and share curated data used in training and prediction workflows. Feature Store is a centralized store for features and associated metadata, allowing features to be easily discovered and reused by data scientist teams working on different projects or ML models.

With Feature Store, you have always been able to add metadata at the feature group level. Data scientists who want the ability to search and discover existing features for their models now have the ability to search for information at the feature level by adding custom metadata. For example, the information can include a description of the feature, the date it was last modified, its original data source, certain metrics, or the level of sensitivity.

The following diagram illustrates the architecture relationships between feature groups, features, and associated metadata. Note how data scientists can now specify descriptions and metadata at both the feature group level and the individual feature level.

In this post, we explain how data scientists and ML engineers can use feature-level metadata with the new search and discovery capabilities of Feature Store to promote better feature reuse across their organization. This capability can significantly help data scientists in the feature selection process and, as a result, help you identify features that lead to increased model accuracy.

Use case

For the purposes of this post, we use two feature groups, customer and loan.

The customer feature group has the following features:

  • age – Customer’s age (numeric)
  • job – Type of job (one-hot encoded, such as admin or services)
  • marital – Marital status (one-hot encoded, such as married or single)
  • education – Level of education (one-hot encoded, such as basic 4y or high school)

The loan feature group has the following features:

  • default – Has credit in default? (one-hot encoded: no or yes)
  • housing – Has housing loan? (one-hot encoded: no or yes)
  • loan – Has personal loan? (one-hot encoded: no or yes)
  • total_amount – Total amount of loans (numeric)

The following figure shows example feature groups and feature metadata.

The purpose of adding a description and assigning metadata to each feature is to increase the speed of discovery by enabling new search parameters along which a data scientist or ML engineer can explore features. These can reflect details about a feature such as its calculation, whether it’s an average over 6 months or 1 year, origin, creator or owner, what the feature means, and more.

In the following sections, we provide two approaches to search and discover features and configure feature-level metadata: the first using Amazon SageMaker Studio directly, and the second programmatically.

Feature discovery in Studio

You can easily search and query features using Studio. With the new enhanced search and discovery capabilities, you can immediately retrieve results using a simple type-ahead of a few characters.

The following screenshot demonstrates the following capabilities:

  • You can access the Feature Catalog tab and observe features across feature groups. The features are presented in a table that includes the feature name, type, description, parameters, date of creation, and associated feature group’s name.
  • You can directly use the type-ahead functionality to immediately return search results.
  • You have the flexibility to use different types of filter options: All, Feature name, Description, or Parameters. Note that All will return all features where either Feature name, Description, or Parameters match the search criteria.
  • You can narrow down the search further by specifying a date range using the Created from and Created to fields and specifying parameters using the Search parameter key and Search parameter value fields.

After you have selected a feature, you can choose the feature’s name to bring up its details. When you choose Edit Metadata, you can add a description and up to 25 key-value parameters, as shown in the following screenshot. Within this view, you can ultimately create, view, update, and delete the feature’s metadata. The following screenshot illustrates how to edit feature metadata for total_amount.

As previously stated, adding key-value pairs to a feature gives you more dimensions along which to search for their given features. For our example, the feature’s origin has been added to every feature’s metadata. When you choose the search icon and filter along the key-value pair origin: job, you can see all the features that were one-hot-encoded from this base attribute.

Feature discovery using code

You can also access and update feature information through the AWS Command Line Interface (AWS CLI) and SDK (Boto3) rather than directly through the AWS Management Console. This allows you to integrate the feature-level search functionality of Feature Store with your own custom data science platforms. In this section, we interact with the Boto3 API endpoints to update and search feature metadata.

To begin improving feature search and discovery, you can add metadata using the update_feature_metadata API. In addition to the description and created_date fields, you can add up to 25 parameters (key-value pairs) to a given feature.

The following code is an example of five possible key-value parameters that have been added to the job_admin feature. This feature was created, along with job_services and job_none, by one-hot-encoding job.

sagemaker_client.update_feature_metadata(
    FeatureGroupName="customer",
    FeatureName="job_admin",
    ParameterAdditions=[
        {"Key": "author", "Value": "arnaud"}, # Feature's author
        {"Key": "team", "Value": "mlops"}, # Team owning the feature
        {"Key": "origin", "Value": "job"}, # Raw input parameter
        {"Key": "sensitivity", "Value": "5"}, # 1-5 scale for data sensitivity
        {"Key": "env", "Value": "testing"} # Environment the feature is used in
    ]
)

After author, team, origin, sensitivity, and env have been added to the job_admin feature, data scientists or ML engineers can retrieve them by calling the describe_feature_metadata API. You can navigate to the Parameters object in the response for the metadata we previously added to our feature. The describe_feature_metadata API endpoint allows you to get greater insight into a given feature by getting its associated metadata.

response = sagemaker_client.describe_feature_metadata(
    FeatureGroupName="customer",
    FeatureName="job_admin",
)

# Navigate to 'Parameters' in response to get metadata
metadata = response['Parameters']

You can search for features by using the SageMaker search API using metadata as search parameters. The following code is an example function that takes a search_string parameter as an input and returns all features where the feature’s name, description, or parameters match the condition:

def search_features_using_string(search_string):
    response = sagemaker_client.search(
        Resource= "FeatureMetadata",
        SearchExpression={
            'Filters': [
               {
                   'Name': 'FeatureName',
                   'Operator': 'Contains',
                   'Value': search_string
               },
               {
                   'Name': 'Description',
                   'Operator': 'Contains',
                   'Value': search_string
               },
               {
                   'Name': 'AllParameters',
                   'Operator': 'Contains',
                   'Value': search_string
               }
           ],
           "Operator": "Or"
        },
    )

    # Displaying results in a pandas DataFrame
    df=pd.json_normalize(response['Results'], max_level=1)
    df.columns = df.columns.map(lambda col: col.split(".")[1])
    df=df.drop('FeatureGroupArn', axis=1)

    return df

The following code snippet uses our search_features function to retrieve all features for which either the feature name, description, or parameters contain the word job:

search_results = search_features_using_string('mlops')
search_results

The following screenshot contains the list of matching feature names as well as their corresponding metadata, including timestamps for each feature’s creation and last modification. You can use this information to improve discovery and visibility into your organization’s features.

Conclusion

SageMaker Feature Store provides a purpose-built feature management solution to help organizations scale ML development across business units and data science teams. Improving feature reuse and feature consistency are primary benefits of a feature store. In this post, we explained how you can use feature-level metadata to improve search and discovery of features. This included creating metadata around a variety of use cases and using it as additional search parameters.

Give it a try, and let us know what you think in comments. If you want to learn more about collaborating and sharing features within Feature Store, refer to Enable feature reuse across accounts and teams using Amazon SageMaker Feature Store.


About the authors

Arnaud Lauer is a Senior Partner Solutions Architect in the Public Sector team at AWS. He enables partners and customers to understand how best to use AWS technologies to translate business needs into solutions. He brings more than 16 years of experience in delivering and architecting digital transformation projects across a range of industries, including the public sector, energy, and consumer goods. Artificial intelligence and machine learning are some of his passions. Arnaud holds 12 AWS certifications, including the ML Specialty Certification.

Nicolas Bernier is an Associate Solutions Architect, part of the Canadian Public Sector team at AWS. He is currently conducting a master’s degree with a research area in Deep Learning and holds five AWS certifications, including the ML Specialty Certification. Nicolas is passionate about helping customers deepen their knowledge of AWS by working with them to translate their business challenges into technical solutions.

Mark Roy is a Principal Machine Learning Architect for AWS, helping customers design and build AI/ML solutions. Mark’s work covers a wide range of ML use cases, with a primary interest in computer vision, deep learning, and scaling ML across the enterprise. He has helped companies in many industries, including insurance, financial services, media and entertainment, healthcare, utilities, and manufacturing. Mark holds six AWS certifications, including the ML Specialty Certification. Prior to joining AWS, Mark was an architect, developer, and technology leader for over 25 years, including 19 years in financial services.

Khushboo Srivastava is a Senior Product Manager for Amazon SageMaker. She enjoys building products that simplify machine learning workflows for customers. In her spare time, she enjoys playing violin, practicing yoga, and traveling.

Read More