Access Snowflake data using OAuth-based authentication in Amazon SageMaker Data Wrangler

In this post, we show how to configure a new OAuth-based authentication feature for using Snowflake in Amazon SageMaker Data Wrangler. Snowflake is a cloud data platform that provides data solutions for data warehousing to data science. Snowflake is an AWS Partner with multiple AWS accreditations, including AWS competencies in machine learning (ML), retail, and data and analytics.

Data Wrangler simplifies the data preparation and feature engineering process, reducing the time it takes from weeks to minutes by providing a single visual interface for data scientists to select and clean data, create features, and automate data preparation in ML workflows without writing any code. You can import data from multiple data sources, such as Amazon Simple Storage Service (Amazon S3), Amazon Athena, Amazon Redshift, Amazon EMR, and Snowflake. With this new feature, you can use your own identity provider (IdP) such as Okta, Azure AD, or Ping Federate to connect to Snowflake via Data Wrangler.

Solution overview

In the following sections, we provide steps for an administrator to set up the IdP, Snowflake, and Studio. We also detail the steps that data scientists can take to configure the data flow, analyze the data quality, and add data transformations. Finally, we show how to export the data flow and train a model using SageMaker Autopilot.

Prerequisites

For this walkthrough, you should have the following prerequisites:

For admin:
- A Snowflake user with permissions to create storage integrations, and security integrations in Snowflake.
- An AWS account with permissions to create AWS Identity and Access Management (IAM) policies and roles.
- Access and permissions to configure IDP to register Data Wrangler application and set up the authorization server or API.

For data scientist:
- An S3 bucket that Data Wrangler can use to output transformed data.
- Access to Amazon SageMaker, an instance of Amazon SageMaker Studio, and a user for Studio. For more information about prerequisites, see Get Started with Data Wrangler.
- An IAM role used for Studio with permissions to create and update secrets in AWS Secrets Manager.

Administrator setup

Instead of having your users directly enter their Snowflake credentials into Data Wrangler, you can have them use an IdP to access Snowflake.

The following steps are involved to enable Data Wrangler OAuth access to Snowflake:

Configure the IdP.
Configure Snowflake.
Configure SageMaker Studio.

Configure the IdP

To set up your IdP, you must register the Data Wrangler application and set up your authorization server or API.

Register the Data Wrangler application within the IdP

Refer to the following documentation for the IdPs that Data Wrangler supports:

Use the documentation provided by your IdP to register your Data Wrangler application. The information and procedures in this section help you understand how to properly use the documentation provided by your IdP.

Specific customizations in addition to the steps in the respective guides are called out in the subsections.

Select the configuration that starts the process of registering Data Wrangler as an application.
Provide the users within the IdP access to Data Wrangler.
Enable OAuth client authentication by storing the client credentials as a Secrets Manager secret.
Specify a redirect URL using the following format: https://domain-ID.studio.AWS Region.sagemaker.aws/jupyter/default/lab.

You’re specifying the SageMaker domain ID and AWS Region that you’re using to run Data Wrangler. You must register a URL for each domain and Region where you’re running Data Wrangler. Users from a domain and Region that don’t have redirect URLs set up for them won’t be able to authenticate with the IdP to access the Snowflake connection.

Make sure the authorization code and refresh token grant types are allowed for your Data Wrangler application.

Set up the authorization server or API within the IdP

Within your IdP, you must set up an authorization server or an application programming interface (API). For each user, the authorization server or the API sends tokens to Data Wrangler with Snowflake as the audience.

Snowflake uses the concept of roles that are distinct from IAM roles used in AWS. You must configure the IdP to use ANY Role to use the default role associated with the Snowflake account. For example, if a user has systems administrator as the default role in their Snowflake profile, the connection from Data Wrangler to Snowflake uses systems administrator as the role.

Use the following procedure to set up the authorization server or API within your IdP:

From your IdP, begin the process of setting up the server or API.
Configure the authorization server to use the authorization code and refresh token grant types.
Specify the lifetime of the access token.
Set the refresh token idle timeout.

The idle timeout is the time that the refresh token expires if it’s not used. If you’re scheduling jobs in Data Wrangler, we recommend making the idle timeout time greater than the frequency of the processing job. Otherwise, some processing jobs might fail because the refresh token expired before they could run. When the refresh token expires, the user must re-authenticate by accessing the connection that they’ve made to Snowflake through Data Wrangler.

Note that Data Wrangler doesn’t support rotating refresh tokens. Using rotating refresh tokens might result in access failures or users needing to log in frequently.

If the refresh token expires, your users must reauthenticate by accessing the connection that they’ve made to Snowflake through Data Wrangler.

Specify session:role-any as the new scope.

For Azure AD, you must also specify a unique identifier for the scope.

After you’ve set up the OAuth provider, you provide Data Wrangler with the information it needs to connect to the provider. You can use the documentation from your IdP to get values for the following fields:

Token URL – The URL of the token that the IdP sends to Data Wrangler
Authorization URL – The URL of the authorization server of the IdP
Client ID – The ID of the IdP
Client secret – The secret that only the authorization server or API recognizes
OAuth scope – This is for Azure AD only

Configure Snowflake

To configure Snowflake, complete the instructions in Import data from Snowflake.

Use the Snowflake documentation for your IdP to set up an external OAuth integration in Snowflake. See the previous section Register the Data Wrangler application within the IdP for more information on how to set up an external OAuth integration.

When you’re setting up the security integration in Snowflake, make sure you activate external_oauth_any_role_mode.

Configure SageMaker Studio

You store the fields and values in a Secrets Manager secret and add it to the Studio Lifecycle Configuration that you’re using for Data Wrangler. A Lifecycle Configuration is a shell script that automatically loads the credentials stored in the secret when the user logs into Studio. For information about creating secrets, see Move hardcoded secrets to AWS Secrets Manager. For information about using Lifecycle Configurations in Studio, see Use Lifecycle Configurations with Amazon SageMaker Studio.

Create a secret for Snowflake credentials

To create your secret for Snowflake credentials, complete the following steps:

On the Secrets Manager console, choose Store a new secret.
For Secret type, select Other type of secret.
Specify the details of your secret as key-value pairs.

Key names require lowercase letters due to case sensitivity. Data Wrangler gives a warning if you enter any of these incorrectly. Input the secret values as key-value pairs Key/value if you’d like, or use the Plaintext option.

The following is the format of the secret used for Okta. If you are using Azure AD, you need to add the datasource_oauth_scope field.

{
    "token_url":"https://identityprovider.com/oauth2/example-portion-of-URL-path/v2/token",
    "client_id":"example-client-id",
    "client_secret":"example-client-secretsovfDSUoOKAiLe4V6DiZrCLpW44x",
    "identity_provider":"OKTA"|"AZURE_AD"|"PING_FEDERATE",
    "authorization_url":"https://identityprovider.com/oauth2/example-portion-of-URL-path/v2/authorize"
}

Update the preceding values with your choice of IdP and information gathered after application registration.
Choose Next.
For Secret name, add the prefix AmazonSageMaker (for example, our secret is AmazonSageMaker-DataWranglerSnowflakeCreds).
In the Tags section, add a tag with the key SageMaker and value true.
Choose Next.
The rest of the fields are optional; choose Next until you have the option to choose Store to store the secret.

After you store the secret, you’re returned to the Secrets Manager console.

Choose the secret you just created, then retrieve the secret ARN.
Store this in your preferred text editor for use later when you create the Data Wrangler data source.

Create a Studio Lifecycle Configuration

To create a Lifecycle Configuration in Studio, complete the following steps:

On the SageMaker console, choose Lifecycle configurations in the navigation pane.
Choose Create configuration.
Choose Jupyter server app.

Create a new lifecycle configuration or append an existing one with the following content:

#!/bin/bash
set -eux
## Script Body
cat > ~/.snowflake_identity_provider_oauth_config <<EOL
{
    "secret_arn": "<secret_arn>"
}
EOL

The configuration creates a file with the name ".snowflake_identity_provider_oauth_config", containing the secret in the user’s home folder.

Choose Create Configuration.

Set the default Lifecycle Configuration

Complete the following steps to set the Lifecycle Configuration you just created as the default:

On the SageMaker console, choose Domains in the navigation pane.
Choose the Studio domain you’ll be using for this example.
On the Environment tab, in the Lifecycle configurations for personal Studio apps section, choose Attach.
For Source, select Existing configuration.
Select the configuration you just made, then choose Attach to domain.
Select the new configuration and choose Set as default, then choose Set as default again in the pop-up message.

Your new settings should now be visible under Lifecycle configurations for personal Studio apps as default.

Shut down the Studio app and relaunch for the changes to take effect.

Data scientist experience

In this section, we cover how data scientists can connect to Snowflake as a data source in Data Wrangler and prepare data for ML.

Create a new data flow

To create your data flow, complete the following steps:

On the SageMaker console, choose Amazon SageMaker Studio in the navigation pane.
Choose Open Studio.
On the Studio Home page, choose Import & prepare data visually. Alternatively, on the File drop-down, choose New, then choose SageMaker Data Wrangler Flow.

Creating a new flow can take a few minutes.

On the Import data page, choose Create connection.
Choose Snowflake from the list of data sources.
For Authentication method, choose OAuth.

If you don’t see OAuth, verify the preceding Lifecycle Configuration steps.

Enter details for Snowflake account name and Storage integration.
Ener a connection name and choose Connect.

You’re redirected to an IdP authentication page. For this example, we’re using Okta.

Enter your user name and password, then choose Sign in.

After the authentication is successful, you’re redirected to the Studio data flow page.

On the Import data from Snowflake page, browse the database objects, or run a query for the targeted data.
In the query editor, enter a query and preview the results.

In the following example, we load Loan Data and retrieve all columns from 5,000 rows.

Choose Import.
Enter a dataset name (for this post, we use snowflake_loan_dataset) and choose Add.

You’re redirected to the Prepare page, where you can add transformations and analyses to the data.

Data Wrangler makes it easy to ingest data and perform data preparation tasks such as exploratory data analysis, feature selection, and feature engineering. We’ve only covered a few of the capabilities of Data Wrangler in this post on data preparation; you can use Data Wrangler for more advanced data analysis such as feature importance, target leakage, and model explainability using an easy and intuitive user interface.

Analyze data quality

Use the Data Quality and Insights Report to perform an analysis of the data that you’ve imported into Data Wrangler. Data Wrangler creates the report from the sampled data.

On the Data Wrangler flow page, choose the plus sign next to Data types, then choose Get data insights.
Choose Data Quality And Insights Report for Analysis type.
For Target column, choose your target column.
For Problem type, select Classification.
Choose Create.

The insights report has a brief summary of the data, which includes general information such as missing values, invalid values, feature types, outlier counts, and more. You can either download the report or view it online.

Add transformations to the data

Data Wrangler has over 300 built-in transformations. In this section, we use some of these transformations to prepare the dataset for an ML model.

On the Data Wrangler flow page, choose plus sign, then choose Add transform.

If you’re following the steps in the post, you’re directed here automatically after adding your dataset.

Verify and modify the data type of the columns.

Looking through the columns, we identify that MNTHS_SINCE_LAST_DELINQ and MNTHS_SINCE_LAST_RECORD should most likely be represented as a number type rather than string.

After applying the changes and adding the step, you can verify the column data type is changed to float.

Looking through the data, we can see that the fields EMP_TITLE, URL, DESCRIPTION, and TITLE will likely not provide value to our model in our use case, so we can drop them.

Choose Add Step, then choose Manage columns.
For Transform, choose Drop column.
For Column to drop, specify EMP_TITLE, URL, DESCRIPTION, and TITLE.
Choose Preview and Add.

Next, we want to look for categorical data in our dataset. Data Wrangler has a built-in functionality to encode categorical data using both ordinal and one-hot encodings. Looking at our dataset, we can see that the TERM, HOME_OWNERSHIP, and PURPOSE columns all appear to be categorical in nature.

Add another step and choose Encode categorical.
For Transform, choose One-hot encode.
For Input column, choose TERM.
For Output style, choose Columns.
Leave all other settings as default, then choose Preview and Add.

The HOME_OWNERSHIP column has four possible values: RENT, MORTGAGE, OWN, and other.

Repeat the preceding steps to apply a one-hot encoding approach on these values.

Lastly, the PURPOSE column has several possible values. For this data, we use a one-hot encoding approach as well, but we set the output to a vector rather than columns.

For Transform, choose One-hot encode.
For Input column, choose PURPOSE.
For Output style, choose Vector.
For Output column, we call this column PURPOSE_VCTR.

This keeps the original PURPOSE column, if we decide to use it later.

Leave all other settings as default, then choose Preview and Add.

Export the data flow

Finally, we export this whole data flow to a feature store with a SageMaker Processing job, which creates a Jupyter notebook with the code pre-populated.

On the data flow page , choose the plus sign and Export to.
Choose where to export. For our use case, we choose SageMaker Feature Store.

The exported notebook is now ready to run.

Export data and train a model with Autopilot

Now we can train the model using Amazon SageMaker Autopilot.

On the data flow page, choose the Training tab.
For Amazon S3 location, enter a location for the data to be saved.
Choose Export and train.
Specify the settings in the Target and features, Training method, Deployment and advance settings, and Review and create sections.
Choose Create experiment to find the best model for your problem.

Clean up

If your work with Data Wrangler is complete, shut down your Data Wrangler instance to avoid incurring additional fees.

Conclusion

In this post, we demonstrated connecting Data Wrangler to Snowflake using OAuth, transforming and analyzing a dataset, and finally exporting it to the data flow so that it could be used in a Jupyter notebook. Most notably, we created a pipeline for data preparation without having to write any code at all.

To get started with Data Wrangler, see Prepare ML Data with Amazon SageMaker Data Wrangler.

About the authors

Ajjay Govindaram is a Senior Solutions Architect at AWS. He works with strategic customers who are using AI/ML to solve complex business problems. His experience lies in providing technical direction as well as design assistance for modest to large-scale AI/ML application deployments. His knowledge ranges from application architecture to big data, analytics, and machine learning. He enjoys listening to music while resting, experiencing the outdoors, and spending time with his loved ones.

Bosco Albuquerque is a Sr. Partner Solutions Architect at AWS and has over 20 years of experience in working with database and analytics products from enterprise database vendors and cloud providers. He has helped large technology companies design data analytics solutions and has led engineering teams in designing and implementing data analytics platforms and data products.

Matt Marzillo is a Sr. Partner Sales Engineer at Snowflake. He has 10 years of experience in data science and machine learning roles both in consulting and with industry organizations. Matt has experience developing and deploying AI and ML models across many different organizations in areas such as marketing, sales, operations, clinical, and finance, as well as advising in consultative roles.

Huong Nguyen is a product leader for Amazon SageMaker Data Wrangler at AWS. She has 15 years of experience creating customer-obsessed and data-driven products for both enterprise and consumer spaces. In her spare time, she enjoys audio books, gardening, hiking, and spending time with her family and friends.

Vedere AI