Enable cross-account access for Amazon SageMaker Data Wrangler using AWS Lake Formation

Amazon SageMaker Data Wrangler is the fastest and easiest way for data scientists to prepare data for machine learning (ML) applications. With Data Wrangler, you can simplify the process of feature engineering and complete each step of the data preparation workflow, including data selection, cleansing, exploration, and visualization through a single visual interface. Data Wrangler comes with 300 built-in data transformation recipes that you can use to quickly normalize, transform, and combine features. With the data selection tool in Data Wrangler, you can quickly select data from different data sources, such as Amazon Simple Storage Service (Amazon S3), Amazon Athena, and Amazon Redshift.

AWS Lake Formation cross-account capabilities simplify securing and managing distributed data lakes across multiple accounts through a centralized approach, providing fine-grained access control to Athena tables.

In this post, we demonstrate how to enable cross-account access for Data Wrangler using Athena as a source and Lake Formation as a central data governance capability. As shown in the following architecture diagram, Account A is the data lake account that holds all the ML-ready data derived from ETL pipelines. Account B is the data science account where a team of data scientists uses Data Wrangler to compile and run data transformations. We need to enable cross-account permissions for Data Wrangler in Account B to access the data tables located in Account A’s data lake via Lake Formation permissions.

With this architecture, data scientists and engineers outside the data lake account can access data from the lake and create data transformations via Data Wrangler.

Before you dive into the setup process, ensure the data to be shared across accounts are crawled and cataloged as detailed in this post. Let us presume this process has been completed and the databases and tables already exist in Lake Formation.

The following are the high-level steps to implement this solution:

  1. In Account A, register your S3 bucket using Lake Formation and create the necessary databases and tables for the data if doesn’t exist.
  2. The Lake Formation administrator can now share datasets from Account A to other accounts. Lake Formation shares these resources using AWS Resource Access Manager (AWS RAM).
  3. In Account B, accept the resource share request using AWS RAM. Create a local resource link for the shared table via Lake Formation and create a local database.
  4. Next, you need to grant permissions for the SageMaker Studio execution role in Account B to access the shared table and the resource link you created in the previous step.
  5. In Data Wrangler, use the local database and the resource link you created in Account B to query the dataset using the Athena connector and perform feature transformations.

Data lake setup using Lake Formation

To get started, create a central data lake in Account A. You can control the access to the data lake with policies and permissions, and define permissions at the database, table, or column level.

To kickstart the setup process, download the titanic dataset .csv file and upload it to your S3 bucket. After you upload the file, you need to register the bucket in Lake Formation. Lake Formation permissions enable fine-grained access control for data in your data lake.

Note: If the titanic dataset has already been cataloged, you can skip the registration step below.

Register your S3 data store in Lake Formation

To register your data store, complete the following steps:

  1. In Account A, sign in to the Lake Formation console.

If this is the first time you’re accessing Lake Formation, you need to add administrators to the account.

  1. In the navigation pane, under Permissions, choose Admins and database creators.
  2. Under Data lake administrators, choose Grant.

You now add AWS Identity and Access Management (IAM) users or roles specific to Account A as data lake administrators.

  1. Under Manage data lake administrators, for IAM users and roles, choose your user or role (for this post, we use user-a).

This can also be the IAM admin role of Account A.

  1. Choose Save.

  1. Make sure the IAMAllowedPrincipals group is not listed under both Data lake administrators and Database creators.

For more information about security settings, see Changing the Default Security Settings for Your Data Lake.

Next, you need to register the S3 bucket as the data lake location.

  1. On the Lake Formation console, under Register and ingest, choose Data lake locations.

This page should display a list of S3 buckets that are marked as data lake storage resources for Lake Formation. A single S3 bucket may act as the repository for many datasets, or you could use separate buckets for separate data sources.

  1. Choose Register location.
  2. For Amazon S3 path, enter the path for your bucket.
  3. For IAM role¸ choose AWSServiceRoleForLakeFormationDataAccess.
  4. Choose Register location.

After this step, you should be able to see your S3 bucket under Data lake locations.

Create a database

This step is optional. Skip this step if the titanic dataset has already been crawled and cataloged. The database and table for the dataset should pre-exist within the data lake.

Complete the following steps to register the database if it does not exist:

  1. On the Lake Formation console, under Data catalog, choose Databases.
  2. Choose Create database.
  3. For Database details, select Database.
  4. For Name, enter a name (for example, titanic).
  5. For Location, enter the S3 data lake bucket path.
  6. Deselect Use only IAM access controls for tables in this database.
  7. Choose Create database.

  1. Under Actions, choose Permissions.
  2. Choose View permissions.
  3. Make sure that the IAMAllowedPrincipals group isn’t listed.

If it’s listed, make sure you revoke access to this group.

You should now be able to view the created database listed under Databases.

You should also be able to see the table in the Lake Formation console, under Data catalog in the navigation pane, under Tables. For this demo, let us presume the table name to be titanic_datalake_bucket_as as shown below.

Grant table permissions to Account A

To grant table permissions to Account A, complete the following steps:

  1. Sign in to the Lake Formation console with Account A.
  2. Under Data catalog, choose Tables.
  3. Select the newly created table.
  4. On the Actions menu, under Permissions, choose Grant.
  5. Select My account.
  6. For IAM users and roles, choose the users or roles you want to grant access (for this post, we choose user-x, a different user within Account A).

You can also set a column filter.

  1. For Columns, choose Include columns.
  2. For Include columns, choose the first five columns from the titanic_datalake_bucket_as table.
  3. For Table permissions, select Select.
  4. Chose Grant.

  1. Still in Account A, switch to the Athena console.
  2. Run a table preview.

You should be able to see the first five columns of the titanic_datalake_bucket_as table as per the granted permissions in the previous steps.

We have validated local access to the data lake table within Account A via this Athena step. Next, let’s grant access to an external account, in our case, Account B for the same table.

Grant table permissions to Account B

This external account is the account running Data Wrangler. To grant table permissions, complete the following steps:

  1. Staying within account A, on the Actions menu, under Permissions, choose Grant.
  2. Select External account.
  3. For AWS account ID, enter the account ID of Account B.
  4. Choose the same first five columns of the table.
  5. For Table permissions and Grantable permissions, select Select.
  6. Choose Grant.

You must revoke the Super permission from the IAMAllowedPrincipals group for this table before granting it external access. You can do this on the Actions menu under View permissions, then choose IAMAllowedPrincipals and choose Revoke.

  1. On the AWS RAM console, still in Account A, under Shared by me, choose Shared resources.

We can find a Lake Formation entry on this page.

  1. Switch to Account B.
  2. On the AWS RAM console, under Shared with me, you see an invitation from Lake Formation in Account A.

  1. Accept the invitation by choosing Accept resource share.

After you accept it, on the Resource shares page, you should see the shared Lake Formation entry, which encapsulates the catalog, database, and table information.

On the Lake Formation console in Account B, you can find the shared table owned by Account A on the Tables page. If you don’t see it, you can refresh your screen and the resource should appear shortly.

To use this shared table inside Account B, you need to create a database local to Account B in Lake Formation.

  1. On the Lake Formation console, under Databases, choose Create databases.
  2. Name the database local_db.

Next, for the shared titanic table in Lake Formation, you need to create a resource link. Resource links are Data Catalog objects that link to metadata databases and tables, typically to shared databases and tables from other AWS accounts. They help enable cross-account access to data in the data lake.

  1. On the table details page, on the Actions menu, choose Create resource link.

  1. For Resource link name, enter a name (for example, titanic_local).
  2. For Database, choose the local database you created previously.
  3. The values for Shared table and Shared table’s database should match the ones in Account A and be auto-populated.
  4. For Shared table’s owner ID, choose the account ID of Account A.
  5. Choose Create.

  1. In the navigation pane, under Data catalog, choose Settings.
  2. Make sure Use only IAM access control is disabled for new databases and tables.

This is to make sure that Lake Formation manages the database and table permissions.

  1. Switch to the SageMaker console.
  2. In the Studio Control Panel, under Studio Summary, copy the ARN of the execution role.
  3. You need to grant this role permissions to access the local database, the shared table, and the local table you had previously in Account B’s Lake Formation.
  4. You also need to attach the following custom policy to this role. This policy allows Studio to access data via Lake Formation and allows Account B to get data partitions for querying the titanic dataset from the created tables:
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "lakeformation:GetDataAccess",
        "glue:GetPartitions"
      ],
      "Resource": [
        "*"
      ]
    }
  ]
}
  1. Switch back to Lake Formation console.
  2. Here, we need to grant permissions for the SageMaker execution role to access the shared titanic_datalake_bucket_as table.

This is the table that you shared to Account B from Account A via AWS RAM.

  1. In Account B, on the table details page, on the Actions menu, under Permissions, choose Grant.
  2. Grant the role access to the table and five columns.

  1. Finally, grant the SageMaker execution role permissions to access the local titanic table in Account B.

Cross-account data access in Studio

In this final stage, you should be ready to validate the steps deployed so far by testing this in the Data Wrangler interface.

  1. On the Import tab, for Import data, choose Amazon Athena as your data source.

  1. For Data catalog, choose AwsDataCatalog.
  2. For Database, choose the local database you created in Account B (local_db).

You should be able to see the local table (titanic_local) in the right pane.

  1. Run an Athena query as shown in the following screenshot to see the selected columns of the titanic dataset that you gave to the SageMaker execution role in Lake Formation (Account B).
  2. Choose Import dataset.

  1. For Dataset Name, enter a name (for example, titanic-dataset).
  2. Choose Add.

This imports the titanic dataset, and you should be able to see the data flow page with the visual blocks on the Prepare tab.

Conclusion

In this post, we demonstrated how to enable cross-account access for Data Wrangler using Lake Formation and AWS RAM. Following this methodology, organizations can allow multiple data science and engineering teams to access data from a central data lake and build feature pipelines and transformation recipes consistently. For more information about Data Wrangler, see Introducing Amazon SageMaker Data Wrangler, a Visual Interface to Prepare Data for Machine Learning and Exploratory data analysis, feature engineering, and operationalizing your data flow into your ML pipeline with Amazon SageMaker Data Wrangler.

Give Data Wrangler a try and share your feedback and questions in the comments section.


About the Authors

Rizwan Gilani is a Software Development Engineer at Amazon SageMaker. His passion lies with making machine learning more interactive and accessible at scale. Before that, he worked on Amazon Alexa as part of the core team that launched Alexa Communications.

 

 

Phi Nguyen is a solutions architect at AWS helping customers with their cloud journey with a special focus on data lake, analytics, semantics technologies and machine learning. In his spare time, you can find him biking to work, coaching his son’s soccer team or enjoying nature walk with his family.

 

 

Arunprasath Shankar is an Artificial Intelligence and Machine Learning (AI/ML) Specialist Solutions Architect with AWS, helping global customers scale their AI solutions effectively and efficiently in the cloud. In his spare time, Arun enjoys watching sci-fi movies and listening to classical music.

 

Read More