Team and user management with Amazon SageMaker and AWS SSO

Amazon SageMaker Studio is a web-based integrated development environment (IDE) for machine learning (ML) that lets you build, train, debug, deploy, and monitor your ML models. Each onboarded user in Studio has their own dedicated set of resources, such as compute instances, a home directory on an Amazon Elastic File System (Amazon EFS) volume, and a dedicated AWS Identity and Access Management (IAM) execution role.

One of the most common real-world challenges in setting up user access for Studio is how to manage multiple users, groups, and data science teams for data access and resource isolation.

Many customers implement user management using federated identities with AWS Single Sign-On (AWS SSO) and an external identity provider (IdP), such as Active Directory (AD) or AWS Managed Microsoft AD directory. It’s aligned with the AWS recommended practice of using temporary credentials to access AWS accounts.

An Amazon SageMaker domain supports AWS SSO and can be configured in AWS SSO authentication mode. In this case, each entitled AWS SSO user has their own Studio user profile. Users given access to Studio have a unique sign-in URL that directly opens Studio, and they sign in with their AWS SSO credentials. Organizations manage their users in AWS SSO instead of the SageMaker domain. You can assign multiple users access to the domain at the same time. You can use Studio user profiles for each user to define their security permissions in Studio notebooks via an IAM role attached to the user profile, called an execution role. This role controls permissions for SageMaker operations according to its IAM permission policies.

In AWS SSO authentication mode, there is always one-to-one mapping between users and user profiles. The SageMaker domain manages the creation of user profiles based on the AWS SSO user ID. You can’t create user profiles via the AWS Management Console. This works well in the case when one user is a member of only one data science team or if users have the same or very similar access requirements across their projects and teams. In a more common use case, when a user can participate in multiple ML projects and be a member of multiple teams with slightly different permission requirements, the user requires access to different Studio user profiles with different execution roles and permission policies. Because you can’t manage user profiles independently of AWS SSO in AWS SSO authentication mode, you can’t implement a one-to-many mapping between users and Studio user profiles.

If you need to establish a strong separation of security contexts, for example for different data categories, or need to entirely prevent the visibility of one group of users’ activity and resources to another, the recommended approach is to create multiple SageMaker domains. At the time of this writing, you can create only one domain per AWS account per Region. To implement the strong separation, you can use multiple AWS accounts with one domain per account as a workaround.

The second challenge is to restrict access to the Studio IDE to only users from inside a corporate network or a designated VPC. You can achieve this by using IAM-based access control policies. In this case, the SageMaker domain must be configured with IAM authentication mode, because the IAM identity-based polices aren’t supported by the sign-in mechanism in AWS SSO mode. The post Secure access to Amazon SageMaker Studio with AWS SSO and a SAML application solves this challenge and demonstrates how to control network access to a SageMaker domain.

This solution addresses these challenges of AWS SSO user management for Studio for a common use case of multiple user groups and a many-to-many mapping between users and teams. The solution outlines how to use a custom SAML 2.0 application as the mechanism to trigger the user authentication for Studio and support multiple Studio user profiles per one AWS SSO user.

You can use this approach to implement a custom user portal with applications backed by the SAML 2.0 authorization process. Your custom user portal can have maximum flexibility on how to manage and display user applications. For example, the user portal can show some ML project metadata to facilitate identifying an application to access.

You can find the solution’s source code in our GitHub repository.

Solution overview

The solution implements the following architecture.

The main high-level architecture components are as follows:

  1. Identity provider – Users and groups are managed in an external identity source, for example in Azure AD. User assignments to AD groups define what permissions a particular user has and which Studio team they have access to. The identity source must by synchronized with AWS SSO.
  2. AWS SSO – AWS SSO manages SSO users, SSO permission sets, and applications. This solution uses a custom SAML 2.0 application to provide access to Studio for entitled AWS SSO users. The solution also uses SAML attribute mapping to populate the SAML assertion with specific access-relevant data, such as user ID and user team. Because the solution creates a SAML API, you can use any IdP supporting SAML assertions to create this architecture. For example, you can use Okta or even your own web application that provides a landing page with a user portal and applications. For this post, we use AWS SSO.
  3. Custom SAML 2.0 applications – The solution creates one application per Studio team and assigns one or multiple applications to a user or a user group based on entitlements. Users can access these applications from within their AWS SSO user portal based on assigned permissions. Each application is configured with the Amazon API Gateway endpoint URL as its SAML backend.
  4. SageMaker domain – The solution provisions a SageMaker domain in an AWS account and creates a dedicated user profile for each combination of AWS SSO user and Studio team the user is assigned to. The domain must be configured in IAM authentication mode.
  5. Studio user profiles – The solution automatically creates a dedicated user profile for each user-team combination. For example, if a user is a member of two Studio teams and has corresponding permissions, the solution provisions two separate user profiles for this user. Each profile always belongs to one and only one user. Because you have a Studio user profile for each possible combination of a user and a team, you must consider your account limits for user profiles before implementing this approach. For example, if your limit is 500 user profiles, and each user is a member of two teams, you consume that limit 2.5 times faster, and as a result you can onboard 250 users. With a high number of users, we recommend implementing multiple domains and accounts for security context separation. To demonstrate the proof of concept, we use two users, User 1 and User 2, and two Studio teams, Team 1 and Team 2. User 1 belongs to both teams, whereas User 2 belongs to Team 2 only. User 1 can access Studio environments for both teams, whereas User 2 can access only the Studio environment for Team 2.
  6. Studio execution roles – Each Studio user profile uses a dedicated execution role with permission polices with the required level of access for the specific team the user belongs to. Studio execution roles implement an effective permission isolation between individual users and their team roles. You manage data and resource access for each role and not at an individual user level.

The solution also implements an attribute-based access control (ABAC) using SAML 2.0 attributes, tags on Studio user profiles, and tags on SageMaker execution roles.

In this particular configuration, we assume that AWS SSO users don’t have permissions to sign in to the AWS account and don’t have corresponding AWS SSO-controlled IAM roles in the account. Each user signs in to their Studio environment via a presigned URL from an AWS SSO portal without the need to go to the console in their AWS account. In a real-world environment, you might need to set up AWS SSO permission sets for users to allow the authorized users to assume an IAM role and to sign in to an AWS account. For example, you can provide data scientist role permissions for a user to be able to interact with account resources and have the level of access they need to fulfill their role.

Solution architecture and workflow

The following diagram presents the end-to-end sign-in flow for an AWS SSO user.

An AWS SSO user chooses a corresponding Studio application in their AWS SSO portal. AWS SSO prepares a SAML assertion (1) with configured SAML attribute mappings. A custom SAML application is configured with the API Gateway endpoint URL as its Assertion Consumer Service (ACS), and needs mapping attributes containing the AWS SSO user ID and team ID. We use ssouserid and teamid custom attributes to send all needed information to the SAML backend.

The API Gateway calls an SAML backend API. An AWS Lambda function (2) implements the API, parses the SAML response to extract the user ID and team ID. The function uses them to retrieve a team-specific configuration, such as an execution role and SageMaker domain ID. The function checks if a required user profile exists in the domain, and creates a new one with the corresponding configuration settings if no profile exists. Afterwards, the function generates a Studio presigned URL for a specific Studio user profile by calling CreatePresignedDomainUrl API (3) via a SageMaker API VPC endpoint. The Lambda function finally returns the presigned URL with HTTP 302 redirection response to sign the user in to Studio.

The solution implements a non-production sample version of an SAML backend. The Lambda function parses the SAML assertion and uses only attributes in the <saml2:AttributeStatement> element to construct a CreatePresignedDomainUrl API call. In your production solution, you must use a proper SAML backend implementation, which must include a validation of an SAML response, a signature, and certificates, replay and redirect prevention, and any other features of an SAML authentication process. For example, you can use a python3-saml SAML backend implementation or OneLogin open-source SAML toolkit to implement a secure SAML backend.

Dynamic creation of Studio user profiles

The solution automatically creates a Studio user profile for each user-team combination, as soon as the AWS SSO sign-in process requests a presigned URL. For this proof of concept and simplicity, the solution creates user profiles based on the configured metadata in the AWS SAM template:

Metadata:
  Team1:
    DomainId: !GetAtt SageMakerDomain.Outputs.SageMakerDomainId
    SessionExpiration: 43200
    Tags:
      - Key: Team
        Value: Team1
    UserSettings:
      ExecutionRole: !GetAtt IAM.Outputs.SageMakerStudioExecutionRoleTeam1Arn
  Team2:
    DomainId !GetAtt SageMakerDomain.Outputs.SageMakerDomainId
    SessionExpiration: 43200
    Tags:
      - Key: Team
        Value: Team2
    UserSettings:
      ExecutionRole: !GetAtt IAM.Outputs.SageMakerStudioExecutionRoleTeam2Arn

You can configure own teams, custom settings, and tags by adding them to the metadata configuration for the AWS CloudFormation resource GetUserProfileMetadata.

For more information on configuration elements of UserSettings, refer to create_user_profile in boto3.

IAM roles

The following diagram shows the IAM roles in this solution.

The roles are as follows:

  1. Studio execution role – A Studio user profile uses a dedicated Studio execution role with data and resource permissions specific for each team or user group. This role can also use tags to implement ABAC for data and resource access. For more information, refer to SageMaker Roles.
  2. SAML backend Lambda execution role – This execution role contains permission to call the CreatePresignedDomainUrl API. You can configure the permission policy to include additional conditional checks using Condition keys. For example, to allow access to Studio only from a designated range of IP addresses within your private corporate network, use the following code:

    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Action": [
                    "sagemaker:CreatePresignedDomainUrl"
                ],
                "Resource": "arn:aws:sagemaker:<Region>:<Account_id>:user-profile/*/*",
                "Effect": "Allow"
            },
            {
                "Condition": {
                    "NotIpAddress": {
                        "aws:VpcSourceIp": "10.100.10.0/24"
                    }
                },
                "Action": [
                    "sagemaker:*"
                ],
                "Resource": "arn:aws:sagemaker:<Region>:<Account_id>:user-profile/*/*",
                "Effect": "Deny"
            }
        ]
    }

    For more examples on how to use conditions in IAM policies, refer to Control Access to the SageMaker API by Using Identity-based Policies.

  3. SageMaker – SageMaker assumes the Studio execution role on your behalf, as controlled by a corresponding trust policy on the execution role. This allows the service to access data and resources, and perform actions on your behalf. The Studio execution role must contain a trust policy allowing SageMaker to assume this role.
  4. AWS SSO permission set IAM role – You can assign your AWS SSO users to AWS accounts in your AWS organization via AWS SSO permission sets. A permission set is a template that defines a collection of user role-specific IAM policies. You manage permission sets in AWS SSO, and AWS SSO controls the corresponding IAM roles in each account.
  5. AWS Organizations Service Control Policies – If you use AWS Organizations, you can implement Service Control Policies (SCPs) to centrally control the maximum available permissions for all accounts and all IAM roles in your organization. For example, to centrally prevent access to Studio via the console, you can implement the following SCP and attach it to the accounts with the SageMaker domain:

    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Action": [
            "sagemaker:*"
          ],
          "Resource": "*",
          "Effect": "Allow"
        },
        {
          "Condition": {
            "NotIpAddress": {
              "aws:VpcSourceIp": "<AuthorizedPrivateSubnet>"
            }
          },
          "Action": [
            "sagemaker:CreatePresignedDomainUrl"
          ],
          "Resource": "*",
          "Effect": "Deny"
        }
      ]
    }

Solution provisioned roles

The AWS CloudFormation stack for this solution creates three Studio execution roles used in the SageMaker domain:

  • SageMakerStudioExecutionRoleDefault
  • SageMakerStudioExecutionRoleTeam1
  • SageMakerStudioExecutionRoleTeam2

None of the roles have the AmazonSageMakerFullAccess policy attached, and each has only a limited set of permissions. In your real-world SageMaker environment, you need to amend the role’s permissions based on your specific requirements.

SageMakerStudioExecutionRoleDefault has only the custom policy SageMakerReadOnlyPolicy attached with a restrictive list of allowed actions.

Both team roles, SageMakerStudioExecutionRoleTeam1 and SageMakerStudioExecutionRoleTeam2, additionally have two custom polices, SageMakerAccessSupportingServicesPolicy and SageMakerStudioDeveloperAccessPolicy, allowing usage of particular services and one deny-only policy, SageMakerDeniedServicesPolicy, with explicit deny on some SageMaker API calls.

The Studio developer access policy enforces the setting of the Team tag equal to the same value as the user’s own execution role for calling any SageMaker Create* API:

{
    "Condition": {
        "ForAnyValue:StringEquals": {
            "aws:TagKeys": [
                "Team"
            ]
        },
        "StringEqualsIfExists": {
            "aws:RequestTag/Team": "${aws:PrincipalTag/Team}"
        }
    },
    "Action": [
        "sagemaker:Create*"
    ],
    "Resource": [
        "arn:aws:sagemaker:*:<ACCOUNT_ID>:*"
    ],
    "Effect": "Allow",
    "Sid": "AmazonSageMakerCreate"
}

Furthermore, it allows using delete, stop, update, and start operations only on resources tagged with the same Team tag as the user’s execution role:

{
    "Condition": {
        "StringEquals": {
            "aws:PrincipalTag/Team": "${sagemaker:ResourceTag/Team}"
        }
    },
    "Action": [
        "sagemaker:Delete*",
        "sagemaker:Stop*",
        "sagemaker:Update*",
        "sagemaker:Start*",
        "sagemaker:DisassociateTrialComponent",
        "sagemaker:AssociateTrialComponent",
        "sagemaker:BatchPutMetrics"
    ],
    "Resource": [
        "arn:aws:sagemaker:*:<ACCOUNT_ID>:*"
    ],
    "Effect": "Allow",
    "Sid": "AmazonSageMakerUpdateDeleteExecutePolicy"
}

For more information on roles and polices, refer to Configuring Amazon SageMaker Studio for teams and groups with complete resource isolation.

Network infrastructure

The solution implements a fully isolated SageMaker domain environment with all network traffic going through AWS PrivateLink connections. You may optionally enable internet access from the Studio notebooks. The solution also creates three VPC security groups to control traffic between all solution components such as the SAML backend Lambda function, VPC endpoints, and Studio notebooks.

For this proof of concept and simplicity, the solution creates a SageMaker subnet in a single Availability Zone. For your production setup, you must use multiple private subnets across multiple Availability Zones and ensure that each subnet is appropriately sized, assuming minimum five IPs per user.

This solution provisions all required network infrastructure. The CloudFormation template ./cfn-templates/vpc.yaml contains the source code.

Deployment steps

To deploy and test the solution, you must complete the following steps:

  1. Deploy the solution’s stack via an AWS Serverless Application Model (AWS SAM) template.
  2. Create AWS SSO users, or use existing AWS SSO users.
  3. Create custom SAML 2.0 applications and assign AWS SSO users to the applications.

The full source code for the solution is provided in our GitHub repository.

Prerequisites

To use this solution, the AWS Command Line Interface (AWS CLI), AWS SAM CLI, and Python3.8 or later must be installed.

The deployment procedure assumes that you enabled AWS SSO and configured for the AWS Organizations in the account where the solution is deployed.

To set up AWS SSO, refer to the instructions in GitHub.

Solution deployment options

You can choose from several solution deployment options to have the best fit for your existing AWS environment. You can also select the network and SageMaker domain provisioning options. For detailed information about the different deployment choices, refer to the README file.

Deploy the AWS SAM template

To deploy the AWS SAM template, complete the following steps:

  1. Clone the source code repository to your local environment:
    git clone https://github.com/aws-samples/users-and-team-management-with-amazon-sagemaker-and-aws-sso.git

  2. Build the AWS SAM application:
    sam build

  3. Deploy the application:
    sam deploy --guided

  4. Provide stack parameters according to your existing environment and desired deployment options, such as existing VPC, existing private and public subnets, and existing SageMaker domain, as discussed in the Solution deployment options chapter of the README file.

You can leave all parameters at their default values to provision new network resources and a new SageMaker domain. Refer to detailed parameter usage in the README file if you need to change any default settings.

Wait until the stack deployment is complete. The end-to-end deployment including provisioning all network resources and a SageMaker domain takes about 20 minutes.

To see the stack output, run the following command in the terminal:

export STACK_NAME=<SAM stack name>

aws cloudformation describe-stacks 
--stack-name $STACK_NAME
--output table 
--query "Stacks[0].Outputs[*].[OutputKey, OutputValue]"

Create SSO users

Follow the instructions to add AWS SSO users to create two users with names User1 and User2 or use any two of your existing AWS SSO users to test the solution. Make sure you use AWS SSO in the same AWS Region in which you deployed the solution.

Create custom SAML 2.0 applications

To create the required custom SAML 2.0 applications for Team 1 and for Team 2, complete the following steps:

  1. Open the AWS SSO console in the AWS management account of your AWS organization, in the same Region where you deployed the solution stack.
  2. Choose Applications in the navigation pane.
  3. Choose Add a new application.
  4. Choose Add a custom SAML 2.0 application.
  5. For Display name, enter an application name, for example SageMaker Studio Team 1.
  6. Leave Application start URL and Relay state empty.
  7. Choose If you don’t have a metadata file, you can manually enter your metadata values.
  8. For Application ACS URL, enter the URL provided in the SAMLBackendEndpoint key of the AWS SAM stack output.
  9. For Application SAML audience, enter the URL provided in the SAMLAudience key of the AWS SAM stack output.
  10. Choose Save changes.
  11. Navigate to the Attribute mappings tab.
  12. Set the Subject to email and Format to emailAddress.
  13. Add the following new attributes:
    1. ssouserid set to ${user:AD_GUID}
    2. teamid set to Team1 or Team2, respectively, for each application
  14. Choose Save changes.
  15. On the Assigned users tab, choose Assign users.
  16. Choose User 1 for the Team 1 application and both User 1 and User 2 for the Team 2 application.
  17. Choose Assign users.

Test the solution

To test the solution, complete the following steps:

  1. Go to AWS SSO user portal https://<Identity Store ID>.awsapps.com/start and sign as User 1.
    Two SageMaker applications are shown in the portal.
  2. Choose SageMaker Studio Team 1.
    You’re redirected to the Studio instance for Team 1 in a new browser window.
    The first time you start Studio, SageMaker creates a JupyterServer application. This process takes few minutes.
  3. In Studio, on the File menu, choose New and Terminal to start a new terminal.
  4. In the terminal command line, enter the following command:
    aws sts get-caller-identity

    The command returns the Studio execution role.

    In our setup, this role must be different for each team. You can also check that each user in each instance of Studio has their own home directory on a mounted Amazon EFS volume.

  5. Return to the AWS SSO portal, still logged as User 1, and choose SageMaker Studio Team 2.
    You’re redirected to a Team 2 Studio instance.
    The start process can again take several minutes, because SageMaker starts a new JupyterServer application for User 2.
  6. Sign as User 2 in the AWS SSO portal.
    User 2 has only one application assigned: SageMaker Studio Team 2.

If you start an instance of Studio via this user application, you can verify that it uses the same SageMaker execution role as User 1’s Team 2 instance. However, each Studio instance is completely isolated. User 2 has their own home directory on an Amazon EFS volume and own instance of JupyterServer application. You can verify this by creating a folder and some files for each of the users and see that each user’s home directory is isolated.

Now you can sign in to the SageMaker console and see that there are three user profiles created.

You just implemented a proof of concept solution to manage multiple users and teams with Studio.

Clean up

To avoid charges, you must remove all project-provisioned and generated resources from your AWS account. Use the following SAM CLI command to delete the solution CloudFormation stack:

sam delete delete-stack --stack-name <stack name of SAM stack>

For security reasons and to prevent data loss, the Amazon EFS mount and the content associated with the Studio domain deployed in this solution are not deleted. The VPC and subnets associated with the SageMaker domain remain in your AWS account. For instructions to delete the file system and VPC, refer to Deleting an Amazon EFS file system and Work with VPCs, respectively.

To delete the custom SAML application, complete the following steps:

  1. Open the AWS SSO console in the AWS SSO management account.
  2. Choose Applications.
  3. Select SageMaker Studio Team 1.
  4. On the Actions menu, choose Remove.
  5. Repeat these steps for SageMaker Studio Team 2.

Conclusion

This solution demonstrated how you can create a flexible and customizable environment using AWS SSO and Studio user profiles to support your own organization structure. The next possible improvement steps towards a production-ready solution could be:

  • Implement automated Studio user profile management as a dedicated microservice to support an automated profile provisioning workflow and to handle metadata and configuration for user profiles, for example in Amazon DynamoDB.
  • Use the same mechanism in a more general case of multiple SageMaker domains and multiple AWS accounts. The same SAML backend can vend a corresponding presigned URL redirecting to a user profile-domain-account combination according to your custom logic based on user entitlements and team setup.
  • Implement a synchronization mechanism between your IdP and AWS SSO and automate creation of custom SAML 2.0 applications.
  • Implement scalable data and resource access management with attribute-based access control (ABAC).

If you have any feedback or questions, please leave them in the comments.

Further reading

Documentation

Blog posts


About the Author

Yevgeniy Ilyin is a Solutions Architect at AWS. He has over 20 years of experience working at all levels of software development and solutions architecture and has used programming languages from COBOL and Assembler to .NET, Java, and Python. He develops and codes cloud native solutions with a focus on big data, analytics, and data engineering.

Read More

Build and train ML models using a data mesh architecture on AWS: Part 2

This is the second part of a series that showcases the machine learning (ML) lifecycle with a data mesh design pattern for a large enterprise with multiple lines of business (LOBs) and a Center of Excellence (CoE) for analytics and ML.

In part 1, we addressed the data steward persona and showcased a data mesh setup with multiple AWS data producer and consumer accounts. For an overview of the business context and the steps to set up a data mesh with AWS Lake Formation and register a data product, refer to part 1.

In this post, we address the analytics and ML platform team as a consumer in the data mesh. The platform team sets up the ML environment for the data scientists and helps them get access to the necessary data products in the data mesh. The data scientists in this team use Amazon SageMaker to build and train a credit risk prediction model using the shared credit risk data product from the consumer banking LoB.

Build and train ML models using a data mesh architecture on AWS

The code for this example is available on GitHub.

Analytics and ML consumer in a data mesh architecture

Let’s recap the high-level architecture that highlights the key components in the data mesh architecture.

In the data producer block 1 (left), there is a data processing stage to ensure that shared data is well-qualified and curated. The central data governance block 2 (center) acts as a centralized data catalog with metadata of various registered data products. The data consumer block 3 (right) requests access to datasets from the central catalog and queries and processes the data to build and train ML models.

With SageMaker, data scientists and developers in the ML CoE can quickly and easily build and train ML models, and then directly deploy them into a production-ready hosted environment. SageMaker provides easy access to your data sources for exploration and analysis, and also provides common ML algorithms and frameworks that are optimized to run efficiently against extremely large data in a distributed environment. It’s easy to get started with Amazon SageMaker Studio, a web-based integrated development environment (IDE), by completing the SageMaker domain onboarding process. For more information, refer to the Amazon SageMaker Developer Guide.

Data product consumption by the analytics and ML CoE

The following architecture diagram describes the steps required by the analytics and ML CoE consumer to get access to the registered data product in the central data catalog and process the data to build and train an ML model.

The workflow consists of the following components:

  1. The producer data steward provides access in the central account to the database and table to the consumer account. The database is now reflected as a shared database in the consumer account.
  2. The consumer admin creates a resource link in the consumer account to the database shared by the central account. The following screenshot shows an example in the consumer account, with rl_credit-card being the resource link of the credit-card database.

  3. The consumer admin provides the Studio AWS Identity and Access Management (IAM) execution role access to the resource linked database and the table identified in the Lake Formation tag. In the following example, the consumer admin provided to the SageMaker execution role has permission to access rl_credit-card and the table satisfying the Lake Formation tag expression.
  4. Once assigned an execution role, data scientists in SageMaker can use Amazon Athena to query the table via the resource link database in Lake Formation.
    1. For data exploration, they can use Studio notebooks to process the data with interactive querying via Athena.
    2. For data processing and feature engineering, they can run SageMaker processing jobs with an Athena data source and output results back to Amazon Simple Storage Service (Amazon S3).
    3. After the data is processed and available in Amazon S3 on the ML CoE account, data scientists can use SageMaker training jobs to train models and SageMaker Pipelines to automate model-building workflows.
    4. Data scientists can also use the SageMaker model registry to register the models.

Data exploration

The following diagram illustrates the data exploration workflow in the data consumer account.

The consumer starts by querying a sample of the data from the credit_risk table with Athena in a Studio notebook. When querying data via Athena, the intermediate results are also saved in Amazon S3. You can use the AWS Data Wrangler library to run a query on Athena in a Studio notebook for data exploration. The following code example shows how to query Athena to fetch the results as a dataframe for data exploration:

df= wr.athena.read_sql_query('SELECT * FROM credit_card LIMIT 10;', database="rl_credit-card", ctas_approach=False)

Now that you have a subset of the data as a dataframe, you can start exploring the data and see what feature engineering updates are needed for model training. An example of data exploration is shown in the following screenshot.

When you query the database, you can see the access logs from the Lake Formation console, as shown in the following screenshot. These logs give you information about who or which service has used Lake Formation, including the IAM role and time of access. The screenshot shows a log about SageMaker accessing the table credit_risk in AWS Glue via Athena. In the log, you can see the additional audit context that contains the query ID that matches the query ID in Athena.

The following screenshot shows the Athena query run ID that matches the query ID from the preceding log. This shows the data accessed with the SQL query. You can see what data has been queried by navigating to the Athena console, choosing the Recent queries tab, and then looking for the run ID that matches the query ID from the additional audit context.

Data processing

After data exploration, you may want to preprocess the entire large dataset for feature engineering before training a model. The following diagram illustrates the data processing procedure.

In this example, we use a SageMaker processing job, in which we define an Athena dataset definition. The processing job queries the data via Athena and uses a script to split the data into training, testing, and validation datasets. The results of the processing job are saved to Amazon S3. To learn how to configure a processing job with Athena, refer to Use Amazon Athena in a processing job with Amazon SageMaker.

In this example, you can use the Python SDK to trigger a processing job with the Scikit-learn framework. Before triggering, you can configure the inputs parameter to get the input data via the Athena dataset definition, as shown in the following code. The dataset contains the location to download the results from Athena to the processing container and the configuration for the SQL query. When the processing job is finished, the results are saved in Amazon S3.

AthenaDataset = AthenaDatasetDefinition (
  catalog = 'AwsDataCatalog', 
  database = 'rl_credit-card', 
  query_string = 'SELECT * FROM "rl_credit-card"."credit_card""',                                
  output_s3_uri = 's3://sagemaker-us-east-1-********7363/athenaqueries/', 
  work_group = 'primary', 
  output_format = 'PARQUET')

dataSet = DatasetDefinition(
  athena_dataset_definition = AthenaDataset, 
  local_path='/opt/ml/processing/input/dataset.parquet')


sklearn_processor.run(
    code="processing/preprocessor.py",
    inputs=[ProcessingInput(
      input_name="dataset", 
      destination="/opt/ml/processing/input", 
      dataset_definition=dataSet)],
    outputs=[
        ProcessingOutput(
            output_name="train_data", source="/opt/ml/processing/train", destination=train_data_path
        ),
        ProcessingOutput(
            output_name="val_data", source="/opt/ml/processing/val", destination=val_data_path
        ),
        ProcessingOutput(
            output_name="model", source="/opt/ml/processing/model", destination=model_path
        ),
        ProcessingOutput(
            output_name="test_data", source="/opt/ml/processing/test", destination=test_data_path
        ),
    ],
    arguments=["--train-test-split-ratio", "0.2"],
    logs=False,
)

Model training and model registration

After preprocessing the data, you can train the model with the preprocessed data saved in Amazon S3. The following diagram illustrates the model training and registration process.

For data exploration and SageMaker processing jobs, you can retrieve the data in the data mesh via Athena. Although the SageMaker Training API doesn’t include a parameter to configure an Athena data source, you can query data via Athena in the training script itself.

In this example, the preprocessed data is now available in Amazon S3 and can be used directly to train an XGBoost model with SageMaker Script Mode. You can provide the script, hyperparameters, instance type, and all the additional parameters needed to successfully train the model. You can trigger the SageMaker estimator with the training and validation data in Amazon S3. When the model training is complete, you can register the model in the SageMaker model registry for experiment tracking and deployment to a production account.

estimator = XGBoost(
    entry_point=entry_point,
    source_dir=source_dir,
    output_path=output_path,
    code_location=code_location,
    hyperparameters=hyperparameters,
    instance_type="ml.c5.xlarge",
    instance_count=1,
    framework_version="0.90-2",
    py_version="py3",
    role=role,
)

inputs = {"train": train_input_data, "validation": val_input_data}

estimator.fit(inputs, job_name=job_name)

Next steps

You can make incremental updates to the solution to address requirements around data updates and model retraining, automatic deletion of intermediate data in Amazon S3, and integrating a feature store. We discuss each of these in more detail in the following sections.

Data updates and model retraining triggers

The following diagram illustrates the process to update the training data and trigger model retraining.

The process includes the following steps:

  1. The data producer updates the data product with either a new schema or additional data at a regular frequency.
  2. After the data product is re-registered in the central data catalog, this generates an Amazon CloudWatch event from Lake Formation.
  3. The CloudWatch event triggers an AWS Lambda function to synchronize the updated data product with the consumer account. You can use this trigger to reflect the data changes by doing the following:
    1. Rerun the AWS Glue crawler.
    2. Trigger model retraining if the data drifts beyond a given threshold.

For more details about setting up an SageMaker MLOps deployment pipeline for drift detection, refer to the Amazon SageMaker Drift Detection GitHub repo.

Auto-deletion of intermediate data in Amazon S3

You can automatically delete intermediate data that is generated by Athena queries and stored in Amazon S3 in the consumer account at regular intervals with S3 object lifecycle rules. For more information, refer to Managing your storage lifecycle.

SageMaker Feature Store integration

SageMaker Feature Store is purpose-built for ML and can store, discover, and share curated features used in training and prediction workflows. A feature store can work as a centralized interface between different data producer teams and LoBs, enabling feature discoverability and reusability to multiple consumers. The feature store can act as an alternative to the central data catalog in the data mesh architecture described earlier. For more information about cross-account architecture patterns, refer to Enable feature reuse across accounts and teams using Amazon SageMaker Feature Store.

Refer to the MLOps foundation roadmap for enterprises with Amazon SageMaker blog post to find out more about building an MLOps foundation based on the MLOps maturity model.

Conclusion

In this two-part series, we showcased how you can build and train ML models with a multi-account data mesh architecture on AWS. We described the requirements of a typical financial services organization with multiple LoBs and an ML CoE, and illustrated the solution architecture with Lake Formation and SageMaker. We used the example of a credit risk data product registered in Lake Formation by the consumer banking LoB and accessed by the ML CoE team to train a credit risk ML model with SageMaker.

Each data producer account defines data products that are curated by people who understand the data and its access control, use, and limitations. The data products and the application domains that consume them are interconnected to form the data mesh. The data mesh architecture allows the ML teams to discover and access these curated data products.

Lake Formation allows cross-account access to Data Catalog metadata and underlying data. You can use Lake Formation to create a multi-account data mesh architecture. SageMaker provides an ML platform with key capabilities around data management, data science experimentation, model training, model hosting, workflow automation, and CI/CD pipelines for productionization. You can set up one or more analytics and ML CoE environments to build and train models with data products registered across multiple accounts in a data mesh.

Try out the AWS CloudFormation templates and code from the example repository to get started.


About the authors

Karim Hammouda is a Specialist Solutions Architect for Analytics at AWS with a passion for data integration, data analysis, and BI. He works with AWS customers to design and build analytics solutions that contribute to their business growth. In his free time, he likes to watch TV documentaries and play video games with his son.

Hasan Poonawala is a Senior AI/ML Specialist Solutions Architect at AWS, Hasan helps customers design and deploy machine learning applications in production on AWS. He has over 12 years of work experience as a data scientist, machine learning practitioner, and software developer. In his spare time, Hasan loves to explore nature and spend time with friends and family.

Benoit de Patoul is an AI/ML Specialist Solutions Architect at AWS. He helps customers by providing guidance and technical assistance to build solutions related to AI/ML using AWS. In his free time, he likes to play piano and spend time with friends.

Read More

Build and train ML models using a data mesh architecture on AWS: Part 1

Organizations across various industries are using artificial intelligence (AI) and machine learning (ML) to solve business challenges specific to their industry. For example, in the financial services industry, you can use AI and ML to solve challenges around fraud detection, credit risk prediction, direct marketing, and many others.

Large enterprises sometimes set up a center of excellence (CoE) to tackle the needs of different lines of business (LoBs) with innovative analytics and ML projects.

To generate high-quality and performant ML models at scale, they need to do the following:

  • Provide an easy way to access relevant data to their analytics and ML CoE
  • Create accountability on data providers from individual LoBs to share curated data assets that are discoverable, understandable, interoperable, and trustworthy

This can reduce the long cycle time for converting ML use cases from experiment to production and generate business value across the organization.

A data mesh architecture strives to solve these technical and organizational challenges by introducing a decentralized socio-technical approach to share, access, and manage data in complex and large-scale environments—within or across organizations. The data mesh design pattern creates a responsible data-sharing model that aligns with the organizational growth to achieve the ultimate goal of increasing the return of business investments in the data teams, process, and technology.

In this two-part series, we provide guidance on how organizations can build a modern data architecture using a data mesh design pattern on AWS and enable an analytics and ML CoE to build and train ML models with data across multiple LoBs. We use an example of a financial service organization to set the context and the use case for this series.

Build and train ML models using a data mesh architecture on AWS:

In this first post, we show the procedures of setting up a data mesh architecture with multiple AWS data producer and consumer accounts. Then we focus on one data product, which is owned by one LoB within the financial organization, and how it can be shared into a data mesh environment to allow other LoBs to consume and use this data product. This is mainly targeting the data steward persona, who is responsible for streamlining and standardizing the process of sharing data between data producers and consumers and ensuring compliance with data governance rules.

In the second post, we show one example of how an analytics and ML CoE can consume the data product for a risk prediction use case. This is mainly targeting the data scientist persona, who is responsible for utilizing both organizational-wide and third-party data assets to build and train ML models that extract business insights to enhance the experience of financial services customers.

Data mesh overview

The founder of the data mesh pattern, Zhamak Dehghani in her book Data Mesh Delivering Data-Driven Value at Scale, defined four principles towards the objective of the data mesh:

  • Distributed domain ownership – To pursue an organizational shift from centralized ownership of data by specialists who run the data platform technologies to a decentralized data ownership model, pushing ownership and accountability of the data back to the LoBs where data is produced (source-aligned domains) or consumed (consumption-aligned domains).
  • Data as a product – To push upstream the accountability of sharing curated, high-quality, interoperable, and secure data assets. Therefore, data producers from different LoBs are responsible for making data in a consumable form right at the source.
  • Self-service analytics – To streamline the experience of data users of analytics and ML so they can discover, access, and use data products with their preferred tools. Additionally, to streamline the experience of LoB data providers to build, deploy, and maintain data products via recipes and reusable components and templates.
  • Federated computational governance – To federate and automate the decision-making involved in managing and controlling data access to be on the level of data owners from the different LoBs, which is still in line with the wider organization’s legal, compliance, and security policies that are ultimately enforced through the mesh.

AWS introduced its vision for building a data mesh on top of AWS in various posts:

  • First, we focused on the organizational part associated with distributed domain ownership and data as a product principles. The authors described the vision of aligning multiple LOBs across the organization towards a data product strategy that provides the consumption-aligned domains with tools to find and obtain the data they need, while guaranteeing the necessary control around the use of that data by introducing accountability for the source-aligned domains to provide data products ready to be used right at the source. For more information, refer to How JPMorgan Chase built a data mesh architecture to drive significant value to enhance their enterprise data platform.
  • Then we focused on the technical part associated with building data products, self-service analytics, and federated computational governance principles. The authors described the core AWS services that empower the source-aligned domains to build and share data products, a wide variety of services that can enable consumer-aligned domains to consume data products in different ways based on their preferred tools and the use cases they are working towards, and finally the AWS services that govern the data sharing procedure by enforcing data access polices. For more information, refer to Design a data mesh architecture using AWS Lake Formation and AWS Glue.
  • We also showed a solution to automate data discovery and access control through a centralized data mesh UI. For more details, refer to Build a data sharing workflow with AWS Lake Formation for your data mesh.

Financial services use case

Typically, large financial services organizations have multiple LoBs, such as consumer banking, investment banking, and asset management, and also one or more analytics and ML CoE teams. Each LoB provides different services:

  • The consumer banking LoB provides a variety of services to consumers and businesses, including credit and mortgage, cash management, payment solutions, deposit and investment products, and more
  • The commercial or investment banking LoB offers comprehensive financial solutions, such as lending, bankruptcy risk, and wholesale payments to clients, including small businesses, mid-sized companies, and large corporations
  • The asset management LoB provides retirement products and investment services across all asset classes

Each LoB defines their own data products, which are curated by people who understand the data and are best suited to specify who is authorized to use it, and how it can be used. In contrast, other LoBs and application domains such as the analytics and ML CoE are interested in discovering and consuming qualified data products, blending it together to generate insights, and making data-driven decisions.

The following illustration depicts some LoBs and examples of data products that they can share. It also shows the consumers of data products such as the analytics and ML CoE, who build ML models that can be deployed to customer-facing applications to further enhance the end-customer’s experience.

Following the data mesh socio-technical concept, we start with the social aspect with a set of organizational steps, such as the following:

  • Utilizing domain experts to define boundaries for each domain, so each data product can be mapped to a specific domain
  • Identifying owners for data products provided from each domain, so each data product has a strategy defined by their owner
  • Identifying governance polices from global and local or federated incentives, so when data consumers access a specific data product, the access policy associated with the product can be automatically enforced through a central data governance layer

Then we move to the technical aspect, which includes the following end-to-end scenario defined in the previous diagram:

  1. Empower the consumer banking LoB with tools to build a ready-to-use consumer credit profile data product.
  2. Allow the consumer banking LoB to share data products into the central governance layer.
  3. Embed global and federated definitions of data access policies that should be enforced while accessing the consumer credit profile data product through the central data governance.
  4. Allow the analytics and ML CoE to discover and access the data product through the central governance layer.
  5. Empower the analytics and ML CoE with tools to utilize the data product for building and training a credit risk prediction model.We don’t cover the final steps (6 and 7 in the preceding diagram) in this series. However, to show the business value such an ML model can bring to the organization in an end-to-end scenario, we illustrate the following:
  6. This model could later be deployed back to customer-facing systems such as a consumer banking web portal or mobile application.
  7. It can be specifically used within the loan application to assess the risk profile of credit and mortgage requests.

Next, we describe the technical needs of each of the components.

Deep dive into technical needs

To make data products available for everyone, organizations need to make it easy to share data between different entities across the organization while maintaining appropriate control over it, or in other words, to balance agility with proper governance.

Data consumer: Analytics and ML CoE

The data consumers such as data scientists from the analytics and ML CoE need to be able to do the following:

  • Discover and access relevant datasets for a given use case
  • Be confident that datasets they want to access are already curated, up to date, and have robust descriptions
  • Request access to datasets of interest to their business cases
  • Use their preferred tools to query and process such datasets within their environment for ML without the need for replicating data from the original remote location or for worrying about engineering or infrastructure complexities associated with processing data physically stored in a remote site
  • Get notified of any data updates made by the data owners

Data producer: Domain ownership

The data producers, such as domain teams from different LoBs in the financial services org, need to register and share curated datasets that contain the following:

  • Technical and operational metadata, such as database and table names and sizes, column schemas, and keys
  • Business metadata such as data description, classification, and sensitivity
  • Tracking metadata such as schema evolution from the source to the target form and any intermediate forms
  • Data quality metadata such as correctness and completeness ratios and data bias
  • Access policies and procedures

These are needed to allow data consumers to discover and access data without relying on manual procedures or having to contact the data product’s domain experts to gain more knowledge about the meaning of the data and how it can be accessed.

Data governance: Discoverability, accessibility, and auditability

Organizations need to balance the agilities illustrated earlier with proper mitigation of the risks associated with data leaks. Particularly in regulated industries like financial services, there is a need to maintain central data governance to provide overall data access and audit control while reducing the storage footprint by avoiding multiple copies of the same data across different locations.

In traditional centralized data lake architectures, the data producers often publish raw data and pass on the responsibility of data curation, data quality management, and access control to data and infrastructure engineers in a centralized data platform team. However, these data platform teams may be less familiar with the various data domains, and still rely on support from the data producers to be able to properly curate and govern access to data according to the policies enforced at each data domain. In contrast, the data producers themselves are best positioned to provide curated, qualified data assets and are aware of the domain-specific access polices that need to be enforced while accessing data assets.

Solution overview

The following diagram shows the high-level architecture of the proposed solution.

We address data consumption by the analytics and ML CoE with Amazon Athena and Amazon SageMaker in part 2 of this series.

In this post, we focus on the data onboarding process into the data mesh and describe how an individual LoB such as the consumer banking domain data team can use AWS tools such as AWS Glue and AWS Glue DataBrew to prepare, curate, and enhance the quality of their data products, and then register those data products into the central data governance account through AWS Lake Formation.

Consumer banking LoB (data producer)

One of the core principles of data mesh is the concept of data as a product. It’s very important that the consumer banking domain data team work on preparing data products that are ready for use by data consumers. This can be done by using AWS extract, transform, and load (ETL) tools like AWS Glue to process raw data collected on Amazon Simple Storage Service (Amazon S3), or alternatively connect to the operational data stores where the data is produced. You can also use DataBrew, which is a no-code visual data preparation tool that makes it easy to clean and normalize data.

For example, while preparing the consumer credit profile data product, the consumer banking domain data team can make a simple curation to translate from German to English the attribute names of the raw data retrieved from the open-source dataset Statlog German credit data, which consists of 20 attributes and 1,000 rows.

Data governance

The core AWS service for enabling data mesh governance is Lake Formation. Lake Formation offers the ability to enforce data governance within each data domain and across domains to ensure data is easily discoverable and secure. It provides a federated security model that can be administered centrally, with best practices for data discovery, security, and compliance, while allowing high agility within each domain.

Lake Formation offers an API to simplify how data is ingested, stored, and managed, together with row-level security to protect your data. It also provides functionality like granular access control, governed tables, and storage optimization.

In addition, Lake Formations offers a Data Sharing API that you can use to share data across different accounts. This allows the analytics and ML CoE consumer to run Athena queries that query and join tables across multiple accounts. For more information, refer to the AWS Lake Formation Developer Guide.

AWS Resource Access Manager (AWS RAM) provides a secure way to share resources via AWS Identity and Access Manager (IAM) roles and users across AWS accounts within an organization or organizational units (OUs) in AWS Organizations.

Lake Formation together with AWS RAM provides one way to manage data sharing and access across AWS accounts. We refer to this approach as RAM-based access control. For more details about this approach, refer to Build a data sharing workflow with AWS Lake Formation for your data mesh.

Lake Formation also offers another way to manage data sharing and access using Lake Formation tags. We refer to this approach as tag-based access control. For more details, refer to Build a modern data architecture and data mesh pattern at scale using AWS Lake Formation tag-based access control.

Throughout this post, we use the tag-based access control approach because it simplifies the creation of policies on a smaller number of logical tags that are commonly found in different LoBs instead of specifying policies on named resources at the infrastructure level.

Prerequisites

To set up a data mesh architecture, you need at least three AWS accounts: a producer account, a central account, and a consumer account.

Deploy the data mesh environment

To deploy a data mesh environment, you can use the following GitHub repository. This repository contains three AWS CloudFormation templates that deploy a data mesh environment that includes each of the accounts (producer, central, and consumer). Within each account, you can run its corresponding CloudFormation template.

Central account

In the central account, complete the following steps:

  1. Launch the CloudFormation stack:
  2. Create two IAM users:
    1. DataMeshOwner
    2. ProducerSteward
  3. Grant DataMeshOwner as the Lake Formation admin.
  4. Create one IAM role:
    1. LFRegisterLocationServiceRole
  5. Create two IAM policies:
    1. ProducerStewardPolicy
    2. S3DataLakePolicy
  6. Create the database credit-card for ProducerSteward at the producer account.
  7. Share the data location permission to the producer account.

Producer account

In the producer account, complete the following steps:

  1. Launch the CloudFormation stack:
  2. Create the S3 bucket credit-card, which holds the table credit_card.
  3. Allow S3 bucket access for the central account Lake Formation service role.
  4. Create the AWS Glue crawler creditCrawler-<ProducerAccountID>.
  5. Create an AWS Glue crawler service role.
  6. Grant permissions on the S3 bucket location credit-card-<ProducerAccountID>-<aws-region> to the AWS Glue crawler role.
  7. Create a producer steward IAM user.

Consumer account

In the consumer account, complete the following steps:

  1. Launch the CloudFormation stack:
  2. Create the S3 bucket <AWS Account ID>-<aws-region>-athena-logs.
  3. Create the Athena workgroup consumer-workgroup.
  4. Create the IAM user ConsumerAdmin.

Add a database and subscribe the consumer account to it

After you run the templates, you can go through the step-by-step guide to add a product in the data catalog and have the consumer subscribed to it. The guide starts by setting up a database where the producer can place its products and then explains how the consumer can subscribe to that database and access the data. All of this is performed while using LF-tags, which is the tag-based access control for Lake Formation.

Data product registration

The following architecture describes the detailed steps of how the consumer banking LoB team acting as data producers can register their data products in the central data governance account (onboard data products to the organization data mesh).

The general steps to register a data product are as follows:

  1. Create a target database for the data product in the central governance account. As an example, the CloudFormation template from the central account already creates the target database credit-card.
  2. Share the created target database with the origin in the producer account.
  3. Create a resource link of the shared database in the producer account. In the following screenshot, we see on the Lake Formation console in the producer account that rl_credit-card is the resource link of the credit-card database.
  4. Populate tables (with the data curated in the producer account) inside the resource link database (rl_credit-card) using an AWS Glue crawler in the producer account.

The created table automatically appears in the central governance account. The following screenshot shows an example of the table in Lake Formation in the central account. This is after performing the earlier steps to populate the resource link database rl_credit-card in the producer account.

Conclusion

In part 1 of this series, we discussed the goals of financial services organizations to achieve more agility for their analytics and ML teams and reduce the time from data to insights. We also focused on building a data mesh architecture on AWS, where we’ve introduced easy-to-use, scalable, and cost-effective AWS services such as AWS Glue, DataBrew, and Lake Formation. Data producing teams can use these services to build and share curated, high-quality, interoperable, and secure data products that are ready to use by different data consumers for analytical purposes.

In part 2, we focus on analytics and ML CoE teams who consume data products shared by the consumer banking LoB to build a credit risk prediction model using AWS services such as Athena and SageMaker.


About the authors

Karim Hammouda is a Specialist Solutions Architect for Analytics at AWS with a passion for data integration, data analysis, and BI. He works with AWS customers to design and build analytics solutions that contribute to their business growth. In his free time, he likes to watch TV documentaries and play video games with his son.

Hasan Poonawala is a Senior AI/ML Specialist Solutions Architect at AWS, Hasan helps customers design and deploy machine learning applications in production on AWS. He has over 12 years of work experience as a data scientist, machine learning practitioner, and software developer. In his spare time, Hasan loves to explore nature and spend time with friends and family.

Benoit de Patoul is an AI/ML Specialist Solutions Architect at AWS. He helps customers by providing guidance and technical assistance to build solutions related to AI/ML using AWS. In his free time, he likes to play piano and spend time with friends.

Read More

Integrate Amazon SageMaker Data Wrangler with MLOps workflows

As enterprises move from running ad hoc machine learning (ML) models to using AI/ML to transform their business at scale, the adoption of ML Operations (MLOps) becomes inevitable. As shown in the following figure, the ML lifecycle begins with framing a business problem as an ML use case followed by a series of phases, including data preparation, feature engineering, model building, deployment, continuous monitoring, and retraining. For many enterprises, a lot of these steps are still manual and loosely integrated with each other. Therefore, it’s important to automate the end-to-end ML lifecycle, which enables frequent experiments to drive better business outcomes. Data preparation is one of the crucial steps in this lifecycle, because the ML model’s accuracy depends on the quality of the training dataset.

Machine learning lifecycle

Data scientists and ML engineers spend somewhere between 70–80% of their time collecting, analyzing, cleaning, and transforming data required for model training. Amazon SageMaker Data Wrangler is a fully managed capability of Amazon SageMaker that makes it faster for data scientists and ML engineers to analyze and prepare data for their ML projects with little to no code. When it comes to operationalizing an end-to-end ML lifecycle, data preparation is almost always the first step in the process. Given that there are many ways to build an end-to-end ML pipeline, in this post we discuss how you can easily integrate Data Wrangler with some of the well-known workflow automation and orchestration technologies.

Solution overview

In this post, we demonstrate how users can integrate data preparation using Data Wrangler with Amazon SageMaker Pipelines, AWS Step Functions, and Apache Airflow with Amazon Managed Workflow for Apache Airflow (Amazon MWAA). Pipelines is a SageMaker feature that is a purpose-built and easy-to-use continuous integration and continuous delivery (CI/CD) service for ML. Step Functions is a serverless, low-code visual workflow service used to orchestrate AWS services and automate business processes. Amazon MWAA is a managed orchestration service for Apache Airflow that makes it easier to operate end-to-end data and ML pipelines.

For demonstration purposes, we consider a use case to prepare data to train an ML model with the SageMaker built-in XGBoost algorithm that will help us identify fraudulent vehicle insurance claims. We used a synthetically generated set of sample data to train the model and create a SageMaker model using the model artifacts from the training process. Our goal is to operationalize this process end to end by setting up an ML workflow. Although ML workflows can be more elaborate, we use a minimal workflow for demonstration purposes. The first step of the workflow is data preparation with Data Wrangler, followed by a model training step, and finally a model creation step. The following diagram illustrates our solution workflow.

MLOps workflows with SageMaker Data Wrangler

In the following sections, we walk you through how to set up a Data Wrangler flow and integrate Data Wrangler with Pipelines, Step Functions, and Apache Airflow.

Set up a Data Wrangler flow

We start by creating a Data Wrangler flow, also called a data flow, using the data flow UI via the Amazon SageMaker Studio IDE. Our sample dataset consists of two data files: claims.csv and customers.csv, which are stored in an Amazon Simple Storage Service (Amazon S3) bucket. We use the data flow UI to apply Data Wrangler’s built-in transformations such categorical encoding, string formatting, and imputation to the feature columns in each of these files. We also apply custom transformation to a few feature columns using a few lines of custom Python code with Pandas DataFrame. The following screenshot shows the transforms applied to the claims.csv file in the data flow UI.

Transforms applied to the claims.csv data file

Finally, we join the results of the applied transforms of the two data files to generate a single training dataset for our model training. We use Data Wrangler’s built-in join datasets capability, which lets us perform SQL-like join operations on tabular data. The following screenshot shows the data flow in the data flow UI in Studio. For step-by-step instructions to create the data flow using Data Wrangler, refer to the GitHub repository.

SageMaker Data Wrangler data flow in the data flow UI in SageMaker Studio.

You can now use the data flow (.flow) file to perform data transformations on our raw data files. The data flow UI can automatically generate Python notebooks for us to use and integrate directly with Pipelines using the SageMaker SDK. For Step Functions, we use the AWS Step Functions Data Science Python SDK to integrate our Data Wrangler processing with a Step Functions pipeline. For Amazon MWAA, we use a custom Airflow operator and the Airflow SageMaker operator. We discuss each of these approaches in detail in the following sections.

Integrate Data Wrangler with Pipelines

SageMaker Pipelines is a native workflow orchestration tool for building ML pipelines that take advantage of direct SageMaker integration. Along with the SageMaker model registry, Pipelines improves the operational resilience and reproducibility of your ML workflows. These workflow automation components enable you to easily scale your ability to build, train, test, and deploy hundreds of models in production; iterate faster; reduce errors due to manual orchestration; and build repeatable mechanisms. Each step in the pipeline can keep track of the lineage, and intermediate steps can be cached for quickly rerunning the pipeline. You can create pipelines using the SageMaker Python SDK.

A workflow built with SageMaker pipelines consists of a sequence of steps forming a Directed Acyclic Graph (DAG). In this example, we begin with a processing step, which runs a SageMaker Processing job based on the Data Wrangler’s flow file to create a training dataset. We then continue with a training step, where we train an XGBoost model using SageMaker’s built-in XGBoost algorithm and the training dataset created in the previous step. After a model has been trained, we end this workflow with a RegisterModel step to register the trained model with the SageMaker model registry.

MLOps workflow built with SageMaker Pipelines

Installation and walkthrough

To run this sample, we use a Jupyter notebook running Python3 on a Data Science kernel image in a Studio environment. You can also run it on a Jupyter notebook instance locally on your machine by setting up the credentials to assume the SageMaker execution role. The notebook is lightweight and can be run on an ml.t3.medium instance. Detailed step-by-step instructions can be found in the GitHub repository.

You can either use the export feature in Data Wrangler to generate the Pipelines code, or build your own script from scratch. In our sample repository, we use a combination of both approaches for simplicity. At a high level, the following are the steps to build and run the Pipelines workflow:

  1. Generate a flow file from Data Wrangler or use the setup script to generate a flow file from a preconfigured template.
  2. Create an Amazon Simple Storage Service (Amazon S3) bucket and upload your flow file and input files to the bucket. In our sample notebook, we use the SageMaker default S3 bucket.
  3. Follow the instructions in the notebook to create a Processor object based on the Data Wrangler flow file, and an Estimator object with the parameters of the training job.
    1. In our example, because we only use SageMaker features and the default S3 bucket, we can use Studio’s default execution role. The same AWS Identity and Access Management (IAM) role is assumed by the pipeline run, the processing job, and the training job. You can further customize the execution role according to minimum privilege.
  4. Continue with the instructions to create a pipeline with steps referencing the Processor and Estimator objects, and then run the pipeline. The processing and training jobs run on SageMaker managed environments and take a few minutes to complete.
  5. In Studio, you can see the pipeline details monitor the pipeline run. You can also monitor the underlying processing and training jobs from the SageMaker console or from Amazon CloudWatch.

Integrate Data Wrangler with Step Functions

With Step Functions, you can express complex business logic as low-code, event-driven workflows that connect different AWS services. The Step Functions Data Science SDK is an open-source library that allows data scientists to create workflows that can preprocess datasets and build, deploy, and monitor ML models using SageMaker and Step Functions. Step Functions is based on state machines and tasks. Step Functions creates workflows out of steps called states, and expresses that workflow in the Amazon States Language. When you create a workflow using the Step Functions Data Science SDK, it creates a state machine representing your workflow and steps in Step Functions.

For this use case, we built a Step Functions workflow based on the common pattern used in this post that includes a processing step, training step, and RegisterModel step. In this case, we import these steps from the Step Functions Data Science Python SDK. We chain these steps in the same order to create a Step Functions workflow. The workflow uses the flow file that was generated from Data Wrangler, but you can also use your own Data Wrangler flow file. We reuse some code from the Data Wrangler export feature for simplicity. We run the data preprocessing logic generated by Data Wrangler flow file to create a training dataset, train a model using the XGBoost algorithm, and save the trained model artifact as a SageMaker model. Additionally, in the GitHub repo, we also show how Step Functions allows us to try and catch errors, and handle failures and retries with FailStateStep and CatchStateStep.

The resulting flow diagram, as shown in the following screenshot, is available on the Step Functions console after the workflow has started. This helps data scientists and engineers visualize the entire workflow and every step within it, and access the linked CloudWatch logs for every step.

MLOps workflow built with Step Functions

Installation and walkthrough

To run this sample, we use a Python notebook running with a Data Science kernel in a Studio environment. You can also run it on a Python notebook instance locally on your machine by setting up the credentials to assume the SageMaker execution role. The notebook is lightweight and can be run on a t3 medium instance for example. Detailed step-by-step instructions can be found in the GitHub repository.

You can either use the export feature in Data Wrangler to generate the Pipelines code and modify it for Step Functions or build your own script from scratch. In our sample repository, we use a combination of both approaches for simplicity. At a high level, the following are the steps to build and run the Step Functions workflow:

  1. Generate a flow file from Data Wrangler or use the setup script to generate a flow file from a preconfigured template.
  2. Create an S3 bucket and upload your flow file and input files to the bucket.
  3. Configure your SageMaker execution role with the required permissions as mentioned earlier. Refer to the GitHub repository for detailed instructions.
  4. Follow the instructions to run the notebook in the repository to start a workflow. The processing job runs on a SageMaker-managed Spark environment and can take few minutes to complete.
  5. Go to Step Functions console and track the workflow visually. You can also navigate to the linked CloudWatch logs to debug errors.

Let’s review some important sections of the code here. To define the workflow, we first define the steps in the workflow for the Step Function state machine. The first step is the data_wrangler_step for data processing, which uses the Data Wrangler flow file as an input to transform the raw data files. We also define a model training step and a model creation step named training_step and model_step, respectively. Finally, we create a workflow by chaining all the steps we created, as shown in the following code:

from stepfunctions.steps import Chain 
from stepfunctions.workflow import Workflow 
import uuid 

workflow_graph = Chain([data_wrangler_step, training_step,model_step ]) 
branching_workflow = Workflow( name = "Wrangler-SF-Run-{}".format(uuid.uuid1().hex),definition = workflow_graph, role = iam_role ) 
branching_workflow.create()

In our example, we built the workflow to take job names as parameters because they’re unique and need to be randomly generated during every pipeline run. We pass these names when the workflow runs. You can also schedule Step Functions workflow to run using CloudWatch (see Schedule a Serverless Workflow with AWS Step Functions and Amazon CloudWatch), invoked using Amazon S3 Events, or invoked from Amazon EventBridge (see Create an EventBridge rule that triggers a Step Functions workflow). For demonstration purposes, we can invoke the Step Functions workflow from the Step Functions console UI or using the following code from the notebook.

# Execute workflow 
execution = branching_workflow.execute( 
			inputs=  { “ProcessingJobName”: processing_job_name, # Unique processing job name, 
				   “TrainingJobName”: training_job_name, # Unique training job name, 
				   “ModelName” : model_name # Unique model name 
				 } 
	) 
execution_output = execution.get_output(wait=True)

Integrate Data Wrangler with Apache Airflow

Another popular way of creating ML workflows is using Apache Airflow. Apache Airflow is an open-source platform that allows you to programmatically author, schedule, and monitor workflows. Amazon MWAA makes it easy to set up and operate end-to-end ML pipelines with Apache Airflow in the cloud at scale. An Airflow pipeline consists of a sequence of tasks, also referred to as a workflow. A workflow is defined as a DAG that encapsulates the tasks and the dependencies between them, defining how they should run within the workflow.

We created an Airflow DAG within an Amazon MWAA environment to implement our MLOps workflow. Each task in the workflow is an executable unit, written in Python programming language, that performs some action. A task can either be an operator or a sensor. In our case, we use an Airflow Python operator along with SageMaker Python SDK to run the Data Wrangler Python script, and use Airflow’s natively supported SageMaker operators to train the SageMaker built-in XGBoost algorithm and create the model from the resulting artifacts. We also created a helpful custom Data Wrangler operator (SageMakerDataWranglerOperator) for Apache Airflow, which you can use to process Data Wrangler flow files for data processing without the need for any additional code.

The following screenshot shows the Airflow DAG with five steps to implement our MLOps workflow.

The Start step uses a Python operator to initialize configurations for the rest of the steps in the workflow. SageMaker_DataWrangler_Step uses SageMakerDataWranglerOperator and the data flow file we created earlier. SageMaker_training_step and SageMaker_create_model_step use the built-in SageMaker operators for model training and model creation, respectively. Our Amazon MWAA environment uses the smallest instance type (mw1.small), because the bulk of the processing is done via Processing jobs, which uses its own instance type that can be defined as configuration parameters within the workflow.

Installation and walkthrough

Detailed step-by-step installation instructions to deploy this solution can be found in our GitHub repository. We used a Jupyter notebook with Python code cells to set up the Airflow DAG. Assuming you have already generated the data flow file, the following is a high-level overview of the installation steps:

  1. Create an S3 bucket and subsequent folders required by Amazon MWAA.
  2. Create an Amazon MWAA environment. Note that we used Airflow version 2.0.2 for this solution.
  3. Create and upload a requirements.txt file with all the Python dependencies required by the Airflow tasks and upload it to the /requirements directory within the Amazon MWAA primary S3 bucket. This is used by the managed Airflow environment to install the Python dependencies.
  4. Upload the SMDataWranglerOperator.py file to the /dags directory. This Python script contains code for the custom Airflow operator for Data Wrangler. This operator can be used for tasks to process any .flow file.
  5. Create and upload the config.py script to the /dags directory. This Python script is used for the first step of our DAG to create configuration objects required by the remaining steps of the workflow.
  6. Finally, create and upload the ml_pipelines.py file to the /dags directory. This script contains the DAG definition for the Airflow workflow. This is where we define each of the tasks, and set up dependencies between them. Amazon MWAA periodically polls the /dags directory to run this script to create the DAG or update the existing one with any latest changes.

The following is the code for SageMaker_DataWrangler_step, which uses the custom SageMakerDataWranglerOperator. With just a few lines of code in your DAG definition Python script, you can point the SageMakerDataWranglerOperator to the Data Wrangler flow file location (which is an S3 location). Behind the scenes, this operator uses SageMaker Processing jobs to process the .flow file in order to apply the defined transforms to your raw data files. You can also specify the type of instance and number of instances needed by the Data Wrangler processing job.

# Airflow Data Wrangler operator 
from SMDataWranglerOperator import SageMakerDataWranglerOperator 
preprocess_task = SageMakerDataWranglerOperator( task_id='DataWrangler_Processing_Step', 
                                                 dag=dag, 
                                                 flow_file_s3uri = flow_uri, 
                                                 processing_instance_count=2, 
                                                 instance_type='ml.m5.4xlarge', 
                                                 aws_conn_id="aws_default", 
                                                 config=config)

The config parameter accepts a dictionary (key-value pairs) of additional configurations required by the processing job, such as the output prefix of the final output file, type of output file (CSV or Parquet), and URI for the built-in Data Wrangler container image. The following code is what the config dictionary for SageMakerDataWranglerOperator looks like. These configurations are required for a SageMaker Processing processor. For details of each of these config parameters, refer to sagemaker.processing.Processor().

{
	"sagemaker_role": #required SageMaker IAM Role name or ARN,
	"s3_data_type": #optional;defaults to "S3Prefix"
	"s3_input_mode": #optional;defaults to "File",
	"s3_data_distribution_type": #optional;defaults to "FullyReplicated",
	"kms_key": #optional;defaults to None,
	"volume_size_in_gb": #optional;defaults to 30,
	"enable_network_isolation": #optional;defaults to False,
	"wait_for_processing": #optional;defaults to True,
	"container_uri": #optional;defaults to built - in container URI,
	"container_uri_pinned": #optional;defaults to built - in container URI,
	"outputConfig": {
		"s3_output_upload_mode": #optional;defaults to EndOfJob
		"output_content_type": #optional;defaults to CSV
		"output_bucket": #optional;defaults to SageMaker Default bucket
		"output_prefix": #optional;defaults to None.Prefix within bucket where output will be written
	}
}

Clean up

To avoid incurring future charges, delete the resources created for the solutions you implemented.

  1. Follow these instructions provided in the GitHub repository to clean up resources created by the SageMaker Pipelines solution.
  2. Follow these instructions provided in the GitHub repository to clean up resources created by the Step Functions solution.
  3. Follow these instructions provided in the GitHub repository to clean up resources created by the Amazon MWAA solution.

Conclusion

This post demonstrated how you can easily integrate Data Wrangler with some of the well-known workflow automation and orchestration technologies in AWS. We first reviewed a sample use case and architecture for the solution that uses Data Wrangler for data preprocessing. We then demonstrated how you can integrate Data Wrangler with Pipelines, Step Functions, and Amazon MWAA.

As a next step, you can find and try out the code samples and notebooks in our GitHub repository using the detailed instructions for each of the solutions discussed in this post. To learn more about how Data Wrangler can help your ML workloads, visit the Data Wrangler product page and Prepare ML Data with Amazon SageMaker Data Wrangler.


About the authors

Rodrigo Alarcon is a Senior ML Strategy Solutions Architect with AWS based out of Santiago, Chile. In his role, Rodrigo helps companies of different sizes generate business outcomes through cloud-based AI and ML solutions. His interests include machine learning and cybersecurity.

Ganapathi Krishnamoorthi is a Senior ML Solutions Architect at AWS. Ganapathi provides prescriptive guidance to startup and enterprise customers helping them to design and deploy cloud applications at scale. He is specialized in machine learning and is focused on helping customers leverage AI/ML for their business outcomes. When not at work, he enjoys exploring outdoors and listening to music.

Anjan Biswas is a Senior AI Services Solutions Architect with focus on AI/ML and Data Analytics. Anjan is part of the world-wide AI services team and works with customers to help them understand, and develop solutions to business problems with AI and ML. Anjan has over 14 years of experience working with global supply chain, manufacturing, and retail organizations and is actively helping customers get started and scale on AWS AI services.

Read More

Tiny cars and big talent show Canadian policymakers the power of machine learning

In the end, it came down to 213 thousandths of a second! That was the difference between the two best times in the finale of the first AWS AWS DeepRacer Student Wildcard event hosted in Ottawa, Canada this May.

I watched in awe as 13 students competed in a live wildcard race for the AWS DeepRacer Student League, the first global autonomous racing league for students offering educational material and resources to get hands on and start with machine learning (ML).

Students hit the starting line to put their ML skills to the test in Canada’s capital where members of parliament cheered them on, including Parliamentary Secretary for Innovation, Science and Economic Development, Andy Fillmore. Daphne Hong, a fourth-year engineering student at the University of Calgary, won the race with a lap time of 11:167 seconds. Not far behind were Nixon Chan from University of Waterloo and Vijayraj Kharod from Toronto Metropolitan University.

Daphne was victorious after battling nerves earlier in the day when she took practice runs as she struggled turning the corners and quickly adjusted her model. “After seeing how the physical track did compared to the virtual one throughout the day, I was able to make some adjustments and overcome those corners and round them as I intended, so I’m super, super happy about that,” said a beaming Daphne after being presented with her championship trophy.

Daphne also received a $1,000 Amazon Canada Gift Card, while the second and third place racers — Nixon Chan and Vijayraj Kharod — got trophies and $500 gift cards. The top two contestants now have a chance to race virtually in the AWS DeepRacer Student League finale in October. “The whole experience feels like a win for me,” said DeepRacer participant Connor Hunszinger from the University of Alberta.

The event not only highlighted the importance of machine learning education to Canadian policymakers, but also made clear that these young Canadians could be poised to do great things with their ML skills.

The road to the Ottawa Wildcard

This Ottawa race is one of several wildcard events taking place around the world this year as part of the AWS DeepRacer Student League to bring students together to compete live in person. The top two finalists in each Wildcard race will have the opportunity to compete in the AWS DeepRacer Student League finale, with a chance of winning up to $5,000 USD towards their tuition. The top three racers from the student league finale in October will advance to the global AWS DeepRacer League Championship held at AWS re:Invent in Las Vegas this December.

Students who raced in Ottawa began their journey this March when they competed in the global AWS DeepRacer Student League by submitting their model to the virtual 3D simulation environment and posting times to the leaderboard. From the student league, the top student racers across Canada were selected to compete in the wildcard event. Students trained their models in preparation for the event through the virtual environment and then applied their ML models for the first time on a physical track in Ottawa. Each student competitor was given one three-minute attempt to complete their fastest lap with only the speed of the car being controlled.

“Honestly, I don’t really consider my peers here my competitors. I loved being able to work with them. It seems more like a friendly, supportive and collaborative environment. We were always cheering each other on,” says Daphne Hong, AWS DeepRacer Student League Canada Wildcard winner. “This event is great because it allows people who don’t really have that much AI or ML experience to learn more about the industry and see it live with these cars. I want to share my findings and my knowledge with those around me, those in my community and spread the word about ML and AI.”

Building access to machine learning in Canada

Machine learning talent is in hot demand, making up a large portion of AI job postings in Canada. The Canadian economy needs people with the skills recently on display at the DeepRacer event, and Canadian policymakers are intent on building an AI talent pool.

According to the World Economic Forum, 58 million jobs will be created by the growth of machine learning in the next few years, but right now, there are only 300,000 engineers with the relevant training to build and deploy ML models.

That means organizations of all types must not only train their existing workers with ML skills, but also invest in training programs and solutions to develop those capabilities for future workers. AWS is doing its part with a multitude of products for learners of all levels.

  • AWS Artificial Intelligence and Machine Learning Scholarship, a $10 million education and scholarship program, aimed at preparing underserved and underrepresented students in tech globally for careers in the space.
  • AWS DeepRacer, the world’s first global autonomous racing league, open to developers globally to get started in ML with a 1/18th scale race car driven by reinforcement learning. Developers can compete in the global racing league for prizes and rewards.
  • AWS DeepRacer Student, a version of AWS DeepRacer open to students 16 years and older globally with free access to 20 hours of ML educational content and 10 hours of compute resources for model training monthly at no cost. Participants can compete in the global racing league exclusively for students to win scholarships and prizes.
  • Machine Learning University, self service ML training courses with learn at your own pace educational content built by Amazon’s ML scientists.

Cloud computing makes access to machine learning technology a lot easier, faster — and fun, if the AWS DeepRacer Student League Wildcard event was any indication. The race was created by AWS, as an enjoyable, hands-on way to make ML more widely accessible to anyone interested in the technology.

Get started with your machine learning journey and take part in the AWS DeepRacer Student league today for your chance to wins prizes and glory.


About the author

Nicole Foster is Director of AWS Global AI/ML and Canada Public Policy at Amazon, where she leads the direction and strategy of artificial intelligence public policy for Amazon Web Services (AWS) around the world as well as the company’s public policy efforts in support of the AWS business in Canada. In this role, she focuses on issues related to emerging technology, digital modernization, cloud computing, cyber security, data protection and privacy, government procurement, economic development, skilled immigration, workforce development, and renewable energy policy.

Read More

Predict shipment ETA with no-code machine learning using Amazon SageMaker Canvas

Logistics and transportation companies track ETA (estimated time of arrival), which is a key metric for their business. Their downstream supply chain activities are planned based on this metric. However, delays often occur, and the ETA might differ from the product’s or shipment’s actual time of arrival (ATA), for instance due to shipping distance or carrier-related or weather-related issues. This impacts the entire supply chain, in many instances reducing productivity and increasing waste and inefficiencies. Predicting the exact day a product arrives to a customer is challenging because it depends on various factors such as order type, carrier, origin, and distance.

Analysts working in the logistics and transportation industry have domain expertise and knowledge of shipping and logistics attributes. However, they need to be able to generate accurate shipment ETA forecasts for efficient business operations. They need an intuitive, easy-to-use, no-code capability to create machine learning (ML) models for predicting shipping ETA forecasts.

To help achieve the agility and effectiveness that business analysts seek, we launched Amazon SageMaker Canvas, a no-code ML solution that helps companies accelerate solutions to business problems quickly and easily. SageMaker Canvas provides business analysts with a visual point-and-click interface that allows you to build models and generate accurate ML predictions on your own—without requiring any ML experience or having to write a single line of code.

In this post, we show how to use SageMaker Canvas to predict shipment ETAs.

Solution overview

Although ML development is a complex and iterative process, we can generalize an ML workflow into business requirements analysis, data preparation, model development, and model deployment stages.

SageMaker Canvas abstracts the complexities of data preparation and model development, so you can focus on delivering value to your business by drawing insights from your data without a deep knowledge of the data science domain. The following architecture diagram highlights the components in a no-code or low-code solution.

The following are the steps as outlined in the architecture:

  1. Download the dataset to your local machine.
  2. Import the data into SageMaker Canvas.
  3. Join your datasets.
  4. Prepare the data.
  5. Build and train your model.
  6. Evaluate the model.
  7. Test the model.
  8. Share the model for deployment.

Let’s assume you’re a business analyst assigned to the product shipment tracking team of a large logistics and transportation organization. Your shipment tracking team has asked you to assist in predicting the shipment ETA. They have provided you with a historical dataset that contains characteristics tied to different products and their respective ETA, and want you to predict the ETA for products that will be shipped in the future.

We use SageMaker Canvas to perform the following steps:

  1. Import our sample datasets.
  2. Join the datasets.
  3. Train and build the predictive machine maintenance model.
  4. Analyze the model results.
  5. Test predictions against the model.

Dataset overview

We use two datasets (shipping logs and product description) in CSV format, which contain shipping log information and certain characteristics of a product, respectively.

The ShippingLogs dataset contains the complete shipping data for all products delivered, including estimated time shipping priority, carrier, and origin. It has approximately 10,000 rows and 12 feature columns. The following table summarizes the data schema.

ActualShippingDays Number of days it took to deliver the shipment
Carrier Carrier used for shipment
YShippingDistance Distance of shipment on the Y-axis
XShippingDistance Distance of shipment on the X-axis
ExpectedShippingDays Expected days for shipment
InBulkOrder Is it a bulk order
ShippingOrigin Origin of shipment
OrderDate Date when the order was placed
OrderID Order ID
ShippingPriority Priority of shipping
OnTimeDelivery Whether the shipment was delivered on time
ProductId Product ID

The ProductDescription dataset contains metadata information of the product that is being shipped in the order. This dataset has approximately 10,000 rows and 5 feature columns. The following table summarizes the data schema.

ComputerBrand Brand of the computer
ComputeModel Model of the computer
ScreeenSize Screen size of the computer
PackageWeight Package weight
ProductID Product ID

Prerequisites

An IT administrator with an AWS account with appropriate permissions must complete the following prerequisites:

  1. Deploy an Amazon SageMaker domain. For instructions, see Onboard to Amazon SageMaker Domain.
  2. Launch SageMaker Canvas. For instructions, see Setting up and managing Amazon SageMaker Canvas (for IT administrators).
  3. Configure cross-origin resource sharing (CORS) policies in Amazon Simple Storage Service (Amazon S3) for SageMaker Canvas to enable the upload option from local disk. For instructions, see Give your users the ability to upload local files.

Import the dataset

First, download the datasets (shipping logs and product description) and review the files to make sure all the data is there.

SageMaker Canvas provides several sample datasets in your application to help you get started. To learn more about the SageMaker-provided sample datasets you can experiment with, see Use sample datasets. If you use the sample datasets (canvas-sample-shipping-logs.csv and canvas-sample-product-descriptions.csv) available within SageMaker Canvas, you don’t have to import the shipping logs and product description datasets.

You can import data from different data sources into SageMaker Canvas. If you plan to use your own dataset, follow the steps in Importing data in Amazon SageMaker Canvas.

For this post, we use the full shipping logs and product description datasets that we downloaded.

  1. Sign in to the AWS Management Console, using an account with the appropriate permissions to access SageMaker Canvas.
  2. On the SageMaker Canvas console, choose Import.
  3. Choose Upload and select the files ShippingLogs.csv and ProductDescriptions.csv.
  4. Choose Import data to upload the files to SageMaker Canvas.

Create a consolidated dataset

Next, let’s join the two datasets.

  1. Choose Join data.
  2. Drag and drop ShippingLogs.csv and ProductDescriptions.csv from the left pane under Datasets to the right pane.
    The two datasets are joined using ProductID as the inner join reference.
  3. Choose Import and enter a name for the new joined dataset.
  4. Choose Import data.

You can choose the new dataset to preview its contents.

After you review the dataset, you can create your model.

Build and train model

To build and train your model, complete the following steps:

  1. For Model name, enter ShippingForecast.
  2. Choose Create.
    In the Model view, you can see four tabs, which correspond to the four steps to create a model and use it to generate predictions: Select, Build, Analyze, and Predict.
  3. On the Select tab, select the ConsolidatedShippingData you created earlier.You can see that this dataset comes from Amazon S3, has 12 columns, and 10,000 rows.
  4. Choose Select dataset.

    SageMaker Canvas automatically moves to the Build tab.
  5. On the Build tab, choose the target column, in our case ActualShippingDays.
    Because we’re interested in how many days it will take for the goods to arrive for the customer, SageMaker Canvas automatically detects that this is a numeric prediction problem (also known as regression). Regression estimates the values of a dependent target variable based on one or more other variables or attributes that are correlated with it.Because we also have a column with time series data (OrderDate), SageMaker Canvas may interpret this as a time series forecast model type.
  6. Before advancing, make sure that the model type is indeed Numeric model type; if that’s not the case, you can select it with the Change type option.

Data preparation

In the bottom half of the page, you can look at some of the statistics of the dataset, including missing and mismatched values, unique values, and mean and median values.

Column view provides you with the listing of all columns, their data types, and their basic statistics, including missing and mismatched values, unique values, and mean and median values. This can help you devise a strategy to handle missing values in the datasets.

Grid view provides you with a graphical distribution of values for each column and the sample data. You can start inferring relevant columns for the training the model.

Let’s preview the model to see the estimated RMSE (root mean squared error) for this numeric prediction.

You can also drop some of the columns, if you don’t want to use them for the prediction, by simply deselecting them. For this post, we deselect the order*_**id* column. Because it’s a primary key, it doesn’t have valuable information, and so doesn’t add value to the model training process.

You can choose Preview model to get insights on feature importance and iterate the model quickly. We also see the RMSE is now 1.223, which is improved from 1.225. The lower the RMSE, the better a given model is able to fit a dataset.

From our exploratory data analysis, we can see that the dataset doesn’t have a lot of missing values. Therefore, we don’t have to handle missing values. If you see a lot of missing values for your features, you can filter the missing values.

To extract more insights, you can proceed with a datetime extraction. With the datetime extraction transform, you can extract values from a datetime column to a separate column.

To perform a datetime extraction, complete the following steps:

  1. On the Build tab of the SageMaker Canvas application, choose Extract.
  2. Choose the column from which you want to extract values (for this post, OrderDate).
  3. For Value, choose one or more values to extract from the column. For this post, we choose Year and Month.The values you can extract from a timestamp column are Year, Month, Day, Hour, Week of year, Day of year, and Quarter.
  4. Choose Add to add the transform to the model recipe.

SageMaker Canvas creates a new column in the dataset for each of the values you extract.

Model training

It’s time to finally train the model! Before building a complete model, it’s a good practice to have a general idea about the performances that our model will have by training a quick model. A quick model trains fewer combinations of models and hyperparameters in order to prioritize speed over accuracy. This is helpful in cases like ours where we want to prove the value of training an ML model for our use case. Note that the quick build option isn’t available for models bigger than 50,000 rows.

Now we wait anywhere from 2–15 minutes for the quick build to finish training our model.

Evaluate model performance

When training is complete, SageMaker Canvas automatically moves to the Analyze tab to show us the results of our quick training, as shown in the following screenshot.

You may experience slightly different values. This is expected. Machine learning introduces some variation in the process of training models, which can lead to different results for different builds.

Let’s focus on the Overview tab. This tab shows you the column impact, or the estimated importance of each column in predicting the target column. In this example, the ExpectedShippingDays column has the most significant impact in our predictions.

On the Scoring tab, you can see a plot representing the best fit regression line for ActualshippingDays. On average, the model prediction has a difference of +/- 0.7 from the actual value of ActualShippingDays. The Scoring section for numeric prediction shows a line to indicate the model’s predicted value in relation to the data used to make predictions. The values of the numeric prediction are often +/- the RMSE value. The value that the model predicts is often within the range of the RMSE. The width of the purple band around the line indicates the RMSE range. The predicted values often fall within the range.

As the thickness of the RMSE band on a model increases, the accuracy of the prediction decreases. As you can see, the model predicts with high accuracy to begin with (lean band) and as the value of actualshippingdays increases (17–22), the band becomes thicker, indicating lower accuracy.

The Advanced metrics section contains information for users that want a deeper understanding of their model performance. The metrics for numeric prediction are as follows:

  • R2 – The percentage of the difference in the target column that can be explained by the input column.
  • MAE – Mean absolute error. On average, the prediction for the target column is +/- {MAE} from the actual value.
  • MAPE – Mean absolute percent error. On average, the prediction for the target column is +/- {MAPE} % from the actual value.
  • RMSE – Root mean square error. The standard deviation of the errors.

The following screenshot shows a graph of the residuals or errors. The horizontal line indicates an error of 0 or a perfect prediction. The blue dots are the errors. Their distance from the horizontal line indicates the magnitude of the errors.

R-squared is a statistical measure of how close the data is to the fitted regression line. The higher percentage indicates that the model explains all the variability of the response data around its mean 87% of the time.

On average, the prediction for the target column is +/- 0.709 {MAE} from the actual value. This indicates that on average the model will predict the target within half a day. This is useful for planning purposes.

The model has a standard deviation (RMSE) of 1.223. As you can see, the model predicts with high accuracy to begin with (lean band) and as the value of actualshippingdays increases (17–22), the band becomes thicker, indicating lower accuracy.

The following image shows an error density plot.

You now have two options as next steps:

  • You can use this model to run some predictions by choosing Predict.
  • You can create a new version of this model to train with the Standard build option. This will take much longer—about 4–6 hours—but will produce more accurate results.

Because we feel confident about using this model given the performances we’ve seen, we opt to go ahead and use the model for predictions. If you weren’t confident, you could have a data scientist review the modeling SageMaker Canvas did and offer potential improvements.

Note that training a model with the Standard build option is necessary to share the model with a data scientist with the Amazon SageMaker Studio integration.

Generate predictions

Now that the model is trained, let’s generate some predictions.

  1. Choose Predict on the Analyze tab, or choose the Predict tab.
  2. Choose Batch prediction.
  3. Choose Select dataset, and choose the dataset ConsolidatedShipping.csv.

SageMaker Canvas uses this dataset to generate our predictions. Although it’s generally not a good idea not to use the same dataset for both training and testing, we’re using the same dataset for the sake of simplicity. You can also import another dataset if you desire.

After a few seconds, the prediction is done and you can choose the eye icon to see a preview of the predictions, or choose Download to download a CSV file containing the full output.

You can also choose to predict values one by one by selecting Single prediction instead of Batch prediction. SageMaker Canvas then shows you a view where you can provide the values for each feature manually and generate a prediction. This is ideal for situations like what-if scenarios—for example, how does ActualShippingDays change if the ShippingOrigin is Houston? What if we used a different carrier? What if the PackageWeight is different?

Standard build

Standard build chooses accuracy over speed. If you want to share the artifacts of the model with your data scientist and ML engineers, you may choose to create a standard build next.

First add a new version.

Then choose Standard build.

The Analyze tab shows your build progress.

When the model is complete, you can observe that the RMSE value of the standard build is 1.147, compared to 1.223 with the quick build.

After you create a standard build, you can share the model with data scientists and ML engineers for further evaluation and iteration.

Clean up

To avoid incurring future session charges, log out of SageMaker Canvas.

Conclusion

In this post, we showed how a business analyst can create a shipment ETA prediction model with SageMaker Canvas using sample data. SageMaker Canvas allows you to create accurate ML models and generate predictions using a no-code, visual, point-and-click interface. Analysts can take this to the next level by sharing their models with data scientist colleagues. The data scientists can view the SageMaker Canvas model in Studio, where they can explore the choices SageMaker Canvas made to generate ML models, validate model results, and even take the model to production with a few clicks. This can accelerate ML-based value creation and help scale improved outcomes faster.


About the authors

Rajakumar Sampathkumar is a Principal Technical Account Manager at AWS, providing customers guidance on business-technology alignment and supporting the reinvention of their cloud operation models and processes. He is passionate about the cloud and machine learning. Raj is also a machine learning specialist and works with AWS customers to design, deploy, and manage their AWS workloads and architectures.

Meenakshisundaram Thandavarayan is a Senior AI/ML specialist with a passion to design, create and promote human-centered Data and Analytics experiences. He supports AWS Strategic customers on their transformation towards data driven organization.

Read More