Create and manage Amazon EMR Clusters from SageMaker Studio to run interactive Spark and ML workloads – Part 2

In Part 1 of this series, we offered step-by-step guidance for creating, connecting, stopping, and debugging Amazon EMR clusters from Amazon SageMaker Studio in a single-account setup.

In this post, we dive deep into how you can use the same functionality in certain enterprise-ready, multi-account setups. As described in the AWS Well-Architected Framework, separating workloads across accounts enables your organization to set common guardrails while isolating environments. This can be particularly useful for certain security requirements, as well as simplify cost between projects and teams.

Solution overview

In this post, we go through the process to achieve the following architectural setup. We present the same simple interface as we saw in Part 1 for our data workers, abstracting away multi-account details from their day-to-day workflow when not needed.

We first describe how to set up your cross-account networks in order to connect to Amazon EMR from Studio. To start, we need to make sure that some prerequisites are set correctly. For our example, a DevOps admin needs to configure an Amazon SageMaker domain with an elastic network interface to a private VPC and specify the security group ID to attach.

Set up the network

After we set up the Studio domain, we need to configure our network settings to allow communication between accounts.

VPC peering

We start with VPC peering between the accounts in order to facilitate traffic back and forth.

  1. From our Studio account, on the Amazon Virtual Private Cloud (Amazon VPC) console, choose Peering connections.
  2. Choose Create peering connection.
  3. Create your request to peer the Studio VPC within the Amazon EMR account’s VPC.

After you make the peering request, the admin can accept this request from the second account.

When peering private subnets, you should enable private IP DNS resolution at the VPC peering connection level.

Route tables

After you establish the peering connection, you must enable the flow of traffic by manually adding routes to the private subnet route tables in both accounts. We do this to enable creation and connection of EMR clusters from the Studio account to the remote account’s private subnet.

These routes point to the IP address range of the peered VPC’s private subnets and are set by going to the Route Tables tab found on the subnet page. Here the admin on each account can edit the routes.

The following route table of a Studio subnet shows traffic outbound from the Studio account for 2.0.1.0/24 through a peering connection.

The following route table of an Amazon EMR subnet shows traffic outbound from the Amazon EMR account to Studio for 10.0.20.0/24 through a peering connection.

Security groups

Lastly, the security group that is attached to your Studio domain must allow outbound traffic, and the security group of the Amazon EMR primary node must allow inbound TCP traffic from the Studio instance security group.

The following screenshot shows the inbound rules configuration in your SageMaker account.

The following screenshot shows the inbound rules configuration in your Amazon EMR account.

Set up permissions

We need to create an AWS Identity and Access Management (IAM) role in the secondary Amazon EMR account that has the same Amazon EMR visibility permission as we saw in Part 1.

The following code shows the specific permissions for the IAM role. It’s the same as in Part 1, but includes the policy AllowRoleAssumptionForCrossAccountDiscovery:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AllowPresignedUrl",
            "Effect": "Allow",
            "Action": [
                "elasticmapreduce:DescribeCluster",
                "elasticmapreduce:ListInstanceGroups",
                "elasticmapreduce:CreatePersistentAppUI",
                "elasticmapreduce:DescribePersistentAppUI",
                "elasticmapreduce:GetPersistentAppUIPresignedURL",
                "elasticmapreduce:GetOnClusterAppUIPresignedURL"
            ],
            "Resource": [
                "arn:aws:elasticmapreduce:<region>:<account-id>:cluster/*"
            ]
        },
        {
            "Sid": "AllowClusterDetailsDiscovery",
            "Effect": "Allow",
            "Action": [
                "elasticmapreduce:DescribeCluster",
                "elasticmapreduce:ListInstanceGroups"
            ],
            "Resource": [
                "arn:aws:elasticmapreduce:<region>:<account-id>:cluster/*"
            ]
        },
        {
            "Sid": "AllowClusterDiscovery",
            "Effect": "Allow",
            "Action": [
                "elasticmapreduce:ListClusters"
            ],
            "Resource": "*"
        },
        { 
            "Sid": "AllowRoleAssumptionForCrossAccountDiscovery", 
            "Effect": "Allow", 
            "Action": "sts:AssumeRole", 
            "Resource": ["arn:aws:iam::<cross-account>:role/<studio-execution-role>" ]
        },
        {
            "Sid": "AllowEMRTemplateDiscovery",
            "Effect": "Allow",
            "Action": [
              "servicecatalog:SearchProducts"
            ],
            "Resource": "*"
        }
    ]
}

This assumable role also needs a trust relationship with the Studio account (be sure to modify the account ID):

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::<account-id>:root"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

User journey

The following diagram illustrates the user journey for a unified notebook experience after you connect your various accounts. Just as in the previous post, the DevOps persona creates an AWS Service Catalog product and portfolio within the Studio account, from which data workers can provision templated EMR clusters.

Again, it’s worth noting that you can modify the full set of properties for Amazon EMR when creating AWS CloudFormation templates that can be deployed though Studio. This means that you can enable Spot, auto scaling, and other popular configurations through your Service Catalog product.

You can parameterize the preset CloudFormation template, which creates the EMR cluster, so that end-users can modify different aspects of the cluster to match their workloads. For example, the data scientist or data engineer may want to specify the number of core nodes on the cluster, and the creator of the template can specify AllowedValues to set guardrails.

Discover EMR clusters across accounts

To enable cluster discovery across accounts, we need to provide the previously created remote IAM role ARN to the Studio execution role. The Studio execution role assumes that remote role to discover and connect to EMR clusters in the remote account. The ARN of this assumable cross-account role is loaded by the Studio Jupyter server at launch and determines which role to use for cross-account cluster discoverability. To set and modify these user-specific ARNs, admins can create a Lifecycle Configuration (LCC), associated with the Jupyter server not the kernel gateway app, which writes the role ARN onto the Amazon Elastic File System (Amazon EFS) home directory for each user. You can apply this LCC to the entire set of users or it can be specific to individuals so they have granular access to which clusters can be viewed through assumed roles.

When the Jupyter server starts, lifecycle configurations run prior to reading of ARN roles that are written in the config file. This enables administrators to overwrite and fully control which cross-account ARNs are used at runtime. After the LCC runs and the files are written, the server reads the file /home/sagemaker-user/.cross-account-configuration-DO_NOT_DELETE/emr-discovery-iam-role-arns-DO_NOT_DELETE.json and stores that cross-account ARN. The following is an example LCC bash script:

# This script creates the file that informs SageMaker Studio that the role "arn:aws:iam::123456789012:role/ASSUMABLE-ROLE" in remote account "123456789012" must be assumed to list and describe EMR clusters in the remote account.

#!/bin/bash

set -eux

FILE_DIRECTORY="/home/sagemaker-user/.cross-account-configuration-DO_NOT_DELETE"
FILE_NAME="emr-discovery-iam-role-arns-DO_NOT_DELETE.json"
FILE="$FILE_DIRECTORY/$FILE_NAME"

mkdir -p $FILE_DIRECTORY

cat > "$FILE" <<- "EOF"
{
  "123456789012": "arn:aws:iam::123456789012:role/ASSUMABLE-ROLE"
}
EOF

At this point, a user can log in to their account and although they can modify this file, there’s no impact to the admin’s ARN designation. This is because the value is already stored by this point and the file is overwritten upon the server being restarted, because the LCC runs every time the Jupyter server app is started.

This configuration process can be completely abstracted away from data workers who discover and connect to clusters within the Studio. The only noticeable difference for cross-account clusters is that on the browsing tab, there is a column for account ID for which the cluster is housed in.

Use EMR clusters across accounts

After you establish cross-account visibility, the process for creating and stopping clusters remains the same as in Part 1. Refer to our GitHub repository for example cross-account CloudFormation stacks.

After you deploy the Service Catalog product, the process for end-users to spin up a cluster remains the same. Simply go to the Clusters page and choose Create cluster.

After cluster creation, we connect to our cluster using the Clusters graphical interface in Studio Notebooks. This creates an auto-populated magic cell that appears largely the same as with a single account, but with an appended parameter for the assumable cross-account role.

After the connection is made, we can proceed with the demo as before. You can clone our GitHub example repo and run through the notebook example just as in Part 1.

Conclusion

In this second and final part of our series, we showed how Studio users can create, connect, debug, and stop EMR clusters in cross-account setups. After you set up the networking and permissions, the end-user experience is just as we saw in Part 1. We encourage you to utilize this new functionality of Studio in your multi-account workloads today!


About the Authors

Sumedha Swamy is a Principal Product Manager at Amazon Web Services. He leads SageMaker Studio team to build it into the IDE of choice for interactive data science and data engineering workflows. He has spent the past 15 years building customer-obsessed consumer and enterprise products using Machine Learning. In his free time he likes photographing the amazing geology of the American Southwest.

Prateek Mehrotra is a Senior SDE working for SageMaker Studio at Amazon Web Services. He is focused on building interactive ML solutions which simplify usability by abstracting away complexity. In his spare time, Prateek enjoys spending time with his family and likes to explore the world with them.

Sriharsha M S is an AI/ML specialist solutions architect in the Strategic Specialist team at Amazon Web Services. He works with strategic AWS customers who are taking advantage of AI/ML to solve complex business problems. He provides technical guidance and design advice to implement AI/ML applications at scale. His expertise spans application architecture, big data, analytics, and machine learning.

Sean MorganSean Morgan is a Senior ML Solutions Architect at AWS. He has experience in the semiconductor and academic research fields, and uses his experience to help customers reach their goals on AWS. In his free time Sean is an activate open source contributor/maintainer and is the special interest group lead for TensorFlow Addons.

Ruchir Tewari is a Senior Solutions Architect specializing in security and is a member of the ML TFC. For several years he has helped customers build secure architectures for a variety of hybrid, big data and AI/ML applications. He enjoys spending time with family, music and hikes in nature.

Luna Wang is a UX designer at AWS who has a background in computer science and interaction design. She is passionate about building customer-obsessed products and solving complex technical and business problems by using design methods. She is now working with a cross-functional team to build a set of new capabilities for interactive ML in SageMaker Studio.

Read More

Create and manage Amazon EMR Clusters from SageMaker Studio to run interactive Spark and ML workloads – Part 1

Amazon SageMaker Studio is the first fully integrated development environment (IDE) for machine learning (ML). It provides a single, web-based visual interface where you can perform all ML development steps required to prepare data, as well as build, train, and deploy models. We recently introduced the ability to visually browse and connect to Amazon EMR clusters right from the Studio notebook. Starting today, you can now monitor and debug your Spark jobs running on Amazon EMR from Studio notebooks with just a single click. Additionally, you can now discover, connect to, create, stop, and manage EMR clusters directly from Studio.

We demonstrate these newly introduced capabilities in this two-part post.

Analyzing, transforming, and preparing large amounts of data is a foundational step of any data science and ML workflow. Data workers such as data scientists and data engineers use Apache Spark, Hive, and Presto running on Amazon EMR for fast data preparation. Until today, these data workers could easily discover and connect to EMR clusters running in the same account as Studio but were unable to do so across accounts—a configuration common among several customer setups. Furthermore, when data workers needed to create EMR clusters tailored to their specific interactive workloads on demand, they had to switch interfaces to either request their administrator to create one or use detailed technical knowledge of DevOps to create it by themselves. This process was not only difficult and disruptive to their workflow, but also distracted data workers from focusing on their data preparation tasks. Consequently, although uneconomical, many customers kept persistent clusters running in anticipation of incoming workload regardless of active usage. Finally, monitoring and debugging Spark jobs running on Amazon EMR required setting up complex security rules and web proxies, adding significant friction to the data workers’ workflow.

Starting today, data workers can easily discover and connect to EMR clusters in single-account and cross-account configurations directly from Studio. Furthermore, you now have one-click access to the Spark UI to monitor and debug Spark jobs running on Amazon EMR right from Studio notebooks, which greatly simplifies your Spark debugging workflow. Finally, you can use the AWS Service Catalog to define and roll out preconfigured templates to select data workers to enable them to create EMR clusters right from Studio. You can fully control the organizational, security, compute, and networking guardrails to be adhered to when data workers use these templates. Data workers can visually browse through a set of templates made available to them, customize them for their specific workloads, create EMR clusters on demand, and stop them with just a few clicks in Studio. This feature considerably simplifies the data preparation workflow and enables you to more optimally use EMR clusters for interactive workloads from Studio.

In Part 1 of our series, we dive into the details of how DevOps administrators can use the AWS Service Catalog to define parameterized templates that data workers can use to create EMR clusters directly from the Studio interface. We provide an AWS CloudFormation template to create an AWS Service Catalog product for creating EMR clusters within an existing Amazon SageMaker domain, as well as a new CloudFormation template to stand up a SageMaker domain, Studio user profile, and Service Catalog product shared with that user so you can get started from scratch. As part of the solution, we utilize a single-click Spark UI interface to debug and monitor our ETL jobs. We use the transformed data to train and deploy an ML model using SageMaker training and hosting services.

As a follow-up, Part 2 provides a deep dive into cross-account setups. These multi-account setups are common amongst customers and are a best practice for many enterprise account setups, as mentioned in our AWS Well-Architected Framework.

Solution overview

We first describe how to communicate with Amazon EMR from Studio, as shown in the post Perform interactive data engineering and data science workflows from Amazon SageMaker Studio notebooks. In our solution, we utilize a SageMaker domain that has been configured with an elastic network interface through private VPC mode. That connected VPC is where we spin up our EMR clusters for this demo. For more information about the prerequisites, see our documentation.

The following diagram shows the complete user journey. A DevOps persona creates the Service Catalog product within a portfolio that is accessible to the Studio execution roles.

It’s important to note that you can use the full set of CloudFormation properties for Amazon EMR when creating templates that can be deployed though Studio. This means that you can enable Spot, auto scaling, and other popular configurations through your Service Catalog product.

You can parameterize the preset CloudFormation template (which creates the EMR cluster) so that end users can modify different aspects of the cluster to match their workloads. For example, the data scientist or data engineer may want to specify the number of core nodes on the cluster, and the creator of the template can specify AllowedValues to set guardrails.

The following template parameters give some examples of commonly used parameters:

"Parameters": {
    "EmrClusterName": {
      "Type": "String",
      "Description": "EMR cluster Name."
    },
    "CoreInstanceType": {
      "Type": "String",
      "Description": "Instance type of the EMR core nodes.",
      "Default": "m5.xlarge",
      "AllowedValues": [
        "m5.xlarge",
        "m3.2xlarge"
      ]
    },
    "CoreInstanceCount": {
      "Type": "String",
      "Description": "Number of core instances in the EMR cluster.",
      "Default": "2",
      "AllowedValues": [
        "2",
        "5",
        "10"
      ]
    },
    "EmrReleaseVersion": {
      "Type": "String",
      "Description": "The release version of EMR to launch.",
      "Default": "emr-5.33.1",
      "AllowedValues": [
        "emr-5.33.1",
        "emr-6.4.0"
      ]
    }
  }

For the product to be visible within the Studio interface, we need to set the following tags on the Service Catalog product:

sagemaker:studio-visibility:emr true

Lastly, the CloudFormation template in the Service Catalog product must have the following mandatory stack parameters:

```
SageMakerProjectName:
Type: String
Description: Name of the project

SageMakerProjectId:
Type: String
Description: Service generated Id of the project
````

Both values for these parameters are automatically injected when the stack is launched, so you don’t need to fill them in. They’re part of the template because SageMaker projects are utilized as part of the integration between the Service Catalog and Studio.

The second part of the single-account user journey (as shown in the architecture diagram) is from the data worker’s perspective within Studio. As shown in the post Perform interactive data engineering and data science workflows from Amazon SageMaker Studio notebooks, Studio users can browse existing EMR clusters and seamlessly connect to them using Kerberos, LDAP, HTTP, or no-auth mechanisms. Now, you can also create new EMR clusters through provisioning of templates, as shown in the following architecture diagram.

For Studio users to browse the available clusters, we need to attach an AWS Identity and Access Management (IAM) policy that permits Amazon EMR discoverability. For more information, see our existing documentation.

Deploy resources with AWS CloudFormation

For this post, we’ve provided two CloudFormation stacks to demonstrate the Studio and EMR capabilities found in our GitHub repository.

The first stack provides an end-to-end CloudFormation template that stands up a private VPC, a SageMaker domain attached to that VPC, and a SageMaker user with visibility to the pre-created Service Catalog product.

The second stack is intended for users with existing Studio private VPC setups who want to utilize a CloudFormation stack to deploy a Service Catalog product and make it visible to an existing SageMaker user.

You will be charged for Studio and Amazon EMR resources used when you launch the following stacks. For more information, see Amazon SageMaker Pricing and Amazon EMR pricing.

Follow the instructions in the cleanup sections at the end of this post to make sure that you don’t continue to be charged for these resources.

To launch the end-to-end stack, choose the stack for your desired Region.

ap-northeast-1
ap-northeast-2
ap-south-1
ap-southeast-1
ca-central-1
eu-central-1
eu-north-1
eu-west-1
eu-west-2
eu-west-3
sa-east-1
us-east-1
us-east-2
us-west-1
us-west-2

This stack is intended to be a from-scratch setup and therefore the admin doesn’t need to launch this stack to input specific parameters related to their account. However, because our subsequent Amazon EMR stack uses the outputs of this stack, we need to provide a deterministic stack name so that it can be referenced. The preceding link provides the stack name as expected by this demo and it should not be modified.

After we launch the stack, we can see that our Studio domain has been created, and studio-user is attached to an execution role that was created with visibility to our Service Catalog product.

If you choose to run the end-to-end stack, skip the following existing domain information.

If you have an existing domain stack, launch the following stack in your preferred Region.

ap-northeast-1
ap-northeast-2
ap-south-1
ap-southeast-1
ca-central-1
eu-central-1
eu-north-1
eu-west-1
eu-west-2
eu-west-3
sa-east-1
us-east-1
us-east-2
us-west-1
us-west-2

Because this stack is intended for accounts with existing domains that are attached to a private subnet, the admin fills in the required parameters during the stack launch. This is intended to simplify the experience for downstream data workers, and we abstract this networking information away from them.

Again, because the subsequent Amazon EMR stack utilizes the parameters the admin inputs here, we need to provide a deterministic stack name so that they can be referenced. The preceding stack link provides the stack name as expected by this demo.

If you’re using the second stack with an existing domain and users, you need to complete one additional step to make sure the Spark UI functionality is available and that your user can browse EMR clusters and spin them up and down. Simply attach the following policy to the SageMaker execution role that you input as a parameter, providing the Region and account ID as needed:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AllowPresignedUrl",
            "Effect": "Allow",
            "Action": [
                "elasticmapreduce:DescribeCluster",
                "elasticmapreduce:ListInstanceGroups",
                "elasticmapreduce:CreatePersistentAppUI",
                "elasticmapreduce:DescribePersistentAppUI",
                "elasticmapreduce:GetPersistentAppUIPresignedURL",
                "elasticmapreduce:GetOnClusterAppUIPresignedURL"
            ],
            "Resource": [
                "arn:aws:elasticmapreduce:<region>:<account-id>:cluster/*"
            ]
        },
        {
            "Sid": "AllowClusterDetailsDiscovery",
            "Effect": "Allow",
            "Action": [
                "elasticmapreduce:DescribeCluster",
                "elasticmapreduce:ListInstanceGroups"
            ],
            "Resource": [
                "arn:aws:elasticmapreduce:<region>:<account-id>:cluster/*"
            ]
        },
        {
            "Sid": "AllowClusterDiscovery",
            "Effect": "Allow",
            "Action": [
                "elasticmapreduce:ListClusters"
            ],
            "Resource": "*"
        },
        {
            "Sid": "AllowSagemakerProjectManagement",
            "Effect": "Allow",
            "Action": [
                "sagemaker:CreateProject",
                "sagemaker:DeleteProject"
            ],
            "Resource": "arn:aws:sagemaker:<region>:<account-id>:project/*"
        },
        {
            "Sid": "AllowEMRTemplateDiscovery",
            "Effect": "Allow",
            "Action": [
              "servicecatalog:SearchProducts"
            ],
            "Resource": "*"
        }
    ]
}

Review the AWS Service Catalog product

After you launch your stack, you can see that an IAM role was created as a launch constraint, which provisions our EMR cluster. Both stacks also generated the AWS Service Catalog product and the association to our Studio execution role.

On the list of AWS Service Catalog products, we see the product name, which is later visible from the Studio interface.

This product has a launch constraint that governs the role that creates the cluster.

Note that our product has been tagged appropriately for visibility within the Studio interface.

If we look into the template that was provisioned, we can see the CloudFormation template that initializes our cluster, creates the Hive tables, and loads them with the demo data.

Create an EMR cluster from Studio

After the Service Catalog product has been created in your account through the stack that fits your setup, we can continue the demonstration from the data worker’s prospective.

  1. Launch a Studio notebook.
  2. Under SageMaker resources, choose Clusters on the drop-down menu.
  3. Choose Create cluster.
  4. From the available templates, choose the provisioned template SageMaker Studio Domain No Auth EMR.
  5. Enter your desired configurable parameters and choose Create cluster.

You can now monitor the deployment on the Clusters management tab. As part of the template, our cluster instantiates Hive tables with some data that we can use as part of our example.

Connect to an EMR Cluster from Studio

After your cluster has entered the Running/Waiting status, you can connect to the cluster in the same way as was described in the post Perform interactive data engineering and data science workflows from Amazon SageMaker Studio notebooks.

First, we clone our GitHub repo.

As of this writing, only a subset of kernels support connecting to an existing EMR cluster. For the full list of supported kernels, and information on building your own Studio images with connectivity capabilities; see our documentation. For this post, we use the SparkMagic kernel from the PySpark image and run the smstudio-pyspark-hive-sentiment-analysis.ipynb notebook from the repository.

For simplicity, the template that we deploy uses a no-auth authentication mechanism, but as shown in our previous post, this works seamlessly with Kerberos, LDAP, and HTTP auth as well.

After a connection is made, there is a hyperlink for the Spark UI, which we use to debug and monitor our demonstration. We dive into the technical details later in the post, but you can open this in a new tab now.

Next, we show the functionality from our previous post where we can query the newly instantiated tables using PySpark, write transformed data to Amazon Simple Storage Service (Amazon S3), and launch SageMaker training and hosting jobs all from the same smstudio-pyspark-hive-sentiment-analysis.ipynb notebook.

The following screenshots demonstrate preprocessing the data.

The following screenshots show the process of training the model.

The following screenshots demonstrate deploying the model.

Monitor and debug with the Spark UI

As mentioned before, the process for viewing the Spark UI has been greatly simplified, and a presigned URL is generated at the time of connection to your cluster. Each pre-signed URL has a time to live of 5 minutes.

You can use this UI for monitoring your Spark run and shuffling, among other things. For more information, see the documentation.

Stop an EMR cluster from Studio

After we’re done with our analysis and model building, we can use the Studio interface to stop our cluster. Because this runs DELETE STACK under the hood, users only have access to stop clusters that were launched using provisioned Service Catalog templates and can’t stop existing clusters that were created outside of Studio.

Clean up the end-to-end stack

If you deployed the end-to-end stack, complete the following steps to clean up resources deployed for this solution:

  1. Stop your cluster, as shown in the previous section.

This also deletes the S3 bucket, so you should copy the contents in the bucket to a backup location if you want to retain the data for later use.

  1. On the Studio console, choose your user name (studio-user).
  2. Delete all the apps listed under Apps by choosing Delete app.
  3. Wait until the status shows as Completed.

Next, you delete your Amazon Elastic File System (Amazon EFS) volume.

  1. On the Amazon EFS console, delete the file system that SageMaker created.

You can confirm it’s the correct volume by choosing the file system ID and confirming the tag is ManagedByAmazonSageMakerResource.

Finally, you delete the CloudFormation template.

  1. On the AWS CloudFormation console, choose Stacks.
  2. Select the stack you deployed for this solution.
  3. Choose Delete.

Clean up the existing domain stack

The second stack has a simpler cleanup because we’re leaving the Studio resources in place as they were prior to starting this tutorial.

  1. Stop your cluster as shown in the previous cleanup instructions.
  2. Remove the attached policy you added to the SageMaker execution role that permitted Amazon EMR browsing and PresignedURL access.
  3. On the AWS CloudFormation console, choose Stacks.
  4. Select the stack you deployed for this solution.
  5. Choose Delete.

Conclusion

In this post, we demonstrated a unified notebook-centric experience to create and manage EMR clusters, run analytics on those clusters, and train and deploy SageMaker models, all from the Studio interface. We also showed a one-click interface for debugging and monitoring Amazon EMR jobs through the Spark UI. We encourage you to try out this new functionality in Studio yourself, and check out Part 2 of this post, which dives deep how data workers can discover, connect, create, and stop clusters in a multi-account setup.


About the Authors

Sumedha Swamy is a Principal Product Manager at Amazon Web Services. He leads SageMaker Studio team to build it into the IDE of choice for interactive data science and data engineering workflows. He has spent the past 15 years building customer-obsessed consumer and enterprise products using Machine Learning. In his free time, he likes photographing the amazing geology of the American Southwest.

Prateek Mehrotra is a Senior SDE working for SageMaker Studio at Amazon Web Services. He is focused on building interactive ML solutions which simplify usability by abstracting away complexity. In his spare time, Prateek enjoys spending time with his family and likes to explore the world with them.

Sriharsha M S is an AI/ML specialist solutions architect in the Strategic Specialist team at Amazon Web Services. He works with strategic AWS customers who are taking advantage of AI/ML to solve complex business problems. He provides technical guidance and design advice to implement AI/ML applications at scale. His expertise spans application architecture, big data, analytics, and machine learning.

Sean MorganSean Morgan is a Senior ML Solutions Architect at AWS. He has experience in the semiconductor and academic research fields, and uses his experience to help customers reach their goals on AWS. In his free time Sean is an activate open source contributor/maintainer and is the special interest group lead for TensorFlow Addons.

Ruchir Tewari is a Senior Solutions Architect specializing in security and is a member of the ML TFC. For several years he has helped customers build secure architectures for a variety of hybrid, big data and AI/ML applications. He enjoys spending time with family, music and hikes in nature.

Luna Wang is a UX designer at AWS who has a background in computer science and interaction design. She is passionate about building customer-obsessed products and solving complex technical and business problems by using design methods. She is now working with a cross-functional team to build a set of new capabilities for interactive ML in SageMaker Studio.

Read More

Improve the return on your marketing investments with intelligent user segmentation in Amazon Personalize

Today, we’re excited to announce intelligent user segmentation powered by machine learning (ML) in Amazon Personalize, a new way to deliver personalized experiences to your users and run more effective campaigns through your marketing channels.

Traditionally, user segmentation depends on demographic or psychographic information to sort users into predefined audiences. More advanced techniques look to identify common behavioral patterns in the customer journey (such as frequent site visits, recent purchases, or cart abandonment) using business rules to derive users’ intent. These techniques rely on assumptions about the users’ preferences and intentions that limit their scalability, don’t automatically learn from changing user behaviors, and don’t offer user experiences personalized for each user. User segmentation in Amazon Personalize uses ML techniques, developed and perfected at Amazon, to learn what is relevant to users. Amazon Personalize automatically identifies high propensity users without the need to develop and maintain an extensive and brittle catalog of rules. This means you can create more effective user segments that scale with your catalog and learn from your users’ changing behavior to deliver what matters to them.

Amazon Personalize enables developers to build personalized user experiences with the same ML technology used by Amazon with no ML expertise required. We make it easy for developers to build applications capable of delivering a wide array of personalization experiences. You can start creating user segments quickly with the Amazon Personalize API or AWS Management Console and only pay for what you use, with no minimum fees or upfront commitments. All data is encrypted to be private and secure, and is only used to create your user segments.

This post walks you through how to use Amazon Personalize to segment your users based on preferences for grocery products using an Amazon Prime Pantry dataset.

Overview of solution

We’re introducing two new recipes that segment your users based on their interest in different product categories, brands, and more. Our item affinity recipe (aws-item-affinity) identifies users based on their interest in the individual items in your catalog, such as a movie, song, or product. The item attribute affinity recipe (aws-item-attribute) identifies users based on the attributes of items in your catalog, such as genre or brand. This allows you to better engage users with your marketing campaigns and improve retention through targeted messaging.

The notebook that accompanies this post demonstrates how to use the aws-item-affinity and aws-item-attribute recipe to create user segments based on their preferences for grocery products in an Amazon Prime Pantry dataset. We use one dataset group that contains user-item interaction data and item metadata. We use these datasets to train solutions using the two recipes and create user segments in batch.

To test the performance of the solution, we split the interactions data into a training set and test set. The Amazon Prime Pantry dataset has approximately 18 years of interaction data from August 9, 2000, to October 5, 2018, with approximately 1.7 million interactions. We hold out 5% of the most recent interactions and train on the remaining 95%. This results in a split where we use interactions from August 9, 2000, through February 1, 2018, to train the solution and use the remaining 8 months of interactions to simulate future activity as ground truth.

Results

When reproducing these tests in the notebook, your results may vary slightly. This is because when training, the solution the parameters of the underlying models are randomly initialized.

Let’s first review the results by looking at a few examples. We ran queries on three items, and identified 10 users that have a high propensity to engage with the items. We then look at the users’ shopping histories to assess if they would likely be interested in the queried product.

The following table shows the results of a segmentation query on gingerbread coffee, an item we might want to promote for the holiday season. Each row in the table shows the last three purchases of the 10 users returned from the query. Most of the users we identified are clearly coffee drinkers, having recently purchased coffee and coffee creamers. Interestingly, the item we queried on is a whole bean coffee, not a ground coffee. We see in the item histories that, where the information is available, the users have recently purchased whole bean coffee.

Gingerbread Coffee, 1 lb Whole Bean FlavorSeal Vacuum Bag: Bite into a freshly baked Gingerbread Coffee
USER_ID Last Three Purchases
A1H3ATRIQ098I7 Brew La La Red Velvet Cupcake Coffee Ola’s Exotic Super Premium Coffee Organic Uganda B Coffee Masters Gourmet Coffee
ANEDXRFDZDL18 Pepperidge Farm Goldfish Crackers Boston Baked Beans (1) 5.3 Oz Theater Box Sizecont Boost Simply Complete Nutritional Drink
APHFL4MDJRGWB Dunkin’ Donuts Original Blend Ground Coffee Coffee-Mate Coffee Mix Folgers Gourmet Selections Coconut Cream Pie Flavo
ANX42D33MNOVP The Coffee Fool Fool’s House American Don Francisco’s Hawaiian Hazelnut Don Francisco’s French Roast Coffee
A2NLJJVA0IEK2S Coffee Masters Flavored Coffee Lays 15pk Hickory Sticks Original (47g / 1.6oz per Albanese Confectionery Sugar Free Gummy Bears
A1GDEQIGFPRBNO Christopher Bean Coffee Flavored Ground Coffee Cameron’s French Vanilla Almond Whole Bean Coffee Cameron’s Coffee Roasted Whole Bean Coffee
A1MDO8RZCZ40B0 Master Chef Ground Coffee New England Ground Coffee Maxwell House Wake Up Roast Medium Coffee
A2LK2DENORQI8S The Bean Coffee Company Organic Holiday Bean (Vani Lola Savannah Angel Dust Ground New England Coffee Blueberry Cobbler
AGW1F5N8HV3AS New England Coffee Colombian Kirkland Signature chicken breast Lola Savannah Banana Nut Whole Bean
A13YHYM6FA6VJO Lola Savannah Triple Vanilla Whole Bean Lola Savannah Vanilla Cinnamon Pecan Whole Bean Pecan Maple Nut

The next table shows a segmentation query on hickory liquid smoke, a seasoning used for barbecuing and curing bacon. We see a number of different grocery products that might accompany barbecue in the users’ recent purchases: barbecue sauces, seasonings, and hot sauce. Two of the users recently purchased Prague Powder No. 1 Pink Curing Salt, a product also used for curing bacon. We may have missed these two users if we had relied on rules to identify people interested in grilling.

Wright’s Natural Hickory Seasoning Liquid Smoke, 128 Ounce This seasoning is produced by burning fresh cut hickory chips, then condensing the smoke into a liquid form.
USER_ID Last Three Purchases
A1MHK19QSCV8SY Hoosier Hill Farm Prague Powder No.1 Pink Curing S APPLE CIDER VINEGAR Fleischmann’s Instant Dry Yeast 1lb bagDry Yeast.M
A3G5P0SU1AW2DO Wright’s Natural Hickory Seasoning Liquid Smoke Eight O’Clock Whole Bean Coffee Kitchen Bouquet Browning and Seasoning Sauce
A2WW9T8EEI8NU4 Hidden Valley Dips Mix Creamy Dill .9 oz Packets ( Frontier Garlic Powder Wolf Chili Without Beans
A2TEJ1S0SK7ZT Black Tai Salt Co’s – (Food Grade) Himalayan Cryst Marukan Genuine Brewed Rice Vinegar Unseasoned Cheddar Cheese Powder
A3MPY3AGRMPCZL Wright’s Natural Hickory Seasoning Liquid Smoke San Francisco Bay OneCup Fog Chaser (120 Count) Si Kikkoman Soy Sauce
A2U77Z3Z7DC9T9 Food to Live Yellow Mustard Seeds (Kosher) 5 Pound 100 Sheets (6.7oz) Dried Kelp Seaweed Nori Raw Uns SB Oriental Hot Mustard Powder
A2IPDJISO5T6AX Angel Brand Oyster Sauce Bullhead Barbecue Sauce ONE ORGANIC Sushi Nori Premium Roasted Organic Sea
A3NDGGX7CWV8RT Frontier Mustard Seed Da Bomb Ghost Pepper HOT SaucesWe infused our hot Starwest Botanicals Organic Rosemary Leaf Whole
A3F7NO1Q3RQ9Y0 Yankee Traders Brand Whole Allspice Aji No Moto Ajinomoto Monosodium Glutamate Umami S Hoosier Hill Farm Prague Powder No.1 Pink Curing S
A3JKI7AWYSTILO Lalah’s Heated Indian Curry Powder 3 Lb LargeCurry Ducal Beans Black Beans with Cheese Emerald Nuts Whole Cashews

Our third example shows a segmentation query on a decoration used to top cakes. We see that the users identified are not only bakers, but are also clearly interested in decorating their baked goods. We see recent purchases like other cake toppers, edible decorations, and fondant (an icing used to sculpt cakes).

Letter C – Swarovski Crystal Monogram Wedding Cake Topper Letter, Jazz up your cakes with a sparkling monogram from our Sparkling collection! These single letter monograms are silver plated covered in crystal rhinestones and come in several sizes for your convenience.
USER_ID Last Three Purchases
A3RLEN577P4E3M The Republic Of Tea Alyssa’s Gluten Free Oatmeal Cookies – Pack of 4. Double Honey Filled Candies
AOZ0D3AGVROT5 Sea Green Disco Glitter Dust Christmas Green Disco Glitter Dust Baby Green Disco Glitter Dust
AC7O52PQ4HPYR Rhinestone Cake Topper Number 7 by otherThis delic Rhinestone Cake Topper Number 5This delicate and h Rhinestone Cake Topper Number 8 by otherThis delic
ALXKY9T83C4Z6 Heart Language of Love Bride and Groom White Weddi Bliss Cake Topper by Lenox (836473)It’s a gift tha First Dance Bride and Groom Wedding Cake TopperRom
A2XERDJ6I2K38U Egyptian Gold Luster Dust Kellogg’s Rice Krispies Treats Wilton Decorator Preferred Green Fondant
A1474SH2RB49MP Assorted Snowflake Sugar Decorations Disney Movie Darice VL3L Mirror Acrylic Initial Letter Cake Top Edible Snowflakes Sugar Decorations (15 pc).
A24E9YGY3V94N8 TOOGOO(R) Double-Heart Cake Topper Decoration for Custom Personalized Mr Mrs Wedding Cake Topper Wit Jacobs Twiglets 6 Pack Jacobs Twiglets are one of
A385P0YAW6U5J3 Tinksky Wedding Cake Topper God Gave Me You Sparkl Sweet Sixteen Cake Topper 16th Birthday Cake Toppe Catching the Big One DecoSet Cake DecorationReel i
A3QW120I2BY1MU Golda’s Kitchen Acetate Cake Collars – 4. Twinings of London English Breakfast Tea K-Cups fo Chefmaster by US Cake Supply 9-Ounce Airbrush Clea
A3DCP979LU7CTE DecoPac Heading for The Green DecoSet Cake TopperL Rhinestne Cake Topper Number 90This delicate and h Rhinestone Cake Topper Letter KThis delicate and h

These three examples make sense based on our editorial judgement, but to truly assess the performance of the recipe, we need to analyze more of the results. To do this broader assessment, we run the aws-item-affinity solution on 500 randomly selected items that appear in the test set to query a list of 2,262 users (1% of the users in the dataset). We then use the test set to assess how frequently the 2,262 users purchased the items during the test period. For comparison, we also assess how frequently 2,262 of the most active users purchased the items during the test period. The following table shows that the aws-item-affinity solution is four times better at identifying users that would purchase a given item.

Test Metrics
Hits Recall
Personalize – Item Affinity 0.2880 0.1297
Active User Baseline 0.0720 0.0320

Although these results are informative, they’re not a perfect reflection of the performance of the recipe because the user segmentation wasn’t used to promote the items which users later interacted with. The best way to measure performance is an online A/B test—running a marketing campaign on a list of users derived from the aws-item-affinity solution alongside a set of the most active users to measure the difference in engagement.

Conclusion

Amazon Personalize now makes it easy to run more intelligent user segmentation at scale, without having to maintain complex sets of rules or relying on broad assumptions about the preferences of your users. This allows you to better engage users with your marketing campaigns and improve retention through targeted messaging.

To learn more about Amazon Personalize, visit the product page.


About the Authors

Daniel Foley is a Senior Product Manager for Amazon Personalize. He is focused on building applications that leverage artificial intelligence to solve our customers’ largest challenges. Outside of work, Dan is an avid skier and hiker.

Debarshi Raha is a Senior Software Engineer for Amazon Personalize. He is passionate about building AI-based personalization systems at scale. In his spare time, he enjoys traveling and photography.

Ge Liu is an Applied Scientist at AWS AI Labs working on developing next generation recommender system for Amazon Personalize. Her research interests include Recommender System, Deep Learning, and Reinforcement Learning.

Haizhou Fu is a senior software engineer on the Amazon Personalize team working on designing and building recommendation systems and solutions for different industries. Outside of his work, he loves playing soccer, basketball and watching movies, reading and learning about physics, especially theories related to time and space.

Read More

Amazon Personalize announces recommenders optimized for Retail and Media & Entertainment

Today, we’re excited to announce the launch of personalized recommenders in Amazon Personalize that are optimized for retail and media and entertainment, making it even easier to personalize your websites, apps, and marketing campaigns. With this launch, we have drawn on Amazon’s rich experience creating unique personalized user experiences using machine learning (ML) to build recommenders for common personalization use cases. Use cases optimized recommendation solutions deliver personalized experiences for your users that consider the metrics that matter most to your business, the preferences of your individual users, and where your users are being served a personalized experience within the user journey. You can quickly integrate recommenders into any application via easy-to-use APIs.

This post walks you through the process of creating a recommender and getting recommendations for your users.

New personalized recommenders

To realize the true potential of personalization, businesses need to tailor their content to the user journey. For instance, an ecommerce website can recommend products to an existing customer based on their past browsing history (for example, a “Recommended for you” carousel) to drive greater engagement by providing item recommendations that are relevant to that user’s individual interests. On a product detail page, you can upsell products through a “Customers who viewed X also viewed” widget that uses the context of the product your customer is already engaging with. Finally, on the checkout page, a retailer may want to cross-sell products with “Frequently bought together” recommendations to increase average order value.

Similarly, a video-on-demand business can place a widget on their home page that shows the most popular recommendations to highlight the most viewed content across the world in the past week or month. You may want to build a “Because you watched this” widget after videos are watched to provide similar content with a greater chance of driving an increase in the time spent on your platform.

Each touchpoint requires intelligent personalization that understands the user, their current context, and their real-time interests or in-session preferences when delivering recommendations. Businesses today understand the need for and benefits of personalization, but building recommendation systems from the ground up requires significant investments of time and resources, in addition to extensive ML expertise.

With the launch of recommenders, you simply select the use cases you need from a library of recommenders within Amazon Personalize. “Most Viewed,” “Best Sellers”, “Frequently Bought Together,” “Customers who Viewed X also Viewed,” and “Recommended for you” are available for retail, and “Most Popular,” “Because you Watched X,” “More Like X,” and “Top Picks” are available for media and entertainment, with more to come. You select the recommenders for your use cases and Amazon Personalize does the heavy lifting of using ML to generate recommendations that you access through an easy-to-use API.

Recommenders learn from your users’ historical activity as well as their real-time interactions with items in your catalog to adjust to changing user preferences and deliver immediate value to your end users and business. Recommenders fully manage the lifecycle of maintaining and hosting personalized recommendation solutions. This accelerates the time needed to bring a solution to market and ensures that the recommendation solutions you deliver to production stay relevant for your users.

Amazon Personalize enables developers to build personalized user experiences with the same ML technology used by Amazon with no ML expertise required. We make it easy for developers to build applications capable of delivering a wide array of personalization experiences. You can start getting recommendations with Amazon Personalize quickly using a few simple API calls or some clicks on the AWS Management Console. You only pay for what you use, with no minimum fees or upfront commitments. All data is encrypted to be private and secure, and is only used to create your recommendations and segments.

Create a recommender

This section walks through the process of creating a recommender. The first step is to create a domain dataset group, which you can create by loading historic data in Amazon Simple Storage Service (Amazon S3) or from data gathered from real-time events.

Each dataset group can contain up to three datasets: Users, Items, and Interactions, with the Interactions dataset being mandatory to create a recommender. Datasets must adhere to the domain-specific schema in order to be used to create the domain-related recommenders.

In this post, we use the Amazon Prime Pantry dataset, which consists of purchase-related data for grocery items, to set up a retail recommender. We have uploaded the interactions dataset under the dataset group Prime-Pantry. You can monitor the status of the data upload through the dashboard for the Prime-Pantry dataset group on the Amazon Personalize console. After the data is imported successfully, choose Create recommenders.

As of this writing, Amazon Personalize offers five recipes for retail customers and four for media and entertainment customers.

The retail recipes are as follows:

  • Customers who viewed X also viewed – Recommendations for items that customers also viewed when they viewed a given item
  • Frequently bought together – Recommendations for items that customers buy together based on a specific item
  • Popular Items by Purchases – Popular items based on the items purchased by your users
  • Popular Items by Views – Popular items based on items viewed by your users
  • Recommended for you – Personalized recommendations for a given user ensuring that any items previously purchased are filtered out

The recipes for media and entertainment are as follows:

  • Most Popular – Most popular videos
  • Because you watched X – Videos similar to a given video watched by a user
  • More like X – Videos similar to a given video
  • Top picks for you – Personalized content recommendations for a specified user

The following screenshot shows how you can select recommenders based on your business needs and define the names of the recommenders. You use each recommender’s ARN to get recommendations when using the REST APIs. In this example, we create two recommenders. The first recommender is for the use case “Items frequently bought together” and is called PP-ItemsFrequentlyBoughtTogether. We also create a recommender for the use case “Popular Items by Purchases” called PP-PopularItemsByPurchases.

You can toggle Use default recommender configurations and Amazon Personalize automatically chooses the best configuration for the models underlying the recommenders. Then choose Create recommenders to start the model building process.

The time taken to create a recommender depends on the data and use cases selected. During this time, Amazon Personalize selects the optimal algorithm for each of the selected use cases, processes the underlying data, and trains a custom private model for your users. You can access all your recommenders and their current status on the Recommenders page.

When the recommender’s status changes to Active, you can choose it to review relevant details about the recommender and test it. Testing helps check the recommendations before you integrate the recommender into your website or application.

The following image shows the test output for a particular item ID for the recommender PP ItemsFrequentlyBoughtTogether.

At this step, you can also apply any filters on the recommendations; for example, to remove items purchased in the past.

Amazon Personalize also provides a recommender ARN in the details section, which you can use to produce recommendations through the Amazon Personalize REST APIs. The following code is an example of calling your API from Python for PP-FrequentlyBoughtTogetherRecommender:

get_recommendations_response = personalize_runtime.get_recommendations( 
campaignArn = arn:aws:personalize:us-west-2:261294318658:recommender/PP-ItemsFrequentlyBoughtTogether 
itemId = str(item_id) 
)

This API call produces the same results as if testing the recommender via the console.

Your recommender is now ready to feed into your website or app and personalize the journey of each of your customers.

Conclusion

Amazon Personalize packages our rich experience creating unique personalized user experiences with ML at Amazon and offers our expertise as a fully managed service to developers looking to personalize their websites and apps. With the launch of use case optimized recommenders, we’re going one step further to tailor our learnings to the unique marketing needs of each industry and each individual business. Recommenders allow you to easily and swiftly access recommendations that are optimized for your specific use case. By understanding the unique context of your customers and their touchpoints, Amazon Personalize allows you to harness the raw power of ML to derive more value for your business and your users.

To learn more about Amazon Personalize, visit the product page.


About the Authors

Anchit Gupta is a Senior Product Manager for Amazon Personalize. She focuses on delivering products that make it easier to build machine learning solutions. In her spare time, she enjoys cooking, playing board/card games, and reading.

Hao Ding is an Applied Scientist at AWS AI Labs and is working on developing next generation recommender system for Amazon Personalize. His research interests include Recommender System, Deep Learning, and Graph Mining.

Pranav Agarwal is a Sr. Software Development Engineer with Amazon Personalize and works on architecting software systems and building AI-powered recommender systems at scale. Outside of work, he enjoys reading, running and has started picking up ice-skating.

Nghia Hoang is a Senior Machine Learning Scientist at AWS AI Labs working on developing personalized learning methods with applications to recommender systems. His research interests include Probabilistic Inference, Deep Generative Learning, Personalized Federated Learning and Meta Learning.

Read More

Build MLOps workflows with Amazon SageMaker projects, GitLab, and GitLab pipelines

Machine learning operations (MLOps) are key to effectively transition from an experimentation phase to production. The practice provides you the ability to create a repeatable mechanism to build, train, deploy, and manage machine learning models. To quickly adopt MLOps, you often require capabilities that use your existing toolsets and expertise. Projects in Amazon SageMaker give organizations the ability to easily set up and standardize developer environments for data scientists and CI/CD (continuous integration, continuous delivery) systems for MLOps engineers. With SageMaker projects, MLOps engineers or organization administrators can define templates that bootstrap the ML workflow with source version control, automated ML pipelines, and a set of code to quickly start iterating over ML use cases. With projects, dependency management, code repository management, build reproducibility, and artifact sharing and management become easy for organizations to set up. SageMaker projects are provisioned using AWS Service Catalog products. Your organization can use project templates to provision projects for each of your users.

In this post, you use a custom SageMaker project template to incorporate CI/CD practices with GitLab and GitLab pipelines. You automate building a model using Amazon SageMaker Pipelines for data preparation, model training, and model evaluation. SageMaker projects builds on Pipelines by implementing the model deployment steps and using SageMaker Model Registry, along with your existing CI/CD tooling, to automatically provision a CI/CD pipeline. In our use case, after the trained model is approved in the model registry, the model deployment pipeline is triggered via a GitLab pipeline.

Prerequisites

For this walkthrough, you should have the following prerequisites:

This post provides a detailed explanation of the SageMaker projects, GitLab, and GitLab pipelines integration. We review the code and discuss the components of the solution. To deploy the solution, reference the GitHub repo, which provides step-by-step instructions for implementing a MLOps workflow using a SageMaker project template with GitLab and GitLab pipelines.

Solution overview

The following diagram shows the architecture we build using a custom SageMaker project template.

Let’s review the components of this architecture to understand the end-to-end setup:

  • GitLab – Acts as our code repository and enables CI/CD using GitLab pipelines. The custom SageMaker project template creates two repositories (model build and model deploy) in your GitLab account.
    • The first repository (model build) provides code to create a multi-step model building pipeline. This includes steps for data processing, model training, model evaluation, and conditional model registration based on accuracy. It trains a linear regression model using the XGBoost algorithm on the well-known UCI Machine Learning Abalone dataset.
    • The second repository (model deploy) contains the code and configuration files for model deployment, as well as the test scripts required to pass the quality benchmark. These are code stubs that must be defined for your use case.
    • Each repository also has a GitLab CI pipeline. The model build pipeline automatically triggers and runs the pipeline from end to end whenever a new commit is made to the model build repository. The model deploy pipeline is triggered whenever a new model version is added to the model registry, and the model version status is marked as Approved.
  • SageMaker Pipelines – Contains the directed acyclic graph (DAG) that includes data preparation, model training, and model evaluation.
  • Amazon S3 – An Amazon Simple Storage Service (Amazon S3) bucket stores the output model artifacts that are generated from the pipeline.
  • AWS Lambda – Two AWS Lambda functions are created, which we review in more detail later in this post:
    • One function seeds the code into your two GitLab repositories.
    • One function triggers the model deployment pipeline after the new model is registered in the model registry.
  • SageMaker Model Registry – Tracks the model versions and respective artifacts, including the lineage and metadata. A model package group is created that contains the group of related model versions. The model registry also manages the approval status of the model version for downstream deployment.
  • Amazon EventBridge Amazon EventBridge monitors all changes to the model registry. It also contains a rule that triggers the Lambda function for the model deploy pipeline, when the model package version state changes from PendingManualApproval to Approved in the model registry.
  • AWS CloudFormation AWS CloudFormation deploys the model and creates the SageMaker endpoints when the model deploy pipeline is triggered by the approval of the trained model.
  • SageMaker hosting – Creates two HTTPS real-time endpoints to perform inference. The hosting option is configurable, for example, for batch transform or asynchronous inference. The staging endpoint is created when the model deploy pipeline is triggered by the approval of the trained model. This endpoint is used to evaluate the deployed model by confirming it’s generating predictions that meet our target accuracy requirements. When the model is ready to be deployed in production, a production endpoint is provisioned by manually starting the job in the GitLab model deploy pipeline.

Use the new MLOps project template with GitLab and GitLab pipelines

In this section, we review the parameters required for the MLOps project template (see the following screenshot). This template allows you to utilize GitLab pipelines as your orchestrator.

The template has the following parameters:

  • GitLab Server URL – The URL of the GitLab server in https:// format. The GitLab accounts under your organization may contain a different customized server URL (domain). The server URL is required to authorize access to the python-gitlab API. You use the personal access token you created to allow permission to the Lambda functions to push the seed code into your GitLab repositories. We discuss the Lambda function code in more detail in the next section.
  • Base URL for your GitLab Repositories – The URL for your GitLab account to create the model build and deploy repositories in the format of https://<gitlab server>/<username> or https://<gitlab server><group>/<project>. You must create a personal access token under your GitLab user account in order to authenticate with the GitLab API.
  • Model Build Repository Name – The name of the repository mlops-gitlab-project-seedcode-model-build of the model build and training seed code.
  • Model Deploy Repository Name – The name of the repository mlops-gitlab-project-seedcode-model-deploy of the model deploy seed code.
  • GitLab Group ID – GitLab groups are important for managing access and permissions for projects. Enter the ID of the group that repositories are created for. In this example, we enter None, because we’re using the root group.
  • GitLab Secret Name (Secrets Manager) – The secret in AWS Secrets Manager contains the value of the GitLab personal access token that is used by the Lambda function to populate the seed code in the repositories. Enter the name of the secret you created in Secrets Manager.

Lambda functions code overview

As discussed earlier, we create two Lambda functions. The first function seeds the code into your GitLab repositories. The second function triggers your model deployment. Let’s review these functions in more detail.

Seedcodecheckin Lambda function

This function helps create the GitLab projects and repositories and pushes the code files into these repositories. These files are needed to set up the ML CI/CD pipelines.

The Secrets Manager secret is created to allow the function to retrieve the stored GitLab personal access token. This token allows the function to communicate with GitLab to create repositories and push the seed code. It also allows the environment variables to be passed in through the project.yml file. See the following code:

def get_secret():
    ''' '''
    secret_name = os.environ['SecretName']
    region_name = os.environ['Region']
    
    session = boto3.session.Session()
    client = session.client(
        service_name='secretsmanager',
        region_name=region_name
    )

The Secrets Manager secret was created when you ran the init.sh file earlier as part of the code repo prerequisites.

The deployment package for the function contains several libraries, including python-gitlab and cfn-response. Because our function’s source code is packaged as a .zip file and interacts with AWS CloudFormation, we use cfn-response. We use the python-gitlab API and the Amazon SDK for Python (Boto3) to download the seed code files and upload them to Amazon S3 to be pushed to our GitLab repositories. See the following code:

    # Configure SDKs for GitLab and S3
    gl = gitlab.Gitlab(gitlab_server_uri, private_token=gitlab_private_token)
    s3 = boto3.client('s3')
 
    model_build_filename = f'/tmp/{str(uuid.uuid4())}-model-build-seed-code.zip'
    model_deploy_filename = f'/tmp/{str(uuid.uuid4())}-model-deploy-seed-code.zip'
    model_build_directory = f'/tmp/{str(uuid.uuid4())}-model-build'
    model_deploy_directory = f'/tmp/{str(uuid.uuid4())}-model-deploy'

    # Get Model Build Seed Code from S3 for Gitlab Repo
    with open(model_build_filename, 'wb') as f:
        s3.download_fileobj(sm_seed_code_bucket, model_build_sm_seed_code_object_name, f)

    # Get Model Deploy Seed Code from S3 for Gitlab Repo
    with open(model_deploy_filename, 'wb') as f:
        s3.download_fileobj(sm_seed_code_bucket, model_deploy_sm_seed_code_object_name, f)

Two projects (repositories) are created in GitLab, and the seed code files are pushed into the repositories (model build and model deploy) using the python-gitlab API:

# Create the GitLab Project
    try:
        if group_id is None:
            build_project = gl.projects.create({'name': gitlab_project_name_build})
        else:
            build_project = gl.projects.create({'name': gitlab_project_name_build, 'namespace_id': int(group_id)})
    ....
    try:
        if group_id is None:
            deploy_project = gl.projects.create({'name': gitlab_project_name_deploy})
        else:
            deploy_project = gl.projects.create({'name': gitlab_project_name_deploy, 'namespace_id': int(group_id)})
    ....
    
    # Commit to the above created Repo all the files that were in the seed code Zip
    try:
        build_project.commits.create(build_data)
    except Exception as e:
        logging.error("Code could not be pushed to the model build repo.")
        logging.error(e)
        cfnresponse.send(event, context, cfnresponse.FAILED, response_data)
        return { 
            'message' : "GitLab seedcode checkin failed."
        }

    try:
        deploy_project.commits.create(deploy_data)
    except Exception as e:
        logging.error("Code could not be pushed to the model deploy repo.")
        logging.error(e)
        cfnresponse.send(event, context, cfnresponse.FAILED, response_data)
        return { 
            'message' : "GitLab seedcode checkin failed."
        }

The following screenshot shows the successful run of the Lambda function pushing the required seed code files into both projects in your GitLab account.

gitlab-trigger Lambda function

This Lambda function is triggered by EventBridge. The project.yml CloudFormation template contains an EventBridge rule that triggers the function when the model package state changes in the SageMaker model registry. See the following code:

ModelDeploySageMakerEventRule:
    Type: AWS::Events::Rule
    Properties:
      # Max length allowed: 64
      Name: !Sub sagemaker-${SageMakerProjectName}-${SageMakerProjectId}-event-rule # max: 10+33+15+5=63 chars
      Description: "Rule to trigger a deployment when SageMaker Model registry is updated with a new model package. For example, a new model package is registered with Registry"
      EventPattern:
        source:
          - "aws.sagemaker"
        detail-type:
          - "SageMaker Model Package State Change"
        detail:
          ModelPackageGroupName:
            - !Sub ${SageMakerProjectName}-${SageMakerProjectId}
      State: "ENABLED"
      Targets:
        -
          Arn: !GetAtt GitLabPipelineTriggerLambda.Arn
          Id: !Sub sagemaker-${SageMakerProjectName}-trigger

The following screenshot contains a subset of the function code that triggers the GitLab pipeline in the .gitlab-ci.yml file. It deploys the SageMaker model endpoints using the CloudFormation template endpoint-config-template.yml in your model deploy repository.

To better understand the solution, review the entire code for the functions as needed.

GitLab and GitLab pipelines overview

As described earlier, GitLab plays a key role as the source code repo and enabling CI/CD pipelines in this solution. Let’s look into our GitLab account to understand the components.

After the project is successfully created, using our custom template in SageMaker projects per the steps in the code repo, navigate to your GitLab account to see two new repositories. Each repository has a GitLab CI pipeline associated with it that runs as soon as the project is created.

The first run of each pipeline fails because GitLab doesn’t have the AWS credentials. For each repository, navigate to Settings, CI/CD, Variables. Create two new variables, AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY, with the associated information for your GitLab role.

Model build pipeline in GitLab

Let’s review the GitLab pipelines, starting with the model build pipeline. We define the pipelines in GitLab by creating the .gitlab-ci.yml file, where we define the various stages and related jobs. As shown in the following screenshot, this pipeline has only one stage (training) and the related script shows how a SageMaker pipeline file is triggered. (You can learn more about the SageMaker pipeline by exploring the pipeline.py file on GitHub.)

When this GitLab pipeline is triggered, it starts the Abalone SageMaker pipeline to build your model.

When the model build is complete, you can locate this model in the model registry in SageMaker Studio.

Use this template for your custom use case

The model build repository contains code for preprocessing, training, and evaluating the model for the UCI Abalone dataset. You need to modify the files to address your custom use case.

  1. Navigate to the pipelines folder in your model build repository.

  1. Upload your dataset to a S3 bucket. Replace the bucket URL in this section of your pipeline.py file.

  1. Navigate to .gitlab-ci.yml and modify this section with the folder and file of your use case.

Model deployment pipeline in GitLab

When the SageMaker pipeline that trains the model is complete, a model is added to the SageMaker model registry. If that model is approved, the GitLab pipeline in the model deploy repository starts and the model deployment process begins.

To approve the model in the model registry, complete the following steps:

  1. Choose the Components and registries icon.
  2. Choose Model registry, and choose (right-click) the model version.
  3. Choose Update model version status.
  4. Change the status from Pending to Approved.

This triggers the deploy pipeline.

Now, let’s review the .gitlab-ci.yml file in the model deploy repository. As shown in the following screenshot, this model deploy pipeline has four stages: build, staging deploy, test staging, and production deploy. This pipeline uses AWS CloudFormation to deploy the model and create the SageMaker endpoints.

A manual step in the GitLab pipeline exists for model promotion from staging to production that creates an endpoint with the suffix -prod. If you choose manual, this job runs and upon completion deploys the SageMaker endpoint.

To verify that the endpoints were created, navigate to the Endpoints page on the SageMaker console. You should see two endpoints: <model_name>-staging and <model_name>-prod.

GitLab implementation patterns

In this section, we discuss two patterns for implementing GitLab: hosting with Amazon Virtual Private Cloud (Amazon VPC), or with two-factor authentication.

Hosting GitLab in an Amazon VPC

You may choose to deploy GitLab in an Amazon VPC to use a private network and provide access to AWS resources. In this scenario, the Lambda functions also must be deployed in a VPC to access the GitLab API. We accomplish this by updating the project.yml file and the AWS Identity and Access Management (IAM) role AmazonSageMakerServiceCatalogProductsUseRole.

The IAM user that you used to create the VPC requires the following user permissions for Lambda to verify network resources:

  • ec2:DescribeSecurityGroups
  • ec2:DescribeSubnets
  • ec2:DescribeVpcs

The Lambda functions’ execution role requires the following permissions to create and manage network interfaces:

  • ec2:CreateNetworkInterface
  • ec2:DescribeNetworkInterfaces
  • ec2:DeleteNetworkInterface
  1. On the IAM console, search for AmazonSageMakerServiceCatalogProductsUseRole.
  2. Choose Attach policies.
  3. Search for the AWSLambdaVPCAccessExecutionRole managed policy.
  4. Choose Attach policy.

Next, we update project.yml to configure the functions to deploy in a VPC by providing the VPC security groups and subnets.

    1. Add the subnet IDs and security group IDs to the Parameters section, for example:
      SubnetId1:
      Type: AWS::EC2::Subnet::Id
      Description: Subnet Id for Lambda function
      
      SubnetId2:
      Type: AWS::EC2::Subnet::Id
      Description: Subnet Id for Lambda function
      
      SecurityGroupId:
      Type: AWS::EC2::SecurityGroup::Id
      Description: Security Group Id for Lambda function to Execute
      

    2. Add the VpcConfig information under Properties for the GitSeedCodeCheckinLambda and GitLabPipelineTriggerLambda functions, for example:
      SubnetId1:
      GitSeedCodeCheckinLambda:
      Type: 'AWS::Lambda::Function'
      Properties:
      Description: To trigger the codebuild project for the seedcode checkin
      .....
      VpcConfig:
      SecurityGroupIds:
      - !Ref SecurityGroupId
      SubnetIds:
      - !Ref SubnetId1
      - !Ref SubnetId2
      

Two-factor authentication enabled

If you enabled two-factor authentication on your GitLab account, you need to use your personal access token to clone the repositories in SageMaker Studio. The token requires the read_repository and write_repository flags. To clone the model build and model deploy repositories, enter the following commands:

git clone https://oauth2:PERSONAL_ACCESS_TOKEN@gitlab.com/username/gitlab-project-seedcode-model-build-<project-id>
git clone https://oauth2:PERSONAL_ACCESS_TOKEN@gitlab.com/username/gitlab-project-seedcode-model-deploy-<project-id>

Because you previously created a secret for your personal access token, no changes are required to the code when two-factor authentication is enabled.

Summary

In this post, we walked through using a custom SageMaker MLOps project template to automatically build and configure a CI/CD pipeline. This pipeline incorporated your existing CI/CD tooling with SageMaker features for data preparation, model training, model evaluation, and model deployment. In our use case, we focused on using GitLab and GitLab pipelines with SageMaker projects and pipelines. For more detailed implementation information, review the GitHub repo. Try it out and let us know if you have any questions in the comments section!


About the Authors

Kirit Thadaka is an ML Solutions Architect working in the Amazon SageMaker Service SA team. Prior to joining AWS, Kirit spent time working in early stage AI startups followed by some time in consulting in various roles in AI research, MLOps, and technical leadership.

Lauren Mullennex is a Solutions Architect based in Denver, CO. She works with customers to help them architect solutions on AWS. In her spare time, she enjoys hiking and cooking Hawaiian cuisine.

Indrajit Ghosalkar is a Sr. Solutions Architect at Amazon Web Services based in Singapore. He loves helping customers achieve their business outcomes through cloud adoption and realize their data analytics and ML goals through adoption of DataOps / MLOps practices and solutions. In his spare time, he enjoys playing with his son, traveling and meeting new people.

Read More