Secure Amazon SageMaker Studio presigned URLs Part 1: Foundational infrastructure

You can access Amazon SageMaker Studio notebooks from the Amazon SageMaker console via AWS Identity and Access Management (IAM) authenticated federation from your identity provider (IdP), such as Okta. When a Studio user opens the notebook link, Studio validates the federated user’s IAM policy to authorize access, and generates and resolves the presigned URL for the user. Because the SageMaker console runs on an internet domain, this generated presigned URL is visible in the browser session. This presents an undesired threat vector for exfiltration and gaining access to customer data when proper access controls are not enforced.

Studio supports a few methods for enforcing access controls against presigned URL data exfiltration:

  • Client IP validation using the IAM policy condition aws:sourceIp
  • Client VPC validation using the IAM condition aws:sourceVpc
  • Client VPC endpoint validation using the IAM policy condition aws:sourceVpce

When you access Studio notebooks from the SageMaker console, the only available option is to use client IP validation with the IAM policy condition aws:sourceIp. However, you can use browser traffic routing products such as Zscaler to ensure scale and compliance for your workforce internet access. These traffic routing products generate their own source IP, whose IP range is not controlled by the enterprise customer. This makes it impossible for these enterprise customers to use the aws:sourceIp condition.

To use client VPC endpoint validation using the IAM policy condition aws:sourceVpce, the creation of a presigned URL needs to originate in the same customer VPC where Studio is deployed, and resolution of the presigned URL needs to happen via a Studio VPC endpoint on the customer VPC. This resolution of the presigned URL during access time for corporate network users can be accomplished using DNS forwarding rules (both in Zscaler and corporate DNS) and then into the customer VPC endpoint using an AWS Route 53 inbound resolver.

In this part, we discuss the overarching architecture for securing studio pre-signed url and demonstrate how to set up the foundational infrastructure to create and launch a Studio presigned URL through your VPC endpoint over a private network without traversing the internet. This serves as the foundational layer for preventing data exfiltration by external bad actors gaining access to Studio pre-signed URL and unauthorized or spoofed corporate user access within a corporate environment.

Solution overview

The following diagram illustrates over-arching solution architecture.

The process includes the following steps:

  1. A corporate user authenticates via their IdP, connects to their corporate portal, and opens the Studio link from the corporate portal.
  2. The corporate portal application makes a private API call using an API Gateway VPC endpoint to create a presigned URL.
  3. The API Gateway VPC endpoint “create presigned URL” call is forwarded to the Route 53 inbound resolver on the customer VPC as configured in the corporate DNS.
  4. The VPC DNS resolver resolves it to the API Gateway VPC endpoint IP. Optionally, it looks up a private hosted zone record if it exists.
  5. The API Gateway VPC endpoint routes the request via the Amazon private network to the “create presigned URL API” running in the API Gateway service account.
  6. API Gateway invokes the create-pre-signedURL private API and proxies the request to the create-pre-signedURL Lambda function.
  7. The create-pre-signedURL Lambda call is invoked via the Lambda VPC endpoint.
  8. The create-pre-signedURL function runs in the service account, retrieves authenticated user context (user ID, Region, and so on), looks up a mapping table to identify the SageMaker domain and user profile identifier, makes a sagemaker createpre-signedDomainURL API call, and generates a presigned URL. The Lambda service role has the source VPC endpoint conditions defined for the SageMaker API and Studio.
  9. The generated presigned URL is resolved over the Studio VPC endpoint.
  10. Studio validates that the presigned URL is being accessed via the customer’s VPC endpoint defined in the policy, and returns the result.
  11. The Studio notebook is returned to the user’s browser session over the corporate network without traversing the internet.

The following sections walk you through how to implement this architecture to resolve Studio presigned URLs from a corporate network using VPC endpoints. We demonstrate a complete implementation by showing the following steps:

  1. Set up the foundational architecture.
  2. Configure the corporate app server to access a SageMaker presigned URL via a VPC endpoint.
  3. Set up and launch Studio from the corporate network.

Set up the foundational architecture

In the post Access an Amazon SageMaker Studio notebook from a corporate network, we demonstrated how to resolve a presigned URL domain name for a Studio notebook from a corporate network without traversing the internet. You can follow the instructions in that post to set up the foundational architecture, and then return to this post and proceed to the next step.

Configure the corporate app server to access a SageMaker presigned URL via a VPC endpoint

To enable accessing Studio from your internet browser, we set up an on-premises app server on Windows Server on the on-premises VPC public subnet. However, the DNS queries for accessing Studio are routed through the corporate (private) network. Complete the following steps to configure routing Studio traffic through the corporate network:

  1. Connect to your on-premises Windows app server.

  2. Choose Get Password then browse and upload your private key to decrypt your password.
  3. Use an RDP client and connect to the Windows Server using your credentials.
    Resolving Studio DNS from the Windows Server command prompt results in using public DNS servers, as shown in the following screenshot.
    Now we update Windows Server to use the on-premises DNS server that we set up earlier.
  4. Navigate to Control Panel, Network and Internet, and choose Network Connections.
  5. Right-click Ethernet and choose the Properties tab.
  6. Update Windows Server to use the on-premises DNS server.
  7. Now you update your preferred DNS server with your DNS server IP.
  8. Navigate to VPC and Route Tables and choose your STUDIO-ONPREM-PUBLIC-RT route table.
  9. Add a route to 10.16.0.0/16 with the target as the peering connection that we created during the foundational architecture setup.

Set up and launch Studio from your corporate network

To set up and launch Studio, complete the following steps:

  1. Download Chrome and launch the browser on this Windows instance.
    You may need to turn off Internet Explorer Enhanced Security Configuration to allow file downloads and then enable file downloads.
  2. In your local device Chrome browser, navigate to the SageMaker console and open the Chrome developer tools Network tab.
  3. Launch the Studio app and observe the Network tab for the authtokenparameter value, which includes the generated presigned URL along with the remote server address that the URL is routed to for resolution.In this example, the remote address 100.21.12.108 is one of the public DNS server addresses to resolve the SageMaker DNS domain name d-h4cy01pxticj.studio.us-west-2.sagemaker.aws.
  4. Repeat these steps from the Amazon Elastic Compute Cloud (Amazon EC2) Windows instance that you configured as part of the foundational architecture.

We can observe that the remote address is not the public DNS IP, instead it’s the Studio VPC endpoint 10.16.42.74.

Conclusion

In this post, we demonstrated how to resolve a Studio presigned URL from a corporate network using Amazon private VPC endpoints without exposing the presigned URL resolution to the internet. This further secures your enterprise security posture for accessing Studio from a corporate network for building highly secure machine learning workloads on SageMaker. In part 2 of this series, we further extend this solution to demonstrate how to build a private API for accessing Studio with aws:sourceVPCE IAM policy validation and token authentication. Try out this solution and leave your feedback in the comments!


About the Authors

Ram Vittal is a machine learning solutions architect at AWS. He has over 20+ years of experience architecting and building distributed, hybrid and cloud applications. He is passionate about building secure and scalable AI/ML and Big Data solutions to help enterprise customers with their cloud adoption and optimization journey to improve their business outcomes. In his spare time, he enjoys tennis and photography.

Neelam Koshiya is an enterprise solution architect at AWS. Her current focus is to help enterprise customers with their cloud adoption journey for strategic business outcomes. In her spare time, she enjoys reading and being outdoors.

Read More

Secure Amazon SageMaker Studio presigned URLs Part 2: Private API with JWT authentication

In part 1 of this series, we demonstrated how to resolve an Amazon SageMaker Studio presigned URL from a corporate network using Amazon private VPC endpoints without traversing the internet. In this post, we will continue to build on top of the previous solution to demonstrate how to build a private API Gateway via Amazon API Gateway as a proxy interface to generate and access Amazon SageMaker presigned URLs. Furthermore, we add an additional guardrail to ensure presigned URLs are only generated and accessed for the authenticated end-user within the corporate network.

Solution overview

The following diagram illustrates the architecture of the solution.

The process includes the following steps:

  1. In the Amazon Cognito user pool, first set up a user with the name matching their Studio user profile and register Studio as the app client in the user pool.
  2. The user federates from their corporate identity provider (IdP) and authenticates with the Amazon Cognito user pool for accessing Studio.
  3. Amazon Cognito returns a token to the user authorizing access to the Studio application.
  4. The user invokes createStudioPresignedUrl API on API Gateway along with a token in the header.
  5. API Gateway invokes a custom AWS Lambda authorizer and validates the token.
  6. When the token is valid, Amazon Cognito returns an access grant policy with studio user profile id to API Gateway.
  7. API Gateway invokes the createStudioPresignedUrl Lambda function for creating the studio presigned url .
  8. The createStudioPresignedUrl function creates a presigned URL using the SageMaker API VPC endpoint and returns to caller.
  9. User acccesses the presigned URL from their corporate network that resolves over the Studio VPC endpoint .
  10. The function’s AWS Identity and Access Management (IAM) policy makes sure that the presigned URL creation and access are performed via VPC endpoints.

The following sections walk you through solution deployment, configuration, and validation for the API Gateway private API for creating and resolving a Studio presigned URL from a corporate network using VPC endpoints.

  1. Deploy the solution
  2. Configure the Amazon Cognito user
  3. Authenticating the private API for the presigned URL using a JSON Web Token
  4. Configure the corporate DNS server for accessing the private API
  5. Test the API Gateway private API for a presigned URL from the corporate network
  6. Pre-Signed URL Lambda Auth Policy
  7. Cleanup

Deploy the solution

You can deploy the solution through either the AWS Management Console or the AWS Serverless Application Model (AWS SAM).

To deploy the solution via the console, launch the following AWS CloudFormation template in your account by choosing Launch Stack. It takes approximately 10 minutes for the CloudFormation stack to complete.

To deploy the solution using AWS SAM, you can find the latest code in the aws-security GitHub repository, where you can also contribute to the sample code. The following commands show how to deploy the solution using the AWS SAM CLI. If not currently installed, install the AWS SAM CLI.

  1. Clone the repository at https://github.com/aws-samples/secure-sagemaker-studio-presigned-url.
  2. After you clone the repo, navigate to the source and run the following code:
    sam deploy –guided

Configure the Amazon Cognito user

To configure your Amazon Cognito user, complete the following steps:

  1. Create an Amazon Cognito user with the same name as a SageMaker user profile:
    aws cognito-idp admin-create-user --user-pool-id <user_pool_id> --username <sagemaker_username>

  2. Set the user password:
    aws cognito-idp admin-set-user-password --user-pool-id <user_pool_id> --username <sagemaker_username> --password <password> --permanent

  3. Get an access token:
    aws cognito-idp initiate-auth --auth-flow USER_PASSWORD_AUTH --client-id <cognito_app_client_id> --auth-parameters USERNAME=<sagemaker_username>,PASSWORD=<password>

Authenticating the private API for the presigned URL using a JSON Web Token

When you deployed a private API for creating a SageMaker presigned URL, you added a guardrail to restrict access to access the presigned URL by anyone outside the corporate network and VPC endpoint. However, without implementing another control to the private API within the corporate network, any internal user within the corporate network would be able to pass unauthenticated parameters for the SageMaker user profile and access any SageMaker app.

To mitigate this issue, we propose passing a JSON Web Token (JWT) for the authenticated caller to the API Gateway and validating that token with a JWT authorizer. There are multiple options for implementing an authorizer for the private API Gateway, using either a custom Lambda authorizer or Amazon Cognito.

With a custom Lambda authorizer, you can embed a SageMaker user profile name in the returned policy. This prevents any users within the corporate network from being able to send any SageMaker user profile name for creating a presigned URL that they’re not authorized to create. We use Amazon Cognito to generate our tokens and a custom Lambda authorizer to validate and return the appropriate policy. (For more information, refer to Building fine-grained authorization using Amazon Cognito, API Gateway, and IAM). The Lambda authorizer uses the Amazon Cognito user name as the user profile name.

If you’re unable to use Amazon Cognito, you can develop a custom application to authenticate and pass end-user tokens to the Lambda authorizer. For more information, refer to Use API Gateway Lambda authorizers.

Configure the corporate DNS server for accessing the private API

To configure your corporate DNS server, complete the following steps:

  1. On the Amazon Elastic Compute Cloud (Amazon EC2) console, choose your on-premises DNSA EC2 instance and connect via Systems Manager Session Manager.
  2. Add a zone record in the /etc/named.conf file for resolving to the API Gateway’s DNS name via your Amazon Route 53 inbound resolver, as shown in the following code:
    zone "zxgua515ef.execute-api.<region>.amazonaws.com" {
      type forward;
      forward only;
      forwarders { 10.16.43.122; 10.16.102.163; };
    };

  3. Restart the named service using the following command:
    sudo service named restart

Validate requesting a presigned URL from the API Gateway private API for authorized users

In a real-world scenario, you would implement a front-end interface that would pass the appropriate Authorization headers for authenticated and authorized resources using either a custom solution or leverage AWS Amplify. For brevity of this blog post, the following steps leverages Postman to quickly validate the solution we deployed actually restricts requesting the presigned URL for an internal user, unless authorized to do so.

To validate the solution with Postman, complete the following steps:

  1. Install Postman on the WINAPP EC2 instance. See instructions here
  2. Open Postman and add the access token to your Authorization header:
    Authorization: Bearer <access token>

  3. Modify the API Gateway URL to access it from your internal EC2 instance:
    1. Add the VPC endpoint into your API Gateway URL:
      https://<API-G-ID>-<VPCE-ID>.execute-api.<region>.amazonaws.com/dev/EMPLOYEE_ID

    2. Add the Host header with a value of your API Gateway URL:
      <API-G-ID>.execute-api.<region>.amazonaws.com

    3. First, change the EMPLOYEE_ID to your Amazon Cognito user and SageMaker user profile name. Make sure you receive an authorized presigned URL.
    4. Then change the EMPLOYEE_ID to a user that is not yours and make sure you receive an access failure.
  4. On the Amazon EC2 console, choose your on-premises WINAPP instance and connect via your RDP client.
  5. Open a Chrome browser and navigate to your authorized presigned URL to launch Studio.

Studio is launched over VPC endpoint with remote address as the Studio VPC endpoint IP.

If the presigned URL is accessed outside of the corporate network, the resolution fails because the IAM policy condition for the presigned URL enforces creation and access from a VPC endpoint.

Pre-Signed URL Lambda Auth Policy

Above solution created the following Auth Policy for the Lambda that generated Pre-Signed URL for accessing SageMaker Studio.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Condition": {
                "IpAddress": {
                    "aws:VpcSourceIp": "10.16.0.0/16"
                }
            },
            "Action": "sagemaker:CreatePresignedDomainUrl",
            "Resource": "arn:aws:sagemaker:<region>:<account-id>:user-profile/*/*",
            "Effect": "Allow"
        },
        {
            "Condition": {
                "IpAddress": {
                    "aws:SourceIp": "192.168.10.0/24"
                }
            },
            "Action": "sagemaker:CreatePresignedDomainUrl",
            "Resource": "arn:aws:sagemaker:<region>:<account-id>:user-profile/*/*",
            "Effect": "Allow"
        },
        {
            "Condition": {
                "StringEquals": {
                    "aws:sourceVpce": [
                        "vpce-sm-api-xx",
                        "vpce-sm-api-yy"
                    ]
                }
            },
            "Action": "sagemaker:CreatePresignedDomainUrl",
            "Resource": "arn:aws:sagemaker:<region>:<account-id>:user-profile/*/*",
            "Effect": "Allow"
        }
    ]
}

The above policy enforces Studio pre-signed URL is both generated and accessed via one of these three entrypoints:

  1. aws:VpcSourceIp as your AWS VPC CIDR
  2. aws:SourceIp as your corporate network CIDR
  3. aws:sourceVpce as your SageMaker API VPC endpoints

Cleanup

To avoid incurring ongoing charges, delete the CloudFormation stacks you created. Alternatively, if you deployed the solution using SAM, you need to authenticate to the AWS account the solution was deployed and run sam delete.

Conclusion

In this post, we demonstrated how to access Studio using a private API Gateway from a corporate network using Amazon private VPC endpoints, preventing access to presigned URLs outside the corporate network, and securing the API Gateway with a JWT authorizer using Amazon Cognito and custom Lambda authorizers.

Try out with this solution and experiment integrating this with your corporate portal, and leave your feedback in the comments!


About the Authors

Ram Vittal is a machine learning solutions architect at AWS. He has over 20+ years of experience architecting and building distributed, hybrid and cloud applications. He is passionate about building secure and scalable AI/ML and Big Data solutions to help enterprise customers with their cloud adoption and optimization journey to improve their business outcomes. In his spare time, he enjoys tennis, photography, and action movies.

Jonathan Nguyen is a Shared Delivery Team Senior Security Consultant at AWS. His background is in AWS Security with a focus on Threat Detection and Incident Response. Today, he helps enterprise customers develop a comprehensive AWS Security strategy, deploy security solutions at scale, and train customers on AWS Security best practices.

Chris Childers is a Cloud Infrastructure Architect in Professional Services at AWS. He works with AWS customers to design and automate their cloud infrastructure and improve their adoption of DevOps culture and processes.

Read More

Use a custom image to bring your own development environment to RStudio on Amazon SageMaker

RStudio on Amazon SageMaker is the industry’s first fully managed RStudio Workbench in cloud. You can quickly launch the familiar RStudio integrated development environment (IDE), and dial up and down the underlying compute resources without interrupting your work, making it easy to build machine learning (ML) and analytics solutions in R at scale. RStudio on SageMaker already comes with a built-in image preconfigured with R programming and data science tools; however, you often need to customize your IDE environment. Starting today, you can bring your own custom image with packages and tools of your choice, and make them available to all the users of RStudio on SageMaker in a few clicks.

Bringing your own custom image has several benefits. You can standardize and simplify the getting started experience for data scientists and developers by providing a starter image, preconfigure the drivers required for connecting to data stores, or pre-install specialized data science software for your business domain. Furthermore, organizations that have previously hosted their own RStudio Workbench may have existing containerized environments that they want to continue to use in RStudio on SageMaker.

In this post, we share step-by-step instructions to create a custom image and bring it to RStudio on SageMaker using the AWS Management Console or AWS Command Line Interface (AWS CLI). You can get your first custom IDE environment up and running in few simple steps. For more information on the content discussed in this post, refer to Bring your own RStudio image.

Solution overview

When a data scientist starts a new session in RStudio on SageMaker, a new on-demand ML compute instance is provisioned and a container image that defines the runtime environment (operating system, libraries, R versions, and so on) is run on the ML instance. You can provide your data scientists multiple choices for the runtime environment by creating custom container images and making them available on the RStudio Workbench launcher, as shown in the following screenshot.

The following diagram describes the process to bring your custom image. First you build a custom container image from a Dockerfile and push it to a repository in Amazon Elastic Container Registry (Amazon ECR). Next, you create a SageMaker image that points to the container image in Amazon ECR, and attach that image to your SageMaker domain. This makes the custom image available for launching a new session in RStudio.

Prerequisites

To implement this solution, you must have the following prerequisites:

We provide more details on each in this section.

RStudio on SageMaker domain

If you have an existing SageMaker domain with RStudio enabled prior to April 7, 2022, you must delete and recreate the RStudioServerPro app under the user profile name domain-shared to get the latest updates for bring your own custom image capability. The AWS CLI commands are as follows. Note that this action interrupts RStudio users on SageMaker.

aws sagemaker delete-app 
    --domain-id <sagemaker-domain-id> 
    --app-type RStudioServerPro 
    --app-name default 
    --user-profile-name domain-shared
aws sagemaker create-app 
    --domain-id <sagemaker-domain-id> 
    --app-type RStudioServerPro 
    --app-name default 
    --user-profile-name domain-shared

If this is your first time using RStudio on SageMaker, follow the step-by-step setup process described in Get started with RStudio on Amazon SageMaker, or run the following AWS CloudFormation template to set up your first RStudio on SageMaker domain. If you already have a working RStudio on SageMaker domain, you can skip this step.

The following RStudio on SageMaker CloudFormation template requires an RStudio license approved through AWS License Manager. For more about licensing, refer to RStudio license. Also note that only one SageMaker domain is permitted per AWS Region, so you’ll need to use an AWS account and Region that doesn’t have an existing domain.

  1. Choose Launch Stack.
    Launch stack button
    The link takes you to the us-east-1 Region, but you can change to your preferred Region.
  2. In the Specify template section, choose Next.
  3. In the Specify stack details section, for Stack name, enter a name.
  4. For Parameters, enter a SageMaker user profile name.
  5. Choose Next.
  6. In the Configure stack options section, choose Next.
  7. In the Review section, select I acknowledge that AWS CloudFormation might create IAM resources and choose Next.
  8. When the stack status changes to CREATE_COMPLETE, go to the Control Panel on the SageMaker console to find the domain and the new user.

IAM policies to interact with Amazon ECR

To interact with your private Amazon ECR repositories, you need the following IAM permissions in the IAM user or role you’ll use to build and push Docker images:

{ 
    "Version":"2012-10-17", 
    "Statement":[ 
        {
            "Sid": "VisualEditor0",
            "Effect":"Allow", 
            "Action":[ 
                "ecr:CreateRepository", 
                "ecr:BatchGetImage", 
                "ecr:CompleteLayerUpload", 
                "ecr:DescribeImages", 
                "ecr:DescribeRepositories", 
                "ecr:UploadLayerPart", 
                "ecr:ListImages", 
                "ecr:InitiateLayerUpload", 
                "ecr:BatchCheckLayerAvailability", 
                "ecr:PutImage" 
            ], 
            "Resource": "*" 
        }
    ]
}

To initially build from a public Amazon ECR image as shown in this post, you need to attach the AWS-managed AmazonElasticContainerRegistryPublicReadOnly policy to your IAM user or role as well.

To build a Docker container image, you can use either a local Docker client or the SageMaker Docker Build CLI tool from a terminal within RStudio on SageMaker. For the latter, follow the prerequisites in Using the Amazon SageMaker Studio Image Build CLI to build container images from your Studio notebooks to set up the IAM permissions and CLI tool.

AWS CLI versions

There are minimum version requirements for the AWS CLI tool to run the commands mentioned in this post. Make sure to upgrade AWS CLI on your terminal of choice:

  • AWS CLI v1 >= 1.23.6
  • AWS CLI v2 >= 2.6.2

Prepare a Dockerfile

You can customize your runtime environment in RStudio in a Dockerfile. Because the customization depends on your use case and requirements, we show you the essentials and the most common customizations in this example. You can download the full sample Dockerfile.

Install RStudio Workbench session components

The most important software to install in your custom container image is RStudio Workbench. We download from the public S3 bucket hosted by RStudio PBC. There are many version releases and OS distributions for use. The version of the installation needs to be compatible with the RStudio Workbench version used in RStudio on SageMaker, which is 1.4.1717-3 at the time of writing. The OS (argument OS in the following snippet) needs to match the base OS used in the container image. In our sample Dockerfile, the base image we use is Amazon Linux 2 from an AWS-managed public Amazon ECR repository. The compatible RStudio Workbench OS is centos7.

FROM public.ecr.aws/amazonlinux/amazonlinux
...
ARG RSW_VERSION=1.4.1717-3
ARG RSW_NAME=rstudio-workbench-rhel
ARG OS=centos7
ARG RSW_DOWNLOAD_URL=https://s3.amazonaws.com/rstudio-ide-build/server/${OS}/x86_64
RUN RSW_VERSION_URL=`echo -n "${RSW_VERSION}" | sed 's/+/-/g'` 
    && curl -o rstudio-workbench.rpm ${RSW_DOWNLOAD_URL}/${RSW_NAME}-${RSW_VERSION_URL}-x86_64.rpm 
    && yum install -y rstudio-workbench.rpm

You can find all the OS release options with the following command:

aws s3 ls s3://rstudio-ide-build/server/

Install R (and versions of R)

The runtime for your custom RStudio container image needs at least one version of R. We can first install a version of R and make it the default R by creating soft links to /usr/local/bin/:

# Install main R version
ARG R_VERSION=4.1.3
RUN curl -O https://cdn.rstudio.com/r/centos-7/pkgs/R-${R_VERSION}-1-1.x86_64.rpm && 
    yum install -y R-${R_VERSION}-1-1.x86_64.rpm && 
    yum clean all && 
    rm -rf R-${R_VERSION}-1-1.x86_64.rpm

RUN ln -s /opt/R/${R_VERSION}/bin/R /usr/local/bin/R && 
    ln -s /opt/R/${R_VERSION}/bin/Rscript /usr/local/bin/Rscript

Data scientists often need multiple versions of R so that they can easily switch between projects and code base. RStudio on SageMaker supports easy switching between R versions, as shown in the following screenshot.

RStudio on SageMaker automatically scans and discovers versions of R in the following directories:

/usr/lib/R
/usr/lib64/R
/usr/local/lib/R
/usr/local/lib64/R
/opt/local/lib/R
/opt/local/lib64/R
/opt/R/*
/opt/local/R/*

We can install more versions in the container image, as shown in the following snippet. They will be installed in /opt/R/.

RUN curl -O https://cdn.rstudio.com/r/centos-7/pkgs/R-4.0.5-1-1.x86_64.rpm && 
    yum install -y R-4.0.5-1-1.x86_64.rpm && 
    yum clean all && 
    rm -rf R-4.0.5-1-1.x86_64.rpm

RUN curl -O https://cdn.rstudio.com/r/centos-7/pkgs/R-3.6.3-1-1.x86_64.rpm && 
    yum install -y R-3.6.3-1-1.x86_64.rpm && 
    yum clean all && 
    rm -rf R-3.6.3-1-1.x86_64.rpm

RUN curl -O https://cdn.rstudio.com/r/centos-7/pkgs/R-3.5.3-1-1.x86_64.rpm && 
    yum install -y R-3.5.3-1-1.x86_64.rpm && 
    yum clean all && 
    rm -rf R-3.5.3-1-1.x86_64.rpm

Install RStudio Professional Drivers

Data scientists often need to access data from sources such as Amazon Athena and Amazon Redshift within RStudio on SageMaker. You can do so using RStudio Professional Drivers and RStudio Connections. Make sure you install the relevant libraries and drivers as shown in the following snippet:

# Install RStudio Professional Drivers ----------------------------------------#
RUN yum update -y && 
    yum install -y unixODBC unixODBC-devel && 
    yum clean all

ARG DRIVERS_VERSION=2021.10.0-1
RUN curl -O https://drivers.rstudio.org/7C152C12/installer/rstudio-drivers-${DRIVERS_VERSION}.el7.x86_64.rpm && 
    yum install -y rstudio-drivers-${DRIVERS_VERSION}.el7.x86_64.rpm && 
    yum clean all && 
    rm -f rstudio-drivers-${DRIVERS_VERSION}.el7.x86_64.rpm && 
    cp /opt/rstudio-drivers/odbcinst.ini.sample /etc/odbcinst.ini

RUN /opt/R/${R_VERSION}/bin/R -e 'install.packages("odbc", repos="https://packagemanager.rstudio.com/cran/__linux__/centos7/latest")'

Install custom libraries

You can also install additional R and Python libraries so that data scientists don’t need to install them on the fly:

RUN /opt/R/${R_VERSION}/bin/R -e 
    "install.packages(c('reticulate', 'readr', 'curl', 'ggplot2', 'dplyr', 'stringr', 'fable', 'tsibble', 'dplyr', 'feasts', 'remotes', 'urca', 'sodium', 'plumber', 'jsonlite'), repos='https://packagemanager.rstudio.com/cran/__linux__/centos7/latest')"
    
RUN /opt/python/${PYTHON_VERSION}/bin/pip install --upgrade 
        'boto3>1.0<2.0' 
        'awscli>1.0<2.0' 
        'sagemaker[local]<3' 
        'sagemaker-studio-image-build' 
        'numpy'

When you’ve finished your customization in a Dockerfile, it’s time to build a container image and push it to Amazon ECR.

Build and push to Amazon ECR

You can build a container image from the Dockerfile from a terminal where the Docker engine is installed, such as your local terminal or AWS Cloud9. If you’re building it from a terminal within RStudio on SageMaker, you can use SageMaker Studio Image Build. We demonstrate the steps for both approaches.

In a local terminal where the Docker engine is present, you can run the following commands from where the Dockerfile is. You can use the sample script create-and-update-image.sh.

IMAGE_NAME=r-4.1.3-rstudio-1.4.1717-3           # the name for SageMaker Image
REPO=rstudio-custom                             # ECR repository name
TAG=$IMAGE_NAME
# login to your Amazon ECR
aws ecr get-login-password | docker login --username AWS --password-stdin ${ACCOUNT_ID}.dkr.ecr.${REGION}.amazonaws.com

# create a repo
aws ecr create-repository --repository-name ${REPO}

# build a docker image and push it to the repo
docker build . -t ${REPO}:${TAG} -t ${ACCOUNT_ID}.dkr.ecr.${REGION}.amazonaws.com/${REPO}:${TAG}
docker push ${ACCOUNT_ID}.dkr.ecr.${REGION}.amazonaws.com/${REPO}:${TAG}

In a terminal on RStudio on SageMaker, run the following commands:

pip install sagemaker-studio-image-build
sm-docker build . --repository ${REPO}:${IMAGE_NAME}

After these commands, you have a repository and a Docker container image in Amazon ECR for our next step, in which we attach the container image for use in RStudio on SageMaker. Note the image URI in Amazon ECR <ACCOUNT_ID>.dkr.ecr.<REGION>.amazonaws.com/<REPO>:<TAG> for later use.

Update RStudio on SageMaker through the console

RStudio on SageMaker allows runtime customization through the use of a custom SageMaker image. A SageMaker image is a holder for a set of SageMaker image versions. Each image version represents a container image that is compatible with RStudio on SageMaker and stored in an Amazon ECR repository. To make a custom SageMaker image available to all RStudio users within a domain, you can attach the image to the domain following the steps in this section.

  1. On the SageMaker console, navigate to the Custom SageMaker Studio images attached to domain page, and choose Attach image.
  2. Select New image, and enter your Amazon ECR image URI.
  3. Choose Next.
  4. In the Image properties section, provide an Image name (required), Image display name (optional), Description (optional), IAM role, and tags.
    The image display name, if provided, is shown in the session launcher in RStudio on SageMaker. If the Image display name field is left empty, the image name is shown in RStudio on SageMaker instead.
  5. Leave EFS mount path and Advanced configuration (User ID and Group ID) as default because RStudio on SageMaker manages the configuration for us.
  6. In the Image type section, select RStudio image.
  7. Choose Submit.

You can now see a new entry in the list. It’s worth noting that, with the introduction of the support of custom RStudio images, you can see a new Usage type column in the table to denote whether an image is an RStudio image or an Amazon SageMaker Studio image.

It may take up to 5–10 minutes for the custom images to be available in the session launcher UI. You can then launch a new R session in RStudio on SageMaker with your custom images.

Over time, you may want to retire old and outdated images. To remove the custom images from the list of custom images in RStudio, select the images in the list and choose Detach.

Choose Detach again to confirm.

Update RStudio on SageMaker via the AWS CLI

The following sections describe the steps to create a SageMaker image and attach it for use in RStudio on SageMaker on the SageMaker console and using the AWS CLI. You can use the sample script create-and-update-image.sh.

Create the SageMaker image and image version

The first step is to create a SageMaker image from the custom container image in Amazon ECR by running the following two commands:

ROLE_ARN=<execution-role-that-has-AmazonSageMakerFullAccess-policy>
DISPLAY_NAME=RSession-r-4.1.3-rstudio-1.4.1717-3
aws sagemaker create-image 
    --image-name ${IMAGE_NAME} 
    --display-name ${DISPLAY_NAME} 
    --role-arn ${ROLE_ARN}

aws sagemaker create-image-version 
    --image-name ${IMAGE_NAME} 
    --base-image "${ACCOUNT_ID}.dkr.ecr.${REGION}.amazonaws.com/${REPO}:${TAG}"

Note that the custom image displayed in the session launcher in RStudio on SageMaker is determined by the input of --display-name. If the optional display name is not provided, the input of --image-name is used instead. Also note that the IAM role allows SageMaker to attach an Amazon ECR image to RStudio on SageMaker.

Create an AppImageConfig

In addition to a SageMaker image, which captures the image URI from Amazon ECR, an app image configuration (AppImageConfig) is required for use in a SageMaker domain. We simplify the configuration for an RSessionApp image so we can just create a placeholder configuration with the following command:

IMAGE_CONFIG_NAME=r-4-1-3-rstudio-1-4-1717-3
aws sagemaker create-app-image-config 
    --app-image-config-name ${IMAGE_CONFIG_NAME}

Attach to a SageMaker domain

With the SageMaker image and the app image configuration created, we’re ready to attach the custom container image to the SageMaker domain. To make a custom SageMaker image available to all RStudio users within a domain, you attach the image to the domain as a default user setting. All existing users and any new users will be able to use the custom image.

For better readability, we place the following configuration into the JSON file default-user-settings.json:

    "DefaultUserSettings": {
        "RSessionAppSettings": {
           "CustomImages": [
                {
                     "ImageName": "r-4.1.3-rstudio-2022",
                     "AppImageConfigName": "r-4-1-3-rstudio-2022"
                },
                {
                     "ImageName": "r-4.1.3-rstudio-1.4.1717-3",
                     "AppImageConfigName": "r-4-1-3-rstudio-1-4-1717-3"
                }
            ]
        }
    }
}

In this file, we can specify the image and AppImageConfig name pairs in a list in DefaultUserSettings.RSessionAppSettings.CustomImages. This preceding snippet assumes two custom images are being created.

Then run the following command to update the SageMaker domain:

aws sagemaker update-domain 
    --domain-id <sagemaker-domain-id> 
    --cli-input-json file://default-user-settings.json

After you update the domaim, it may take up to 5–10 minutes for the custom images to be available in the session launcher UI. You can then launch a new R session in RStudio on SageMaker with your custom images.

Detach images from a SageMaker domain

You can detach images simply by removing the ImageName and AppImageConfigName pairs from default-user-settings.json and updating the domain.

For example, updating the domain with the following default-user-settings.json removes r-4.1.3-rstudio-2022 from the R session launching UI and leaves r-4.1.3-rstudio-1.4.1717-3 as the only custom image available to all users in a domain:

{
    "DefaultUserSettings": {
        "RSessionAppSettings": {
           "CustomImages": [
                {
                     "ImageName": "r-4.1.3-rstudio-1.4.1717-3",
                     "AppImageConfigName": "r-4-1-3-rstudio-1-4-1717-3"
                }
            ]
        }
    }
}

Conclusion

RStudio on SageMaker makes it simple for data scientists to build ML and analytic solutions in R at scale, and for administrators to manage a robust data science environment for their developers. Data scientists want to customize the environment so that they can use the right libraries for the right job and achieve the desired reproducibility for each ML project. Administrators need to standardize the data science environment for regulatory and security reasons. You can now create custom container images that meet your organizational requirements and allow data scientists to use them in RStudio on SageMaker.

We encourage you to try it out. Happy developing!


About the Authors

Michael Hsieh is a Senior AI/ML Specialist Solutions Architect. He works with customers to advance their ML journey with a combination of AWS ML offerings and his ML domain knowledge. As a Seattle transplant, he loves exploring the great Mother Nature the city has to offer, such as the hiking trails, scenery kayaking in the SLU, and the sunset at Shilshole Bay.

Declan Kelly is a Software Engineer on the Amazon SageMaker Studio team. He has been working on Amazon SageMaker Studio since its launch at AWS re:Invent 2019. Outside of work, he enjoys hiking and climbing.

Sean MorganSean Morgan is an AI/ML Solutions Architect at AWS. He has experience in the semiconductor and academic research fields, and uses his experience to help customers reach their goals on AWS. In his free time, Sean is an active open-source contributor and maintainer, and is the special interest group lead for TensorFlow Add-ons.

Read More

Text classification for online conversations with machine learning on AWS

Online conversations are ubiquitous in modern life, spanning industries from video games to telecommunications. This has led to an exponential growth in the amount of online conversation data, which has helped in the development of state-of-the-art natural language processing (NLP) systems like chatbots and natural language generation (NLG) models. Over time, various NLP techniques for text analysis have also evolved. This necessitates the requirement for a fully managed service that can be integrated into applications using API calls without the need for extensive machine learning (ML) expertise. AWS offers pre-trained AWS AI services like Amazon Comprehend, which can effectively handle NLP use cases involving classification, text summarization, entity recognition, and more to gather insights from text.

Additionally, online conversations have led to a wide-spread phenomenon of non-traditional usage of language. Traditional NLP techniques often perform poorly on this text data due to the constantly evolving and domain-specific vocabularies that exist within different platforms, as well as the significant lexical deviations of words from proper English, either by accident or intentionally as a form of adversarial attack.

In this post, we describe multiple ML approaches for text classification of online conversations with tools and services available on AWS.

Prerequisites

Before diving deep into this use case, please complete the following prerequisites:

  1. Set up an AWS account and create an IAM user.
  2. Set up the AWS CLI and AWS SDKs.
  3. (Optional) Set up your Cloud9 IDE environment.

Dataset

For this post, we use the Jigsaw Unintended Bias in Toxicity Classification dataset, a benchmark for the specific problem of classification of toxicity in online conversations. The dataset provides toxicity labels as well as several subgroup attributes such as obscene, identity attack, insult, threat, and sexually explicit. Labels are provided as fractional values, which represent the proportion of human annotators who believed the attribute applied to a given piece of text, which are rarely unanimous. To generate binary labels (for example, toxic or non-toxic), a threshold of 0.5 is applied to the fractional values, and comments with values greater than the threshold are treated as the positive class for that label.

Subword embedding and RNNs

For our first modeling approach, we use a combination of subword embedding and recurrent neural networks (RNNs) to train text classification models. Subword embeddings were introduced by Bojanowski et al. in 2017 as an improvement upon previous word-level embedding methods. Traditional Word2Vec skip-gram models are trained to learn a static vector representation of a target word that optimally predicts that word’s context. Subword models, on the other hand, represent each target word as a bag of the character n-grams that make up the word, where an n-gram is composed of a set of n consecutive characters. This method allows for the embedding model to better represent the underlying morphology of related words in the corpus as well as the computation of embeddings for novel, out-of-vocabulary (OOV) words. This is particularly important in the context of online conversations, a problem space in which users often misspell words (sometimes intentionally to evade detection) and also use a unique, constantly evolving vocabulary that might not be captured by a general training corpus.

Amazon SageMaker makes it easy to train and optimize an unsupervised subword embedding model on your own corpus of domain-specific text data with the built-in BlazingText algorithm. We can also download existing general-purpose models trained on large datasets of online text, such as the following English language models available directly from fastText. From your SageMaker notebook instance, simply run the following to download a pretrained fastText model:

!wget -O vectors.zip https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M-subword.zip

Whether you’ve trained your own embeddings with BlazingText or downloaded a pretrained model, the result is a zipped model binary that you can use with the gensim library to embed a given target word as a vector based on its constituent subwords:

# Imports
import os
from zipfile import ZipFile
from gensim.models.fasttext import load_facebook_vectors

# Unzip the model binary into 'dir_path'
with ZipFile('vectors.zip', 'r') as zipObj:
    zipObj.extractall(path=<dir_path_name>)

# Load embedding model into memory
embed_model = load_facebook_vectors(os.path.join(<dir_path_name>, 'vectors.bin'))

# Compute embedding vector for 'word'
word_embedding = embed_model[word]

After we preprocess a given segment of text, we can use this approach to generate a vector representation for each of the constituent words (as separated by spaces). We then use SageMaker and a deep learning framework such as PyTorch to train a customized RNN with a binary or multilabel classification objective to predict whether the text is toxic or not and the specific sub-type of toxicity based on labeled training examples.

To upload your preprocessed text to Amazon Simple Storage Service (Amazon S3), use the following code:

import boto3
s3 = boto3.client('s3')

bucket = <bucket_name>
prefix = <prefix_name>

s3.upload_file('train.pkl', bucket, os.path.join(prefix, 'train/train.pkl'))
s3.upload_file('valid.pkl', bucket, os.path.join(prefix, 'valid/valid.pkl'))
s3.upload_file('test.pkl', bucket, os.path.join(prefix, 'test/test.pkl'))

To initiate scalable, multi-GPU model training with SageMaker, enter the following code:

import sagemaker
sess = sagemaker.Session()
role = iam.get_role(RoleName= ‘AmazonSageMakerFullAccess’)['Role']['Arn']

from sagemaker.pytorch import PyTorch

# hyperparameters, which are passed into the training job
hyperparameters = {
    'epochs': 20, # Maximum number of epochs to train model
    'train-batch-size': 128, # Training batch size (No. sentences)
    'eval-batch-size': 1024, # Evaluation batch size (No. sentences)
    'embed-size': 300, # Vector dimension of word embeddings (Must match embedding model)
    'lstm-hidden-size': 200, # Number of neurons in LSTM hidden layer
    'lstm-num-layers': 2, # Number of stacked LSTM layers
    'proj-size': 100, # Number of neurons in intermediate projection layer
    'num-targets': len(<list_of_label_names>), # Number of targets for classification
    'class-weight': ' '.join([str(c) for c in <list_of_weights_per_class>]), # Weight to apply to each target during training
    'total-length':<max_number_of_words_per_sentence>,
    'metric-for-best-model': 'ap_score_weighted', # Metric on which to select the best model
}

# create the Estimator
pytorch_estimator = PyTorch(
    entry_point='train.py',
    source_dir=<source_dir_path>,
    instance_type=<train_instance_type>,
    volume_size=200,
    instance_count=1,
    role=role,
    framework_version='1.6.0’,
    py_version='py36',
    hyperparameters=hyperparameters,
    metric_definitions=[
        {'Name': 'validation:accuracy', 'Regex': 'eval_accuracy = (.*?);'},
        {'Name': 'validation:f1-micro', 'Regex': 'eval_f1_score_micro = (.*?);'},
        {'Name': 'validation:f1-macro', 'Regex': 'eval_f1_score_macro = (.*?);'},
        {'Name': 'validation:f1-weighted', 'Regex': 'eval_f1_score_weighted = (.*?);'},
        {'Name': 'validation:ap-micro', 'Regex': 'eval_ap_score_micro = (.*?);'},
        {'Name': 'validation:ap-macro', 'Regex': 'eval_ap_score_macro = (.*?);'},
        {'Name': 'validation:ap-weighted', 'Regex': 'eval_ap_score_weighted = (.*?);'},
        {'Name': 'validation:auc-micro', 'Regex': 'eval_auc_score_micro = (.*?);'},
        {'Name': 'validation:auc-macro', 'Regex': 'eval_auc_score_macro = (.*?);'},
        {'Name': 'validation:auc-weighted', 'Regex': 'eval_auc_score_weighted = (.*?);'}
    ]
)

pytorch_estimator.fit(
    {
        'train': 's3://<bucket_name>/<prefix_name>/train',
        'valid': 's3://<bucket_name>/<prefix_name>/valid',
        'test': 's3://<bucket_name>/<prefix_name>/test'
    }
)

Within <source_dir_path>, we define a PyTorch Dataset that is used by train.py to prepare the text data for training and evaluation of the model:

def pad_matrix(m: torch.Tensor, max_len: int =100)-> tuple[int, torch.Tensor] :
    """Pads an embedding matrix to a specified maximum length."""
    if m.ndim == 1:
        m = m.reshape(1, -1)
    mask = np.ones_like(m)
    if m.shape[0] > max_len:
        m = m[:max_len, :]
        mask = mask[:max_len, :]
    else:
        m = np.pad(m, ((0, max_len - m.shape[0]), (0,0)))
        mask = np.pad(mask, ((0, max_len - mask.shape[0]), (0,0)))
    return m, mask


class EmbeddingDataset(Dataset: torch.utils.data.Dataset):
    """PyTorch dataset representing pretrained sentence embeddings, masks, and labels."""
    def __init__(self, text: str, labels: int, max_len: int=100):
        self.text = text
        self.labels = labels
        self.max_len = max_len

    def __len__(self) -> int:
        return len(self.labels)

    def __getitem__(self, idx: int) -> dict:   
        e = embed_line(self.text[idx])
        length = e.shape[0]
        m, mask = pad_matrix(e, max_len=self.max_len)
        
        item = {}
        item['embeddings'] = torch.from_numpy(m)
        item['mask'] = torch.from_numpy(mask)
        item['labels'] = torch.tensor(self.labels[idx])
        if length > self.max_len:
            item['lengths'] = torch.tensor(self.max_len)
        else:
            item['lengths'] = torch.tensor(length)
        
        return item

Note that this code anticipates that the vectors.zip file containing your fastText or BlazingText embeddings will be stored in <source_dir_path>.

Additionally, you can easily deploy pretrained fastText models on their own to live SageMaker endpoints to compute embedding vectors on the fly for use in relevant word-level tasks. See the following GitHub example for more details.

Transformers with Hugging Face

For our second modeling approach, we transition to the usage of Transformers, introduced in the paper Attention Is All You Need. Transformers are deep learning models designed to deliberately avoid the pitfalls of RNNs by relying on a self-attention mechanism to draw global dependencies between input and output. The Transformer model architecture allows for significantly better parallelization and can achieve high performance in relatively short training time.

Built on the success of Transformers, BERT, introduced in the paper BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, added bidirectional pre-training for language representation. Inspired by the Cloze task, BERT is pre-trained with masked language modeling (MLM), in which the model learns to recover the original words for randomly masked tokens. The BERT model is also pretrained on the next sentence prediction (NSP) task to predict if two sentences are in correct reading order. Since its advent in 2018, BERT and its variations have been widely used in text classification tasks.

Our solution uses a variant of BERT known as RoBERTa, which was introduced in the paper RoBERTa: A Robustly Optimized BERT Pretraining Approach. RoBERTa further improves BERT performance on a variety of natural language tasks by optimized model training, including training models longer on a 10 times larger bigger corpus, using optimized hyperparameters, dynamic random masking, removing the NSP task, and more.

Our RoBERTa-based models use the Hugging Face Transformers library, which is a popular open-source Python framework that provides high-quality implementations of all kinds of state-of-the-art Transformer models for a variety of NLP tasks. Hugging Face has partnered with AWS to enable you to easily train and deploy Transformer models on SageMaker. This functionality is available through Hugging Face AWS Deep Learning Container images, which include the Transformers, Tokenizers, and Datasets libraries, and optimized integration with SageMaker for model training and inference.

In our implementation, we inherit the RoBERTa architecture backbone from the Hugging Face Transformers framework and use SageMaker to train and deploy our own text classification model, which we call RoBERTox. RoBERTox uses byte pair encoding (BPE), introduced in Neural Machine Translation of Rare Words with Subword Units, to tokenize input text into subword representations. We can then train our models and tokenizers on the Jigsaw data or any large domain-specific corpus (such as the chat logs from a specific game) and use them for customized text classification. We define our custom classification model class in the following code:

class RoBERToxForSequenceClassification(CustomLossMixIn, RobertaPreTrainedModel):
    _keys_to_ignore_on_load_missing = [r"position_ids"]

    def __init__(self, config: PretrainedConfig, *inputs, **kwargs):
        """Initialize the RoBERToxForSequenceClassification instance

        Parameters
        ----------
        config : PretrainedConfig
        num_labels : Optional[int]
            if not None, overwrite the default classification head in pretrained model.
        mode : Optional[str]
            'MULTI_CLASS', 'MULTI_LABEL' or "REGRESSION". Used to determine loss
        class_weight : Optional[List[float]]
            If not None, add class weight to BCEWithLogitsLoss or CrossEntropyLoss
        """
        super().__init__(config, *inputs, **kwargs)
        # Define model architecture
        self.roberta = RobertaModel(self.config, add_pooling_layer=False)
        self.classifier = RobertaClassificationHead(self.config)
        self.init_weights()

    @modeling_roberta.add_start_docstrings_to_model_forward(
        modeling_roberta.ROBERTA_INPUTS_DOCSTRING.format("batch_size, sequence_length")
    )
    @modeling_roberta.add_code_sample_docstrings(
        tokenizer_class=modeling_roberta._TOKENIZER_FOR_DOC,
        checkpoint=modeling_roberta._CHECKPOINT_FOR_DOC,
        output_type=SequenceClassifierOutput,
        config_class=modeling_roberta._CONFIG_FOR_DOC,
    )
    def forward(
            self,
            input_ids: torch.Tensor = None,
            attention_mask: torch.Tensor = None,
            token_type_ids: torch.Tensor = None,
            position_ids: torch.Tensor =None,
            head_mask: torch.Tensor =None,
            inputs_embeds: torch.Tensor =None,
            labels: torch.Tensor =None,
            output_attentions: torch.Tensor =None,
            output_hidden_states: torch.Tensor =None,
            return_dict: bool =None,
            sample_weights: torch.Tensor =None,
    ) -> : dict:
        """Forward pass to return loss, logits, ...

        Returns
        --------
        output : SequenceClassifierOutput
            has those keys: loss, logits, hidden states, attentions
        """
        return_dict = return_dict or self.config.use_return_dict

        outputs = self.roberta(
            input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
            position_ids=position_ids,
            head_mask=head_mask,
            inputs_embeds=inputs_embeds,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )
        sequence_output = outputs[0]  # [CLS] embedding
        logits = self.classifier(sequence_output)
        loss = self.compute_loss(logits, labels, sample_weights=sample_weights)

        if not return_dict:
            output = (logits,) + outputs[2:]
            return ((loss,) + output) if loss is not None else output

        return SequenceClassifierOutput(
            loss=loss,
            logits=logits,
            hidden_states=outputs.hidden_states,
            attentions=outputs.attentions,
        )

    def compute_loss(self, logits: torch.Tensor, labels: torch.Tensor, sample_weights: Optional[torch.Tensor] = None) -> torch.FloatTensor:
        return super().compute_loss(logits, labels, sample_weights)

Before training, we prepare our text data and labels using Hugging Face’s datasets library and upload the result to Amazon S3:

from datasets import Dataset
import multiprocessing

data_train = Dataset.from_pandas(df_train)
…

tokenizer = <instantiated_huggingface_tokenizer>

def preprocess_function(examples: examples) -> torch.Tensor:
    result = tokenizer(examples["text"], padding="max_length", max_length=128, truncation=True)
    return result

num_proc = multiprocessing.cpu_count()
print("Number of CPUs =", num_proc)

data_train = data_train.map(
    preprocess_function,
    batched=True,
    load_from_cache_file=False,
    num_proc=num_proc
)
…

import botocore
from datasets.filesystems import S3FileSystem

s3_session = botocore.session.Session()

# create S3FileSystem instance with s3_session
s3 = S3FileSystem(session=s3_session)  

# saves encoded_dataset to your s3 bucket
data_train.save_to_disk(f's3://<bucket_name>/<prefix_name>/train', fs=s3)
… 

We initiate training of the model in a similar fashion to the RNN:

import sagemaker
sess = sagemaker.Session()
role = sagemaker.get_execution_role()
from sagemaker.huggingface import HuggingFace

# hyperparameters, which are passed into the training job
hyperparameters = {
    'model-name': <huggingface_base_model_name>,
    'epochs': 10,
    'train-batch-size': 32,
    'eval-batch-size': 64,
    'num-labels': len(<list_of_label_names>),
    'class-weight': ' '.join([str(c) for c in <list_of_class_weights>]),
    'metric-for-best-model': 'ap_score_weighted',
    'save-total-limit': 1,
}

# create the Estimator
huggingface_estimator = HuggingFace(
    entry_point='train.py',
    source_dir=<source_dir_path>,
    instance_type=<train_instance_type>,
    instance_count=1,
    role=role,
    transformers_version='4.6.1',
    pytorch_version='1.7.1',
    py_version='py36',
    hyperparameters=hyperparameters,
    metric_definitions=[
        {'Name': 'validation:accuracy', 'Regex': 'eval_accuracy = (.*?);'},
        {'Name': 'validation:f1-micro', 'Regex': 'eval_f1_score_micro = (.*?);'},
        {'Name': 'validation:f1-macro', 'Regex': 'eval_f1_score_macro = (.*?);'},
        {'Name': 'validation:f1-weighted', 'Regex': 'eval_f1_score_weighted = (.*?);'},
        {'Name': 'validation:ap-micro', 'Regex': 'eval_ap_score_micro = (.*?);'},
        {'Name': 'validation:ap-macro', 'Regex': 'eval_ap_score_macro = (.*?);'},
        {'Name': 'validation:ap-weighted', 'Regex': 'eval_ap_score_weighted = (.*?);'},
        {'Name': 'validation:auc-micro', 'Regex': 'eval_auc_score_micro = (.*?);'},
        {'Name': 'validation:auc-macro', 'Regex': 'eval_auc_score_macro = (.*?);'},
        {'Name': 'validation:auc-weighted', 'Regex': 'eval_auc_score_weighted = (.*?);'}
    ]
)

huggingface_estimator.fit(
    {
        'train': 's3://<bucket_name>/<prefix_name>/train',
        'valid': 's3://<bucket_name>/<prefix_name>/valid',
        'test': 's3://<bucket_name>/<prefix_name>/test'
)

Finally, the following Python code snippet illustrates the process of serving RoBERTox via a live SageMaker endpoint for real-time text classification for a JSON request:

from sagemaker.huggingface import HuggingFaceModel
from sagemaker import get_execution_role
from sagemaker.predictor import Predictor
from sagemaker.serializers import JSONSerializer
from sagemaker.deserializers import JSONDeserializer

class Classifier(Predictor):
    def __init__(self, endpoint_name, sagemaker_session):
        super().__init__(endpoint_name, sagemaker_session,
                         serializer=JSONSerializer(),
                         deserializer=JSONDeserializer())


hf_model = HuggingFaceModel(
    role=get_execution_role(),
    model_data=<s3_model_and_tokenizer.tar.gz>,
    entry_point="inference.py",
    transformers_version="4.6.1",
    pytorch_version="1.7.1",
    py_version="py36",
    predictor_cls=Classifier
)

predictor = hf_model.deploy(instance_type=<deploy_instance_type>, initial_instance_count=1)

Evaluation of model performance: Jigsaw unintended bias dataset

The following table contains performance metrics for models trained and evaluated on data from the Jigsaw Unintended Bias in Toxicity Detection Kaggle competition. We trained models for three different but interrelated tasks:

  • Binary case – The model was trained on the full training dataset to predict the toxicity label only
  • Fine-grained case – The subset of the training data for which toxicity>=0.5 was used to predict other toxicity sub-type labels (obscene, threat, insult, identity_attack, sexual_explicit)
  • Multitask case – The full training dataset was used to predict all six labels simultaneously

We trained RNN and RoBERTa models for each of these three tasks using the Jigsaw-provided fractional labels, which correspond to the proportion of annotators who thought the label was appropriate for the text, as well as with binary labels combined with class weights in the network loss function. In the binary labeling scheme, the proportions were thresholded at 0.5 for each available label (1 if label>=0.5, 0 otherwise), and the model loss functions were weighted based on the relative proportions of each binary label in the training dataset. In all cases, we found that using the fractional labels directly resulted in the best performance, indicating the added value of the information inherent in the degree of agreement between annotators.

We display two model metrics: the average precision (AP), which provides a summary of the precision-recall curve by computing the weighted mean of the precision values achieved at each classification threshold, and the area under the receiver operating characteristic curve (AUC), which aggregates model performance across classification thresholds with respect to the true positive rate and false positive rate. Note that the true class for a given text instance in the test set corresponds to whether the true proportion is greater than or equal to 0.5 (1 if label>=0.5, 0 otherwise).

. Subword Embedding + RNN RoBERTa
. Fractional labels Binary labels + Class weighting Fractional labels Binary labels + Class weighting
Binary AP=0.746, AUC=0.966 AP=0.730, AUC=0.963 AP=0.758, AUC=0.966 AP=0.747, AUC=0.963
Fine-grained AP=0.906, AUC=0.909 AP=0.850, AUC=0.851 AP=0.913, AUC=0.913 AP=0.911, AUC=0.912
Multitask AP=0.721, AUC=0.972 AP=0.535, AUC=0.907 AP=0.740, AUC=0.972 AP=0.711, AUC=0.961

Conclusion

In this post, we presented two text classification approaches for online conversations using AWS ML services. You can generalize these solutions across online communication platforms, with industries such as gaming particularly likely to benefit from improved ability to detect harmful content. In future posts, we plan to further discuss an end-to-end architecture for seamless deployment of models into your AWS account.

If you’d like help accelerating your use of ML in your products and processes, please contact the Amazon ML Solutions Lab.


About the Authors

Ryan Brand is a Data Scientist in the Amazon Machine Learning Solutions Lab. He has specific experience in applying machine learning to problems in healthcare and the life sciences, and in his free time he enjoys reading history and science fiction.

Sourav Bhabesh is a Data Scientist at the Amazon ML Solutions Lab. He develops AI/ML solutions for AWS customers across various industries. His specialty is Natural Language Processing (NLP) and is passionate about deep learning. Outside of work he enjoys reading books and traveling.

Liutong Zhou is an Applied Scientist at the Amazon ML Solutions Lab. He builds bespoke AI/ML solutions for AWS customers across various industries. He specializes in Natural Language Processing (NLP) and is passionate about multi-modal deep learning. He is a lyric tenor and enjoys singing operas outside of work.

Sia Gholami is a Senior Data Scientist at the Amazon ML Solutions Lab, where he builds AI/ML solutions for customers across various industries. He is passionate about natural language processing (NLP) and deep learning. Outside of work, Sia enjoys spending time in nature and playing tennis.

Daniel Horowitz is an Applied AI Science Manager. He leads a team of scientists on the Amazon ML Solutions Lab working to solve customer problems and drive cloud adoption with ML.

Read More

Hyperparameter optimization for fine-tuning pre-trained transformer models from Hugging Face

Large attention-based transformer models have obtained massive gains on natural language processing (NLP). However, training these gigantic networks from scratch requires a tremendous amount of data and compute. For smaller NLP datasets, a simple yet effective strategy is to use a pre-trained transformer, usually trained in an unsupervised fashion on very large datasets, and fine-tune it on the dataset of interest. Hugging Face maintains a large model zoo of these pre-trained transformers and makes them easily accessible even for novice users.

However, fine-tuning these models still requires expert knowledge, because they’re quite sensitive to their hyperparameters, such as learning rate or batch size. In this post, we show how to optimize these hyperparameters with the open-source framework Syne Tune for distributed hyperparameter optimization (HPO). Syne Tune allows us to find a better hyperparameter configuration that achieves a relative improvement between 1-4% compared to default hyperparameters on popular GLUE benchmark datasets. The choice of the pre-trained model itself can also be considered a hyperparameter and therefore be automatically selected by Syne Tune. On a text classification problem, this leads to an additional boost in accuracy of approximately 5% compared to the default model. However, we can automate more decisions a user needs to make; we demonstrate this by also exposing the type of instance as a hyperparameter that we later use to deploy the model. By selecting the right instance type, we can find configurations that optimally trade off cost and latency.

For an introduction to Syne Tune please refer to Run distributed hyperparameter and neural architecture tuning jobs with Syne Tune.

Hyperparameter optimization with Syne Tune

We will use the GLUE benchmark suite, which consists of nine datasets for natural language understanding tasks, such as textual entailment recognition or sentiment analysis. For that, we adapt Hugging Face’s run_glue.py training script. GLUE datasets come with a predefined training and evaluation set with labels as well as a hold-out test set without labels. Therefore, we split the training set into a training and validation sets (70%/30% split) and use the evaluation set as our holdout test dataset. Furthermore, we add another callback function to Hugging Face’s Trainer API that reports the validation performance after each epoch back to Syne Tune. See the following code:

import transformers

from syne_tune.report import Reporter

class SyneTuneReporter(transformers.trainer_callback.TrainerCallback):

    def __init__(self):
        self.report = Reporter()

    def on_evaluate(self, args, state, control, **kwargs):
        results = kwargs['metrics'].copy()
        results['step'] = state.global_step
        results['epoch'] = int(state.epoch)
        self.report(**results)

We start with optimizing typical training hyperparameters: the learning rate, warmup ratio to increase the learning rate, and the batch size for fine-tuning a pretrained BERT (bert-base-cased) model, which is the default model in the Hugging Face example. See the following code:

config_space = dict()
config_space['learning_rate'] = loguniform(1e-6, 1e-4)
config_space['per_device_train_batch_size'] =  randint(16, 48)
config_space['warmup_ratio'] = uniform(0, 0.5)

As our HPO method, we use ASHA, which samples hyperparameter configurations uniformly at random and iteratively stops the evaluation of poorly performing configurations. Although more sophisticated methods utilize a probabilistic model of the objective function, such as BO or MoBster exists, we use ASHA for this post because it comes without any assumptions on the search space.

In the following figure, we compare the relative improvement in test error over Hugging Faces’ default hyperparameter configuration.

For simplicity, we limit the comparison to MRPC, COLA, and STSB, but we also observe similar improvements also for other GLUE datasets. For each dataset, we run ASHA on a single ml.g4dn.xlarge Amazon SageMaker instance with a runtime budget of 1,800 seconds, which corresponds to approximately 13, 7, and 9 full function evaluations on these datasets, respectively. To account for the intrinsic randomness of the training process, for example caused by the mini-batch sampling, we run both ASHA and the default configuration for five repetitions with an independent seed for the random number generator and report the average and standard deviation of the relative improvement across the repetitions. We can see that, across all datasets, we can in fact improve predictive performance by 1-3% relative to the performance of the carefully selected default configuration.

Automate selecting the pre-trained model

We can use HPO to not only find hyperparameters, but also automatically select the right pre-trained model. Why do we want to do this? Because no a single model outperforms across all datasets, we have to select the right model for a specific dataset. To demonstrate this, we evaluate a range of popular transformer models from Hugging Face. For each dataset, we rank each model by its test performance. The ranking across datasets (see the following Figure) changes and not one single model that scores the highest on every dataset. As reference we also show the absolute test performance of each model and dataset in the following figure.

To automatically select the right model, we can cast the choice of the model as categorical parameters and add this to our hyperparameter search space:

config_space['model_name_or_path'] = choice(['bert-base-cased', 'bert-base-uncased', 'distilbert-base-uncased', 'distilbert-base-cased', 'roberta-base', 'albert-base-v2', 'distilroberta-base', 'xlnet-base-cased', 'albert-base-v1'])

Although the search space is now larger, that doesn’t necessarily mean that it’s harder to optimize. The following figure shows the test error of the best observed configuration (based on the validation error) on the MRPC dataset of ASHA over time when we search in the original space (blue line) (with a BERT-base-cased pre-trained model) or in the new augmented search space (orange line). Given the same budget, ASHA is able to find a much better performing hyperparameter configuration in the extended search space than in the smaller space.

Automate selecting the instance type

In practice, we might not just care about optimizing predictive performance. We might also care about other objectives, such as training time, (dollar) cost, latency, or fairness metrics. We also need to make other choices beyond the hyperparameters of the model, for example selecting the instance type.

Although the instance type doesn’t influence predictive performance, it strongly impacts the (dollar) cost, training runtime, and latency. The latter becomes particularly important when the model is deployed. We can phrase HPO as a multi-objective optimization problem, where we aim to optimize multiple objectives simultaneously. However, no single solution optimizes all metrics at the same time. Instead, we aim to find a set of configurations that optimally trade off one objective vs. the other. This is called the Pareto set.

To analyze this setting further, we add the choice of the instance type as an additional categorical hyperparameter to our search space:

config_space['st_instance_type'] = choice(['ml.g4dn.xlarge', 'ml.g4dn.2xlarge', 'ml.p2.xlarge', 'ml.g4dn.4xlarge', 'ml.g4dn.8xlarge', 'ml.p3.2xlarge'])

We use MO-ASHA, which adapts ASHA to the multi-objective scenario by using non-dominated sorting. In each iteration, MO-ASHA also selects for each configuration also the type of instance we want to evaluate it on. To run HPO on a heterogeneous set of instances, Syne Tune provides the SageMaker backend. With this backend, each trial is evaluated as an independent SageMaker training job on its own instance. The number of workers defines how many SageMaker jobs we run in parallel at a given time. The optimizer itself, MO-ASHA in our case, runs either on the local machine, a Sagemaker notebook or on a separate SageMaker training job. See the following code:

backend = SageMakerBackend(
    sm_estimator=HuggingFace(
        entry_point=str('run_glue.py'),
        source_dir=os.getcwd(),
        base_job_name='glue-moasha',
        # instance-type given here are override by Syne Tune with values sampled from `st_instance_type`.
        instance_type='ml.m5.large',
        instance_count=1,
        py_version="py38",
        pytorch_version='1.9',
        transformers_version='4.12',
        max_run=3600,
        role=get_execution_role(),
    ),
)

The following figures show the latency vs test error on the left and latency vs cost on the right for random configurations sampled by MO-ASHA (we limit the axis for visibility) on the MRPC dataset after running it for 10,800 seconds on four workers. Color indicates the instance type. The dashed black line represents the Pareto set, meaning the set of points that dominate all other points in at least one objective.

We can observe a trade-off between latency and test error, meaning the best configuration with the lowest test error doesn’t achieve the lowest latency. Based on your preference, you can select a hyperparameter configuration that sacrifices on test performance but comes with a smaller latency. We also see the trade off between latency and cost. By using a smaller ml.g4dn.xlarge instance, for example, we only marginally increase latency, but pay a fourth of the cost of an ml.g4dn.8xlarge instance.

Conclusion

In this post, we discussed hyperparameter optimization for fine-tuning pre-trained transformer models from Hugging Face based on Syne Tune. We saw that by optimizing hyperparameters such as learning rate, batch size, and the warm-up ratio, we can improve upon the carefully chosen default configuration. We can also extend this by automatically selecting the pre-trained model via hyperparameter optimization.

With the help of Syne Tune’s SageMaker backend, we can treat the instance type as an hyperparameter. Although the instance type doesn’t affect performance, it has a significant impact on the latency and cost. Therefore, by casting HPO as a multi-objective optimization problem, we’re able to find a set of configurations that optimally trade off one objective vs. the other. If you want to try this out yourself, check out our example notebook.


About the Authors

Aaron Klein is an Applied Scientist at AWS.

Matthias Seeger is a Principal Applied Scientist at AWS.

David Salinas is a Sr Applied Scientist at AWS.

Emily Webber joined AWS just after SageMaker launched, and has been trying to tell the world about it ever since! Outside of building new ML experiences for customers, Emily enjoys meditating and studying Tibetan Buddhism.

Cedric Archambeau is a Principal Applied Scientist at AWS and Fellow of the European Lab for Learning and Intelligent Systems.

Read More