Use the AWS CDK to deploy Amazon SageMaker Studio lifecycle configurations

Use the AWS CDK to deploy Amazon SageMaker Studio lifecycle configurations

Amazon SageMaker Studio is the first fully integrated development environment (IDE) for machine learning (ML). Studio provides a single web-based visual interface where you can perform all ML development steps required to prepare data, as well as build, train, and deploy models. Lifecycle configurations are shell scripts triggered by Studio lifecycle events, such as starting a new Studio notebook. You can use lifecycle configurations to automate customization for your Studio environment. This customization includes installing custom packages, configuring notebook extensions, preloading datasets, and setting up source code repositories. For example, as an administrator for a Studio domain, you may want to save costs by having notebook apps shut down automatically after long periods of inactivity.

The AWS Cloud Development Kit (AWS CDK) is a framework for defining cloud infrastructure through code and provisioning it through AWS CloudFormation stacks. A stack is a collection of AWS resources that can be programmatically updated, moved, or deleted. AWS CDK constructs are the building blocks of AWS CDK applications, representing the blueprint to define cloud architectures.

In this post, we show how to use the AWS CDK to set up Studio, use Studio lifecycle configurations, and enable its access for data scientists and developers in your organization.

Solution overview

The modularity of lifecycle configurations allows you to apply them to all users in a domain or to specific users. This way, you can set up lifecycle configurations and reference them in the Studio kernel gateway or Jupyter server quickly and consistently. The kernel gateway is the entry point to interact with a notebook instance, whereas the Jupyter server represents the Studio instance. This enables you to apply DevOps best practices and meet safety, compliance, and configuration standards across all AWS accounts and Regions. For this post, we use Python as the main language, but the code can be easily changed to other AWS CDK supported languages. For more information, refer to Working with the AWS CDK.

Prerequisites

To get started, make sure you have the following prerequisites:

Clone the GitHub repository

First, clone the GitHub repository.

As you clone the repository, you can observe that we have a classic AWS CDK project with the directory studio-lifecycle-config-construct, which contains the construct and resources required to create lifecycle configurations.

AWS CDK constructs

The file we want to inspect is aws_sagemaker_lifecycle.py. This file contains the SageMakerStudioLifeCycleConfig construct we use to set up and create lifecycle configurations.

The SageMakerStudioLifeCycleConfig construct provides the framework for building lifecycle configurations using a custom AWS Lambda function and shell code read in from a file. The construct contains the following parameters:

  • ID – The name of the current project.
  • studio_lifecycle_content – The base64 encoded content.
  • studio_lifecycle_tags – Labels you assign to organize Amazon resources. They are inputted as key-value pairs and are optional for this configuration.
  • studio_lifecycle_config_app_typeJupyterServer is for the unique server itself, and the KernelGateway app corresponds to a running SageMaker image container.

For more information on the Studio notebook architecture, refer to Dive deep into Amazon SageMaker Studio Notebooks architecture.

The following is a code snippet of the Studio lifecycle config construct (aws_sagemaker_lifecycle.py):

class SageMakerStudioLifeCycleConfig(Construct):
 def __init__(
 self,
 scope: Construct,
 id: str,
 studio_lifecycle_config_content: str,
 studio_lifecycle_config_app_type: str,
 studio_lifecycle_config_name: str,
 studio_lifecycle_config_arn: str,
 **kwargs,
 ):
 super().__init__(scope, id)
 self.studio_lifecycle_content = studio_lifecycle_content
 self.studio_lifecycle_config_name = studio_lifecycle_config_name
 self.studio_lifecycle_config_app_type = studio_lifecycle_config_app_type

 lifecycle_config_role = iam.Role(
 self,
 "SmStudioLifeCycleConfigRole",
 assumed_by=iam.ServicePrincipal("lambda.amazonaws.com"),
 )

 lifecycle_config_role.add_to_policy(
 iam.PolicyStatement(
 resources=[f"arn:aws:sagemaker:{scope.region}:{scope.account}:*"],
 actions=[
 "sagemaker:CreateStudioLifecycleConfig",
 "sagemaker:ListUserProfiles",
 "sagemaker:UpdateUserProfile",
 "sagemaker:DeleteStudioLifecycleConfig",
 "sagemaker:AddTags",
 ],
 )
 )

 create_lifecycle_script_lambda = lambda_.Function(
 self,
 "CreateLifeCycleConfigLambda",
 runtime=lambda_.Runtime.PYTHON_3_8,
 timeout=Duration.minutes(3),
 code=lambda_.Code.from_asset(
 "../mlsl-cdk-constructs-lib/src/studiolifecycleconfigconstruct"
 ),
 handler="onEvent.handler",
 role=lifecycle_config_role,
 environment={
 "studio_lifecycle_content": self.studio_lifecycle_content,
 "studio_lifecycle_config_name": self.studio_lifecycle_config_name,
 "studio_lifecycle_config_app_type": self.studio_lifecycle_config_app_type,
 },
 )

 config_custom_resource_provider = custom_resources.Provider(
 self,
 "ConfigCustomResourceProvider",
 on_event_handler=create_lifecycle_script_lambda,
 )

 studio_lifecyle_config_custom_resource = CustomResource(
 self,
 "LifeCycleCustomResource",
 service_token=config_custom_resource_provider.service_token,
 )
 self. studio_lifecycle_config_arn = studio_lifecycle_config_custom_resource.get_att("StudioLifecycleConfigArn")

After you import and install the construct, you can use it. The following code snippet shows how to create a lifecycle config using the construct in a stack either in app.py or another construct:

my_studio_lifecycle_config = SageMakerStudioLifeCycleConfig(
 self,
 "MLSLBlogPost",
 studio_lifecycle_config_content="base64content",
 studio_lifecycle_config_name="BlogPostTest",
 studio_lifecycle_config_app_type="JupyterServer",
 
 )

Deploy AWS CDK constructs

To deploy your AWS CDK stack, run the following commands in the location where you cloned the repository.

The command may be python instead of python3 depending on your path configurations.

  1. Create a virtual environment:
    1. For macOS/Linux, use python3 -m venv .cdk-venv.
    2. For Windows, use python3 -m venv .cdk-venv.
  2. Activate the virtual environment:
    1. For macOS/Linux, use source .cdk-venvbinactivate.
    2. For Windows, use .cdk-venv/Scripts/activate.bat.
    3. For PowerShell, use .cdk-venv/Scripts/activate.ps1.
  3. Install the required dependencies:
    1. pip install -r requirements.txt
    2. pip install -r requirements-dev.txt
  4. At this point, you can optionally synthesize the CloudFormation template for this code:
    cdk synth

  5. Deploy the solution with the following commands:
    1. aws configure
    2. cdk bootstrap
    3. cdk deploy

When the stack is successfully deployed, you should be able to view the stack on the CloudFormation console.

You will also be able to view the lifecycle configuration on the SageMaker console.

Choose the lifecycle configuration to view the shell code that runs as well as any tags you assigned.

Attach the Studio lifecycle configuration

There are multiple ways to attach a lifecycle configuration. In this section, we present two methods: using the AWS Management Console, and programmatically using the infrastructure provided.

Attach the lifecycle configuration using the console

To use the console, complete the following steps:

  1. On the SageMaker console, choose Domains in the navigation pane.
  2. Choose the domain name you’re using and the current user profile, then choose Edit.
  3. Select the lifecycle configuration you want to use and choose Attach.

From here, you can also set it as default.

Attach the lifecycle configuration programmatically

You can also retrieve the ARN of the Studio lifecycle configuration created by the construct’s and attach it to the Studio construct programmatically. The following code shows the lifecycle configuration ARN being passed to a Studio construct:

default_user_settings=sagemaker.CfnDomain.UserSettingsProperty(
                execution_role=self.sagemaker_role.role_arn,
                jupyter_server_app_settings=sagemaker.CfnDomain.JupyterServerAppSettingsProperty(
                    default_resource_spec=sagemaker.CfnDomain.ResourceSpecProperty(
                        instance_type="system",
                        lifecycle_config_arn = my_studio_lifecycle_config.studio_lifeycycle_config_arn

                    )
                )

Clean up

Complete the steps in this section to clean up your resources.

Delete the Studio lifecycle configuration

To delete your lifecycle configuration, complete the following steps:

  1. On the SageMaker console, choose Studio lifecycle configurations in the navigation pane.
  2. Select the lifecycle configuration, then choose Delete.

Delete the AWS CDK stack

When you’re done with the resources you created, you can destroy your AWS CDK stack by running the following command in the location where you cloned the repository:

cdk destroy

When asked to confirm the deletion of the stack, enter yes.

You can also delete the stack on the AWS CloudFormation console with the following steps:

  1. On the AWS CloudFormation console, choose Stacks in the navigation pane.
  2. Choose the stack that you want to delete.
  3. In the stack details pane, choose Delete.
  4. Choose Delete stack when prompted.

If you run into any errors, you may have to manually delete some resources depending on your account configuration.

Conclusion

In this post, we discussed how Studio serves as an IDE for ML workloads. Studio offers lifecycle configuration support, which allows you to set up custom shell scripts to perform automated tasks, or set up development environments at launch. We used AWS CDK constructs to build the infrastructure for the custom resource and lifecycle configuration. Constructs are synthesized into CloudFormation stacks that are then deployed to create the custom resource and lifecycle script that is used in Studio and the notebook kernel.

For more information, visit Amazon SageMaker Studio.


About the Authors

Cory Hairston is a Software Engineer with the Amazon ML Solutions Lab. He currently works on providing reusable software solutions.

Alex Chirayath is a Senior Machine Learning Engineer at the Amazon ML Solutions Lab. He leads teams of data scientists and engineers to build AI applications to address business needs.

Gouri Pandeshwar is an Engineer Manager at the Amazon ML Solutions Lab. He and his team of engineers are working to build reusable solutions and frameworks that help accelerate adoption of AWS AI/ML services for customers’ business use cases.

Read More

Boost agent productivity with Salesforce integration for Live Call Analytics

Boost agent productivity with Salesforce integration for Live Call Analytics

As a contact center agent, would you rather focus on having productive customer conversations or get distracted by having to look up customer information and knowledge articles that could exist in various systems? We’ve all been there. Having a productive conversation while multitasking is challenging. A single negative experience may put a dent on a customer’s perception of your brand.

The Live Call Analytics with Agent Assist (LCA) open-source solution addresses these challenges by providing features such as AI-powered agent assistance, call transcription, call summarization, and much more. As part of our effort to meet the needs of your agents, we strive to add features based on your feedback and our own experience helping contact center operators.

One of the features we added is the ability to write your own AWS Lambda hooks for the start of call and post-call to custom process calls as they occur. This makes it easier to custom integrate with LCA architecture without complex modification to the original source code. It also lets you update LCA stack deployments more easily and quickly than if you were modifying the code directly.

Today, we are excited to announce a feature that lets you integrate LCA with your Customer Relationship Management (CRM) system, built on top of the pre- and post-call Lambda hooks.

In this post, we walk you through setting up the LCA/CRM integration with Salesforce.

Solution overview

LCA now has two additional Lambda hooks:

  • Start of call Lambda hook – The LCA Call Event/Transcript Processor invokes this hook at the beginning of each call. This function can implement custom logic that applies to the beginning of call processing, such as retrieving call summary details logged into a case in a CRM.
  • Post-call summary Lambda hook – The LCA Call Event/Transcript Processor invokes this hook after the call summary is processed. This function can implement custom logic that’s relevant to postprocessing, for example, updating the call summary to a CRM system.

The following diagram illustrates the start of call and post-call (summary) Lambda hooks that integrate with Salesforce to look up and update case records, respectively.

Start of call and Post call (summary) Lambda Hooks that integrate with Salesforce to look-up and update Case records respectively

Here are the steps we walk you through:

  1. Set up Salesforce to allow the custom Lambda hooks to look up or update the case records.
  2. Deploy the LCA and Salesforce integration stacks.
  3. Update the LCA stack with the Salesforce integration Lambda hooks and perform validations.

Prerequisites

You need the following prerequisites:

Create a Salesforce connected app

To set up your Salesforce app, complete the following steps:

  1. Log in to your Salesforce org and go to Setup.
  2. Search for App Manager and choose App Manager.
    Search for App Manager
  3. Choose New Connected App.
  4. For Connected App Name, enter a name.
  5. For Contact Email, enter a valid email.
  6. Select Enable OAuth Settings and enter a value for Callback URL.
  7. Under Available OAuth Scopes, choose Manage user data via APIs (api).
  8. Select Require Secret for Webserver Flow and Require Secret for Refresh Token Flow.
  9. Choose Save.
    New Connected App
  10. Under API (Enable OAuth Settings), choose Manage Consumer Details.
  11. Verify your identity if prompted.
  12. Copy the consumer key and consumer secret.

You need these when deploying the AWS Serverless Application Model (AWS SAM) application.

Get your Salesforce access token

If you don’t already have an access token, you need to obtain one. Before doing this, make sure that you’re prepared to update any applications that are using an access token because this step creates a new one and may invalidate the prior tokens.

  1. Find your personal information by choosing Settings from View profile on the top right.
  2. Choose Reset My Security Token followed by Reset Security Token.
    Reset Security Token
  3. Make note of the new access token that you receive via email.

Create a Salesforce customer contact record for each caller

The Lambda function that performs case look-up and update matches the caller’s phone number with a contact record in Salesforce. To create a new contact, complete the following steps:

  1. Log in to your Salesforce org.
  2. Under App Launcher, search for and choose Service Console.
    Service Console
  3. On the Service Console page, choose Contacts from the drop-down list, then choose New.
    Add new contact
  4. Enter a valid phone number under the Phone field of the New Contact page.
  5. Enter other contact details and choose Save.
  6. Repeat Steps 1–5 for any caller that makes a phone call and test the integration.

Deploy the LCA stack

Complete the following steps to deploy the LCA stack:

  1. Follow the instructions under the Deploy the CloudFormation stack section of Live call analytics and agent assist for your contact center with Amazon language AI services.
  2. Make sure that you choose ANTHROPIC, SAGEMAKER, or LAMBDA for the End of Call Transcript Summary parameter. See Transcript Summarization for more details.

The stacks take about 45 minutes to deploy.

  1. After the main stack shows CREATE_COMPLETE, on the Outputs tab, make a note of the Kinesis data stream ARN (CallDataStreamArn).

Deploy the Salesforce integration stack

To deploy the Salesforce integration stack, complete the following steps:

  1. Open a command-line terminal and run the following commands:
https://github.com/aws-samples/amazon-transcribe-live-call-analytics.git
cd amazon-transcribe-live-call-analytics/plugins/salesforce-integration
sam build
sam deploy —guided

Use the following table as a reference for parameter choices.

Parameter Name Description
AWS Region The Region where you have deployed the LCA solution
SalesforceUsername The user name of your Salesforce organization that has permissions to read and create cases
SalesforcePassword The password associated to your Salesforce user name
SalesforceAccessToken The access token you obtained earlier
SalesforceConsumerKey The consumer key you copied earlier
SalesforceConsumerSecret The consumer secret you obtained earlier
SalesforceHostUrl The login URL of your Salesforce organization
SalesforceAPIVersion The Salesforce API version (choose default or v56.0)
LCACallDataStreamArn The Kinesis data stream ARN (CallDataStreamArn) obtained earlier
  1. After the stack successfully deploys, make a note of StartOfCallLambdaHookFunctionArn and PostCallSummaryLambdaHookFunctionArn from the outputs displayed on your terminal.

Update LCA Stack

Complete the following steps to update the LCA stack:

  1. On the AWS CloudFormation console, update the main LCA stack.
  2. Choose Use current template.
  3. For Lambda Hook Function ARN for Custom Start of Call Processing (existing), provide the StartOfCallLambdaHookFunctionArn that you obtained earlier.
  4. For Lambda Hook Function ARN for Custom Post Processing, after the Call Transcript Summary is processed (existing), provide the PostCallSummaryLambdaHookFunctionArn that you obtained earlier.
  5. Make sure that End of Call Transcript Summary is not DISABLED.

Validate the integration

Make a test call and make sure you can see the beginning of call AGENT ASSIST and post-call AGENT ASSIST transcripts. Refer to the Explore live call analysis and agent assist features section of the Live call analytics and agent assist for your contact center with Amazon language AI services post for guidance.

Clean up

To avoid incurring charges, clean up your resources by following these instructions when you are finished experimenting with this solution:

  1. On the AWS CloudFormation console, and delete the LCA stacks that you deployed. This deletes resources that were created by deploying the solution. The recording S3 buckets, DynamoDB table, and CloudWatch log groups are retained after the stack is deleted to avoid deleting your data.
  2. On your terminal, run sam delete to delete the Salesforce integration Lambda functions.
  3. Follow the instructions in Deactivate a Developer Edition Org to deactivate your Salesforce Developer org.

Conclusion

In this post, we demonstrated how the Live-Call Analytics sample project can accelerate your adoption of real-time contact center analytics and integration. Rather than building from scratch, we show how to use the existing code base with the pre-built integration points with the start of call and post-call Lambda hooks. This enhances agent productivity by integrating with Salesforce to look up and update case records. Explore our open-source project and enhance the CRM pre- and post-call Lambda hooks to accommodate your use case.


About the Authors

Kishore Dhamodaran is a Senior Solutions Architect at AWS.

Bob Strahan Bob Strahan is a Principal Solutions Architect in the AWS Language AI Services team.

Christopher Lott is a Senior Solutions Architect in the AWS AI Language Services team. He has 20 years of enterprise software development experience. Chris lives in Sacramento, California and enjoys gardening, aerospace, and traveling the world.

Babu Srinivasan is a Sr. Specialist SA – Language AI services in the World Wide Specialist organization at AWS, with over 24 years of experience in IT and the last 6 years focused on the AWS Cloud. He is passionate about AI/ML. Outside of work, he enjoys woodworking and entertains friends and family (sometimes strangers) with sleight of hand card magic.

Read More

Reduce energy consumption of your machine learning workloads by up to 90% with AWS purpose-built accelerators

Reduce energy consumption of your machine learning workloads by up to 90% with AWS purpose-built accelerators

Machine learning (ML) engineers have traditionally focused on striking a balance between model training and deployment cost vs. performance. Increasingly, sustainability (energy efficiency) is becoming an additional objective for customers. This is important because training ML models and then using the trained models to make predictions (inference) can be highly energy-intensive tasks. In addition, more and more applications around us have become infused with ML, and new ML-powered applications are conceived every day. A popular example is OpenAI’s ChatGPT, which is powered by a state-of-the-art large language model (LMM). For reference, GPT-3, an earlier generation LLM has 175 billion parameters and requires months of non-stop training on a cluster of thousands of accelerated processors. The Carbontracker study estimates that training GPT-3 from scratch may emit up to 85 metric tons of CO2 equivalent, using clusters of specialized hardware accelerators.

There are several ways AWS is enabling ML practitioners to lower the environmental impact of their workloads. One way is through providing prescriptive guidance around architecting your AI/ML workloads for sustainability. Another way is by offering managed ML training and orchestration services such as Amazon SageMaker Studio, which automatically tears down and scales up ML resources when not in use, and provides a host of out-of-the-box tooling that saves cost and resources. Another major enabler is the development of energy efficient, high-performance, purpose-built accelerators for training and deploying ML models.

The focus of this post is on hardware as a lever for sustainable ML. We present the results of recent performance and power draw experiments conducted by AWS that quantify the energy efficiency benefits you can expect when migrating your deep learning workloads from other inference- and training-optimized accelerated Amazon Elastic Compute Cloud (Amazon EC2) instances to AWS Inferentia and AWS Trainium. Inferentia and Trainium are AWS’s recent addition to its portfolio of purpose-built accelerators specifically designed by Amazon’s Annapurna Labs for ML inference and training workloads.

AWS Inferentia and AWS Trainium for sustainable ML

To provide you with realistic numbers of the energy savings potential of AWS Inferentia and AWS Trainium in a real-world application, we have conducted several power draw benchmark experiments. We have designed these benchmarks with the following key criteria in mind:

  • First, we wanted to make sure that we captured direct energy consumption attributable to the test workload, including not just the ML accelerator but also the compute, memory, and network. Therefore, in our test setup, we measured power draw at that level.
  • Second, when running the training and inference workloads, we ensured that all instances were operating at their respective physical hardware limits and took measurements only after that limit was reached to ensure comparability.
  • Finally, we wanted to be certain that the energy savings reported in this post could be achieved in a practical real-world application. Therefore, we used common customer-inspired ML use cases for benchmarking and testing.

The results are reported in the following sections.

Inference experiment: Real-time document understanding with LayoutLM

Inference, as opposed to training, is a continuous, unbounded workload that doesn’t have a defined completion point. It therefore makes up a large portion of the lifetime resource consumption of an ML workload. Getting inference right is key to achieving high performance, low cost, and sustainability (better energy efficiency) along the full ML lifecycle. With inference tasks, customers are usually interested in achieving a certain inference rate to keep up with the ingest demand.

The experiment presented in this post is inspired by a real-time document understanding use case, which is a common application in industries like banking or insurance (for example, for claims or application form processing). Specifically, we select LayoutLM, a pre-trained transformer model used for document image processing and information extraction. We set a target SLA of 1,000,000 inferences per hour, a value often considered as real time, and then specify two hardware configurations capable of meeting this requirement: one using Amazon EC2 Inf1 instances, featuring AWS Inferentia, and one using comparable accelerated EC2 instances optimized for inference tasks. Throughout the experiment, we track several indicators to measure inference performance, cost, and energy efficiency of both hardware configurations. The results are presented in the following figure.

Performance, Cost and Energy Efficiency Results of Inference Benchmarks

AWS Inferentia delivers 6.3 times higher inference throughput. As a result, with Inferentia, you can run the same real-time LayoutLM-based document understanding workload on fewer instances (6 AWS Inferentia instances vs. 33 other inference-optimized accelerated EC2 instances, equivalent to an 82% reduction), use less than a tenth (-92%) of the energy in the process, all while achieving significantly lower cost per inference (USD 2 vs. USD 25 per million inferences, equivalent to a 91% cost reduction).

Training experiment: Training BERT Large from scratch

Training, as opposed to inference, is a finite process that is repeated much less frequently. ML engineers are typically interested in high cluster performance to reduce training time while keeping cost under control. Energy efficiency is a secondary (yet growing) concern. With AWS Trainium, there is no trade-off decision: ML engineers can benefit from high training performance while also optimizing for cost and reducing environmental impact.

To illustrate this, we select BERT Large, a popular language model used for natural language understanding use cases such as chatbot-based question answering and conversational response prediction. Training a well-performing BERT Large model from scratch typically requires 450 million sequences to be processed. We compare two cluster configurations, each with a fixed size of 16 instances and capable of training BERT Large from scratch (450 million sequences processed) in less than a day. The first uses traditional accelerated EC2 instances. The second setup uses Amazon EC2 Trn1 instances featuring AWS Trainium. Again, we benchmark both configurations in terms of training performance, cost, and environmental impact (energy efficiency). The results are shown in the following figure.

Performance, Cost and Energy Efficiency Results of Training Benchmarks

In the experiments, AWS Trainium-based instances outperformed the comparable training-optimized accelerated EC2 instances by a factor of 1.7 in terms of sequences processed per hour, cutting the total training time by 43% (2.3h versus 4h on comparable accelerated EC2 instances). As a result, when using a Trainium-based instance cluster, the total energy consumption for training BERT Large from scratch is approximately 29% lower compared to a same-sized cluster of comparable accelerated EC2 instances. Again, these performance and energy efficiency benefits also come with significant cost improvements: cost to train for the BERT ML workload is approximately 62% lower on Trainium instances (USD 787 versus USD 2091 per full training run).

Getting started with AWS purpose-built accelerators for ML

Although the experiments conducted here all use standard models from the natural language processing (NLP) domain, AWS Inferentia and AWS Trainium excel with many other complex model architectures including LLMs and the most challenging generative AI architectures that users are building (such as GPT-3). These accelerators do particularly well with models with over 10 billion parameters, or computer vision models like stable diffusion (see Model Architecture Fit Guidelines for more details). Indeed, many of our customers are already using Inferentia and Trainium for a wide variety of ML use cases.

To run your end-to-end deep learning workloads on AWS Inferentia- and AWS Trainium-based instances, you can use AWS Neuron. Neuron is an end-to-end software development kit (SDK) that includes a deep learning compiler, runtime, and tools that are natively integrated into the most popular ML frameworks like TensorFlow and PyTorch. You can use the Neuron SDK to easily port your existing TensorFlow or PyTorch deep learning ML workloads to Inferentia and Trainium and start building new models using the same well-known ML frameworks. For easier setup, use one of our Amazon Machine Images (AMIs) for deep learning, which come with many of the required packages and dependencies. Even simpler: you can use Amazon SageMaker Studio, which natively supports TensorFlow and PyTorch on Inferentia and Trainium (see the aws-samples GitHub repo for an example).

One final note: while Inferentia and Trainium are purpose built for deep learning workloads, many less complex ML algorithms can perform well on CPU-based instances (for example, XGBoost and LightGBM and even some CNNs). In these cases, a migration to AWS Graviton3 may significantly reduce the environmental impact of your ML workloads. AWS Graviton-based instances use up to 60% less energy for the same performance than comparable accelerated EC2 instances.

Conclusion

There is a common misconception that running ML workloads in a sustainable and energy-efficient fashion means sacrificing on performance or cost. With AWS purpose-built accelerators for machine learning, ML engineers don’t have to make that trade-off. Instead, they can run their deep learning workloads on highly specialized purpose-built deep learning hardware, such as AWS Inferentia and AWS Trainium, that significantly outperforms comparable accelerated EC2 instance types, delivering lower cost, higher performance, and better energy efficiency—up to 90%—all at the same time. To start running your ML workloads on Inferentia and Trainium, check out the AWS Neuron documentation or spin up one of the sample notebooks. You can also watch the AWS re:Invent 2022 talk on Sustainability and AWS silicon (SUS206), which covers many of the topics discussed in this post.


About the Authors

Karsten Schroer is a Solutions Architect at AWS. He supports customers in leveraging data and technology to drive sustainability of their IT infrastructure and build data-driven solutions that enable sustainable operations in their respective verticals. Karsten joined AWS following his PhD studies in applied machine learning & operations management. He is truly passionate about technology-enabled solutions to societal challenges and loves to dive deep into the methods and application architectures that underlie these solutions.

Kamran Khan is a Sr. Technical Product Manager at AWS Annapurna Labs. He works closely with AI/ML customers to shape the roadmap for AWS purpose-built silicon innovations coming out of Amazon’s Annapurna Labs. His specific focus is on accelerated deep-learning chips including AWS Trainium and AWS Inferentia. Kamran has 18 years of experience in the semiconductor industry. Kamran has over a decade of experience helping developers achieve their ML goals.

Read More

Onboard users to Amazon SageMaker Studio with Active Directory group-specific IAM roles

Onboard users to Amazon SageMaker Studio with Active Directory group-specific IAM roles

Amazon SageMaker Studio is a web-based integrated development environment (IDE) for machine learning (ML) that lets you build, train, debug, deploy, and monitor your ML models. For provisioning Studio in your AWS account and Region, you first need to create an Amazon SageMaker domain—a construct that encapsulates your ML environment. More concretely, a SageMaker domain consists of an associated Amazon Elastic File System (Amazon EFS) volume, a list of authorized users, and a variety of security, application, policy, and Amazon Virtual Private Cloud (Amazon VPC) configurations.

When creating your SageMaker domain, you can choose to use either AWS IAM Identity Center (successor to AWS Single Sign-On) or AWS Identity and Access Management (IAM) for user authentication methods. Both authentication methods have their own set of use cases; in this post, we focus on SageMaker domains with IAM Identity Center, or single sign-on (SSO) mode, as the authentication method.

With SSO mode, you set up an SSO user and group in IAM Identity Center and then grant access to either the SSO group or user from the Studio console. Currently, all SSO users in a domain inherit the domain’s execution role. This may not work for all organizations. For instance, administrators may want to set up IAM permissions for a Studio SSO user based on their Active Directory (AD) group membership. Furthermore, because administrators are required to manually grant SSO users access to Studio, the process may not scale when onboarding hundreds of users.

In this post, we provide prescriptive guidance for the solution to provision SSO users to Studio with least privilege permissions based on AD group membership. This guidance enables you to quickly scale for onboarding hundreds of users to Studio and achieve your security and compliance posture.

Solution overview

The following diagram illustrates the solution architecture.

The workflow to provision AD users in Studio includes the following steps:

  1. Set up a Studio domain in SSO mode.
  2. For each AD group:
    1. Set up your Studio execution role with appropriate fine-grained IAM policies
    2. Record an entry in the AD group-role mapping Amazon DynamoDB table.

    Alternatively, you can adopt a naming standard for IAM role ARNs based on the AD group name and derive the IAM role ARN without needing to store the mapping in an external database.

  3. Sync your AD users and groups and memberships to AWS Identity Center:
    1. If you’re using an identity provider (IdP) that supports SCIM, use the SCIM API integration with IAM Identity Center.
    2. If you are using self-managed AD, you may use AD Connector.
  4. When the AD group is created in your corporate AD, complete the following steps:
    1. Create a corresponding SSO group in IAM Identity Center.
    2. Associate the SSO group to the Studio domain using the SageMaker console.
  5. When an AD user is created in your corporate AD, a corresponding SSO user is created in IAM Identity Center.
  6. When the AD user is assigned to an AD group, an IAM Identity Center API (CreateGroupMembership) is invoked, and SSO group membership is created.
  7. The preceding event is logged in AWS CloudTrail with the name AddMemberToGroup.
  8. An Amazon EventBridge rule listens to CloudTrail events and matches the AddMemberToGroup rule pattern.
  9. The EventBridge rule triggers the target AWS Lambda function.
  10. This Lambda function will call back IAM Identity Center APIs, get the SSO user and group information, and perform the following steps to create the Studio user profile (CreateUserProfile) for the SSO user:
    1. Look up the DynamoDB table to fetch the IAM role corresponding to the AD group.
    2. Create a user profile with the SSO user and the IAM role obtained from the lookup table.
    3. The SSO user is granted access to Studio.
  11. The SSO user is redirected to the Studio IDE via the Studio domain URL.

Note that, as of writing, Step 4b (associate the SSO group to the Studio domain) needs to be performed manually by an admin using the SageMaker console at the SageMaker domain level.

Set up a Lambda function to create the user profiles

The solution uses a Lambda function to create the Studio user profiles. We provide the following sample Lambda function that you can copy and modify to meet your needs for automating the creation of the Studio user profile. This function performs the following actions:

  1. Receive the CloudTrail AddMemberToGroup event from EventBridge.
  2. Retrieve the Studio DOMAIN_ID from the environment variable (you can alternatively hard-code the domain ID or use a DynamoDB table as well if you have multiple domains).
  3. Read from a dummy markup table to match AD users to execution roles. You can change this to fetch from the DynamoDB table if you’re using a table-driven approach. If you use DynamoDB, your Lambda function’s execution role needs permissions to read from the table as well.
  4. Retrieve the SSO user and AD group membership information from IAM Identity Center, based on the CloudTrail event data.
  5. Create a Studio user profile for the SSO user, with the SSO details and the matching execution role.
import os
import json
import boto3
DOMAIN_ID = os.environ.get('DOMAIN_ID', 'd-xxxx')


def lambda_handler(event, context):
    
    print({"Event": event})

    client = boto3.client('identitystore')
    sm_client = boto3.client('sagemaker')
    
    event_detail = event['detail']
    group_response = client.describe_group(
        IdentityStoreId=event_detail['requestParameters']['identityStoreId'],
        GroupId=event_detail['requestParameters']['groupId'],
    )
    group_name = group_response['DisplayName']
    
    user_response = client.describe_user(
        IdentityStoreId=event_detail['requestParameters']['identityStoreId'],
        UserId=event_detail['requestParameters']['member']['memberId']
    )
    user_name = user_response['UserName']
    print(f"Event details: {user_name} has been added to {group_name}")
    
    mapping_dict = {
        "ad-group-1": "<execution-role-arn>",
        "ad-group-2": "<execution-role-arn>”
    }
    
    user_role = mapping_dict.get(group_name)
    
    if user_role:
        response = sm_client.create_user_profile(
            DomainId=DOMAIN_ID,
            SingleSignOnUserIdentifier="UserName",
            SingleSignOnUserValue=user_name,
            # if the SSO user_name value is an email, 
	  #  add logic to handle it since Studio user profiles don’t accept @ character
            UserProfileName=user_name, 
            UserSettings={
                "ExecutionRole": user_role
            }
        )
        print(response)
    else:
        response = "Group is not authorized to use SageMaker. Doing nothing."
        print(response)
    return {
        'statusCode': 200,
        'body': json.dumps(response)
    }

Note that by default, the Lambda execution role doesn’t have access to create user profiles or list SSO users. After you create the Lambda function, access the function’s execution role on IAM and attach the following policy as an inline policy after scoping down as needed based on your organization requirements.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "identitystore:DescribeGroup",
                "identitystore:DescribeUser"
            ],
            "Effect": "Allow",
            "Resource": "*"
        },
        {
            "Action": "sagemaker:CreateUserProfile",
            "Effect": "Allow",
            "Resource": "*"
        },
        {
            "Action": "iam:PassRole",
            "Effect": "Allow",
            "Resource": [
                "<list-of-studio-execution-roles>"
            ]
        }
    ]
}

Set up the EventBridge rule for the CloudTrail event

EventBridge is a serverless event bus service that you can use to connect your applications with data from a variety of sources. In this solution, we create a rule-based trigger: EventBridge listens to events and matches against the provided pattern and triggers a Lambda function if the pattern match is successful. As explained in the solution overview, we listen to the AddMemberToGroup event. To set it up, complete the following steps:

  1. On the EventBridge console, choose Rules in the navigation pane.
  2. Choose Create rule.
  3. Provide a rule name, for example, AddUserToADGroup.
  4. Optionally, enter a description.
  5. Select default for the event bus.
  6. Under Rule type, choose Rule with an event pattern, then choose Next.
  7. On the Build event pattern page, choose Event source as AWS events or EventBridge partner events.
  8. Under Event pattern, choose the Custom patterns (JSON editor) tab and enter the following pattern:
    {
      "source": ["aws.sso-directory"],
      "detail-type": ["AWS API Call via CloudTrail"],
      "detail": {
        "eventSource": ["sso-directory.amazonaws.com"],
        "eventName": ["AddMemberToGroup"]
      }
    }

  9. Choose Next.
  10. On the Select target(s) page, choose the AWS service for the target type, the Lambda function as the target, and the function you created earlier, then choose Next.
  11. Choose Next on the Configure tags page, then choose Create rule on the Review and create page.

After you’ve set the Lambda function and the EventBridge rule, you can test out this solution. To do so, open your IdP and add a user to one of the AD groups with the Studio execution role mapped. Once you add the user, you can verify the Lambda function logs to inspect the event and also see the Studio user provisioned automatically. Additionally, you can use the DescribeUserProfile API call to verify that the user is created with appropriate permissions.

Supporting multiple Studio accounts

To support multiple Studio accounts with the preceding architecture, we recommend the following changes:

  1. Set up an AD group mapped to each Studio account level.
  2. Set up a group-level IAM role in each Studio account.
  3. Set up or derive the group to IAM role mapping.
  4. Set up a Lambda function to perform cross-account role assumption, based on the IAM role mapping ARN and created user profile.

Deprovisioning users

When a user is removed from their AD group, you should remove their access from the Studio domain as well. With SSO, when a user is removed, the user is disabled in IAM Identity Center automatically if the AD to IAM Identity Center sync is in place, and their Studio application access is immediately revoked.

However, the user profile on Studio still persists. You can add a similar workflow with CloudTrail and a Lambda function to remove the user profile from Studio. The EventBridge trigger should now listen for the DeleteGroupMembership event. In the Lambda function, complete the following steps:

  1. Obtain the user profile name from the user and group ID.
  2. List all running apps for the user profile using the ListApps API call, filtering by the UserProfileNameEquals parameter. Make sure to check for the paginated response, to list all apps for the user.
  3. Delete all running apps for the user and wait until all apps are deleted. You can use the DescribeApp API to view the app’s status.
  4. When all apps are in a Deleted state (or Failed), delete the user profile.

With this solution in place, ML platform administrators can maintain group memberships in one central location and automate the Studio user profile management through EventBridge and Lambda functions.

The following code shows a sample CloudTrail event:

"AddMemberToGroup": 
{
    "eventVersion": "1.08",
    "userIdentity": {
        "type": "Unknown",
        "accountId": "<account-id>",
        "accessKeyId": "30997fec-b566-4b8b-810b-60934abddaa2"
    },
    "eventTime": "2022-09-26T22:24:18Z",
    "eventSource": "sso-directory.amazonaws.com",
    "eventName": "AddMemberToGroup",
    "awsRegion": "us-east-1",
    "sourceIPAddress": "54.189.184.116",
    "userAgent": "Okta SCIM Client 1.0.0",
    "requestParameters": {
        "identityStoreId": "d-906716eb24",
        "groupId": "14f83478-a061-708f-8de4-a3a2b99e9d89",
        "member": {
            "memberId": "04c8e458-a021-702e-f9d1-7f430ff2c752"
        }
    },
    "responseElements": null,
    "requestID": "b24a123b-afb3-4fb6-8650-b0dc1f35ea3a",
    "eventID": "c2c0873b-5c49-404c-add7-f10d4a6bd40c",
    "readOnly": false,
    "eventType": "AwsApiCall",
    "managementEvent": true,
    "recipientAccountId": "<account-id>",
    "eventCategory": "Management",
    "tlsDetails": {
        "tlsVersion": "TLSv1.2",
        "cipherSuite": "ECDHE-RSA-AES128-GCM-SHA256",
        "clientProvidedHostHeader": "up.sso.us-east-1.amazonaws.com"
    }
}

The following code shows a sample Studio user profile API request:

create-user-profile \
--domain-id d-xxxxxx \
--user-profile-name ssouserid
--single-sign-on-user-identifier 'userName' \
--single-sign-on-user-value 'ssouserid‘ \
--user-settings ExecutionRole=arn:aws:iam::<account id>:role/name

Conclusion

In this post, we discussed how administrators can scale Studio onboarding for hundreds of users based on their AD group membership. We demonstrated an end-to-end solution architecture that organizations can adopt to automate and scale their onboarding process to meet their agility, security, and compliance needs. If you’re looking for a scalable solution to automate your user onboarding, try this solution, and leave you feedback below! For more information about onboarding to Studio, see Onboard to Amazon SageMaker Domain.


About the authors

Ram Vittal is an ML Specialist Solutions Architect at AWS. He has over 20 years of experience architecting and building distributed, hybrid, and cloud applications. He is passionate about building secure and scalable AI/ML and big data solutions to help enterprise customers with their cloud adoption and optimization journey to improve their business outcomes. In his spare time, he rides his motorcycle and walks with his 2-year-old sheep-a-doodle!

Durga Sury is an ML Solutions Architect in the Amazon SageMaker Service SA team. She is passionate about making machine learning accessible to everyone. In her 4 years at AWS, she has helped set up AI/ML platforms for enterprise customers. When she isn’t working, she loves motorcycle rides, mystery novels, and hiking with her 5-year-old husky.

Read More

SambaSafety automates custom R workload, improving driver safety with Amazon SageMaker and AWS Step Functions

SambaSafety automates custom R workload, improving driver safety with Amazon SageMaker and AWS Step Functions

At SambaSafety, their mission is to promote safer communities by reducing risk through data insights. Since 1998, SambaSafety has been the leading North American provider of cloud–based mobility risk management software for organizations with commercial and non–commercial drivers. SambaSafety serves more than 15,000 global employers and insurance carriers with driver risk and compliance monitoring, online training and deep risk analytics, as well as risk pricing solutions. Through the collection, correlation and analysis of driver record, telematics, corporate and other sensor data, SambaSafety not only helps employers better enforce safety policies and reduce claims, but also helps insurers make informed underwriting decisions and background screeners perform accurate, efficient pre–hire checks.

Not all drivers present the same risk profile. The more time spent behind the wheel, the higher your risk profile. SambaSafety’s team of data scientists has developed complex and propriety modeling solutions designed to accurately quantify this risk profile. However, they sought support to deploy this solution for batch and real-time inference in a consistent and reliable manner.

In this post, we discuss how SambaSafety used AWS machine learning (ML) and continuous integration and continuous delivery (CI/CD) tools to deploy their existing data science application for batch inference. SambaSafety worked with AWS Advanced Consulting Partner Firemind to deliver a solution that used AWS CodeStar, AWS Step Functions, and Amazon SageMaker for this workload. With AWS CI/CD and AI/ML products, SambaSafety’s data science team didn’t have to change their existing development workflow to take advantage of continuous model training and inference.

Customer use case

SambaSafety’s data science team had long been using the power of data to inform their business. They had several skilled engineers and scientists building insightful models that improved the quality of risk analysis on their platform. The challenges faced by this team were not related to data science. SambaSafety’s data science team needed help connecting their existing data science workflow to a continuous delivery solution.

SambaSafety’s data science team maintained several script-like artifacts as part of their development workflow. These scripts performed several tasks, including data preprocessing, feature engineering, model creation, model tuning, and model comparison and validation. These scripts were all run manually when new data arrived into their environment for training. Additionally, these scripts didn’t perform any model versioning or hosting for inference. SambaSafety’s data science team had developed manual workarounds to promote new models to production, but this process became time-consuming and labor-intensive.

To free up SambaSafety’s highly skilled data science team to innovate on new ML workloads, SambaSafety needed to automate the manual tasks associated with maintaining existing models. Furthermore, the solution needed to replicate the manual workflow used by SambaSafety’s data science team, and make decisions about proceeding based on the outcomes of these scripts. Finally, the solution had to integrate with their existing code base. The SambaSafety data science team used a code repository solution external to AWS; the final pipeline had to be intelligent enough to trigger based on updates to their code base, which was written primarily in R.

Solution overview

The following diagram illustrates the solution architecture, which was informed by one of the many open-source architectures maintained by SambaSafety’s delivery partner Firemind.

Architecture Diagram

The solution delivered by Firemind for SambaSafety’s data science team was built around two ML pipelines. The first ML pipeline trains a model using SambaSafety’s custom data preprocessing, training, and testing scripts. The resulting model artifact is deployed for batch and real-time inference to model endpoints managed by SageMaker. The second ML pipeline facilitates the inference request to the hosted model. In this way, the pipeline for training is decoupled from the pipeline for inference.

One of the complexities in this project is replicating the manual steps taken by the SambaSafety data scientists. The team at Firemind used Step Functions and SageMaker Processing to complete this task. Step Functions allows you to run discrete tasks in AWS using AWS Lambda functions, Amazon Elastic Kubernetes Service (Amazon EKS) workers, or in this case SageMaker. SageMaker Processing allows you to define jobs that run on managed ML instances within the SageMaker ecosystem. Each run of a Step Function job maintains its own logs, run history, and details on the success or failure of the job.

The team used Step Functions and SageMaker, together with Lambda, to handle the automation of training, tuning, deployment, and inference workloads. The only remaining piece was the continuous integration of code changes to this deployment pipeline. Firemind implemented a CodeStar project that maintained a connection to SambaSafety’s existing code repository. When the industrious data science team at SambaSafety posts an update to a specific branch of their code base, CodeStar picks up the changes and triggers the automation.

Conclusion

SambaSafety’s new serverless MLOps pipeline had a significant impact on their capability to deliver. The integration of data science and software development enables their teams to work together seamlessly. Their automated model deployment solution reduced time to delivery by up to 70%.

SambaSafety also had the following to say:

“By automating our data science models and integrating them into their software development lifecycle, we have been able to achieve a new level of efficiency and accuracy in our services. This has enabled us to stay ahead of the competition and deliver innovative solutions to clients. Our clients will greatly benefit from this with the faster turnaround times and improved accuracy of our solutions.”

SambaSafety connected with AWS account teams with their problem. AWS account and solutions architecture teams worked to identify this solution by sourcing from our robust partner network. Connect with your AWS account team to identify similar transformative opportunities for your business.


About the Authors

frgudDan Ferguson is an AI/ML Specialist Solutions Architect (SA) on the Private Equity Solutions Architecture at Amazon Web Services. Dan helps Private Equity backed portfolio companies leverage AI/ML technologies to achieve their business objectives.

KhalilAdibKhalil Adib is a Data Scientist at Firemind, driving the innovation Firemind can provide to their customers around the magical worlds of AI and ML. Khalil tinkers with the latest and greatest tech and models, ensuring that Firemind are always at the bleeding edge.

JasonMathewJason Mathew is a Cloud Engineer at Firemind, leading the delivery of projects for customers end-to-end from writing pipelines with IaC, building out data engineering with Python, and pushing the boundaries of ML. Jason is also the key contributor to Firemind’s open source projects.

Read More

Build a multilingual automatic translation pipeline with Amazon Translate Active Custom Translation

Build a multilingual automatic translation pipeline with Amazon Translate Active Custom Translation

Dive into Deep Learning (D2L.ai) is an open-source textbook that makes deep learning accessible to everyone. It features interactive Jupyter notebooks with self-contained code in PyTorch, JAX, TensorFlow, and MXNet, as well as real-world examples, exposition figures, and math. So far, D2L has been adopted by more than 400 universities around the world, such as the University of Cambridge, Stanford University, the Massachusetts Institute of Technology, Carnegie Mellon University, and Tsinghua University. This work is also made available in Chinese, Japanese, Korean, Portuguese, Turkish, and Vietnamese, with plans to launch Spanish and other languages.

It is a challenging endeavor to have an online book that is continuously kept up to date, written by multiple authors, and available in multiple languages. In this post, we present a solution that D2L.ai used to address this challenge by using the Active Custom Translation (ACT) feature of Amazon Translate and building a multilingual automatic translation pipeline.

We demonstrate how to use the AWS Management Console and Amazon Translate public API to deliver automatic machine batch translation, and analyze the translations between two language pairs: English and Chinese, and English and Spanish. We also recommend best practices when using Amazon Translate in this automatic translation pipeline to ensure translation quality and efficiency.

Solution overview

We built automatic translation pipelines for multiple languages using the ACT feature in Amazon Translate. ACT allows you to customize translation output on the fly by providing tailored translation examples in the form of parallel data. Parallel data consists of a collection of textual examples in a source language and the desired translations in one or more target languages. During translation, ACT automatically selects the most relevant segments from the parallel data and updates the translation model on the fly based on those segment pairs. This results in translations that better match the style and content of the parallel data.

The architecture contains multiple sub-pipelines; each sub-pipeline handles one language translation such as English to Chinese, English to Spanish, and so on. Multiple translation sub-pipelines can be processed in parallel. In each sub-pipeline, we first build the parallel data in Amazon Translate using the high-quality dataset of tailed translation examples from the human-translated D2L books. Then we generate the customized machine translation output on the fly at run time, which achieves better quality and accuracy.

solution architecture

In the following sections, we demonstrate how to build each translation pipeline using Amazon Translate with ACT, along with Amazon SageMaker and Amazon Simple Storage Service (Amazon S3).

First, we put the source documents, reference documents, and parallel data training set in an S3 bucket. Then we build Jupyter notebooks in SageMaker to run the translation process using Amazon Translate public APIs.

Prerequisites

To follow the steps in this post, make sure you have an AWS account with the following:

  • Access to AWS Identity and Access Management (IAM) for role and policy configuration
  • Access to Amazon Translate, SageMaker, and Amazon S3
  • An S3 bucket to store the source documents, reference documents, parallel data dataset, and output of translation

Create an IAM role and policies for Amazon Translate with ACT

Our IAM role needs to contain a custom trust policy for Amazon Translate:

{
    "Version": "2012-10-17",
    "Statement": [{
        "Sid": "Statement1",
        "Effect": "Allow",
        "Principal": {
            "Service": "translate.amazonaws.com"
        },
        "Action": "sts:AssumeRole"
    }]
}

This role must also have a permissions policy that grants Amazon Translate read access to the input folder and subfolders in Amazon S3 that contain the source documents, and read/write access to the output S3 bucket and folder that contains the translated documents:

{
    "Version": "2012-10-17",
    "Statement": [{
        "Effect": "Allow",
        "Action": [
            "s3:ListBucket",
            "s3:GetObject",
            "s3:PutObject",
            “s3:DeleteObject” 
        ]
        "Resource": [
            "arn:aws:s3:::YOUR-S3_BUCKET-NAME"
        ] 
    }]
}

To run Jupyter notebooks in SageMaker for the translation jobs, we need to grant an inline permission policy to the SageMaker execution role. This role passes the Amazon Translate service role to SageMaker that allows the SageMaker notebooks to have access to the source and translated documents in the designated S3 buckets:

{
    "Version": "2012-10-17",
    "Statement": [{
        "Action": ["iam:PassRole"],
        "Effect": "Allow",
        "Resource": [
            "arn:aws:iam::YOUR-AWS-ACCOUNT-ID:role/batch-translate-api-role"
        ]
    }]
}

Prepare parallel data training samples

The parallel data in ACT needs to be trained by an input file consisting of a list of textual example pairs, for instance, a pair of source language (English) and target language (Chinese). The input file can be in TMX, CSV, or TSV format. The following screenshot shows an example of a CSV input file. The first column is the source language data (in English), and the second column is the target language data (in Chinese). The following example is extracted from D2L-en book and D2L-zh book.

scrennshot-1

Perform custom parallel data training in Amazon Translate

First, we set up the S3 bucket and folders as shown in the following screenshot. The source_data folder contains the source documents before the translation; the generated documents after the batch translation are put in the output folder. The ParallelData folder holds the parallel data input file prepared in the previous step.

screenshot-2

After uploading the input files to the source_data folder, we can use the CreateParallelData API to run a parallel data creation job in Amazon Translate:

S3_BUCKET = “YOUR-S3_BUCKET-NAME”
pd_name = “pd-d2l-short_test_sentence_enzh_all”
pd_description = “Parallel Data for English to Chinese”
pd_fn = “d2l_short_test_sentence_enzh_all.csv”
response_t = translate_client.create_parallel_data(
                Name=pd_name,                              # pd_name is the parallel data name 
                Description=pd_description,          # pd_description is the parallel data description 
                ParallelDataConfig={
                      'S3Uri': 's3://'+S3_BUCKET+'/Paralleldata/'+pd_fn,        # S3_BUCKET is the S3 bucket name defined in the previous step
                      'Format': 'CSV'
                },
)
print(pd_name, ": ", response_t['Status'], " created.")

To update existing parallel data with new training datasets, we can use the UpdateParallelData API:

S3_BUCKET = “YOUR-S3_BUCKET-NAME”
pd_name = “pd-d2l-short_test_sentence_enzh_all”
pd_description = “Parallel Data for English to Chinese”
pd_fn = “d2l_short_test_sentence_enzh_all.csv”
response_t = translate_client.update_parallel_data(
                Name=pd_name,                          # pd_name is the parallel data name
                Description=pd_description,      # pd_description is the parallel data description 
                ParallelDataConfig={
                      'S3Uri': 's3://'+S3_BUCKET+'/Paralleldata/'+pd_fn,	# S3_BUCKET is the S3 bucket name defined in the previous step
                      'Format': 'CSV'  
                },
)
print(pd_name, ": ", response_t['Status'], " updated.")

We can check the training job progress on the Amazon Translate console. When the job is complete, the parallel data status shows as Active and is ready to use.

screenshot-3

Run asynchronized batch translation using parallel data

The batch translation can be conducted in a process where multiple source documents are automatically translated into documents in target languages. The process involves uploading the source documents to the input folder of the S3 bucket, then applying the StartTextTranslationJob API of Amazon Translate to initiate an asynchronized translation job:

S3_BUCKET = “YOUR-S3_BUCKET-NAME”
ROLE_ARN = “THE_ROLE_DEFINED_IN_STEP_1”
src_fdr = “source_data”
output_fdr = “output”
src_lang = “en”
tgt_lang = “zh”
pd_name = “pd-d2l-short_test_sentence_enzh_all”
response = translate_client.start_text_translation_job (  
              JobName='D2L_job',         
              InputDataConfig={
                 'S3Uri': 's3://'+S3_BUCKET+'/'+src_fdr+'/',       # S3_BUCKET is the S3 bucket name defined in the previous step 
                                                                   # src_fdr is the folder in S3 bucket containing the source files  
                 'ContentType': 'text/html'
              },
              OutputDataConfig={ 
                  'S3Uri': 's3://'+S3_BUCKET+'/’+output_fdr+’/',   # S3_BUCKET is the S3 bucket name defined in the previous step 
                                                                   # output_fdr is the folder in S3 bucket containing the translated files
              },
              DataAccessRoleArn=ROLE_ARN,            # ROLE_ARN is the role defined in the previous step 
              SourceLanguageCode=src_lang,           # src_lang is the source language, such as ‘en’
              TargetLanguageCodes=[tgt_lang,],       # tgt_lang is the source language, such as ‘zh’
              ParallelDataNames=pd_name              # pd_name is the parallel data name defined in the previous step        
)

We selected five source documents in English from the D2L book (D2L-en) for the bulk translation. On the Amazon Translate console, we can monitor the translation job progress. When the job status changes into Completed, we can find the translated documents in Chinese (D2L-zh) in the S3 bucket output folder.

screenshot-4

Evaluate the translation quality

To demonstrate the effectiveness of the ACT feature in Amazon Translate, we also applied the traditional method of Amazon Translate real-time translation without parallel data to process the same documents, and compared the output with the batch translation output with ACT. We used the BLEU (BiLingual Evaluation Understudy) score to benchmark the translation quality between the two methods. The only way to accurately measure the quality of machine translation output is to have an expert review and grade the quality. However, BLEU provides an estimate of relative quality improvement between two output. A BLEU score is typically a number between 0–1; it calculates the similarity of the machine translation to the reference human translation. The higher score represents better quality in natural language understanding (NLU).

We have tested a set of documents in four pipelines: English into Chinese (en to zh), Chinese into English (zh to en), English into Spanish (en to es), and Spanish into English (es to en). The following figure shows that the translation with ACT produced a higher average BLEU score in all the translation pipelines.

chart-1

We also observed that, the more granular the parallel data pairs are, the better the translation performance. For example, we use the following parallel data input file with pairs of paragraphs, which contains 10 entries.

screenshot-5

For the same content, we use the following parallel data input file with pairs of sentences and 16 entries.

screenshot-6

We used both parallel data input files to construct two parallel data entities in Amazon Translate, then created two batch translation jobs with the same source document. The following figure compares the output translations. It shows that the output using parallel data with pairs of sentences out-performed the one using parallel data with pairs of paragraphs, for both English to Chinese translation and Chinese to English translation.

chart-2

If you are interested in learning more about these benchmark analyses, refer to Auto Machine Translation and Synchronization for “Dive into Deep Learning”.

Clean up

To avoid recurring costs in the future, we recommend you clean up the resources you created:

  1. On the Amazon Translate console, select the parallel data you created and choose Delete. Alternatively, you can use the DeleteParallelData API or the AWS Command Line Interface (AWS CLI) delete-parallel-data command to delete the parallel data.
  2. Delete the S3 bucket used to host the source and reference documents, translated documents, and parallel data input files.
  3. Delete the IAM role and policy. For instructions, refer to Deleting roles or instance profiles and Deleting IAM policies.

Conclusion

With this solution, we aim to reduce the workload of human translators by 80%, while maintaining the translation quality and supporting multiple languages. You can use this solution to improve your translation quality and efficiency. We are working on further improving the solution architecture and translation quality for other languages.

Your feedback is always welcome; please leave your thoughts and questions in the comments section.


About the authors

Yunfei BaiYunfei Bai is a Senior Solutions Architect at AWS. With a background in AI/ML, data science, and analytics, Yunfei helps customers adopt AWS services to deliver business results. He designs AI/ML and data analytics solutions that overcome complex technical challenges and drive strategic objectives. Yunfei has a PhD in Electronic and Electrical Engineering. Outside of work, Yunfei enjoys reading and music.

RachelHuRachel Hu is an applied scientist at AWS Machine Learning University (MLU). She has been leading a few course designs, including ML Operations (MLOps) and Accelerator Computer Vision. Rachel is an AWS senior speaker and has spoken at top conferences including AWS re:Invent, NVIDIA GTC, KDD, and MLOps Summit. Before joining AWS, Rachel worked as a machine learning engineer building natural language processing models. Outside of work, she enjoys yoga, ultimate frisbee, reading, and traveling.

WatsonWatson Srivathsan is the Principal Product Manager for Amazon Translate, AWS’s natural language processing service. On weekends, you will find him exploring the outdoors in the Pacific Northwest.

Read More

Bring SageMaker Autopilot into your MLOps processes using a custom SageMaker Project

Bring SageMaker Autopilot into your MLOps processes using a custom SageMaker Project

Every organization has its own set of standards and practices that provide security and governance for their AWS environment. Amazon SageMaker is a fully managed service to prepare data and build, train, and deploy machine learning (ML) models for any use case with fully managed infrastructure, tools, and workflows. SageMaker provides a set of templates for organizations that want to quickly get started with ML workflows and DevOps continuous integration and continuous delivery (CI/CD) pipelines.

The majority of enterprise customers already have a well-established MLOps practice with a standardized environment in place—for example, a standardized repository, infrastructure, and security guardrails—and want to extend their MLOps process to no-code and low-code AutoML tools as well. They also have a lot of processes that need to be adhered to before promoting a model to production. They’re looking for a quick and easy way to graduate from the initial phase to a repeatable, reliable, and eventually scalable operating phase, as outlined in the following diagram. For more information, refer to MLOps foundation roadmap for enterprises with Amazon SageMaker.

Although these companies have robust data science and MLOps teams to help them build reliable and scalable pipelines, they want to have their low-code AutoML tool users produce code and model artifacts in a manner that can be integrated with their standardized practices, adhering to their code repo structure and with appropriate validations, tests, steps, and approvals.

They are looking for a mechanism for the low-code tools to generate all the source code for each step of the AutoML tasks (preprocessing, training, and postprocessing) in a standardized repository structure that can provide their expert data scientists with the capability to view, validate, and modify the workflow per their needs and then generate a custom pipeline template that can be integrated into a standardized environment (where they have defined their code repository, code build tools, and processes).

This post showcases how to have a repeatable process with low-code tools like Amazon SageMaker Autopilot such that it can be seamlessly integrated into your environment, so you don’t have to orchestrate this end-to-end workflow on your own. We demonstrate how to use CI/CD the low-code/no-code tools code to integrate it into your MLOps environment, while adhering with MLOps best practices.

Solution overview

To demonstrate the orchestrated workflow, we use the publicly available UCI Adult 1994 Census Income dataset to predict if a person has an annual income of greater than $50,000 per year. This is a binary classification problem; the options for the income target variable are either over $50,000 or under $50,000.

The following table summarizes the key components of the dataset.

Data Set Characteristics Multivariate Number of Instances 48842 Area Social
Attribute Characteristics: Categorical, Integer Number of Attributes: 14 Date Donated 1996-05-01
Associated Tasks: Classification Missing Values? Yes Number of Web Hits 2749715

The following table summarizes the attribute information.

Column Name Description
Age Continuous
Workclass Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked
fnlwgt continuous
education Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
education-num continuous
marital-status Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
occupation ech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces
relationship Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
race White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black
sex Female, Male
capital-gain Continuous
capital-loss Continuous
hours-per-week Continuous
native-country United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.
class Income class, either <=50K or >=50K

In this post, we showcase how to use Amazon SageMaker Projects, a tool that helps organizations set up and standardize environments for MLOps with low-code AutoML tools like Autopilot and Amazon SageMaker Data Wrangler.

Autopilot eliminates the heavy lifting of building ML models. You simply provide a tabular dataset and select the target column to predict, and Autopilot will automatically explore different solutions to find the best model. You then can directly deploy the model to production with just one click or iterate on the recommended solutions to further improve the model quality.

Data Wrangler provides an end-to-end solution to import, prepare, transform, featurize, and analyze data. You can integrate a Data Wrangler data preparation flow into your ML workflows to simplify and streamline data preprocessing and feature engineering using little to no coding. You can also add your own Python scripts and transformations to customize workflows. We use Data Wrangler to perform preprocessing on the dataset before submitting the data to Autopilot.

SageMaker Projects helps organizations set up and standardize environments for automating different steps involved in an ML lifecycle. Although notebooks are helpful for model building and experimentation, a team of data scientists and ML engineers sharing code need a more scalable way to maintain code consistency and strict version control.

To help you get started with common model building and deployment paradigms, SageMaker Projects offers a set of first-party templates (1P templates). The 1P templates generally focus on creating resources for model building and model training. The templates include projects that use AWS-native services for CI/CD, such as AWS CodeBuild and AWS CodePipeline. SageMaker Projects can support custom template offerings, where organizations use an AWS CloudFormation template to run a Terraform stack and create the resources needed for an ML workflow.

Organizations may want to extend the 1P templates to support use cases beyond simply training and deploying models. Custom project templates are a way for you to create a standard workflow for ML projects. You can create several templates and use AWS Identity and Access Management (IAM) policies to manage access to those templates on Amazon SageMaker Studio, ensuring that each of your users are accessing projects dedicated for their use cases.

To learn more about SageMaker Projects and creating custom project templates aligned with best practices, refer to Build Custom SageMaker Project Templates – Best Practices.

These custom templates are created as AWS Service Catalog products and provisioned as organization templates on the Studio UI. This is where data scientists can choose a template and have their ML workflow bootstrapped and preconfigured. Projects are provisioned using AWS Service Catalog products. Project templates are used by organizations to provision projects for each of their teams.

In this post, we showcase how to build a custom project template to have an end-to-end MLOps workflow using SageMaker projects, AWS Service Catalog, and Amazon SageMaker Pipelines integrating Data Wrangler and Autopilot with humans in the loop in order to facilitate the steps of model training and deployment. The humans in the loop are the different personas involved in an MLOps practice working collaboratively for a successful ML build and deploy workflow.

The following diagram illustrates the end-to-end low-code/no-code automation workflow.

The workflow includes the following steps:

  1. The Ops team or the Platform team launches the CloudFormation template to set up the prerequisites required to provision the custom SageMaker template.
  2. When the template is available in SageMaker, the Data Science Lead uses the template to create a SageMaker project.
  3. The SageMaker project creation will launch an AWS Service Catalog product that adds two seed codes to the AWS CodeCommit repositories:
    • The seed code for the model building pipeline includes a pipeline that preprocesses the UCI Machine Learning Adult dataset using Data Wrangler, automatically creates an ML model with full visibility using Autopilot, evaluates the performance of a model using a processing step, and registers the model into a model registry based on the model performance.
    • The seed code for model deployment includes a CodeBuild step to find the latest model that has been approved in the model registry and create configuration files to deploy the CloudFormation templates as part of the CI/CD pipelines using CodePipeline. The CloudFormation template deploys the model to staging and production environments.
  4. The first seed code commit starts a CI/CD pipeline using CodePipeline that triggers a SageMaker pipeline, which is a series of interconnected steps encoded using a directed acyclic graph (DAG). In this case, the steps involved are data processing using a Data Wrangler flow, training the model using Autopilot, creating the model, evaluating the model, and if the evaluation is passed, registering the model.

For more details on creating SageMaker pipelines using Autopilot, refer to Launch Amazon SageMaker Autopilot experiments directly from within Amazon SageMaker Pipelines to easily automate MLOps workflows.

  1. After the model is registered, the model approver can either approve or reject the model in Studio.
  2. When the model is approved, a CodePipeline deployment pipeline integrated with the second seed code is triggered.
  3. This pipeline creates a SageMaker serverless scalable endpoint for the staging environment.
  4. There is an automated test step in the deployment pipeline that will be tested on the staging endpoint.
  5. The test results are stored in Amazon Simple Storage Service (Amazon S3). The pipeline will stop for a production deployment approver, who can review all the artifacts before approving.
  6. Once approved, the model is deployed to production in the form of scalable serverless endpoint. Production applications can now consume the endpoint for inference.

The deployment steps consist of the following:

  1. Create the custom SageMaker project template for Autopilot and other resources using AWS CloudFormation. This is a one-time setup task.
  2. Create the SageMaker project using the custom template.

In the following sections, we proceed with each of these steps in more detail and explore the project details page.

Prerequisites

This walkthrough includes the following prerequisites:

Create solution resources with AWS CloudFormation

You can download and launch the CloudFormation template via the AWS CloudFormation console, the AWS Command Line Interface (AWS CLI), the SDK, or by simply choosing Launch Stack:

The CloudFormation template is also available in the AWS Samples GitHub Code repository. The repository contains the following:

  • A CloudFormation template to set up the custom SageMaker project template for Autopilot
  • Seed code with the ML code to set up SageMaker pipelines to automate the data processing and training steps
  • A project folder for the CloudFormation template used by AWS Service Catalog mapped to the custom SageMaker project template that will be created

The CloudFormation template takes several parameters as input.

The following are the AWS Service Catalog product information parameters:

  • Product Name – The name of the AWS Service Catalog product that the SageMaker project custom MLOps template will be associated with
  • Product Description – The description for the AWS Service Catalog product
  • Product Owner – The owner of the Service Catalog product
  • Product Distributor – The distributor of the Service Catalog product

The following are the AWS Service Catalog product support information parameters:

  • Product Support Description – A support description for this product
  • Product Support Email – An email address of the team supporting the AWS Service Catalog product
  • Product Support URL – A support URL for the AWS Service Catalog product

The following are the source code repository configuration parameters:

  • URL to the zipped version of your GitHub repository – Use the defaults if you’re not forking the AWS Samples repository.
  • Name and branch of your GitHub repository – These should match the root folder of the zip. Use the defaults if you’re not forking the AWS Samples repository.
  • StudioUserExecutionRole – Provide the ARN of the Studio user execution IAM role.

After you launch the CloudFormation stack from this template, you can monitor its status on the AWS CloudFormation console.

When the stack is complete, copy the value of the CodeStagingBucketName key on the Outputs tab of the CloudFormation stack and save it in a text editor to use later.

Create the SageMaker project using the new custom template

To create your SageMaker project, complete the following steps:

  1. Sign in to Studio. For more information, see Onboard to Amazon SageMaker Domain.
  2. In the Studio sidebar, choose the home icon.
  3. Choose Deployments from the menu, then choose Projects.
  4. Choose Create project.
  5. Choose Organization templates to view the new custom MLOps template.
  6. Choose Select project template.

  1. For Project details, enter a name and description for your project.
  2. For MLOpsS3Bucket, enter the name of the S3 bucket you saved earlier.

  1. Choose Create project.

A message appears indicating that SageMaker is provisioning and configuring the resources.

When the project is complete, you receive a success message, and your project is now listed on the Projects list.

Explore the project details

On the project details page, you can view various tabs associated with the project. Let’s dive deep into each of these tabs in detail.

Repositories

This tab lists the code repositories associated with this project. You can choose clone repo under Local path to clone the two seed code repositories created in CodeCommit by the SageMaker project. This option provides you with Git access to the code repositories from the SageMaker project itself.

When the clone of the repository is complete, the local path appears in the Local path column. You can choose the path to open the local folder that contains the repository code in Studio.

The folder will be accessible in the navigation pane. You can use the file browser icon to hide or show the folder list. You can make the code changes here or choose the Git icon to stage, commit, and push the change.

Pipelines

This tab lists the SageMaker ML pipelines that define steps to prepare data, train models, and deploy models. For information about SageMaker ML pipelines, see Create and Manage SageMaker Pipelines.

You can choose the pipeline that is currently running to see its latest status. In the following example, the DataProcessing step is performed by using a Data Wrangler data flow.

You can access the data flow from the local path of the code repository that we cloned earlier. Choose the file browser icon to show the path, which is listed in the pipelines folder of the model build repository.

In the pipelines folder, open the autopilot folder.

In the autopilot folder, open the preprocess.flow file.

It will take a moment to open the Data Wrangler flow.

In this example, three data transformations are performed between the source and destination. You can choose each transformation to see more details.

For instructions on how to include or remove transformations in Data Wrangler, refer to Transform Data.

For more information, refer to Unified data preparation and model training with Amazon SageMaker Data Wrangler and Amazon SageMaker Autopilot – Part 1.

When you’re done reviewing, choose the power icon and stop the Data Wrangler resources under Running Apps and Kernel Sessions.

Experiments

This tab lists the Autopilot experiments associated with the project. For more information about Autopilot, see Automate model development with Amazon SageMaker Autopilot.

Model groups

This tab lists groups of model versions that were created by pipeline runs in the project. When the pipeline run is complete, the model created from the last step of the pipeline will be accessible here.

You can choose the model group to access the latest version of the model.

The status of the model version in the following example is Pending. You can choose the model version and choose Update status to update the status.

Choose Approved and choose Update status to approve the model.

After the model status is approved, the model deploy CI/CD pipeline within CodePipeline will start.

You can open the deployed pipeline to see the different stages in the repo.

As shown in the preceding screenshot, this pipeline has four stages:

  • Source – In this stage, CodePipeline checks the CodeCommit repo code into the S3 bucket.
  • Build – In this stage, CloudFormation templates are prepared for the deployment of the model code.
  • DeployStaging – This stage consists of three sub-stages:
    • DeployResourcesStaging – In the first sub-stage, the CloudFormation stack is deployed to create a serverless SageMaker endpoint in the staging environment.
    • TestStaging – In the second-sub stage, automated testing is performed using CodeBuild on the endpoint to check if the inference is happening as expected. The test results will be available in the S3 bucket with the name sagemaker-project-<project ID of the SageMaker project>.

You can get the SageMaker project ID on the Settings tab of the SageMaker project. Within the S3 bucket, choose the project name folder (for example, sagemaker-MLOp-AutoP) and within that, open the TestArtifa/ folder. Choose the object file in this folder to see the test results.

You can access the testing script from the local path of the code repository that we cloned earlier. Choose the file browser icon view the path. Note this will be the deploy repository. In that repo, open the test folder and choose the test.py Python code file.

You can make changes to this testing code as per your use case.

  • ApproveDeployment – In the third sub-stage, there is an additional approval process before the last stage of deploying to production. You can choose Review and approve it to proceed.

  • DeployProd – In this stage, the CloudFormation stack is deployed to create a serverless SageMaker endpoint for the production environment.

Endpoints

This tab lists the SageMaker endpoints that host deployed models for inference. When all the stages in the model deployment pipeline are complete, models are deployed to SageMaker endpoints and are accessible within the SageMaker project.

Settings

This is the last tab on the project page and lists settings for the project. This includes the name and description of the project, information about the project template and SourceModelPackageGroupName, and metadata about the project.

Clean up

To avoid additional infrastructure costs associated with the example in this post, be sure to delete CloudFormation stacks. Also, ensure that you delete the SageMaker endpoints, any running notebooks, and S3 buckets that were created during the setup.

Conclusion

This post described an easy-to-use ML pipeline approach to automate and standardize the training and deployment of ML models using SageMaker Projects, Data Wrangler, Autopilot, Pipelines, and Studio. This solution can help you perform AutoML tasks (preprocessing, training, and postprocessing) in a standardized repository structure that can provide your expert data scientists with the capability to view, validate, and modify the workflow as per their needs and then generate a custom pipeline template that can be integrated to a SageMaker project.

You can modify the pipelines with your preprocessing and pipeline steps for your use case and deploy our end-to-end workflow. Let us know in the comments how the custom template worked for your respective use case.


About the authors

 Vishal Naik is a Sr. Solutions Architect at Amazon Web Services (AWS). He is a builder who enjoys helping customers accomplish their business needs and solve complex challenges with AWS solutions and best practices. His core area of focus includes Machine Learning, DevOps, and Containers. In his spare time, Vishal loves making short films on time travel and alternate universe themes.

Shikhar Kwatra is an AI/ML specialist solutions architect at Amazon Web Services, working with a leading Global System Integrator. He has earned the title of one of the Youngest Indian Master Inventors with over 500 patents in the AI/ML and IoT domains. Shikhar aids in architecting, building, and maintaining cost-efficient, scalable cloud environments for the organization, and supports the GSI partner in building strategic industry solutions on AWS. Shikhar enjoys playing guitar, composing music, and practicing mindfulness in his spare time.

Janisha Anand is a Senior Product Manager in the SageMaker Low/No Code ML team, which includes SageMaker Canvas and SageMaker Autopilot. She enjoys coffee, staying active, and spending time with her family.

Read More

How Forethought saves over 66% in costs for generative AI models using Amazon SageMaker

How Forethought saves over 66% in costs for generative AI models using Amazon SageMaker

This post is co-written with Jad Chamoun, Director of Engineering at Forethought Technologies, Inc. and Salina Wu, Senior ML Engineer at Forethought Technologies, Inc.

Forethought is a leading generative AI suite for customer service. At the core of its suite is the innovative SupportGPT™ technology which uses machine learning to transform the customer support lifecycle—increasing deflection, improving CSAT, and boosting agent productivity. SupportGPT™ leverages state-of-the-art Information Retrieval (IR) systems and large language models (LLMs) to power over 30 million customer interactions annually.

SupportGPT’s primary use case is enhancing the quality and efficiency of customer support interactions and operations. By using state-of-the-art IR systems powered by embeddings and ranking models, SupportGPT can quickly search for relevant information, delivering accurate and concise answers to customer queries. Forethought uses per-customer fine-tuned models to detect customer intents in order to solve customer interactions. The integration of large language models helps humanize the interaction with automated agents, creating a more engaging and satisfying support experience.

SupportGPT also assists customer support agents by offering autocomplete suggestions and crafting appropriate responses to customer tickets that align with the company’s based on previous replies. By using advanced language models, agents can address customers’ concerns faster and more accurately, resulting in higher customer satisfaction.

Additionally, SupportGPT’s architecture enables detecting gaps in support knowledge bases, which helps agents provide more accurate information to customers. Once these gaps are identified, SupportGPT can automatically generate articles and other content to fill these knowledge voids, ensuring the support knowledge base remains customer-centric and up to date.

In this post, we share how Forethought uses Amazon SageMaker multi-model endpoints in generative AI use cases to save over 66% in cost.

Infrastructure challenges

To help bring these capabilities to market, Forethought efficiently scales its ML workloads and provides hyper-personalized solutions tailored to each customer’s specific use case. This hyper-personalization is achieved through fine-tuning embedding models and classifiers on customer data, ensuring accurate information retrieval results and domain knowledge that caters to each client’s unique needs. The customized autocomplete models are also fine-tuned on customer data to further enhance the accuracy and relevance of the responses generated.

One of the significant challenges in AI processing is the efficient utilization of hardware resources such as GPUs. To tackle this challenge, Forethought uses SageMaker multi-model endpoints (MMEs) to run multiple AI models on a single inference endpoint and scale. Because the hyper-personalization of models requires unique models to be trained and deployed, the number of models scales linearly with the number of clients, which can become costly.

To achieve the right balance of performance for real-time inference and cost, Forethought chose to use SageMaker MMEs, which support GPU acceleration. SageMaker MMEs enable Forethought to deliver high-performance, scalable, and cost-effective solutions with subsecond latency, addressing multiple customer support scenarios at scale.

SageMaker and Forethought

SageMaker is a fully managed service that provides developers and data scientists the ability to build, train, and deploy ML models quickly. SageMaker MMEs provide a scalable and cost-effective solution for deploying a large number of models for real-time inference. MMEs use a shared serving container and a fleet of resources that can use accelerated instances such as GPUs to host all of your models. This reduces hosting costs by maximizing endpoint utilization compared to using single-model endpoints. It also reduces deployment overhead because SageMaker manages loading and unloading models in memory and scaling them based on the endpoint’s traffic patterns. In addition, all SageMaker real-time endpoints benefit from built-in capabilities to manage and monitor models, such as including shadow variants, auto scaling, and native integration with Amazon CloudWatch (for more information, refer to CloudWatch Metrics for Multi-Model Endpoint Deployments).

As Forethought grew to host hundreds of models that also required GPU resources, we saw an opportunity to create a more cost-effective, reliable, and manageable architecture through SageMaker MMEs. Prior to migrating to SageMaker MMEs, our models were deployed on Kubernetes on Amazon Elastic Kubernetes Service (Amazon EKS). Although Amazon EKS provided management capabilities, it was immediately apparent that we were managing infrastructure that wasn’t specifically tailored for inference. Forethought had to manage model inference on Amazon EKS ourselves, which was a burden on engineering efficiency. For example, in order to share expensive GPU resources between multiple models, we were responsible for allocating rigid memory fractions to models that were specified during deployment. We wanted to address the following key problems with our existing infrastructure:

  • High cost – To ensure that each model had enough resources, we would be very conservative in how many models to fit per instance. This resulted in much higher costs for model hosting than necessary.
  • Low reliability – Despite being conservative in our memory allocation, not all models have the same requirements, and occasionally some models would throw out of memory (OOM) errors.
  • Inefficient management – We had to manage different deployment manifests for each type of model (such as classifiers, embeddings, and autocomplete), which was time-consuming and error-prone. We also had to maintain the logic to determine the memory allocation for different model types.

Ultimately, we needed an inference platform to take on the heavy lifting of managing our models at runtime to improve the cost, reliability, and the management of serving our models. SageMaker MMEs allowed us to address these needs.

Through its smart and dynamic model loading and unloading, and its scaling capabilities, SageMaker MMEs provided a significantly less expensive and more reliable solution for hosting our models. We are now able to fit many more models per instance and don’t have to worry about OOM errors because SageMaker MMEs handle loading and unloading models dynamically. In addition, deployments are now as simple as calling Boto3 SageMaker APIs and attaching the proper auto scaling policies.

The following diagram illustrates our legacy architecture.

To begin our migration to SageMaker MMEs, we identified the best use cases for MMEs and which of our models would benefit the most from this change. MMEs are best used for the following:

  • Models that are expected to have low latency but can withstand a cold start time (when it’s first loaded in)
  • Models that are called often and consistently
  • Models that need partial GPU resources
  • Models that share common requirements and inference logic

We identified our embeddings models and autocomplete language models as the best candidates for our migration. To organize these models under MMEs, we would create one MME per model type, or task, one for our embeddings models, and another for autocomplete language models.

We already had an API layer on top of our models for model management and inference. Our task at hand was to rework how this API was deploying and handling inference on models under the hood with SageMaker, with minimal changes to how clients and product teams interacted with the API. We also needed to package our models and custom inference logic to be compatible with NVIDIA Triton Inference Server using SageMaker MMEs.

The following diagram illustrates our new architecture.

Custom inference logic

Before migrating to SageMaker, Forethought’s custom inference code (preprocessing and postprocessing) ran in the API layer when a model was invoked. The objective was to transfer this functionality to the model itself to clarify the separation of responsibilities, modularize and simplify their code, and reduce the load on the API.

Embeddings

Forethought’s embedding models consist of two PyTorch model artifacts, and the inference request determines which model to call. Each model requires preprocessed text as input. The main challenges were integrating a preprocessing step and accommodating two model artifacts per model definition. To address the need for multiple steps in the inference logic, Forethought developed a Triton ensemble model with two steps: a Python backend preprocessing process and a PyTorch backend model call. Ensemble models allow for defining and ordering steps in the inference logic, with each step represented by a Triton model of any backend type. To ensure compatibility with the Triton PyTorch backend, the existing model artifacts were converted to TorchScript format. Separate Triton models were created for each model definition, and Forethought’s API layer was responsible for determining the appropriate TargetModel to invoke based on the incoming request.

Autocomplete

The autocomplete models (sequence to sequence) presented a distinct set of requirements. Specifically, we needed to enable the capability to loop through multiple model calls and cache substantial inputs for each call, all while maintaining low latency. Additionally, these models necessitated both preprocessing and postprocessing steps. To address these requirements and achieve the desired flexibility, Forethought developed autocomplete MME models utilizing the Triton Python backend, which offers the advantage of writing the model as Python code.

Benchmarking

After the Triton model shapes were determined, we deployed models to staging endpoints and conducted resource and performance benchmarking. Our main goal was to determine the latency for cold start vs in-memory models, and how latency was affected by request size and concurrency. We also wanted to know how many models could fit on each instance, how many models would cause the instances to scale up with our auto scaling policy, and how quickly the scale-up would happen. In keeping with the instance types we were already using, we did our benchmarking with ml.g4dn.xlarge and ml.g4dn.2xlarge instances.

Results

The following table summarizes our results.

Request Size Cold Start Latency Cached Inference Latency Concurrent Latency (5 requests)
Small (30 tokens) 12.7 seconds 0.03 seconds 0.12 seconds
Medium (250 tokens) 12.7 seconds 0.05 seconds 0.12 seconds
Large (550 tokens) 12.7 seconds 0.13 seconds 0.12 seconds

Noticeably, the latency for cold start requests is significantly higher than the latency for cached inference requests. This is because the model needs to be loaded from disk or Amazon Simple Storage Service (Amazon S3) when a cold start request is made. The latency for concurrent requests is also higher than the latency for single requests. This is because the model needs to be shared between concurrent requests, which can lead to contention.

The following table compares the latency of the legacy models and the SageMaker models.

Request Size Legacy Models SageMaker Models
Small (30 tokens) 0.74 seconds 0.24 seconds
Medium (250 tokens) 0.74 seconds 0.24 seconds
Large (550 tokens) 0.80 seconds 0.32 seconds

Overall, the SageMaker models are a better choice for hosting autocomplete models than the legacy models. They offer lower latency, scalability, reliability, and security.

Resource usage

In our quest to determine the optimal number of models that could fit on each instance, we conducted a series of tests. Our experiment involved loading models into our endpoints using an ml.g4dn.xlarge instance type, without any auto scaling policy.

These particular instances offer 15.5 GB of memory, and we aimed to achieve approximately 80% GPU memory usage per instance. Considering the size of each encoder model artifact, we managed to find the optimal number of Triton encoders to load on an instance to reach our targeted GPU memory usage. Furthermore, given that each of our embeddings models corresponds to two Triton encoder models, we were able to house a set number of embeddings models per instance. As a result, we calculated the total number of instances required to serve all our embeddings models. This experimentation has been crucial in optimizing our resource usage and enhancing the efficiency of our models.

We conducted similar benchmarking for our autocomplete models. These models were around 292.0 MB each. As we tested how many models would fit on a single ml.g4dn.xlarge instance, we noticed that we were only able to fit four models before our instance started unloading models, despite the models having a small size. Our main concerns were:

  • Cause for CPU memory utilization spiking
  • Cause for models getting unloaded when we tried to load in one more model instead of just the least recently used (LRU) model

We were able to pinpoint the root cause of the memory utilization spike coming from initializing our CUDA runtime environment in our Python model, which was necessary to move our models and data on and off the GPU device. CUDA loads many external dependencies into CPU memory when the runtime is initialized. Because the Triton PyTorch backend handles and abstracts away moving data on and off the GPU device, we didn’t run into this issue for our embedding models. To address this, we tried using ml.g4dn.2xlarge instances, which had the same amount of GPU memory but twice as much CPU memory. In addition, we added several minor optimizations in our Python backend code, including deleting tensors after use, emptying the cache, disabling gradients, and garbage collecting. With the larger instance type, we were able to fit 10 models per instance, and the CPU and GPU memory utilization became much more aligned.

The following diagram illustrates this architecture.

Auto scaling

We attached auto scaling policies to both our embeddings and autocomplete MMEs. Our policy for our embeddings endpoint targeted 80% average GPU memory utilization using custom metrics. Our autocomplete models saw a pattern of high traffic during business hours and minimal traffic overnight. Because of this, we created an auto scaling policy based on InvocationsPerInstance so that we could scale according to the traffic patterns, saving on cost without sacrificing reliability. Based on our resource usage benchmarking, we configured our scaling policies with a target of 225 InvocationsPerInstance.

Deploy logic and pipeline

Creating an MME on SageMaker is straightforward and similar to creating any other endpoint on SageMaker. After the endpoint is created, adding additional models to the endpoint is as simple as moving the model artifact to the S3 path that the endpoint targets; at this point, we can make inference requests to our new model.

We defined logic that would take in model metadata, format the endpoint deterministically based on the metadata, and check whether the endpoint existed. If it didn’t, we create the endpoint and add the Triton model artifact to the S3 patch for the endpoint (also deterministically formatted). For example, if the model metadata indicated that it is an autocomplete model, it would create an endpoint for auto-complete models and an associated S3 path for auto-complete model artifacts. If the endpoint existed, we would copy the model artifact to the S3 path.

Now that we had our model shapes for our MME models and the functionality for deploying our models to MME, we needed a way to automate the deployment. Our users must specify which model they want to deploy; we handle packaging and deployment of the model. The custom inference code packaged with the model is versioned and pushed to Amazon S3; in the packaging step, we pull the inference code according to the version specified (or the latest version) and use YAML files that indicate the file structures of the Triton models.

One requirement for us was that all of our MME models would be loaded into memory to avoid any cold start latency during production inference requests to load in models. To achieve this, we provision enough resources to fit all our models (according to the preceding benchmarking) and call every model in our MME at an hourly cadence.

The following diagram illustrates the model deployment pipeline.

The following diagram illustrates the model warm-up pipeline.

Model invocation

Our existing API layer provides an abstraction for callers to make inference on all of our ML models. This meant we only had to add functionality to the API layer to call the SageMaker MME with the correct target model depending on the inference request, without any changes to the calling code. The SageMaker inference code takes the inference request, formats the Triton inputs defined in our Triton models, and invokes the MMEs using Boto3.

Cost benefits

Forethought made significant strides in reducing model hosting costs and mitigating model OOM errors, thanks to the migration to SageMaker MMEs. Before this change, ml.g4dn.xlarge instances running in Amazon EKS. With the transition to MMEs, we discovered it could house 12 embeddings models per instance while achieving 80% GPU memory utilization. This led to a significant decline in our monthly expenses. To put it in perspective, we realized a cost saving of up to 80%. Moreover, to manage higher traffic, we considered scaling up the replicas. Assuming a scenario where we employ three replicas, we found that our cost savings would still be substantial even under these conditions, hovering around 43%.

The journey with SageMaker MMEs has proven financially beneficial, reducing our expenses while ensuring optimal model performance. Previously, our autocomplete language models were deployed in Amazon EKS, necessitating a varying number of ml.g4dn.xlarge instances based on the memory allocation per model. This resulted in a considerable monthly cost. However, with our recent migration to SageMaker MMEs, we’ve been able to reduce these costs substantially. We now host all our models on ml.g4dn.2xlarge instances, giving us the ability to pack models more efficiently. This has significantly trimmed our monthly expenses, and we’ve now realized cost savings in the 66–74% range. This move has demonstrated how efficient resource utilization can lead to significant financial savings using SageMaker MMEs.

Conclusion

In this post, we reviewed how Forethought uses SageMaker multi-model endpoints to decrease cost for real-time inference. SageMaker takes on the undifferentiated heavy lifting, so Forethought can increase engineering efficiency. It also allows Forethought to dramatically lower the cost for real-time inference while maintaining the performance needed for the business-critical operations. By doing so, Forethought is able to provide a differentiated offering for their customers using hyper-personalized models. Use SageMaker MME to host your models at scale and reduce hosting costs by improving endpoint utilization. It also reduces deployment overhead because Amazon SageMaker manages loading models in memory and scaling them based on the traffic patterns to your endpoint. You can find code samples on hosting multiple models using SageMaker MME on GitHub.


About the Authors

Jad Chamoun is a Director of Core Engineering at Forethought. His team focuses on platform engineering covering Data Engineering, Machine Learning Infrastructure, and Cloud Infrastructure.  You can find him on LinkedIn.

Salina Wu is a Sr. Machine Learning Infrastructure engineer at Forethought.ai. She works closely with the Machine Learning team to build and maintain their end-to-end training, serving, and data infrastructures. She is particularly motivated by introducing new ways to improve efficiency and reduce cost across the ML space. When not at work, Salina enjoys surfing, pottery, and being in nature.

James Park is a Solutions Architect at Amazon Web Services. He works with Amazon.com to design, build, and deploy technology solutions on AWS, and has a particular interest in AI and machine learning. In h is spare time he enjoys seeking out new cultures, new experiences,  and staying up to date with the latest technology trends.You can find him on LinkedIn.

Sunil Padmanabhan is a Startup Solutions Architect at AWS. As a former startup founder and CTO, he is passionate about machine learning and focuses on helping startups leverage AI/ML for their business outcomes and design and deploy ML/AI solutions at scale.

Dhawal Patel is a Principal Machine Learning Architect at AWS. He has worked with organizations ranging from large enterprises to mid-sized startups on problems related to distributed computing, and Artificial Intelligence. He focuses on Deep learning including NLP and Computer Vision domains. He helps customers achieve high performance model inference on SageMaker.

Read More