Announcing new Jupyter contributions by AWS to democratize generative AI and scale ML workloads

Announcing new Jupyter contributions by AWS to democratize generative AI and scale ML workloads

Project Jupyter is a multi-stakeholder, open-source project that builds applications, open standards, and tools for data science, machine learning (ML), and computational science. The Jupyter Notebook, first released in 2011, has become a de facto standard tool used by millions of users worldwide across every possible academic, research, and industry sector. Jupyter enables users to work with code and data interactively, and to build and share computational narratives that provide a full and reproducible record of their work.

Given the importance of Jupyter to data scientists and ML developers, AWS is an active sponsor and contributor to Project Jupyter. Our goal is to work in the open-source community to help Jupyter to be the best possible notebook platform for data science and ML. AWS is a platinum sponsor of Project Jupyter through the NumFOCUS Foundation, and I am proud and honored to lead a dedicated team of AWS engineers who contribute to Jupyter’s software and participate in Jupyter’s community and governance. Our open-source contributions to Jupyter include JupyterLab, Jupyter Server, and the Jupyter Notebook subprojects. We are also members of the Jupyter working groups for Security, and Diversity, Equity, and Inclusion (DEI). In parallel to these open-source contributions, we have AWS product teams who are working to integrate Jupyter with products such as Amazon SageMaker.

Today at JupyterCon, we are excited to announce several new tools for Jupyter users to improve their experience and boost development productivity. All of these tools are open-source and can be used anywhere you are running Jupyter.

Introducing two generative AI extensions for Jupyter

Generative AI can significantly boost the productivity of data scientists and developers as they write code. Today, we are announcing two Jupyter extensions that bring generative AI to Jupyter users through a chat UI, IPython magic commands, and autocompletion. These extensions enable you to perform a wide range of development tasks using generative AI models in JupyterLab and Jupyter notebooks.

Jupyter AI, an open-source project to bring generative AI to Jupyter notebooks

Using the power of large language models like ChatGPT, AI21’s Jurassic-2, and (coming soon) Amazon Titan, Jupyter AI is an open-source project that brings generative AI features to Jupyter notebooks. For example, using a large language model, Jupyter AI can help a programmer generate, debug, and explain their source code. Jupyter AI can also answer questions about local files and generate entire notebooks from a simple natural language prompt. Jupyter AI offers both magic commands that work in any notebook or IPython shell, and a friendly chat UI in JupyterLab. Both of these experiences work with dozens of models from a wide range of model providers. JupyterLab users can select any text or notebook cells, enter a natural language prompt to perform a task with the selection, and then insert the AI-generated response wherever they choose. Jupyter AI is integrated with Jupyter’s MIME type system, which lets you work with inputs and outputs of any type that Jupyter supports (text, images, etc.). Jupyter AI also provides integration points that allows third parties to configure their own models. Jupyter AI is an official open-source project of Project Jupyter.

Amazon CodeWhisperer Jupyter extension

Autocompletion is foundational for developers and generative AI can significantly enhance the code suggestion experience. That is why we announced the general availability of Amazon CodeWhisperer earlier in 2023. is an AI coding companion that uses foundational models under the hood to radically improve developer productivity. This works by generating code suggestions in real time based on developers’ comments in natural language and prior code in their integrated development environment (IDE).

Today, we are excited to announce that JupyterLab users can install and use the CodeWhisperer extension for free to generate real-time, single-line, or full-function code suggestions for Python notebooks in JupyterLab and Amazon SageMaker Studio. With CodeWhisperer, you can write a comment in natural language that outlines a specific task in English, such as “Create a pandas dataframe using a CSV file.” Based on this information, CodeWhisperer recommends one or more code snippets directly in the notebook that can accomplish the task. You can quickly and easily accept the top suggestion, view more suggestions, or continue writing your own code.

During its preview, CodeWhisperer proved it is excellent at generating code to accelerate coding tasks, helping developers complete tasks an average of 57% faster. Additionally, developers who used CodeWhisperer were 27% more likely to complete a coding task successfully than those who did not. This is a giant leap forward in developer productivity. CodeWhisperer also includes a built-in reference tracker that detects whether a code suggestion might resemble open-source training data and can flag such suggestions.

Introducing new Jupyter extensions to build, train, and deploy ML at scale

Our mission at AWS is to democratize access to ML across industries. To achieve this goal, starting from 2017, we launched the Amazon SageMaker notebook instance—a fully managed compute instance running Jupyter that includes all the popular data science and ML packages. In 2019, we made a significant leap forward with the launch of SageMaker Studio, an IDE for ML built on top of JupyterLab that enables you to build, train, tune, debug, deploy, and monitor models from a single application. Tens of thousands of customers are using Studio to empower data science teams of all sizes. In 2021, we further extended the benefits of SageMaker to the community of millions of Jupyter users by launching Amazon SageMaker Studio Lab—a free notebook service, again based on JupyterLab, that includes free compute and persistent storage.

Today, we are excited to announce three new capabilities to help you scale ML development faster.

Notebooks scheduling

In 2022, we released a new capability to enable our customers to run notebooks as scheduled jobs in SageMaker Studio and Studio Lab. Thanks to this capability, many of our customers have saved time by not having to manually set up complex cloud infrastructure to scale their ML workflows.

We are excited to announce that the notebooks scheduling tool is now an open-source Jupyter extension that allows JupyterLab users to run and schedule notebooks on SageMaker anywhere JupyterLab runs. Users can select a notebook and automate it as a job that runs in a production environment via a simple yet powerful user interface. After a notebook is selected, the tool takes a snapshot of the entire notebook, packages its dependencies in a container, builds the infrastructure, runs the notebook as an automated job on a schedule set by the user, and deprovisions the infrastructure upon job completion. This reduces the time it takes to move a notebook to production from weeks to hours.

SageMaker open-source distribution

Data scientists and developers want to begin developing ML applications quickly, and it can be complex to install the mutually compatible versions of all the necessary packages. To remove the manual work and improve productivity, we are excited to announce a new open-source distribution that includes the most popular packages for ML, data science, and data visualization. This distribution includes deep learning frameworks like PyTorch, TensorFlow, and Keras; popular Python packages like NumPy, scikit-learn, and pandas; and IDEs like JupyterLab and the Jupyter Notebook. The distribution is versioned using SemVer and will be released on a regular basis moving forward. The container is available via Amazon ECR Public Gallery, and its source code is available on GitHub. This provides enterprises transparency into the packages and build process, thereby making it easier for them to reproduce, customize, or re-certify the distribution. The base image comes with pip and Conda/Mamba, so that data scientists can quickly install additional packages to meet their specific needs.

Amazon CodeGuru Jupyter extension

Amazon CodeGuru Security now supports security and code quality scans in JupyterLab and SageMaker Studio. This new capability assists notebook users in detecting security vulnerabilities such as injection flaws, data leaks, weak cryptography, or missing encryption within the notebook cells. You can also detect many common issues that affect the readability, reproducibility, and correctness of computational notebooks, such as misuse of ML library APIs, invalid run order, and nondeterminism. When vulnerabilities or quality issues are identified in the notebook, CodeGuru generates recommendations that enable you to remediate those issues based on AWS security best practices.

Conclusion

We are excited to see how the Jupyter community will use these tools to scale development, increase productivity, and take advantage of generative AI to transform their industries. Check out the following resources to learn more about Jupyter on AWS and how to install and get started with these new tools:


About the Author

Brian Granger is a leader of the Python project, co-founder of Project Jupyter, and an active contributor to a number of other open-source projects focused on data science in Python. In 2016, he co-created the Altair package for statistical visualization in Python. He is an advisory board member of the NumFOCUS Foundation, a faculty fellow of the Cal Poly Center for Innovation and Entrepreneurship, and the Sr. Principal Technologist at AWS.

Read More

Schedule your notebooks from any JupyterLab environment using the Amazon SageMaker JupyterLab extension

Schedule your notebooks from any JupyterLab environment using the Amazon SageMaker JupyterLab extension

Jupyter notebooks are highly favored by data scientists for their ability to interactively process data, build ML models, and test these models by making inferences on data. However, there are scenarios in which data scientists may prefer to transition from interactive development on notebooks to batch jobs. Examples of such use cases include scaling up a feature engineering job that was previously tested on a small sample dataset on a small notebook instance, running nightly reports to gain insights into business metrics, and retraining ML models on a schedule as new data becomes available.

Migrating from interactive development on notebooks to batch jobs required you to copy code snippets from the notebook into a script, package the script with all its dependencies into a container, and schedule the container to run. To run this job repeatedly on a schedule, you had to set up, configure, and oversee cloud infrastructure to automate deployments, resulting in a diversion of valuable time away from core data science development activities.

To help simplify the process of moving from interactive notebooks to batch jobs, in December 2022, Amazon SageMaker Studio and Studio Lab introduced the capability to run notebooks as scheduled jobs, using notebook-based workflows. You can now use the same capability to run your Jupyter notebooks from any JupyterLab environment such as Amazon SageMaker notebook instances and JupyterLab running on your local machine. SageMaker provides an open-source extension that can be installed on any JupyterLab environment and be used to run notebooks as ephemeral jobs and on a schedule.

In this post, we show you how to run your notebooks from your local JupyterLab environment as scheduled notebook jobs on SageMaker.

Solution overview

The solution architecture for scheduling notebook jobs from any JupyterLab environment is shown in the following diagram. The SageMaker extension expects the JupyterLab environment to have valid AWS credentials and permissions to schedule notebook jobs. We discuss the steps for setting up credentials and AWS Identity and Access Management (IAM) permissions later in this post. In addition to the IAM user and assumed role session scheduling the job, you also need to provide a role for the notebook job instance to assume for access to your data in Amazon Simple Storage Service (Amazon S3) or to connect to Amazon EMR clusters as needed.

In the following sections, we show how to set up the architecture and install the open-source extension, run a notebook with the default configurations, and also use the advanced parameters to run a notebook with custom settings.

Prerequisites

For this post, we assume a locally hosted JupyterLab environment. You can follow the same installation steps for an environment hosted in the cloud as well.

The following steps assume that you already have a valid Python 3 and JupyterLab environment (this extension works with JupyterLab v3.0 or higher).

Install the AWS Command Line Interface (AWS CLI) if you don’t already have it installed. See Installing or updating the latest version of the AWS CLI for instructions.

Set up IAM credentials

You need an IAM user or an active IAM role session to submit SageMaker notebook jobs. To set up your IAM credentials, you can configure the AWS CLI with your AWS credentials for your IAM user, or assume an IAM role. For instructions on setting up your credentials, see Configuring the AWS CLI. The IAM principal (user or assumed role) needs the following permissions to schedule notebook jobs. To add the policy to your principal, refer to Adding IAM identity permissions.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "EventBridgeSchedule",
            "Effect": "Allow",
            "Action": [
                "events:TagResource",
                "events:DeleteRule",
                "events:PutTargets",
                "events:DescribeRule",
                "events:EnableRule",
                "events:PutRule",
                "events:RemoveTargets",
                "events:DisableRule"
            ],
            "Resource": "*",
            "Condition": {
                "StringEquals": {
                    "aws:ResourceTag/sagemaker:is-scheduling-notebook-job": "true"
                }
            }
        },
        {
            "Sid": "IAMPassRoleToNotebookJob",
            "Effect": "Allow",
            "Action": "iam:PassRole",
            "Resource": "arn:aws:iam::*:role/SagemakerJupyterScheduler*",
            "Condition": {
                "StringLike": {
                    "iam:PassedToService": [
                        "sagemaker.amazonaws.com",
                        "events.amazonaws.com"
                    ]
                }
            }
        },
        {
            "Sid": "IAMListRoles",
            "Effect": "Allow",
            "Action": "iam:ListRoles",
            "Resource": "*" 
        },
        {
            "Sid": "S3ArtifactsAccess",
            "Effect": "Allow",
            "Action": [
                "s3:PutEncryptionConfiguration",
                "s3:CreateBucket",
                "s3:PutBucketVersioning",
                "s3:ListBucket",
                "s3:PutObject",
                "s3:GetObject",
                "s3:GetEncryptionConfiguration",
                "s3:DeleteObject",
                "s3:GetBucketLocation"
            ],
            "Resource": [
                "arn:aws:s3:::sagemaker-automated-execution-*"
            ]
        },
        {
            "Sid": "S3DriverAccess",
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket",
                "s3:GetObject",
                "s3:GetBucketLocation"
            ],
            "Resource": [
                "arn:aws:s3:::sagemakerheadlessexecution-*"
            ]
        },
        {
            "Sid": "SagemakerJobs",
            "Effect": "Allow",
            "Action": [
                "sagemaker:DescribeTrainingJob",
                "sagemaker:StopTrainingJob",
                "sagemaker:DescribePipeline",
                "sagemaker:CreateTrainingJob",
                "sagemaker:DeletePipeline",
                "sagemaker:CreatePipeline"
            ],
            "Resource": "*",
            "Condition": {
                "StringEquals": {
                    "aws:ResourceTag/sagemaker:is-scheduling-notebook-job": "true"
                }
            }
        },
         {
            "Sid": "AllowSearch",
            "Effect": "Allow",
            "Action": "sagemaker:Search",
            "Resource": "*"
        },
         {
            "Sid": "SagemakerTags",
            "Effect": "Allow",
            "Action": [
                "sagemaker:ListTags",
                "sagemaker:AddTags"
            ],
            "Resource": [
                "arn:aws:sagemaker:*:*:pipeline/*",
                "arn:aws:sagemaker:*:*:space/*",
                "arn:aws:sagemaker:*:*:training-job/*",
                "arn:aws:sagemaker:*:*:user-profile/*"
            ]
        },
        {
            "Sid": "ECRImage",
            "Effect": "Allow",
            "Action": [
                "ecr:GetAuthorizationToken",
                "ecr:BatchGetImage"
            ],
            "Resource": "*"
        }
   ]
}

If your notebook jobs need to be encrypted with customer managed AWS Key Management Service (AWS KMS) keys, add the policy statement allowing AWS KMS access as well. For a sample policy, see Install policies and permissions for local Jupyter environments.

Set up an IAM role for the notebook job instance

SageMaker requires an IAM role to run jobs on the user’s behalf, such as running the notebook job. This role should have access to the resources required for the notebook to complete the job, such as access to data in Amazon S3.

The scheduler extension automatically looks for IAM roles in the AWS account, with the prefix SagemakerJupyterScheduler to run the notebook jobs.

To create an IAM role, create an execution role for Amazon SageMaker with the AmazonSageMakerFullAccess policy. Name the role SagemakerJupyterSchedulerDemo, or provide a name with the expected prefix.

After the role is created, on the Trust relationships tab, choose Edit trust policy. Replace the existing trust policy with the following:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": [
                    "sagemaker.amazonaws.com",
                    "events.amazonaws.com"
                ]
            },
            "Action": "sts:AssumeRole"
        }
    ]
}

The AmazonSageMakerFullAccess policy is fairly permissive and is generally preferred for experimentation and getting started with SageMaker. We strongly encourage you to create a minimum scoped policy for any future workloads in accordance with security best practices in IAM. For the minimum set of permissions required for the notebook job, see Install policies and permissions for local Jupyter environments.

Install the extension

Open a terminal on your local machine and install the extension by running the following command:

pip install amazon-sagemaker-jupyter-scheduler

After this command runs, you can start JupyterLab by running jupyter lab.

If you’re installing the extension from within the JupyterLab terminal, restart the Jupyter server to load the extension. You can restart the Jupyter server by choosing Shut Down on the File menu from your JupyterLab, and starting JupyterLab from your command line by running jupyter lab.

Submit a notebook job

After the extension is installed on your environment, you can run any self-contained notebook as an ephemeral job. Let’s submit a simple “Hello world” notebook to run as a scheduled job.

  1. On the File menu, choose New and Notebook.
  2. Enter the following contents:
    # install packages
    !pip install pandas
    !pip install boto3
    
    # import block
    import boto3
    import pandas as pd
    
    # download a sample dataset
    s3 = boto3.client("s3")
    # Load the dataset
    file_name = "abalone.csv"
    s3.download_file(
        "sagemaker-sample-files", f"datasets/tabular/uci_abalone/abalone.csv", file_name
    )
    
    # display the dataset
    df = pd.read_csv(file_name)
    df.head()

After the extension is successfully installed, you’ll see the notebook scheduling icon on the notebook.

  1. Choose the icon to create a notebook job.

Alternatively, you can right-click on the notebook in your file explorer and choose Create notebook job.

  1. Provide the job name, input file, compute type, and additional parameters.
  2. Leave the remaining settings at the default and choose Create.

After the job is scheduled, you’re redirected to the Notebook Jobs tab, where you can view the list of notebook jobs and their status, and view the notebook output and logs after the job is complete. You can also access this notebook jobs window from the Launcher, as shown in the following screenshot.

Advanced configurations

From your local compute, notebooks automatically run on the SageMaker Base Python image, which is the official Python 3.8 image from Docker Hub with Boto3 and the AWS CLI included. In real-world cases, data scientists need to install specific packages or frameworks for their notebooks. There are three ways to achieve a reproducible environment:

  • At the simplest option, you can install the packages and frameworks directly on the first cell of your notebook.
  • You can also provide an initialization script in the Additional options section, pointing to a bash script on your local storage that is run by the notebook job when the notebook starts up. In the following section, we show an example of using initialization scripts to install packages.
  • Finally, if you want maximum flexibility in configuring your run environment, you can build your own custom image with a Python3 kernel, push the image to Amazon Elastic Container Registry (Amazon ECR), and provide the ECR image URI to your notebook job under Additional options. The ECR image should follow the requirements for SageMaker images, as listed in Custom SageMaker image specifications.

In addition, your enterprise might set up guardrails like running jobs in internet-free mode within an Amazon VPC, using a custom least-privilege role for the job, and enforcing encryption. You can specify such configurations for your notebook jobs in the Additional options section as well. For a detailed list of advanced configurations, see Additional options.

Add an initialization script

To showcase the initialization script, we now run the sample notebook for Studio notebook jobs available on GitHub. To run this notebook, you need to install the required packages through an initialization script. Complete the following steps:

  1. From your JupyterLab terminal, run the following command to download the file:
    curl https://raw.githubusercontent.com/aws/amazon-sagemaker-examples/main/sagemaker-notebook-jobs/studio-scheduling/scheduled-example.ipynb > scheduled-example.ipynb

  2. On the File menu, choose New and Text file.
  3. Enter the following contents to your file, and save the file under the name init-script.sh:
    echo "Installing required packages"
    
    pip install --upgrade sagemaker
    pip install pandas numpy matplotlib scikit-learn

  4. Choose scheduled-example.ipynb from your file explorer to open the notebook.
  5. Choose the notebook job icon to schedule the notebook, and expand the Additional options section.
  6. For Initialization script location, enter the full path of your script.

You can also optionally customize the input and output S3 folders for your notebook job. SageMaker creates an input folder in a specified S3 location to store the input files, and creates an output S3 folder where the notebook outputs are stored. You can specify encryption, IAM role, and VPC configurations here. See Constraints and considerations for custom image and VPC specifications.

  1. For now, simply update the initialization script, choose Run now for the schedule, and choose Create.

When the job is complete, you can view the notebook with outputs and the output log under Output files, as shown in the following screenshot. In the output log, you should be able to see the initialization script being run before running the notebook.

To further customize your notebook job environment, you can use your own image by specifying the ECR URI of your custom image. If you’re bringing your own image, ensure you install a Python3 kernel when building your image. For a sample Dockerfile that can run a notebook using TensorFlow, see the following code:

FROM tensorflow/tensorflow:latest
RUN pip install ipykernel && 
        python -m ipykernel install --sys-prefix

Conclusion

In this post, we showed you how to run your notebooks from any JupyterLab environment hosted locally as SageMaker training jobs, using the SageMaker Jupyter scheduler extension. Being able to run notebooks in a headless manner, on a schedule, greatly reduces undifferentiated heavy lifting for the data scientists, such as refactoring notebooks to Python scripts, setting up Amazon EventBridge event triggers, and creating AWS Lambda functions or SageMaker pipelines to start the training jobs. SageMaker notebook jobs are run on demand, so you only pay for the time that the notebook is run, and you can use the notebook jobs extension to view the notebook outputs anytime from your JupyterLab environment. We encourage you to try scheduled notebook jobs, and connect with the Machine Learning & AI community on re:Post for feedback!


About the authors

Bhadrinath Pani is a Software Development Engineer at Amazon Web Services, working on Amazon SageMaker interactive ML products, with over 12 years of experience in software development across domains like automotive, IoT, AR/VR, and computer vision. Currently, his main focus is on developing machine learning tools aimed at simplifying the experience for data scientists. In his free time, he enjoys spending time with his family and exploring the beauty of the Pacific Northwest.

Durga Sury is an ML Solutions Architect on the Amazon SageMaker Service SA team. She is passionate about making machine learning accessible to everyone. In her 4 years at AWS, she has helped set up AI/ML platforms for enterprise customers. When she isn’t working, she loves motorcycle rides, mystery novels, and long walks with her 5-year-old husky.

Read More

Announcing provisioned concurrency for Amazon SageMaker Serverless Inference

Announcing provisioned concurrency for Amazon SageMaker Serverless Inference

Amazon SageMaker Serverless Inference allows you to serve model inference requests in real time without having to explicitly provision compute instances or configure scaling policies to handle traffic variations. You can let AWS handle the undifferentiated heavy lifting of managing the underlying infrastructure and save costs in the process. A Serverless Inference endpoint spins up the relevant infrastructure, including the compute, storage, and network, to stage your container and model for on-demand inference. You can simply select the amount of memory to allocate and the number of max concurrent invocations to have a production-ready endpoint to service inference requests.

With on-demand serverless endpoints, if your endpoint doesn’t receive traffic for a while and then suddenly receives new requests, it can take some time for your endpoint to spin up the compute resources to process the requests. This is called a cold start. A cold start can also occur if your concurrent requests exceed the current concurrent request usage. With provisioned concurrency on Serverless Inference, you can mitigate cold starts and get predictable performance characteristics for their workloads. You can add provisioned concurrency to your serverless endpoints, and for the predefined amount of provisioned concurrency, Amazon SageMaker will keep the endpoints warm and ready to respond to requests instantaneously. In addition, you can now use Application Auto Scaling with provisioned concurrency to address inference traffic dynamically based on target metrics or a schedule.

In this post, we discuss what provisioned concurrency and Application Auto Scaling are, how to use them, and some best practices and guidance for your inference workloads.

Provisioned concurrency with Application Auto Scaling

With provisioned concurrency on Serverless Inference endpoints, SageMaker manages the infrastructure that can serve multiple concurrent requests without incurring cold starts. SageMaker uses the value specified in your endpoint configuration file called ProvisionedConcurrency, which is used when you create or update an endpoint. The serverless endpoint enables provisioned concurrency, and you can expect that SageMaker will serve the number of requests you have set without a cold start. See the following code:

endpoint_config_response_pc = client.create_endpoint_config(
    EndpointConfigName=xgboost_epc_name_pc,
    ProductionVariants=[
        {
            "VariantName": "byoVariant",
            "ModelName": model_name,
            "ServerlessConfig": {
                "MemorySizeInMB": 4096,
                "MaxConcurrency": 1,
                #Provisioned Concurrency value setting example 
                "ProvisionedConcurrency": 1
            },
        },
    ],
)

By understanding your workloads and knowing how many cold starts you want to mitigate, you can set this to a preferred value.

Serverless Inference with provisioned concurrency also supports Application Auto Scaling, which allows you to optimize costs based on your traffic profile or schedule to dynamically set the amount of provisioned concurrency. This can be set in a scaling policy, which can be applied to an endpoint.

To specify the metrics and target values for a scaling policy, you can configure a target-tracking scaling policy. Define the scaling policy as a JSON block in a text file. You can then use that text file when invoking the AWS Command Line Interface (AWS CLI) or the Application Auto Scaling API. To define a target-tracking scaling policy for a serverless endpoint, use the SageMakerVariantProvisionedConcurrencyUtilization predefined metric:

{
    "TargetValue": 0.5,
    "PredefinedMetricSpecification": 
    {
        "PredefinedMetricType": "SageMakerVariantProvisionedConcurrencyUtilization"
    },
    "ScaleOutCooldown": 1,
    "ScaleInCooldown": 1
}

To specify a scaling policy based on a schedule (for example, every day at 12:15 PM UTC), you can modify the scaling policy as well. If the current capacity is below the value specified for MinCapacity, Application Auto Scaling scales out to the value specified by MinCapacity. The following code is an example of how to set this via the AWS CLI:

aws application-autoscaling put-scheduled-action 
  --service-namespace sagemaker --schedule 'cron(15 12 * * ? *)' 
  --scheduled-action-name 'ScheduledScalingTest' 
  --resource-id endpoint/MyEndpoint/variant/MyVariant 
  --scalable-dimension sagemaker:variant:DesiredProvisionedConcurrency 
  --scalable-target-action 'MinCapacity=10'

With Application Auto Scaling, you can ensure that your workloads can mitigate cold starts, meet business objectives, and optimize cost in the process.

You can monitor your endpoints and provisioned concurrency specific metrics using Amazon CloudWatch. There are four metrics to focus on that are specific to provisioned concurrency:

  • ServerlessProvisionedConcurrencyExecutions – The number of concurrent runs handled by the endpoint
  • ServerlessProvisionedConcurrencyUtilization – The number of concurrent runs divided by the allocated provisioned concurrency
  • ServerlessProvisionedConcurrencyInvocations – The number of InvokeEndpoint requests handled by the provisioned concurrency
  • ServerlessProvisionedConcurrencySpilloverInvocations – The number of InvokeEndpoint requests not handled provisioned concurrency, which is handled by on-demand Serverless Inference

By monitoring and making decisions based on these metrics, you can tune their configuration with cost and performance in mind and optimize your SageMaker Serverless Inference endpoint.

For SageMaker Serverless Inference, you can choose either a SageMaker-provided container or bring your own. SageMaker provides containers for its built-in algorithms and prebuilt Docker images for some of the most common machine learning (ML) frameworks, such as Apache MXNet, TensorFlow, PyTorch, and Chainer. For a list of available SageMaker images, see Available Deep Learning Containers Images. If you’re bringing your own container, you must modify it to work with SageMaker. For more information about bringing your own container, see Adapting Your Own Inference Container.

Notebook example

Creating a serverless endpoint with provisioned concurrency is a very similar process to creating an on-demand serverless endpoint. For this example, we use a model using the SageMaker built-in XGBoost algorithm. We work with the Boto3 Python SDK to create three SageMaker inference entities:

  • SageMaker model – Create a SageMaker model that packages your model artifacts for deployment on SageMaker using the CreateModel You can also complete this step via AWS CloudFormation using the AWS::SageMaker::Model resource.
  • SageMaker endpoint configuration – Create an endpoint configuration using the CreateEndpointConfig API and the new configuration ServerlessConfig options or by selecting the serverless option on the SageMaker console. You can also complete this step via AWS CloudFormation using the AWS::SageMaker::EndpointConfig You must specify the memory size, which, at a minimum, should be as big as your runtime model object, and the maximum concurrency, which represents the max concurrent invocations for a single endpoint. For our endpoint with provisioned concurrency enabled, we specify that parameter in the endpoint configuration step, taking into account that the value must be greater than 0 and less than or equal to max concurrency.
  • SageMaker endpoint – Finally, using the endpoint configuration that you created in the previous step, create your endpoint using either the SageMaker console or programmatically using the CreateEndpoint You can also complete this step via AWS CloudFormation using the AWS::SageMaker::Endpoint resource.

In this post, we don’t cover the training and SageMaker model creation; you can find all these steps in the complete notebook. We focus primarily on how you can specify provisioned concurrency in the endpoint configuration and compare performance metrics for an on-demand serverless endpoint with a provisioned concurrency enabled serverless endpoint.

Configure a SageMaker endpoint

In the endpoint configuration, you can specify the serverless configuration options. For Serverless Inference, there are two inputs required, and they can be configured to meet your use case:

  • MaxConcurrency – This can be set from 1–200
  • Memory SizeThis can be the following values: 1024 MB, 2048 MB, 3072 MB, 4096 MB, 5120 MB, or 6144 MB

For this example, we create two endpoint configurations: one on-demand serverless endpoint and one provisioned concurrency enabled serverless endpoint. You can see an example of both configurations in the following code:

xgboost_epc_name_pc = "xgboost-serverless-epc-pc" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
xgboost_epc_name_on_demand = "xgboost-serverless-epc-on-demand" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())

endpoint_config_response_pc = client.create_endpoint_config(
    EndpointConfigName=xgboost_epc_name_pc,
    ProductionVariants=[
        {
            "VariantName": "byoVariant",
            "ModelName": model_name,
            "ServerlessConfig": {
                "MemorySizeInMB": 4096,
                "MaxConcurrency": 1,
                # Providing Provisioned Concurrency in EPC
                "ProvisionedConcurrency": 1
            },
        },
    ],
)

endpoint_config_response_on_demand = client.create_endpoint_config(
    EndpointConfigName=xgboost_epc_name_on_demand,
    ProductionVariants=[
        {
            "VariantName": "byoVariant",
            "ModelName": model_name,
            "ServerlessConfig": {
                "MemorySizeInMB": 4096,
                "MaxConcurrency": 1,
            },
        },
    ],
)

print("Endpoint Configuration Arn Provisioned Concurrency: " + endpoint_config_response_pc["EndpointConfigArn"])
print("Endpoint Configuration Arn On Demand Serverless: " + endpoint_config_response_on_demand["EndpointConfigArn"])

With SageMaker Serverless Inference with a provisioned concurrency endpoint, you also need to set the following, which is reflected in the preceding code:

  • ProvisionedConcurrency – This value can be set from 1 to the value of your MaxConcurrency

Create SageMaker on-demand and provisioned concurrency endpoints

We use our two different endpoint configurations to create two endpoints: an on-demand serverless endpoint with no provisioned concurrency enabled and a serverless endpoint with provisioned concurrency enabled. See the following code:

endpoint_name_pc = "xgboost-serverless-ep-pc" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())

create_endpoint_response = client.create_endpoint(
    EndpointName=endpoint_name_pc,
    EndpointConfigName=xgboost_epc_name_pc,
)

print("Endpoint Arn Provisioned Concurrency: " + create_endpoint_response["EndpointArn"])

endpoint_name_on_demand = "xgboost-serverless-ep-on-demand" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())

create_endpoint_response = client.create_endpoint(
    EndpointName=endpoint_name_on_demand,
    EndpointConfigName=xgboost_epc_name_on_demand,
)

print("Endpoint Arn Provisioned Concurrency: " + create_endpoint_response["EndpointArn"])

Compare invocation and performance

Next, we can invoke both endpoints with the same payload:

%%time

#On Demand Serverless Endpoint Test
response = runtime.invoke_endpoint(
    EndpointName=endpoint_name_on_demand,
    Body=b".345,0.224414,.131102,0.042329,.279923,-0.110329,-0.099358,0.0",
    ContentType="text/csv",
)

print(response["Body"].read())

%%time

#Provisioned Endpoint Test
response = runtime.invoke_endpoint(
    EndpointName=endpoint_name_pc,
    Body=b".345,0.224414,.131102,0.042329,.279923,-0.110329,-0.099358,0.0",
    ContentType="text/csv",
)

print(response["Body"].read())

When timing both cells for the first request, we immediately notice a drastic improvement in end-to-end latency in the provisioned concurrency enabled serverless endpoint. To validate this, we can send five requests to each endpoint with 10-minute intervals between each request. With the 10-minute gap, we can ensure that the on-demand endpoint is cold. Therefore, we can successfully evaluate cold start performance comparison between the on-demand and provisioned concurrency serverless endpoints. See the following code:

import time
import numpy as np
print("Testing cold start for serverless inference with PC vs no PC")

pc_times = []
non_pc_times = []

# ~50 minutes
for i in range(5):
    time.sleep(600)
    start_pc = time.time()
    pc_response = runtime.invoke_endpoint(
        EndpointName=endpoint_name_pc,
        Body=b".345,0.224414,.131102,0.042329,.279923,-0.110329,-0.099358,0.0",
        ContentType="text/csv",
    )
    end_pc = time.time() - start_pc
    pc_times.append(end_pc)

    start_no_pc = time.time()
    response = runtime.invoke_endpoint(
        EndpointName=endpoint_name_on_demand,
        Body=b".345,0.224414,.131102,0.042329,.279923,-0.110329,-0.099358,0.0",
        ContentType="text/csv",
    )
    end_no_pc = time.time() - start_no_pc
    non_pc_times.append(end_no_pc)

pc_cold_start = np.mean(pc_times)
non_pc_cold_start = np.mean(non_pc_times)

print("Provisioned Concurrency Serverless Inference Average Cold Start: {}".format(pc_cold_start))
print("On Demand Serverless Inference Average Cold Start: {}".format(non_pc_cold_start))

We can then plot these average end-to-end latency values across five requests and see that the average cold start for provisioned concurrency was approximately 200 milliseconds end to end as opposed to nearly 6 seconds with the on-demand serverless endpoint.

When to use Serverless Inference with provisioned concurrency

Provisioned concurrency is a cost-effective solution for low throughput and spiky workloads requiring low latency guarantees. Provisioned concurrency will be suitable for use cases when the throughput is low, and you want to reduce costs compared with instance-based while still having predictable performance or for workloads with predictable traffic bursts with low latency requirements. For example, a chatbot application run by a tax filing software company typically sees high demand during the last week of March from 10:00 AM to 5:00 PM because it’s close to the tax filing deadline. You can choose on-demand Serverless Inference for the remaining part of the year to serve requests from end-users, but for the last week of March, you can add provisioned concurrency to handle the spike in demand. As a result, you can reduce costs during idle time while still meeting your performance goals.

On the other hand, if your inference workload is steady, has high throughput (enough traffic to keep the instances saturated and busy), has a predictable traffic pattern, and requires ultra-low latency, or it includes large or complex models that require GPUs, Serverless Inference isn’t the right option for you, and you should deploy on real-time inference. Synchronous use cases with burst behavior that don’t require performance guarantees are more suitable for using on-demand Serverless Inference. The traffic patterns and the right hosting option (serverless or real-time inference) are depicted in the following figures:

  • Real-time inference endpoint – Traffic is mostly steady with predictable peaks. The high throughput is enough to keep the instances behind the auto scaling group busy and saturated. This will allow you to efficiently use the existing compute and be cost-effective along with providing ultra-low latency guarantees. For the predictable peaks, you can choose to use the scheduled auto scaling policy in SageMaker for real-time inference endpoints. Read more about the best practices for selecting the right auto scaling policy at Optimize your machine learning deployments with auto scaling on Amazon SageMaker.

  • On-demand Serverless Inference – This option is suitable for traffic with unpredictable peaks, but the ML application is tolerant to cold start latencies. To help determine whether a serverless endpoint is the right deployment option from a cost and performance perspective, use the SageMaker Serverless Inference benchmarking toolkit, which tests different endpoint configurations and compares the most optimal one against a comparable real-time hosting instance.

  • Serverless Inference with provisioned concurrency – This option is suitable for the traffic pattern with predictable peaks but is otherwise low or intermittent. This option provides you additional low latency guarantees for ML applications that can’t tolerate cold start latencies.

Use the following factors to determine which hosting option (real time over on-demand Serverless Inference over Serverless Inference with provisioned concurrency) is right for your ML workloads:

  • Throughput – This represents requests per second or any other metrics that represent the rate of incoming requests to the inference endpoint. We define the high throughput in the following diagram as any throughput that is enough to keep the instances behind the auto scaling group busy and saturated to get the most out of your compute.
  • Traffic pattern – This represents the type of traffic, including traffic with predictable or unpredictable spikes. If the spikes are unpredictable but the ML application needs low-latency guarantees, Serverless Inference with provisioned concurrency might be cost-effective if it’s a low throughput application.
  • Response time – If the ML application needs low-latency guarantees, use Serverless Inference with provisioned concurrency for low throughput applications with unpredictable traffic spikes. If the application can tolerate cold start latencies and has low throughput with unpredictable traffic spikes, use on-demand Serverless Inference.
  • Cost – Consider the total cost of ownership, including infrastructure costs (compute, storage, networking), operational costs (operating, managing, and maintaining the infrastructure), and security and compliance costs.

The following figure illustrates this decision tree.

Best practices

With Serverless Inference with provisioned concurrency, you should still adhere to best practices for workloads that don’t use provisioned concurrency:

  • Avoid installing packages and other operations during container startup and ensure containers are already in their desired state to minimize cold start time when being provisioned and invoked while staying under the 10 GB maximum supported container size. To monitor how long your cold start time is, you can use the CloudWatch metric OverheadLatency to monitor your serverless endpoint. This metric tracks the time it takes to launch new compute resources for your endpoint.
  • Set the MemorySizeInMB value to be large enough to meet your needs as well as increase performance. Larger values will also devote more compute resources. At some point, a larger value will have diminishing returns.
  • Set the MaxConcurrency to accommodate the peaks of traffic while considering the resulting cost.
  • We recommend creating only one worker in the container and only loading one copy of the model. This is unlike real-time endpoints, where some SageMaker containers may create a worker for each vCPU to process inference requests and load the model in each worker.
  • Use Application Auto Scaling to automate your provisioned concurrency setting based on target metrics or schedule. By doing so, you can have finer-grained, automated control of the amount of the provisioned concurrency used with your SageMaker serverless endpoint.

In addition, with the ability to configure ProvisionedConcurrency, you should set this value to the integer representing how many cold starts you would like to avoid when requests come in a short time frame after a period of inactivity. Using the metrics in CloudWatch can help you tune this value to be optimal based on preferences.

Pricing

As with on-demand Serverless Inference, when provisioned concurrency is enabled, you pay for the compute capacity used to process inference requests, billed by the millisecond, and the amount of data processed. You also pay for provisioned concurrency usage based on the memory configured, duration provisioned, and amount of concurrency enabled.

Pricing can be broken down into two components: provisioned concurrency charges and inference duration charges. For more details, refer to Amazon SageMaker Pricing.

Conclusion

SageMaker Serverless Inference with provisioned concurrency provides a very powerful capability for workloads when cold starts need to be mitigated and managed. With this capability, you can better balance cost and performance characteristics while providing a better experience to your end-users. We encourage you to consider whether provisioned concurrency with Application Auto Scaling is a good fit for your workloads, and we look forward to your feedback in the comments!

Stay tuned for follow-up posts where we will provide more insight into the benefits, best practices, and cost comparisons using Serverless Inference with provisioned concurrency.


About the Authors

James Park is a Solutions Architect at Amazon Web Services. He works with Amazon.com to design, build, and deploy technology solutions on AWS, and has a particular interest in AI and machine learning. In h is spare time he enjoys seeking out new cultures, new experiences,  and staying up to date with the latest technology trends.You can find him on LinkedIn.

Dhawal Patel is a Principal Machine Learning Architect at AWS. He has worked with organizations ranging from large enterprises to mid-sized startups on problems related to distributed computing and artificial intelligence. He focuses on deep learning, including NLP and computer vision domains. He helps customers achieve high-performance model inference on Amazon SageMaker.

Ram Vegiraju is a ML Architect with the SageMaker Service team. He focuses on helping customers build and optimize their AI/ML solutions on Amazon SageMaker. In his spare time, he loves traveling and writing.

Rupinder Grewal is a Sr Ai/ML Specialist Solutions Architect with AWS. He currently focuses on serving of models and MLOps on SageMaker. Prior to this role he has worked as Machine Learning Engineer building and hosting models. Outside of work he enjoys playing tennis and biking on mountain trails.

Rishabh Ray Chaudhury is a Senior Product Manager with Amazon SageMaker, focusing on Machine Learning inference. He is passionate about innovating and building new experiences for Machine Learning customers on AWS to help scale their workloads. In his spare time, he enjoys traveling and cooking. You can find him on LinkedIn.

Shruti Sharma is a Sr. Software Development Engineer in AWS SageMaker team. Her current work focuses on helping developers efficiently host machine learning models on Amazon SageMaker. In her spare time she enjoys traveling, skiing and playing chess. You can find her on LinkedIn.

Hao Zhu is a Software Development with Amazon Web Services. In his spare time he loves to hit the slopes and ski. He also enjoys exploring new places, trying different foods, experiencing different cultures and is always up for a new adventure.

Read More

Accelerate protein structure prediction with the ESMFold language model on Amazon SageMaker

Accelerate protein structure prediction with the ESMFold language model on Amazon SageMaker

Proteins drive many biological processes, such as enzyme activity, molecular transport, and cellular support. The three-dimensional structure of a protein provides insight into its function and how it interacts with other biomolecules. Experimental methods to determine protein structure, such as X-ray crystallography and NMR spectroscopy, are expensive and time-consuming.

In contrast, recently-developed computational methods can rapidly and accurately predict the structure of a protein from its amino acid sequence. These methods are critical for proteins that are difficult to study experimentally, such as membrane proteins, the targets of many drugs. One well-known example of this is AlphaFold, a deep learning-based algorithm celebrated for its accurate predictions.

ESMFold is another highly-accurate, deep learning-based method developed to predict protein structure from its amino acid sequence. ESMFold uses a large protein language model (pLM) as a backbone and operates end to end. Unlike AlphaFold2, it doesn’t need a lookup or Multiple Sequence Alignment (MSA) step, nor does it rely on external databases to generate predictions. Instead, the development team trained the model on millions of protein sequences from UniRef. During training, the model developed attention patterns that elegantly represent the evolutionary interactions between amino acids in the sequence. This use of a pLM instead of an MSA enables up to 60 times faster prediction times than other state-of-the-art models.

In this post, we use the pre-trained ESMFold model from Hugging Face with Amazon SageMaker to predict the heavy chain structure of trastuzumab, a monoclonal antibody first developed by Genentech for the treatment of HER2-positive breast cancer. Quickly predicting the structure of this protein could be useful if researchers wanted to test the effect of sequence modifications. This could potentially lead to improved patient survival or fewer side effects.

This post provides an example Jupyter notebook and related scripts in the following GitHub repository.

Prerequisites

We recommend running this example in an Amazon SageMaker Studio notebook running the PyTorch 1.13 Python 3.9 CPU-optimized image on an ml.r5.xlarge instance type.

Visualize the experimental structure of trastuzumab

To begin, we use the biopython library and a helper script to download the trastuzumab structure from the RCSB Protein Data Bank:

from Bio.PDB import PDBList, MMCIFParser
from prothelpers.structure import atoms_to_pdb

target_id = "1N8Z"
pdbl = PDBList()
filename = pdbl.retrieve_pdb_file(target_id, pdir="data")
parser = MMCIFParser()
structure = parser.get_structure(target_id, filename)
pdb_string = atoms_to_pdb(structure)

Next, we use the py3Dmol library to visualize the structure as an interactive 3D visualization:

view = py3Dmol.view()
view.addModel(pdb_string)
view.setStyle({'chain':'A'},{"cartoon": {'color': 'orange'}})
view.setStyle({'chain':'B'},{"cartoon": {'color': 'blue'}})
view.setStyle({'chain':'C'},{"cartoon": {'color': 'green'}})
view.show()

The following figure represents the 3D protein structure 1N8Z from the Protein Data Bank (PDB). In this image, the trastuzumab light chain is displayed in orange, the heavy chain is blue (with the variable region in light blue), and the HER2 antigen is green.

We’ll first use ESMFold to predict the structure of the heavy chain (Chain B) from its amino acid sequence. Then, we will compare the prediction to the experimentally determined structure shown above.

Predict the trastuzumab heavy chain structure from its sequence using ESMFold

Let’s use the ESMFold model to predict the structure of the heavy chain and compare it to the experimental result. To start, we’ll use a pre-built notebook environment in Studio that comes with several important libraries, like PyTorch, pre-installed. Although we could use an accelerated instance type to improve the performance of our notebook analysis, we’ll instead use a non-accelerated instance and run the ESMFold prediction on a CPU.

First, we load the pre-trained ESMFold model and tokenizer from Hugging Face Hub:

from transformers import AutoTokenizer, EsmForProteinFolding

tokenizer = AutoTokenizer.from_pretrained("facebook/esmfold_v1")
model = EsmForProteinFolding.from_pretrained("facebook/esmfold_v1", low_cpu_mem_usage=True)

Next, we copy the model to our device (CPU in this case) and set some model parameters:

device = torch.device("cpu")
model.esm = model.esm.float()
model = model.to(device)
model.trunk.set_chunk_size(64)

To prepare the protein sequence for analysis, we need to tokenize it. This translates the amino acid symbols (EVQLV…) into a numerical format that the ESMFold model can understand (6,19,5,10,19,…):

tokenized_input = tokenizer([experimental_sequence], return_tensors="pt", add_special_tokens=False)["input_ids"]
tokenized_input = tokenized_input.to(device)

Next, we copy the tokenized input to the mode, make a prediction, and save the result to a file:

with torch.no_grad():
notebook_prediction = model.infer_pdb(experimental_sequence)
with open("data/prediction.pdb", "w") as f:
f.write(notebook_prediction)

This takes about 3 minutes on a non-accelerated instance type, like a r5.

We can check the accuracy of the ESMFold prediction by comparing it to the experimental structure. We do this using the US-Align tool developed by the Zhang Lab at the University of Michigan:

from prothelpers.usalign import tmscore

tmscore("data/prediction.pdb", "data/experimental.pdb", pymol="data/superimposed")
PDBchain1 PDBchain2 TM-Score
data/prediction.pdb:A data/experimental.pdb:B 0.802

The template modeling score (TM-score) is a metric for assessing the similarity of protein structures. A score of 1.0 indicates a perfect match. Scores above 0.7 indicate that proteins share the same backbone structure. Scores above 0.9 indicate that the proteins are functionally interchangeable for downstream use. In our case of achieving TM-Score 0.802, the ESMFold prediction would likely be appropriate for applications like structure scoring or ligand binding experiments, but may not be suitable for use cases like molecular replacement that require extremely high accuracy.

We can validate this result by visualizing the aligned structures. The two structures show a high, but not perfect, degree of overlap. Protein structure predictions is a rapidly-evolving field and many research teams are developing ever-more accurate algorithms!

Deploy ESMFold as a SageMaker inference endpoint

Running model inference in a notebook is fine for experimentation, but what if you need to integrate your model with an application? Or an MLOps pipeline? In this case, a better option is to deploy your model as an inference endpoint. In the following example, we’ll deploy ESMFold as a SageMaker real-time inference endpoint on an accelerated instance. SageMaker real-time endpoints provide a scalable, cost-effective, and secure way to deploy and host machine learning (ML) models. With automatic scaling, you can adjust the number of instances running the endpoint to meet the demands of your application, optimizing costs and ensuring high availability.

The pre-built SageMaker container for Hugging Face makes it easy to deploy deep learning models for common tasks. However, for novel use cases like protein structure prediction, we need to define a custom inference.py script to load the model, run the prediction, and format the output. This script includes much of the same code we used in our notebook. We also create a requirements.txt file to define some Python dependencies for our endpoint to use. You can see the files we created in the GitHub repository.

In the following figure, the experimental (blue) and predicted (red) structures of the trastuzumab heavy chain are very similar, but not identical.

After we’ve created the necessary files in the code directory, we deploy our model using the SageMaker HuggingFaceModel class. This uses a pre-built container to simplify the process of deploying Hugging Face models to SageMaker. Note that it may take 10 minutes or more to create the endpoint, depending on the availability of ml.g4dn instance types in our Region.

from sagemaker.huggingface import HuggingFaceModel
from datetime import datetime

huggingface_model = HuggingFaceModel(
model_data = model_artifact_s3_uri, # Previously staged in S3
name = f"emsfold-v1-model-" + datetime.now().strftime("%Y%m%d%s"),
transformers_version='4.17',
pytorch_version='1.10',
py_version='py38',
role=role,
source_dir = "code",
entry_point = "inference.py"
)

rt_predictor = huggingface_model.deploy(
initial_instance_count = 1,
instance_type="ml.g4dn.2xlarge",
endpoint_name=f"my-esmfold-endpoint",
serializer = sagemaker.serializers.JSONSerializer(),
deserializer = sagemaker.deserializers.JSONDeserializer()
)

When the endpoint deployment is complete, we can resubmit the protein sequence and display the first few rows of the prediction:

endpoint_prediction = rt_predictor.predict(experimental_sequence)[0]
print(endpoint_prediction[:900])

Because we deployed our endpoint to an accelerated instance, the prediction should only take a few seconds. Each row in the result corresponds to a single atom and includes the amino acid identity, three spatial coordinates, and a pLDDT score representing the prediction confidence at that location.

PDB_GROUP ID ATOM_LABEL RES_ID CHAIN_ID SEQ_ID CARTN_X CARTN_Y CARTN_Z OCCUPANCY PLDDT ATOM_ID
ATOM 1 N GLU A 1 14.578 -19.953 1.47 1 0.83 N
ATOM 2 CA GLU A 1 13.166 -19.595 1.577 1 0.84 C
ATOM 3 CA GLU A 1 12.737 -18.693 0.423 1 0.86 C
ATOM 4 CB GLU A 1 12.886 -18.906 2.915 1 0.8 C
ATOM 5 O GLU A 1 13.417 -17.715 0.106 1 0.83 O
ATOM 6 cg GLU A 1 11.407 -18.694 3.2 1 0.71 C
ATOM 7 cd GLU A 1 11.141 -18.042 4.548 1 0.68 C
ATOM 8 OE1 GLU A 1 12.108 -17.805 5.307 1 0.68 O
ATOM 9 OE2 GLU A 1 9.958 -17.767 4.847 1 0.61 O
ATOM 10 N VAL A 2 11.678 -19.063 -0.258 1 0.87 N
ATOM 11 CA VAL A 2 11.207 -18.309 -1.415 1 0.87 C

Using the same method as before, we see that the notebook and endpoint predictions are identical.

PDBchain1 PDBchain2 TM-Score
data/endpoint_prediction.pdb:A data/prediction.pdb:A 1.0

As observed in the following figure, the ESMFold predictions generated in-notebook (red) and by the endpoint (blue) show perfect alignment.

Clean up

To avoid further charges, we delete our inference endpoint and test data:

rt_predictor.delete_endpoint()
bucket = boto_session.resource("s3").Bucket(bucket)
bucket.objects.filter(Prefix=prefix).delete()
os.system("rm -rf data obsolete code")

Summary

Computational protein structure prediction is a critical tool for understanding the function of proteins. In addition to basic research, algorithms like AlphaFold and ESMFold have many applications in medicine and biotechnology. The structural insights generated by these models help us better understand how biomolecules interact. This can then lead to better diagnostic tools and therapies for patients.

In this post, we show how to deploy the ESMFold protein language model from Hugging Face Hub as a scalable inference endpoint using SageMaker. For more information about deploying Hugging Face models on SageMaker, refer to Use Hugging Face with Amazon SageMaker. You can also find more protein science examples in the Awesome Protein Analysis on AWS GitHub repo. Please leave us a comment if there are any other examples you’d like to see!


About the Authors

Brian Loyal is a Senior AI/ML Solutions Architect in the Global Healthcare and Life Sciences team at Amazon Web Services. He has more than 17 years’ experience in biotechnology and machine learning, and is passionate about helping customers solve genomic and proteomic challenges. In his spare time, he enjoys cooking and eating with his friends and family.

Shamika Ariyawansa is an AI/ML Specialist Solutions Architect in the Global Healthcare and Life Sciences team at Amazon Web Services. He passionately works with customers to accelerate their AI and ML adoption by providing technical guidance and helping them innovate and build secure cloud solutions on AWS. Outside of work, he loves skiing and off-roading.

Yanjun QiYanjun Qi is a Senior Applied Science Manager at the AWS Machine Learning Solution Lab. She innovates and applies machine learning to help AWS customers speed up their AI and cloud adoption.

Read More

Transform, analyze, and discover insights from unstructured healthcare data using Amazon HealthLake

Transform, analyze, and discover insights from unstructured healthcare data using Amazon HealthLake

Healthcare data is complex and siloed, and exists in various formats. An estimated 80% of data within organizations is considered to be unstructured or “dark” data that is locked inside text, emails, PDFs, and scanned documents. This data is difficult to interpret or analyze programmatically and limits how organizations can derive insights from it and serve their customers more effectively. The rapid rate of data generation means that organizations that aren’t investing in document automation risk getting stuck with legacy processes that are manual, slow, error prone, and difficult to scale.

In this post, we propose a solution that automates ingestion and transformation of previously untapped PDFs and handwritten clinical notes and data. We explain how to extract information from customer clinical data charts using Amazon Textract, then use the raw extracted text to identify discrete data elements using Amazon Comprehend Medical. We store the final output in Fast Healthcare Interoperability Resources (FHIR) compatible format in Amazon HealthLake, making it available for downstream analytics.

Solution overview

AWS provides a variety of services and solutions for healthcare providers to unlock the value of their data. For our solution, we process a small sample of documents through Amazon Textract and load that extracted data as appropriate FHIR resources in Amazon HealthLake. We create a custom process for FHIR conversion and test it end to end.

The data is first loaded into DocumentReference. Amazon HealthLake then creates system-generated resources after processing this unstructured text in DocumentReference and loads it into Condition, MedicationStatement, and Observation resources. We identify a few data fields within FHIR resources like patient ID, date of service, provider type, and name of medical facility.

A MedicationStatement is a record of a medication that is being consumed by a patient. It may indicate that the patient is taking the medication now, has taken the medication in the past, or will be taking the medication in the future. A common scenario where this information is captured is during the history-taking process in the course of a patient visit or stay. The source of medication information could be the patient’s memory, a prescription bottle, or from a list of medications the patient, clinician, or other party maintains.

Observations are a central element in healthcare, used to support diagnosis, monitor progress, determine baselines and patterns, and even capture demographic characteristics. Most observations are simple name/value pair assertions with some metadata, but some observations group other observations together logically, or could even be multi-component observations.

The Condition resource is used to record detailed information about a condition, problem, diagnosis, or other event, situation, issue, or clinical concept that has risen to a level of concern. The condition could be a point-in-time diagnosis in the context of an encounter, an item on the practitioner’s problem list, or a concern that doesn’t exist on the practitioner’s problem list.

The following diagram shows the workflow to migrate unstructured data into FHIR for AI and machine learning (ML) analysis in Amazon HealthLake.

The workflow steps are as follows:

  1. A document is uploaded to an Amazon Simple Storage Service (Amazon S3) bucket.
  2. The document upload in Amazon S3 triggers an AWS Lambda function.
  3. The Lambda function sends the image to Amazon Textract.
  4. Amazon Textract extracts text from the image and stores the output in a separate Amazon Textract output S3 bucket.
  5. The final result is stored as specific FHIR resources (the extracted text is loaded in DocumentReference as base64 encoded text) in Amazon HealthLake to extract meaning from the unstructured data with integrated Amazon Comprehend Medical for easy search and querying.
  6. Users can create meaningful analyses and run interactive analytics using Amazon Athena.
  7. Users can build visualizations, perform ad hoc analysis, and quickly get business insights using Amazon QuickSight.
  8. Users can make predictions with health data using Amazon SageMaker ML models.

Prerequisites

This post assumes familiarity with the following services:

By default, the integrated Amazon Comprehend Medical natural language processing (NLP) capability within Amazon HealthLake is disabled in your AWS account. To enable it, submit a support case with your account ID, AWS Region, and Amazon HealthLake data store ARN. For more information, refer to How do I turn on HealthLake’s integrated natural language processing feature.

Refer to the GitHub repo for more deployment details.

Deploy the solution architecture

To set up the solution, complete the following steps:

  1. Clone the GitHub repo, run cdk deploy PdfMapperToFhirWorkflow from your command prompt or terminal and follow the README file. Deployment will complete in approximately 30 minutes.
  2.  On the Amazon S3 console, navigate to the bucket starting with pdfmappertofhirworkflow-, which was created as part of cdk deploy.
  3.  Inside the bucket, create a folder called uploads and upload the sample PDF (SampleMedicalRecord.pdf).

As soon as the document upload is successful, it will trigger the pipeline, and you can start seeing data in Amazon HealthLake, which you can query using several AWS tools.

Query the data

To explore your data, complete the following steps:

  1. On the CloudWatch console, search for the HealthlakeTextract log group.
  2. In the log group details, note down the unique ID of the document you processed.
  3. On the Amazon HealthLake console, choose Data Stores in the navigation pane.
  4. Select your data store and choose Run query.
  5. For Query type, choose Search with GET.
  6. For Resource type, choose DocumentReference.
  7. For Search parameters, enter the parameter as relates to and the value as DocumentReference/Unique ID.
  8. Choose Run query.
  9. In the Response body section, minimize the resource sections to just view the six resources that were created for the six-page PDF document.
  10. The following screenshot shows the integrated analysis with Amazon Comprehend Medical and NLP enabled. The screenshot on the left is the source PDF; the screenshot on the right is the NLP result from Amazon HealthLake.
  11. You can also run a query with Query type set as Read and Resource type set as Condition using the appropriate resource ID.

    The following screenshot shows the query results.
  12. On the Athena console, run the following query:
    SELECT * FROM "healthlakestore"."documentreference";

Similarly, you can query MedicationStatement, Condition, and Observation resources.

Clean up

After you’re done using this solution, run cdk destroy PdfMapperToFhirWorkflow to ensure you don’t incur additional charges. For more information, refer to AWS CDK Toolkit (cdk command).

Conclusion

AWS AI services and Amazon HealthLake can help store, transform, query, and analyze insights from unstructured healthcare data. Although this post only covered a PDF clinical chart, you could extend the solution to other types of healthcare PDFs, images, and handwritten notes. After the data is extracted into text form, parsed into discrete data elements using Amazon Comprehend Medical, and stored in Amazon HealthLake, it could be further enriched by downstream systems to drive meaningful and actionable healthcare information and ultimately improve patient health outcomes.

The proposed solution doesn’t require the deployment and maintenance of server infrastructure. All services are either managed by AWS or serverless. With AWS’s pay-as-you-go billing model and its depth and breadth of services, the cost and effort of initial setup and experimentation is significantly lower than traditional on-premises alternatives.

Additional resources

For more information about Amazon HealthLake, refer to the following:


About the Authors

Shravan Vurputoor is a Senior Solutions Architect at AWS. As a trusted customer advocate, he helps organizations understand best practices around advanced cloud-based architectures, and provides advice on strategies to help drive successful business outcomes across a broad set of enterprise customers through his passion for educating, training, designing, and building cloud solutions. In his spare time, he enjoys reading, spending time with his family, and cooking.

Rafael M. Koike is a Principal Solutions Architect at AWS supporting Enterprise customers in the South East, and is part of the Storage and Security Technical Field Community. Rafael has a passion to build, and his expertise in security, storage, networking, and application development has been instrumental in helping customers move to the cloud securely and fast.

Randheer Gehlot is a Principal Customer Solutions Manager at AWS. Randheer is passionate about AI/ML and its application within HCLS industry. As an AWS builder, he works with large enterprises to design and rapidly implement strategic migrations to the cloud and build modern, cloud-native solutions.

Read More

Host ML models on Amazon SageMaker using Triton: Python backend

Host ML models on Amazon SageMaker using Triton: Python backend

Amazon SageMaker provides a number of options for users who are looking for a solution to host their machine learning (ML) models. Of these options, one of the key features that SageMaker provides is real-time inference. Real-time inference workloads can have varying levels of requirements and service level agreements (SLAs) in terms of latency and throughput. Regardless of the use case, SageMaker offers a number of options that allow you to find the right balance of cost and performance to meet your business objectives.

There are many factors to consider when choosing the right real-time inference option for your business. For example, your business may have a model that must meet the strictest SLAs for latency and throughput with very predictable performance. For that use case, SageMaker provides SageMaker single model endpoints (SMEs), which allow you to deploy a single ML model against a logical endpoint. For other use cases, you can choose to manage cost and performance using SageMaker multi-model endpoints (MMEs), which allow you to specify multiple models to host behind a logical endpoint. Regardless of the option you may choose, SageMaker endpoints provide a scalable mechanism for even the most demanding enterprise users while providing value in a plethora of features, including shadow variants, auto scaling, and native integration with Amazon CloudWatch (for more information, see CloudWatch Metrics for Multi-Model Endpoint Deployments).

One option supported by SageMaker single and multi-model endpoints is NVIDIA Triton Inference Server. Triton supports various backends as engines to support the running and serving of various ML models for inference. For any Triton deployment, it’s crucial to know how the backend behavior impacts your workloads and what to expect so that you can be successful. In this post, we help you understand the Python backend that is supported by Triton on SageMaker so that you can make an informed decision for your workloads and achieve great results.

SageMaker provides Triton via SMEs and MMEs

The Python backend is available through SageMaker, which enables you to deploy both single and multi-model endpoints with NVIDIA Triton Inference Server. Triton supports instance types that support GPUs, CPUs, and AWS Inferentia chips, which allow you to maximize the performance for your workloads. The following diagram illustrates the NVIDIA Triton Inference Server architecture.

Triton Architecture

Inference requests arrive at the server via either HTTP/REST or by the C API and are then routed to the appropriate per-model scheduler. Triton implements multiple scheduling and batching algorithms that can be configured on a model-by-model basis and can help tune performance. Each model’s scheduler optionally performs batching of inference requests and then passes the requests to the backend corresponding to the model type. The framework backend performs inference using the inputs provided in the batched requests to produce the requested outputs. The outputs are then formatted and returned in the response. The model repository is an object-based repository of the models powered by Amazon Simple Storage Service (Amazon S3) that Triton will make available for inferencing.

For MMEs, SageMaker takes care of traffic shaping to the endpoint and maintains optimal model copies on GPU instances for the best price performance. It continues to route traffic to the instance where the model is loaded. If the instance resources reach capacity due to high memory utilization, SageMaker unloads the least popular models from the container to free up resources to load more frequently used models. SageMaker MMEs offer capabilities for running multiple deep learning or ML models on the GPU at the same time with Triton Inference Server, which has been extended to implement the MME API contract. MMEs enable sharing GPU instances behind an endpoint across multiple models and dynamically load and unload models based on the incoming traffic. With this, you can easily achieve optimal price performance.

When a SageMaker MME receives an HTTP invocation request for a particular model using TargetModel in the request along with the payload, it routes traffic to the right instance behind the endpoint where the target model is loaded. SageMaker takes care of model management behind the endpoint. It dynamically downloads models from Amazon S3 to the instance’s storage volume if the invoked model isn’t available on the instance storage volume. Then SageMaker loads the model to the NVIDIA Triton container’s memory on a GPU accelerated instance and serves the inference request. The GPU core is shared by all the models in an instance. For more information about SageMaker MMEs on GPUs, see Run multiple deep learning models on GPU with Amazon SageMaker multi-model endpoints.

SageMaker MMEs can horizontally scale using an auto scaling policy and provision additional GPU compute instances based on specified metrics. When configuring your auto scaling groups for SageMaker endpoints, you may want to consider SageMakerVariantInvocationsPerInstance as the primary criteria to determine the scaling characteristics of your auto scaling group. In addition, based on whether your models are running on GPU or CPU, you may also consider using CPUUtilization or GPUUtilization as additional criteria. Note that for single model endpoints, because the models deployed are all the same, it’s fairly straightforward to set proper policies to meet your SLAs. For multi-model endpoints, we recommend deploying similar models behind a given endpoint to have more steady predictable performance. In use cases where models of varying sizes and requirements are used, you may want to separate those workloads across multiple multi-model endpoints or spend some time fine-tuning your auto scaling group policy to obtain the best cost and performance balance.

Python backend runtime architecture

As the name suggests, the Python backend is for running models that are written and run in the Python language. Various use cases fall into this category, such as preprocessing or postprocessing steps composing a model ensemble. In other cases, the Python backend may be used as a wrapper to call a Python-based model or framework. Later in this post, we show an example of how you can use the Python backend to call a PyTorch T5 model. This may not always be the most performant option, but it showcases the flexibility that the Python backend provides.

The Python backend creates a runtime environment that creates Python processes using the host’s CPU and memory. You can still attain GPU acceleration if it’s exposed by a Python front end of the framework running the inference. No additional GPU acceleration occurs by using the Python backend itself, but there should be no compatibility errors for any Python process.

On SageMaker, the default Triton Python backend allocates 16 MB, and grows only by 1 MB. However, you can change this by setting the SageMaker environment variables SAGEMAKER_TRITON_SHM_DEFAULT_BYTE_SIZE and SAGEMAKER_TRITON_SHM_GROWTH_BYTE_SIZE. These variables are important because it’s through shared memory that the Python backend will exchange tensors.

The following diagram shows the ensemble scheduler runtime architecture so that you can fine-tune the memory areas, including CPU addressable shared memory, that are used for inter-process communication between C++ and the Python process for exchanging tensors (input/output).

Architecture Diagram

You can monitor resource utilization using CloudWatch, which has native integration with SageMaker.

To get started with the Python backend, you need to create a Python file that has a structure similar to the following code, which dictates the structure as well as how to interact with parameters and return values. Take note of the point in the lifecycle that the methods are called.

import triton_python_backend_utils as pb_utils


class TritonPythonModel:
    """Your Python model must use the same class name. Every Python model
    that is created must have "TritonPythonModel" as the class name.
    """

    @staticmethod
    def auto_complete_config(auto_complete_model_config):

        Parameters
        ----------
        auto_complete_model_config : pb_utils.ModelConfig
          An object containing the existing model configuration. You can build
          upon the configuration given by this object when setting the
          properties for this model.

        Returns
        -------
        pb_utils.ModelConfig
          An object containing the auto-completed model configuration
        """
       

    def initialize(self, args):
       `initialize` is called only once when the model is being loaded.
        Implementing `initialize` function is optional and allows you to 
        do any initialization before execution. This functional allows
        the model to initialize any state associated with the model.

        Parameters
        ----------
        args : dict
          Both keys and values are strings. The dictionary keys and values are:
          * model_config: A JSON string containing the model configuration
          * model_instance_kind: A string containing model instance kind
          * model_instance_device_id: A string containing model instance device
            ID
          * model_repository: Model repository path
          * model_version: Model version
          * model_name: Model name
        ""

    def execute(self, requests):
       `execute` must be implemented in every Python model. `execute`
        function receives a list of pb_utils.InferenceRequest as the only
        argument. This function is called when an inference is requested
        for this model.

        Parameters
        ----------
        requests : list
          A list of pb_utils.InferenceRequest

        Returns
        -------
        list
          A list of pb_utils.InferenceResponse. The length of this list must
          be the same as `requests`
        
        
    def finalize(self):
       `finalize` is called only once when the model is being unloaded.
        Implementing `finalize` function is optional. This function allows
        the model to perform any necessary clean ups before exit.
        

By utilizing the model.py methods, you can take on the responsibility to load models on a specific device (CPU or GPU) by writing the code in the model.py file explicitly. Although other backends that Triton provides allow you to specify a KIND attribute in the config.pbtxt file to determine if the backend runs on CPU or GPU, it’s not applicable for the Python backend because the model is loaded in the respective device depending on the code written in model.py, like .to(device) in torch. It’s important to note that if you explicitly load artifacts into memory or create temporary files, you reclaim your resources by cleaning up, which usually occurs in the finalize method. Otherwise, you may experience unwanted situations such as memory leaks.

SageMaker notebook walkthrough

With the NVIDIA Triton container image on SageMaker, you can now use Triton’s Python backend, which allows you to write your model logic in Python. For example, you can use this backend to run preprocessing and postprocessing code written in Python, or run a PyTorch Python script directly (instead of first converting it to TorchScript and then using the PyTorch backend). The python_backend GitHub repo contains the documentation and source for the backend.

In this section, we walk you through the example notebook, which demonstrates how to use NVIDIA Triton Inference Server on an Amazon SageMaker MME with the GPU feature to deploy an T5 NLP model for translation.

Set up the environment

We begin by setting up the required environment. We install the dependencies required to package our model pipeline and run inferences using Triton Inference Server. We also define the AWS Identity and Access Management (IAM) role that gives SageMaker access to the model artifacts and the NVIDIA Triton Amazon Elastic Container Registry (Amazon ECR) image. You can use the following code example to retrieve the prebuilt Triton ECR image:

import boto3, json, sagemaker, time
from sagemaker import get_execution_role
import numpy as np
import os
 
os.environ["TOKENIZERS_PARALLELISM"] = "false"
 
# sagemaker variables
role = get_execution_role()
sm_client = boto3.client(service_name="sagemaker")
runtime_sm_client = boto3.client("sagemaker-runtime")
sagemaker_session = sagemaker.Session(boto_session=boto3.Session())
s3_client = boto3.client("s3")
bucket = sagemaker.Session().default_bucket()
prefix = "nlp-mme-gpu"
 
# account mapping for SageMaker MME Triton Image
account_id_map = {
    "us-east-1": "785573368785",
    "us-east-2": "007439368137",
    "us-west-1": "710691900526",
    "us-west-2": "301217895009",
    "eu-west-1": "802834080501",
    "eu-west-2": "205493899709",
    "eu-west-3": "254080097072",
    "eu-north-1": "601324751636",
    "eu-south-1": "966458181534",
    "eu-central-1": "746233611703",
    "ap-east-1": "110948597952",
    "ap-south-1": "763008648453",
    "ap-northeast-1": "941853720454",
    "ap-northeast-2": "151534178276",
    "ap-southeast-1": "324986816169",
    "ap-southeast-2": "355873309152",
    "cn-northwest-1": "474822919863",
    "cn-north-1": "472730292857",
    "sa-east-1": "756306329178",
    "ca-central-1": "464438896020",
    "me-south-1": "836785723513",
    "af-south-1": "774647643957",
}
 
region = boto3.Session().region_name
if region not in account_id_map.keys():
    raise ("UNSUPPORTED REGION")
 
base = "amazonaws.com.cn" if region.startswith("cn-") else "amazonaws.com"
mme_triton_image_uri = (
    "{account_id}.dkr.ecr.{region}.{base}/sagemaker-tritonserver:23.02-py3".format(
        account_id=account_id_map[region], region=region, base=base
    )
)

Generate model artifacts

In this example, we host a pre-trained T5-small Hugging Face PyTorch model using Triton’s Python backend. Here we have the Python script model.py, which implements all the logic to initialize the T5 model and run inference for the translation task. There are three main functions in the script:

  • initialize – The initialize function is called one time when the model is being loaded. Implementing initialize is optional. initialize allows you to do any necessary initializations before running the model. This function allows the model to initialize any state associated with this model.
  • execute – The execute function is called whenever an inference request is made. Every Python model must implement the execute function. In the execute function, you’re given a list of InferenceRequest objects. There are two modes of implementing this function: default and decoupled mode. The default mode is the most generic way you would like to implement your model and requires the execute function to return exactly one response per request. The decoupled mode allows you to send multiple responses for a request or not send any responses for a request. The mode you choose should depend on your use case—that is, whether or not you want to return decoupled responses from this model. In this example notebook, we use the default mode.
  • finalize – Implementing finalize is optional. This function allows you to do any cleanup necessary before the model is unloaded from Triton Inference Server.

Build the model repository

Using Triton on SageMaker requires us to first set up a model repository folder containing the models we want to serve. For each model, we need to create a model directory consisting of the model artifact and define the config.pbtxt file to specify the model configuration that Triton uses to load and serve the model. To learn more about the config settings, refer to Model Configuration. The model repository structure for the T5 model is as follows:

Directory structure

Note that Triton has specific requirements for the model repository layout. Within the top-level model repository directory, each model has its own subdirectory containing the information for the corresponding model. Each model directory in Triton must have at least one numeric subdirectory representing a version of the model. Here, that is 1, representing version 1 of our T5 PyTorch model. Each model is run by a specific backend, so each version subdirectory must contain the model artifact required by that backend. Here, we are using the Python backend, and it requires the Python file that is used for serving (model.py). If we were using a PyTorch backend, a model.pt file would be required. For more details on naming conventions for model files, refer to Model Files.

Every Python Triton model must provide a config.pbtxt file describing the model configuration. To use this backend, you must set the backend field of your model config.pbtxt file to python. The following code shows how to define the config file for the T5 PyTorch model being served through Triton’s Python backend:

name: "t5_pytorch"
backend: "python"
max_batch_size: 8
input: [
    {
        name: "input_ids"
        data_type: TYPE_INT32
        dims: [ -1 ]
    },
    {
        name: "attention_mask"
        data_type: TYPE_INT32
        dims: [ -1 ]
    }
]
output [
  {
    name: "output"
    data_type: TYPE_INT32
    dims: [ -1 ]
  }
]
instance_group {
  count: 1
  kind: KIND_GPU
}
dynamic_batching {
}
parameters: {
  key: "EXECUTION_ENV_PATH",
  value: {string_value: "$$TRITON_MODEL_DIRECTORY/hf_env.tar.gz"}
}

In this configuration, we have defined the parameters section to provide an environment path. This is because to serve the Hugging Face T5 PyTorch model using Triton’s Python backend, we have PyTorch and Hugging Face transformers as dependencies. You need to create a custom run environment in the Python backend to include all the dependencies in this example. The alternative is to install Python and all the dependencies in the local environment. The custom run environment is only needed if you want portability across different systems that might not have the Python environment to run the inference. If a custom run environment is required for SageMaker, then this should be pointed out clearly. Currently, the Python backend only supports conda-pack for this purpose. conda-pack ensures that your Conda environment is portable. We follow the instructions from the Triton documentation for packaging dependencies to be used in the Python backend as the Conda environment TAR file. Running the bash script create_hf_env.sh creates the Conda environment containing PyTorch and Hugging Face transformers and packages it as a TAR file, and then we move it into the t5-pytorch model directory:

!bash workspace/create_hf_env.sh
!mv hf_env.tar.gz model_repository/t5_pytorch/

After we create the TAR file from the Conda environment, we place it in the model folder. The following code in the model config.pbtxt file tells the Python backend to use this custom environment for your model:

parameters: {
  key: "EXECUTION_ENV_PATH",
  value: {string_value: "$$TRITON_MODEL_DIRECTORY/hf_env.tar.gz"}
}

Here, $$TRITON_MODEL_DIRECTORY helps provide the environment path relative to the model folder in the model repository, and is resolved to $pwd/model_repository/t5_pytorch. Finally, hf_env.tar.gz is the name we gave to our Conda environment file.

Next, we package our model as *.tar.gz files for uploading to Amazon S3:

!tar -C model_repository/ -czf t5_pytorch.tar.gz t5_pytorch
model_uri_t5_pytorch = sagemaker_session.upload_data(path="t5_pytorch.tar.gz", key_prefix=prefix)

Create a SageMaker endpoint

Now that we have uploaded the model artifacts to Amazon S3, we can create a SageMaker multi-model endpoint. To create a SageMaker endpoint, we need to first create the SageMaker model object and endpoint configuration.

Firstly, we need to define the serving container. In the container definition, define the ModelDataUrl to specify the S3 directory that contains all the models that the SageMaker multi-model endpoint will use to load and serve predictions. Set Mode to MultiModel to indicate SageMaker would create the endpoint with MME container specifications. See the following code:

container = {
"Image": mme_triton_image_uri,
"ModelDataUrl": model_data_url,
"Mode": "MultiModel",
}

Then we create the SageMaker model object using the create_model boto3 API by specifying the ModelName and container definition:

create_model_response = sm_client.create_model(
ModelName=sm_model_name, ExecutionRoleArn=role, PrimaryContainer=container
)

We use this model to create an endpoint configuration where we can specify the type and number of instances we want in the endpoint. Here we are deploying to a g5.2xlarge NVIDIA GPU instance:

create_endpoint_config_response = sm_client.create_endpoint_config(
EndpointConfigName=endpoint_config_name,
ProductionVariants=[
        {
            "InstanceType": "ml.g5.2xlarge",
            "InitialVariantWeight": 1,
            "InitialInstanceCount": 1,
            "ModelName": sm_model_name,
            "VariantName": "AllTraffic",
        }
    ],
)

We use this configuration to create a new SageMaker endpoint and wait for the deployment to finish:

endpoint_name = f"{prefix}-ep-{ts}-2xl"
create_endpoint_response = sm_client.create_endpoint(
EndpointName=endpoint_name, EndpointConfigName=endpoint_config_name
)

The status will change to InService after the deployment is successful.

Invoke your model hosted on the SageMaker endpoint

After the endpoint is running, we can use some sample raw data to perform inference using JSON as the payload format. For the inference request format, Triton uses the KFServing community standard inference protocols. We can send inference requests to the multi-model endpoint using the invoke_enpoint API. We specify the TargetModel in the invocation call and pass in the payload for each model type. See the following code:

texts_to_translate = ["translate English to German: The house is wonderful."]
batch_size = len(texts_to_translate)

t5_payload = get_text_payload("t5-small", texts_to_translate)
response = runtime_sm_client.invoke_endpoint(
EndpointName=endpoint_name,
ContentType="application/octet-stream",
Body=json.dumps(t5_payload),
TargetModel="t5_pytorch.tar.gz",
)
response_body = json.loads(response["Body"].read().decode("utf8"))
output_ids = np.array(response_body["outputs"][0]["data"]).reshape(batch_size, -1)
t5_tokenizer = get_tokenizer("t5-small")
decoded_outputs = t5_tokenizer.batch_decode(output_ids, skip_special_tokens=True)
for text in decoded_outputs:
print(text, "n")

The notebook can be found in the GitHub repository.

Best practices

When using the Python backend, it can sometimes be complicated to optimize the workload for throughput and latency. You should consider the options available through the SageMaker and Triton environment variables that we discussed previously in regards to batch sizes, max delay, and other factors. In addition, you should be aware of the Python backend-specific configuration and the configuration of the underlying framework. The following are some best practices:

  • If using PyTorch (or any other deep learning framework) module in the Python backend, consider experimenting with different values of intra/inter op thread pool size. Because each Python backend model instance runs in a separate process, limiting the number of threads per process prevents over-subscribing the system resources when scaling up the instance count.
  • Even though the Python backend is highly flexible, it performs some extra data copies that can impact inference performance. For the best performance on GPU, consider using Triton’s TensorRT backend when possible.
  • When using Python backend models in an ensemble, refer to Interoperability and GPU Support for a possible zero-copy transfer of Python backend tensors to other frameworks.
  • You can also use the instance_group_count variable in the config.pbtxt file to add a worker process and increase throughput. Be aware that increasing this variable will increase the amount of resource consumption, including CPU and GPU utilization.

You can explore these options and parameters to get the desired performance characteristics you seek. As always, be aware that resources such as processor or memory consumption can change and should be monitored so you can fine-tune and optimize inference performance.

Conclusion

In this post, we dove deep into the Python backend that Triton Inference Server supports on SageMaker. This backend provides for both CPU and GPU acceleration of your models that are written and run in the Python language. There are many options to consider to get the best performance for inference, such as batch sizes, data input formats, and other factors that can be tuned to meet your needs. SageMaker allows you to use single model endpoints for guaranteed performance and multi-model endpoints to get a better balance of performance and cost savings. To get started with MME support for GPU, see Supported algorithms, frameworks, and instances.

We invite you to try Triton Inference Server containers in SageMaker, and share your feedback and questions in the comments.


About the Authors

James Park is a Solutions Architect at Amazon Web Services. He works with Amazon to design, build, and deploy technology solutions on AWS, and has a particular interest in AI and machine learning. In his spare time he enjoys seeking out new cultures, new experiences,  and staying up to date with the latest technology trends.

Dhawal Patel is a Principal Machine Learning Architect at AWS. He has worked with organizations ranging from large enterprises to mid-sized startups on problems related to distributed computing and artificial intelligence. He focuses on deep learning, including NLP and computer vision domains. He helps customers achieve high-performance model inference on Amazon SageMaker.

Melanie Li is a Senior AI/ML Specialist TAM at AWS based in Sydney, Australia. She helps enterprise customers to build solutions leveraging the state-of-the-art AI/ML tools on AWS and provides guidance on architecting and implementing machine learning solutions with best practices. In her spare time, she loves to explore nature outdoors and spend time with family and friends.

Jiahong Liu is a Solution Architect on the Cloud Service Provider team at NVIDIA. He assists clients in adopting machine learning and AI solutions that leverage NVIDIA accelerated computing to address their training and inference challenges. In his leisure time, he enjoys origami, DIY projects, and playing basketball.

Kshitiz Gupta is a Solutions Architect at NVIDIA. He enjoys educating cloud customers about the GPU AI technologies NVIDIA has to offer and assisting them with accelerating their machine learning and deep learning applications. Outside of work, he enjoys running, hiking and wildlife watching.

Read More

Securing MLflow in AWS: Fine-grained access control with AWS native services

Securing MLflow in AWS: Fine-grained access control with AWS native services

With Amazon SageMaker, you can manage the whole end-to-end machine learning (ML) lifecycle. It offers many native capabilities to help manage ML workflows aspects, such as experiment tracking, and model governance via the model registry. This post provides a solution tailored to customers that are already using MLflow, an open-source platform for managing ML workflows.

In a previous post, we discussed MLflow and how it can run on AWS and be integrated with SageMaker—in particular, when tracking training jobs as experiments and deploying a model registered in MLflow to the SageMaker managed infrastructure. However, the open-source version of MLflow doesn’t provide native user access control mechanisms for multiple tenants on the tracking server. This means any user with access to the server has admin rights and can modify experiments, model versions, and stages. This can be a challenge for enterprises in regulated industries that need to keep strong model governance for audit purposes.

In this post, we address these limitations by implementing the access control outside of the MLflow server and offloading authentication and authorization tasks to Amazon API Gateway, where we implement fine-grained access control mechanisms at the resource level using Identity and Access Management (IAM). By doing so, we can achieve robust and secure access to the MLflow server from both SageMaker managed infrastructure and Amazon SageMaker Studio, without having to worry about credentials and all the complexity behind credential management. The modular design proposed in this architecture makes modifying access control logic straightforward without impacting the MLflow server itself. Lastly, thanks to SageMaker Studio extensibility, we further improve the data scientist experience by making MLflow accessible within Studio, as shown in the following screenshot.

MLflow in Studio

MLflow has integrated the feature that enables request signing using AWS credentials into the upstream repository for its Python SDK, improving the integration with SageMaker. The changes to the MLflow Python SDK are available for everyone since MLflow version 1.30.0.

At a high level, this post demonstrates the following:

  • How to deploy an MLflow server on a serverless architecture running on a private subnet not accessible directly from the outside. For this task, we build on top the following GitHub repo: Manage your machine learning lifecycle with MLflow and Amazon SageMaker.
  • How to expose the MLflow server via private integrations to an API Gateway, and implement a secure access control for programmatic access via the SDK and browser access via the MLflow UI.
  • How to log experiments and runs, and register models to an MLflow server from SageMaker using the associated SageMaker execution roles to authenticate and authorize requests, and how to authenticate via Amazon Cognito to the MLflow UI. We provide examples demonstrating experiment tracking and using the model registry with MLflow from SageMaker training jobs and Studio, respectively, in the provided notebook.
  • How to use MLflow as a centralized repository in a multi-account setup.
  • How to extend Studio to enhance the user experience by rendering MLflow within Studio. For this task, we show how to take advantage of Studio extensibility by installing a JupyterLab extension.

Now let’s dive deeper into the details.

Solution overview

You can think about MLflow as three different core components working side by side:

  • A REST API for the backend MLflow tracking server
  • SDKs for you to programmatically interact with the MLflow tracking server APIs from your model training code
  • A React front end for the MLflow UI to visualize your experiments, runs, and artifacts

At a high level, the architecture we have envisioned and implemented is shown in the following figure.

Architecture

Prerequisites

Before deploying the solution, make sure you have access to an AWS account with admin permissions.

Deploy the solution infrastructure

To deploy the solution described in this post, follow the detailed instructions in the GitHub repository README. To automate the infrastructure deployment, we use the AWS Cloud Development Kit (AWS CDK). The AWS CDK is an open-source software development framework to create AWS CloudFormation stacks through automatic CloudFormation template generation. A stack is a collection of AWS resources that can be programmatically updated, moved, or deleted. AWS CDK constructs are the building blocks of AWS CDK applications, representing the blueprint to define cloud architectures.

We combine four stacks:

  • The MLFlowVPCStack stack performs the following actions:
  • The RestApiGatewayStack stack performs the following actions:
    • Exposes the MLflow server via AWS PrivateLink to an REST API Gateway.
    • Deploys an Amazon Cognito user pool to manage the users accessing the UI (still empty after the deployment).
    • Deploys an AWS Lambda authorizer to verify the JWT token with the Amazon Cognito user pool ID keys and returns IAM policies to allow or deny a request. This authorization strategy is applied to <MLFlow-Tracking-Server-URI>/*.
    • Adds an IAM authorizer. This will be applied to the to the <MLFlow-Tracking-Server-URI>/api/*, which will take precedence over the previous one.
  • The AmplifyMLFlowStack stack performs the following action:
    • Creates an app linked to the patched MLflow repository in AWS CodeCommit to build and deploy the MLflow UI.
  • The SageMakerStudioUserStack stack performs the following actions:
    • Deploys a Studio domain (if one doesn’t exist yet).
    • Adds three users, each one with a different SageMaker execution role implementing a different access level:
      • mlflow-admin – Has admin-like permission to any MLflow resources.
      • mlflow-reader – Has read-only admin permissions to any MLflow resources.
      • mlflow-model-approver – Has the same permissions as mlflow-reader, plus can register new models from existing runs in MLflow and promote existing registered models to new stages.

Deploy the MLflow tracking server on a serverless architecture

Our aim is to have a reliable, highly available, cost-effective, and secure deployment of the MLflow tracking server. Serverless technologies are the perfect candidate to satisfy all these requirements with minimal operational overhead. To achieve that, we build a Docker container image for the MLflow experiment tracking server, and we run it in on AWS Fargate on Amazon ECS in its dedicated VPC running on a private subnet. MLflow relies on two storage components: the backend store and for the artifact store. For the backend store, we use Aurora Serverless, and for the artifact store, we use Amazon S3. For the high-level architecture, refer to Scenario 4: MLflow with remote Tracking Server, backend and artifact stores. Extensive details on how to do this task can be found in the following GitHub repo: Manage your machine learning lifecycle with MLflow and Amazon SageMaker.

Secure MLflow via API Gateway

At this point, we still don’t have an access control mechanism in place. As a first step, we expose MLflow to the outside world using AWS PrivateLink, which establishes a private connection between the VPC and other AWS services, in our case API Gateway. Incoming requests to MLflow are then proxied via a REST API Gateway, giving us the possibility to implement several mechanisms to authorize incoming requests. For our purposes, we focus on only two:

  • Using IAM authorizers – With IAM authorizers, the requester must have the right IAM policy assigned to access the API Gateway resources. Every request must add authentication information to requests sent via HTTP by AWS Signature Version 4.
  • Using Lambda authorizers – This offers the greatest flexibility because it leaves full control over how a request can be authorized. Eventually, the Lambda authorizer must return an IAM policy, which in turn will be evaluated by API Gateway on whether the request should be allowed or denied.

For the full list of supported authentication and authorization mechanisms in API Gateway, refer to Controlling and managing access to a REST API in API Gateway.

MLflow Python SDK authentication (IAM authorizer)

The MLflow experiment tracking server implements a REST API to interact in a programmatic way with the resources and artifacts. The MLflow Python SDK provides a convenient way to log metrics, runs, and artifacts, and it interfaces with the API resources hosted under the namespace <MLflow-Tracking-Server-URI>/api/. We configure API Gateway to use the IAM authorizer for resource access control on this namespace, thereby requiring every request to be signed with AWS Signature Version 4.

To facilitate the request signing process, starting from MLflow 1.30.0, this capability can be seamlessly enabled. Make sure that the requests_auth_aws_sigv4 library is installed in the system and set the MLFLOW_TRACKING_AWS_SIGV4 environment variable to True. More information can be found in the official MLflow documentation.

At this point, the MLflow SDK only needs AWS credentials. Because request_auth_aws_sigv4 uses Boto3 to retrieve credentials, we know that it can load credentials from the instance metadata when an IAM role is associated with an Amazon Elastic Compute Cloud (Amazon EC2) instance (for other ways to supply credentials to Boto3, see Credentials). This means that it can also load AWS credentials when running from a SageMaker managed instance from the associated execution role, as discussed later in this post.

Configure IAM policies to access MLflow APIs via API Gateway

You can use IAM roles and policies to control who can invoke resources on API Gateway. For more details and IAM policy reference statements, refer to Control access for invoking an API.

The following code shows an example IAM policy that grants the caller permissions to all methods on all resources on the API Gateway shielding MLflow, practically giving admin access to the MLflow server:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Action": "execute-api:Invoke",
      "Resource": "arn:aws:execute-api:<REGION>:<ACCOUNT_ID>:<MLFLOW_API_ID>/<STAGE>/*/*",
      "Effect": "Allow"
    }
  ]
}

If we want a policy that allows a user read-only access to all resources, the IAM policy would look like the following code:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Action": "execute-api:Invoke",
      "Resource": [
        "arn:aws:execute-api:<REGION>:<ACCOUNT_ID>:<MLFLOW_API_ID>/<STAGE>/GET/*",
        "arn:aws:execute-api:<REGION>:<ACCOUNT_ID>:<MLFLOW_API_ID>/<STAGE>/POST/api/2.0/mlflow/runs/search/",
        "arn:aws:execute-api:<REGION>:<ACCOUNT_ID>:<MLFLOW_API_ID>/<STAGE>/POST/api/2.0/mlflow/experiments/search",
       ],
       "Effect": "Allow"
     }
  ]
}

Another example might be a policy to give specific users permissions to register models to the model registry and promote them later to specific stages (staging, production, and so on):

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Action": "execute-api:Invoke",
      "Resource": [
        "arn:aws:execute-api:<REGION>:<ACCOUNT_ID>:<MLFLOW_API_ID>/<STAGE>/GET/*",
        "arn:aws:execute-api:<REGION>:<ACCOUNT_ID>:<MLFLOW_API_ID>/<STAGE>/POST/api/2.0/mlflow/runs/search/",
        "arn:aws:execute-api:<REGION>:<ACCOUNT_ID>:<MLFLOW_API_ID>/<STAGE>/POST/api/2.0/mlflow/experiments/search",
        "arn:aws:execute-api:<REGION>:<ACCOUNT_ID>:<MLFLOW_API_ID>/<STAGE>/POST/api/2.0/mlflow/model-versions/*",
        "arn:aws:execute-api:<REGION>:<ACCOUNT_ID>:<MLFLOW_API_ID>/<STAGE>/POST/api/2.0/mlflow/registered-models/*"
      ],
      "Effect": "Allow"
    }
  ]
}

MLflow UI authentication (Lambda authorizer)

Browser access to the MLflow server is handled by the MLflow UI implemented with React. The MLflow UI hasn’t been designed to support authenticated users. Implementing a robust login flow might appear a daunting task, but luckily we can rely on the Amplify UI React components for authentication, which greatly reduces the effort to create a login flow in a React application, using Amazon Cognito for the identities store.

Amazon Cognito allows us to manage our own user base and also support third-party identity federation, making it feasible to build, for example, ADFS federation (see Building ADFS Federation for your Web App using Amazon Cognito User Pools for more details). Tokens issued by Amazon Cognito must be verified on API Gateway. Simply verifying the token is not enough for fine-grained access control, therefore the Lambda authorizer allows us the flexibility to implement the logic we need. We can then build our own Lambda authorizer to verify the JWT token and generate the IAM policies to let the API Gateway deny or allow the request. The following diagram illustrates the MLflow login flow.

MLflow UI auth steps

For more information about the actual code changes, refer to the patch file cognito.patch, applicable to MLflow version 2.3.1.

This patch introduces two capabilities:

  • Add the Amplify UI components and configure the Amazon Cognito details via environment variables that implement the login flow
  • Extract the JWT from the session and create an Authorization header with a bearer token of where to send the JWT

Although maintaining diverging code from the upstream always adds more complexity than relying on the upstream, it’s worth noting that the changes are minimal because we rely on the Amplify React UI components.

With the new login flow in place, let’s create the production build for our updated MLflow UI. AWS Amplify Hosting is an AWS service that provides a git-based workflow for CI/CD and hosting of web apps. The build step in the pipeline is defined by the buildspec.yaml, where we can inject as environment variables details about the Amazon Cognito user pool ID, the Amazon Cognito identity pool ID, and the user pool client ID needed by the Amplify UI React component to configure the authentication flow. The following code is an example of the buildspec.yaml file:

version: "1.0"
applications:
  - frontend:
      phases:
        preBuild:
          commands:
          - fallocate -l 4G /swapfile
          - chmod 600 /swapfile
          - mkswap /swapfile
          - swapon /swapfile
          - swapon -s
          - yarn install
        build:
          commands:
          - echo "REACT_APP_REGION=$REACT_APP_REGION" >> .env
          - echo "REACT_APP_COGNITO_USER_POOL_ID=$REACT_APP_COGNITO_USER_POOL_ID" >> .env
          - echo "REACT_APP_COGNITO_IDENTITY_POOL_ID=$REACT_APP_COGNITO_IDENTITY_POOL_ID" >> .env
          - echo "REACT_APP_COGNITO_USER_POOL_CLIENT_ID=$REACT_APP_COGNITO_USER_POOL_CLIENT_ID" >> .env
          - yarn run build
        artifacts:
          baseDirectory: build
          files:
          - "**/*"

Securely log experiments and runs using the SageMaker execution role

One of the key aspects of the solution discussed here is the secure integration with SageMaker. SageMaker is a managed service, and as such, it performs operations on your behalf. What SageMaker is allowed to do is defined by the IAM policies attached to the execution role that you associate to a SageMaker training job, or that you associate to a user profile working from Studio. For more information on the SageMaker execution role, refer to SageMaker Roles.

By configuring the API Gateway to use IAM authentication on the <MLFlow-Tracking-Server-URI>/api/* resources, we can define a set of IAM policies on the SageMaker execution role that will allow SageMaker to interact with MLflow according to the access level specified.

When setting the MLFLOW_TRACKING_AWS_SIGV4 environment variable to True while working in Studio or in a SageMaker training job, the MLflow Python SDK will automatically sign all requests, which will be validated by the API Gateway:

os.environ['MLFLOW_TRACKING_AWS_SIGV4'] = "True"
mlflow.set_tracking_uri(tracking_uri)
mlflow.set_experiment(experiment_name)

Test the SageMaker execution role with the MLflow SDK

If you access the Studio domain that was generated, you will find three users:

  • mlflow-admin – Associated to an execution role with similar permissions as the user in the Amazon Cognito group admins
  • mlflow-reader – Associated to an execution role with similar permissions as the user in the Amazon Cognito group readers
  • mlflow-model-approver – Associated to an execution role with similar permissions as the user in the Amazon Cognito group model-approvers

To test the three different roles, refer to the labs provided as part of this sample on each user profile.

The following diagram illustrates the workflow for Studio user profiles and SageMaker job authentication with MLflow.

SageMaker logging to MLflow

Similarly, when running SageMaker jobs on the SageMaker managed infrastructure, if you set the environment variable MLFLOW_TRACKING_AWS_SIGV4 to True, and the SageMaker execution role passed to the jobs has the correct IAM policy to access the API Gateway, you can securely interact with your MLflow tracking server without needing to manage the credentials yourself. When running SageMaker training jobs and initializing an estimator class, you can pass environment variables that SageMaker will inject and make it available to the training script, as shown in the following code:

environment={
  "AWS_DEFAULT_REGION": region,
  "MLFLOW_EXPERIMENT_NAME": experiment_name,
  "MLFLOW_TRACKING_URI": tracking_uri,
  "MLFLOW_AMPLIFY_UI_URI": mlflow_amplify_ui,
  "MLFLOW_TRACKING_AWS_SIGV4": "true",
  "MLFLOW_USER": user
}

estimator = SKLearn(
  entry_point='train.py',
  source_dir='source_dir',
  role=role,
  metric_definitions=metric_definitions,
  hyperparameters=hyperparameters,
  instance_count=1,
  instance_type='ml.m5.large',
  framework_version='1.0-1',
  base_job_name='mlflow',
  environment=environment
)

Visualize runs and experiments from the MLflow UI

After the first deployment is complete, let’s populate the Amazon Cognito user pool with three users, each belonging to a different group, to test the permissions we have implemented. You can use this script add_users_and_groups.py to seed the user pool. After running the script, if you check the Amazon Cognito user pool on the Amazon Cognito console, you should see the three users created.

Cognito users

On the REST API Gateway side, the Lambda authorizer will first verify the signature of the token using the Amazon Cognito user pool key and verify the claims. Only after that will it extract the Amazon Cognito group the user belongs to from the claim in the JWT token (cognito:groups) and apply different permissions based on the group that we have programmed.

For our specific case, we have three groups:

  • admins – Can see and can edit everything
  • readers – Can only see everything
  • model-approvers – The same as readers, plus can register models, create versions, and promote model versions to the next stage

Depending on the group, the Lambda authorizer will generate different IAM policies. This is just an example on how authorization can be achieved; with a Lambda authorizer, you can implement any logic you need. We have opted to build the IAM policy at run time in the Lambda function itself; however, you can pregenerate appropriate IAM policies, store them in Amazon DynamoDB, and retrieve them at run time according to your own business logic. However, if you want to restrict only a subset of actions, you need to be aware of the MLflow REST API definition.

You can explore the code for the Lambda authorizer on the GitHub repo.

Multi-account considerations

Data science workflows have to pass multiple stages as they progress from experimentation to production. A common approach involves separate accounts dedicated to different phases of the AI/ML workflow (experimentation, development, and production). However, sometimes it’s desirable to have a dedicated account that acts as central repository for models. Although our architecture and sample refer to a single account, it can be easily extended to implement this last scenario, thanks to the IAM capability to switch roles even across accounts.

The following diagram illustrates an architecture using MLflow as a central repository in an isolated AWS account.

MLflow sagemaker multi account

For this use case, we have two accounts: one for the MLflow server, and one for the experimentation accessible by the data science team. To enable cross-account access from a SageMaker training job running in the data science account, we need the following elements:

  • A SageMaker execution role in the data science AWS account with an IAM policy attached that allows assuming a different role in the MLflow account:
{
  "Version": "2012-10-17",
  "Statement": {
    "Effect": "Allow",
    "Action": "sts:AssumeRole",
    "Resource": "<ARN-ROLE-IN-MLFLOW-ACCOUNT>"
  }
}
  • An IAM role in the MLflow account with the right IAM policy attached that grants access to the MLflow tracking server, and allows the SageMaker execution role in the data science account to assume it:
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "<ARN-SAGEMAKER-EXECUTION-ROLE-IN-DATASCIENCE-ACCOUNT>"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

Within the training script running in the data science account, you can use this example before initializing the MLflow client. You need to assume the role in the MLflow account and store the temporary credentials as environment variables, because this new set of credentials will be picked up by a new Boto3 session initialized within the MLflow client.

import boto3

# Session using the SageMaker Execution Role in the Data Science Account
session = boto3.Session()
sts = session.client("sts")

response = sts.assume_role(
  RoleArn="<ARN-ROLE-IN-MLFLOW-ACCOUNT>",
  RoleSessionName="AssumedMLflowAdmin"
)

credentials = response['Credentials']
os.environ['AWS_ACCESS_KEY_ID'] = credentials['AccessKeyId']
os.environ['AWS_SECRET_ACCESS_KEY'] = credentials['SecretAccessKey']
os.environ['AWS_SESSION_TOKEN'] = credentials['SessionToken']

# set remote mlflow server and initialize a new boto3 session in the context
# of the assumed role
mlflow.set_tracking_uri(tracking_uri)
experiment = mlflow.set_experiment(experiment_name)

In this example, RoleArn is the ARN of the role you want to assume, and RoleSessionName is name that you choose for the assumed session. The sts.assume_role method returns temporary security credentials that the MLflow client will use to create a new client for the assumed role. The MLflow client then will send signed requests to API Gateway in the context of the assumed role.

Render MLflow within SageMaker Studio

SageMaker Studio is based on JupyterLab, and just as in JupyterLab, you can install extensions to boost your productivity. Thanks to this flexibility, data scientists working with MLflow and SageMaker can further improve their integration by accessing the MLflow UI from the Studio environment and immediately visualizing the experiments and runs logged. The following screenshot shows an example of MLflow rendered in Studio.

MLflow Iframe in Studio

For information about installing JupyterLab extensions in Studio, refer to Amazon SageMaker Studio and SageMaker Notebook Instance now come with JupyterLab 3 notebooks to boost developer productivity. For details on adding automation via lifecycle configurations, refer to Customize Amazon SageMaker Studio using Lifecycle Configurations.

In the sample repository supporting this post, we provide instructions on how to install the jupyterlab-iframe extension. After the extension has been installed, you can access the MLflow UI without leaving Studio using the same set of credentials you have stored in the Amazon Cognito user pool.

Next steps

There are several options for expanding upon this work. One idea is to consolidate the identity store for both SageMaker Studio and the MLflow UI. Another option would be to utilize a third-party identity federation service with Amazon Cognito, and then utilize AWS IAM Identity Center (successor to AWS Single Sign-On) to grant access to Studio using the same third-party identity. Another one is to introduce full automation using Amazon SageMaker Pipelines for the CI/CD part of the model building, and using MLflow as a centralized experiment tracking server and model registry with strong governance capabilities, as well as automation to automatically deploy approved models to a SageMaker hosting endpoint.

Conclusion

The aim of this post was to provide enterprise-level access control for MLflow. To achieve this, we separated the authentication and authorization processes from the MLflow server and transferred them to API Gateway. We utilized two authorization methods offered by API Gateway, IAM authorizers and Lambda authorizers, to cater to the requirements of both the MLflow Python SDK and the MLflow UI. It’s important to understand that users are external to MLflow, therefore a consistent governance requires maintaining the IAM policies, especially in case of very granular permissions. Finally, we demonstrated how to enhance the experience of data scientists by integrating MLflow into Studio through simple extensions.

Try out the solution on your own by accessing the GitHub repo and let us know if you have any questions in the comments!

Additional resources

For more information about SageMaker and MLflow, see the following:


About the Authors

Paolo Di Francesco is a Senior Solutions Architect at Amazon Web Services (AWS). He holds a PhD in Telecommunication Engineering and has experience in software engineering. He is passionate about machine learning and is currently focusing on using his experience to help customers reach their goals on AWS, in particular in discussions around MLOps. Outside of work, he enjoys playing football and reading.

Chris Fregly is a Principal Specialist Solution Architect for AI and machine learning at Amazon Web Services (AWS) based in San Francisco, California. He is co-author of the O’Reilly Book, “Data Science on AWS.” Chris is also the Founder of many global meetups focused on Apache Spark, TensorFlow, Ray, and KubeFlow. He regularly speaks at AI and machine learning conferences across the world including O’Reilly AI, Open Data Science Conference, and Big Data Spain.

Irshad Buchh is a Principal Solutions Architect at Amazon Web Services (AWS). Irshad works with large AWS Global ISV and SI partners and helps them build their cloud strategy and broad adoption of Amazon’s cloud computing platform. Irshad interacts with CIOs, CTOs and their Architects and helps them and their end customers implement their cloud vision. Irshad owns the strategic and technical engagements and ultimate success around specific implementation projects, and developing a deep expertise in the Amazon Web Services technologies as well as broad know-how around how applications and services are constructed using the Amazon Web Services platform.

Read More

Host ML models on Amazon SageMaker using Triton: TensorRT models

Host ML models on Amazon SageMaker using Triton: TensorRT models

Sometimes it can be very beneficial to use tools such as compilers that can modify and compile your models for optimal inference performance. In this post, we explore TensorRT and how to use it with Amazon SageMaker inference using NVIDIA Triton Inference Server. We explore how TensorRT works and how to host and optimize these models for performance and cost efficiency on SageMaker. SageMaker provides single model endpoints (SMEs), which allow you to deploy a single ML model, or multi-model endpoints (MMEs), which allow you to specify multiple models to host behind a logical endpoint for higher resource utilization.

To serve models, Triton supports various backends as engines to support the running and serving of various ML models for inference. For any Triton deployment, it’s crucial to know how the backend behavior impacts your workloads and what to expect so that you can be successful. In this post, we help you understand the TensorRT backend that is supported by Triton on SageMaker so that you can make an informed decision for your workloads and get great results.

Deep dive into the TensorRT backend

TensorRT enables you to optimize inference using techniques such as quantization, layer and tensor fusion, kernel tuning, and others on NVIDIA GPUs. By adopting and compiling models to use TensorRT, you can optimize performance and utilization for your inference workloads. In some cases, there are trade-offs, which is typical of techniques such as quantization, but the results can be dramatic in benefiting performance, addressing latency and the number of transactions that can be processed.

The TensorRT backend is used to run TensorRT models. TensorRT is an SDK developed by NVIDIA that provides a high-performance deep learning inference library. It’s optimized for NVIDIA GPUs and provides a way to accelerate deep learning inference in production environments. TensorRT supports major deep learning frameworks and includes a high-performance deep learning inference optimizer and runtime that delivers low latency, high-throughput inference for AI applications.

TensorRT is able to accelerate model performance by using a technique called graph optimization to optimize the computation graph generated by a deep learning model. It optimizes the graph to minimize the memory footprint by freeing unnecessary memory and efficiently reusing it. TensorRT compilation fuses the sparse operations inside the model graph to form a larger kernel to avoid the overhead of multiple small kernel launches. With kernel auto-tuning, the engine selects the best algorithm for the target GPU, maximizing hardware utilization. Additionally, TensorRT employs CUDA streams to enable parallel processing of models, further improving GPU utilization and performance. Finally, through quantization, TensorRT can use mixed-precision acceleration of Tensor cores, enabling the model to run in FP32, TF32, FP16, and INT8 precision for the best inference performance. However, although the reduced precision can generally improve the latency performance, it might come with possible instability and degradation in model accuracy. Overall, TensorRT’s combination of techniques results in faster inference and lower latency compared to other inference engines.

The TensorRT backend for Triton Inference Server is designed to take advantage of the powerful inference capabilities of NVIDIA GPUs. To use TensorRT as a backend for Triton Inference Server, you need to create a TensorRT engine from your trained model using the TensorRT API. This engine is then loaded into Triton Inference Server and used to perform inference on incoming requests. The following are the basic steps to use TensorRT as a backend for Triton Inference Server:

  1. Convert your trained model to the ONNX format. Triton Inference Server supports ONNX as a model format. ONNX is a standard for representing deep learning models, enabling them to be transferred between frameworks. If your model isn’t already in the ONNX format, you need to convert it using the appropriate framework-specific tool. For example, in PyTorch, this can be done using the torch.onnx.export method.
  2. Import the ONNX model into TensorRT and generate the TensorRT engine. For TensorRT, there are several ways to build a TensorRT from your ONNX model. For this post, we use the trtexec CLI tool. trtexec is a tool to quickly utilize TensorRT without having to develop your own application. The trtexec tool has three main purposes:
    1. Benchmarking networks on random or user-provided input data.
    2. Generating serialized engines from models.
    3. Generating a serialized timing cache from the builder.
  3. Load the TensorRT engine in Triton Inference Server. After the TensorRT engine is generated, it can be loaded into Triton Inference Server by creating a model configuration file. The model configuration (config.pbtxt) file should include the path to the TensorRT engine file and the input and output shapes of the model.

Each model in a model repository must include a model configuration that provides required and optional information about the model. Typically, this configuration is provided in a config.pbtxt file specified as ModelConfig protobuf. There are several key points to note in this configuration file:

  • name – This field defines the model’s name and must be unique within the model repository.
  • platform – This field defines the type of the model: TensorRT engine, PyTorch, or something else.
  • max_batch_size – This specifies the maximum batch size that can be passed to this model. If the model’s batch dimension is the first dimension, and all inputs and outputs to the model have this batch dimension, then Triton can use its dynamic batcher or sequence batcher to automatically use batching with the model. In this case, max_batch_size should be set to a value greater than or equal to 1, which indicates the maximum batch size that Triton should use with the model. For models that don’t support batching, or don’t support batching in the specific ways we’ve described, max_batch_size must be set to 0.
  • Input and output – These fields are required because NVIDIA Triton needs metadata about the model. Essentially, it requires the names of your network’s input and output layers and the shape of said inputs and outputs.
  • instance_group – This determines how many instances of this model will be created and whether they will use the GPU or CPU.
  • dynamic_batchingDynamic batching is a feature of Triton that allows inference requests to be combined by the server, so that a batch is created dynamically. The preferred_batch_size property indicates the batch sizes that the dynamic batcher should attempt to create. For most models, preferred_batch_size should not be specified, as described in Recommended Configuration Process. An exception is TensorRT models that specify multiple optimization profiles for different batch sizes. In this case, because some optimization profiles may give significant performance improvement compared to others, it may make sense to use preferred_batch_size for the batch sizes supported by those higher-performance optimization profiles. You can also reference the batch size that was previously used when running trtexec. You can also configure the delay time to allow requests to be delayed for a limited time in the scheduler to allow other requests to join the dynamic batch.

The TensorRT backend is improved to have significantly better performance. Improvements include reducing thread contention, using pinned memory for faster transfers between CPU and GPU, and increasing compute and memory copy overlap on GPUs. It also reduces memory usage of TensorRT models in many cases by sharing weights across multiple model instances. Overall, the TensorRT backend for Triton Inference Server provides a powerful and flexible way to serve deep learning models with optimized TensorRT inference. By adjusting the configuration options, you can optimize performance and control behavior to suit your specific use case.

SageMaker provides Triton via SMEs and MMEs

SageMaker enables you to deploy both single and multi-model endpoints with Triton Inference Server. Triton supports a heterogeneous cluster with both GPUs and CPUs, which helps standardize inference across platforms and dynamically scales out to any CPU or GPU to handle peak loads. The following diagram illustrates the Triton Inference Server architecture. Inference requests arrive at the server via either HTTP/REST or by the C API, and are then routed to the appropriate per-model scheduler. Triton implements multiple scheduling and batching algorithms that can be configured on a model-by-model basis. Each model’s scheduler optionally performs batching of inference requests and then passes the requests to the backend corresponding to the model type. The framework backend performs inferencing using the inputs provided in the batched requests to produce the requested outputs. The outputs are then formatted and returned in the response. The model repository is a file system-based repository of the models that Triton will make available for inferencing.

Triton architecture

SageMaker takes care of traffic shaping to the MME endpoint and maintains optimal model copies on GPU instances for best price performance. It continues to route traffic to the instance where the model is loaded. If the instance resources reach capacity due to high utilization, SageMaker unloads the least-used models from the container to free up resources to load more frequently used models. SageMaker MMEs offer capabilities for running multiple deep learning or ML models on the GPU, at the same time, with Triton Inference Server, which has been extended to implement the MME API contract. MMEs enable sharing GPU instances behind an endpoint across multiple models, and dynamically load and unload models based on the incoming traffic. With this, you can easily achieve optimal price performance.

When a SageMaker MME receives an HTTP invocation request for a particular model using TargetModel in the request along with the payload, it routes traffic to the right instance behind the endpoint where the target model is loaded. SageMaker takes care of model management behind the endpoint. It dynamically downloads models from Amazon Simple Storage Service (Amazon S3) to the instance’s storage volume if the invoked model isn’t available on the instance storage volume. Then SageMaker loads the model to the NVIDIA Triton container’s memory on a GPU-accelerated instance and serves the inference request. The GPU core is shared by all the models in an instance. For more information about SageMaker MMEs on GPU, see Run multiple deep learning models on GPU with Amazon SageMaker multi-model endpoints.

SageMaker MMEs can horizontally scale using an auto scaling policy and provision additional GPU compute instances based on specified metrics. When configuring your auto scaling groups for SageMaker endpoints, you may want to consider SageMakerVariantInvocationsPerInstance as the primary criteria to determine the scaling characteristics of your auto scaling groups. In addition, based on whether your models are running on GPU or CPU, you may also consider using CPUUtilization or GPUUtilization as additional criteria. For single model endpoints, because the models deployed are all the same, it’s fairly straightforward to set proper policies to meet your SLAs. For multi-model endpoints, we recommend deploying similar models behind a given endpoint to have more steady, predictable performance. In use cases where models of varying sizes and requirements are used, you might want to separate those workloads across multiple multi-model endpoints or spend some time fine-tuning your auto scaling group policy to obtain the best cost and performance balance.

Solution overview

With the NVIDIA Triton container image on SageMaker, you can now use Triton’s TensorRT backend, which allows you to deploy TensorRT models. The TensorRT_backend repo contains the documentation and source for the backend. In the following sections, we walk you through the example notebook that demonstrates how to use NVIDIA Triton Inference Server on SageMaker MMEs with the GPU feature to deploy a BERT natural language processing (NLP) model.

Set up the environment

We begin by setting up the required environment. We install the dependencies required to package our model pipeline and run inferences using Triton Inference Server. We also define the AWS Identity and Access Management (IAM) role that gives SageMaker access to the model artifacts and the NVIDIA Triton Amazon Elastic Container Registry (Amazon ECR) image. You can use the following code example to retrieve the pre-built Triton ECR image:

import transformers
import boto3, json, sagemaker, time
from sagemaker import get_execution_role
sess = boto3.Session()
sm = sess.client("sagemaker")
sagemaker_session = sagemaker.Session(boto_session=sess)
role = get_execution_role()
client = boto3.client("sagemaker-runtime")
bucket = sagemaker_session.default_bucket()
print(bucket)

account_id_map = {
"us-east-1": "785573368785",
"us-east-2": "007439368137",
"us-west-1": "710691900526",
"us-west-2": "301217895009",
"eu-west-1": "802834080501",
"eu-west-2": "205493899709",
"eu-west-3": "254080097072",
"eu-north-1": "601324751636",
"eu-south-1": "966458181534",
"eu-central-1": "746233611703",
"ap-east-1": "110948597952",
"ap-south-1": "763008648453",
"ap-northeast-1": "941853720454",
"ap-northeast-2": "151534178276",
"ap-southeast-1": "324986816169",
"ap-southeast-2": "355873309152",
"cn-northwest-1": "474822919863",
"cn-north-1": "472730292857",
"sa-east-1": "756306329178",
"ca-central-1": "464438896020",
"me-south-1": "836785723513",
"af-south-1": "774647643957",
}

region = boto3.Session().region_name
if region not in account_id_map.keys():
    raise ("UNSUPPORTED REGION")
    
base = "amazonaws.com.cn" if region.startswith("cn-") else "amazonaws.com"
triton_image_uri = "{account_id}.dkr.ecr.{region}.{base}/sagemaker-tritonserver:23.02-py3".format(
account_id=account_id_map[region], region=region, base=base
)

Add utility methods for preparing the request payload

We create the functions to transform the sample text we’re using for inference into the payload that can be sent for inference to Triton Inference Server. The tritonclient package, which was installed at the beginning, provides utility methods to generate the payload without having to know the details of the specification. We use the created methods to convert our inference request into a binary format, which provides lower latencies for inference. These functions are used during the inference step.

Prepare the TensorRT model

In this step, we load the pre-trained BERT model and convert to ONNX representation using the torch ONNX exporter and the onnx_exporter.py script. After the ONNX model is created, we use the TensorRT trtexec command to create the model plan to be hosted with Triton. This is run as part of the generate_model.sh script from the following cell. Note that the cell takes around 30 minutes to complete.

!docker run --gpus=all --rm -it 
-v `pwd`/workspace:/workspace nvcr.io/nvidia/pytorch:23.02-py3 
            /bin/bash generate_models.sh

While waiting for the command to finish running, you can check the scripts used in this step. In the onnx_exporter.py script, we use the torch.onnx.export function for ONNX model creation:


    torch.onnx.export(
        model,
        dummy_inputs,
        args.save,
        export_params=True,
        opset_version=10,
        input_names=["token_ids", "attn_mask"],
        output_names=["output","pooled_output"],
        dynamic_axes={"token_ids": [0, 1], "attn_mask": [0, 1], "output": [0]},
    )

The command line in the generate_model.sh file creates the TensorRT model plan. For more information, refer to the trtexec command-line tool.

trtexec —onnx=model.onnx —saveEngine=model_bs16.plan —minShapes=token_ids:1x128,attn_mask:1x128 —optShapes=token_ids:16x128,attn_mask:16x128 —maxShapes=token_ids:128x128,attn_mask:128x128 —fp16 —verbose —workspace=14000 | tee conversion_bs16_dy.txt

Build a TensorRT NLP BERT model repository

Using Triton on SageMaker requires us to first set up a model repository folder containing the models we want to serve. For each model, we need to create a model directory consisting of the model artifact and define the config.pbtxt file to specify the model configuration that Triton uses to load and serve the model. To learn more about the config settings, refer to Model Configuration. The model repository structure for the BERT model is as follows:

Folder structure for model

Note that Triton has specific requirements for model repository layout. Within the top-level model repository directory, each model has its own subdirectory containing the information for the corresponding model. Each model directory in Triton must have at least one numeric subdirectory representing a version of the model. Here, the folder 1 represents version 1 of the BERT model. Each model is run by a specific backend, so within each version subdirectory there must be the model artifacts required by that backend. Here, we are using the TensorRT backend, which requires the TensorRT plan file that is used for serving (for this example, model.plan). If we were using a PyTorch backend, a model.pt file would be required. For more details on naming conventions for model files, refer to Model Files.

Every TensorRT model must provide a config.pbtxt file describing the model configuration. In order to use this backend, you must set the backend field of your model config.pbtxt file to tensorrt_plan. The following section of code shows an example of how to define the configuration file for the BERT model being served through Triton’s TensorRT backend:

name: "bert"
platform: "tensorrt_plan"
max_batch_size: 128
input [
  {
    name: "token_ids"
    data_type: TYPE_INT32
    dims: [128]
  },
  {
    name: "attn_mask"
    data_type: TYPE_INT32
    dims: [128]
  }
]
output [
  {
    name: "output"
    data_type: TYPE_FP32
    dims: [128, 768]
  },
  {
    name: "pooled_output"
    data_type: TYPE_FP32
    dims: [768]
  }
]
instance_group {
  count: 1
  kind: KIND_GPU
}
dynamic_batching {
  preferred_batch_size: 16
}

SageMaker expects a .tar.gz file containing each Triton model repository to be hosted on the multi-model endpoint. To simulate several similar models being hosted, you might think all it takes is to tar the model repository we have already built, and then copy it with different file names. However, Triton requires unique model names. Therefore, we first copy the model repo N times, changing the model directory names and their corresponding config.pbtxt files. You can change the number of N to have more copies of the model that can be dynamically loaded to the hosting endpoint to simulate the model load/unload action managed by SageMaker. See the following code:

import os
import shutil

N = 5
prefix = 'bert-mme'
model_repo_base = 'model_repo'

# Get model names from model_repo_0
model_names = [name for name in os.listdir(f'{model_repo_base}_0') if os.path.isdir(f'{model_repo_base}_0/{name}')]

for i in range(N):
    # Make copy of previous model repo, increment # id
    shutil.copytree(f'{model_repo_base}_0', f'{model_repo_base}_{i+1}')
    time.sleep(5)
    for name in model_names:
        model_dirs_path = f'{model_repo_base}_{i+1}/{name}'

        # Open each model's config file to increment model # id there 
        fin = open(f'{model_dirs_path}/config.pbtxt', "rt")
        data = fin.read()
        data = data.replace(name, name[:-1] + str(i+1))
        fin.close()
        fin = open(f'{model_dirs_path}/config.pbtxt', "wt")
        fin.write(data)
        fin.close()
    
        # Change model directory name to match new config
        os.rename(model_dirs_path,model_dirs_path[:-1]+str(i+1))
        time.sleep(2)
        
    if i == 0:
        tar_file_name = f'bert-{i}.tar.gz'
        model_repo_target = f'{model_repo_base}_{i}/'
        !tar -C $model_repo_target -czf $tar_file_name .
        sagemaker_session.upload_data(path=tar_file_name, key_prefix=prefix)

    tar_file_name = f'bert-{i+1}.tar.gz'
    model_repo_target = f'{model_repo_base}_{i+1}/'
    !tar -C $model_repo_target -czf $tar_file_name .
    sagemaker_session.upload_data(path=tar_file_name, key_prefix=prefix)
    !sudo rm -r "$tar_file_name" "$model_repo_target"

Create a SageMaker endpoint

Now that we have uploaded the model artifacts to Amazon S3, we can create the SageMaker model object, endpoint configuration, and endpoint.

Firstly, we need to define the serving container. In the container definition, define the ModelDataUrl to specify the S3 directory that contains all the models that the SageMaker multi-model endpoint will use to load and serve predictions. Set Mode to MultiModel to indicate SageMaker will create the endpoint with MME container specifications. See the following code:

container = {
"Image": triton_image_uri,
"ModelDataUrl": model_data_uri,
"Mode": "MultiModel",
}

Then we create the SageMaker model object using the create_model boto3 API by specifying the ModelName and container definition:

create_model_response = sm.create_model(
ModelName=sm_model_name, ExecutionRoleArn=role, PrimaryContainer=container
)

We use this model to create an endpoint configuration where we can specify the type and number of instances we want in the endpoint. Here we are deploying to a g5.xlarge NVIDIA GPU instance:

create_endpoint_config_response = sm.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            "InstanceType": "ml.g5.xlarge",
            "InitialVariantWeight": 1,
            "InitialInstanceCount": 1,
            "ModelName": sm_model_name,
            "VariantName": "AllTraffic",
        }
    ],
)

With this endpoint configuration, we create a new SageMaker endpoint and wait for the deployment to finish. The status will change to InService when the deployment is successful.

endpoint_name = "triton-nlp-bert-trt-mme-" + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())
create_endpoint_response = sm.create_endpoint(
EndpointName=endpoint_name, EndpointConfigName=endpoint_config_name
)

Invoke your model hosted on the SageMaker endpoint

When the endpoint is running, we can use some sample raw data to perform inference using either JSON or binary+JSON as the payload format. For the inference request format, Triton uses the KFServing community standard inference protocols. We can send the inference request to the multi-model endpoint using the invoke_enpoint API. We specify the TargetModel in the invocation call and pass in the payload for each model type. Here we invoke the endpoint in a for loop to request the endpoint to dynamically load or unload models based on the requests:

text_triton = "Triton Inference Server provides a cloud and edge inferencing solution optimized for both CPUs and GPUs."
input_ids, attention_mask = tokenize_text(text_triton)

payload = {
    "inputs": [
        {"name": "token_ids", "shape": [1, 128], "datatype": "INT32", "data": input_ids},
        {"name": "attn_mask", "shape": [1, 128], "datatype": "INT32", "data": attention_mask},
    ]
}

for i in range(N):
    response = client.invoke_endpoint(
        EndpointName=endpoint_name,
        ContentType="application/octet-stream",
        Body=json.dumps(payload),
        TargetModel=f"bert-{i}.tar.gz",
    )

    print(json.loads(response["Body"].read().decode("utf8")))

You can monitor the model loading and unloading status using Amazon CloudWatch metrics and logs. SageMaker multi-model endpoints provide instance-level metrics to monitor; for more details, refer to Monitor Amazon SageMaker with Amazon CloudWatch. The LoadedModelCount metric shows the number of models loaded in the containers. The ModelCacheHit metric shows the number of invocations to model that are already loaded onto the container to help you get model invitation-level insights. To check if models are unloaded from the memory, you can look for the successful unloaded log entries in the endpoint’s CloudWatch logs.

The notebook can be found in the GitHub repository.

Best practices

Before starting any optimization effort with TensorRT, it’s essential to determine what should be measured. Without measurements, it’s impossible to make reliable progress or measure whether success has been achieved. Here are some best practices to consider when using the TensorRT backend for Triton Inference Server:

  • Optimize your TensorRT model – Before deploying a model on Triton with the TensorRT backend, make sure to optimize the model following the TensorRT best practices guide. This will help you achieve better performance by reducing inference time and memory consumption.
  • Use TensorRT instead of other Triton backends when possible – TensorRT is designed to optimize deep learning models for deployment on NVIDIA GPUs, so using it can significantly improve inference performance compared to using other supported Triton backends.
  • Use the right precision – TensorRT supports multiple precisions (FP32, FP16, INT8), and selecting the right precision for your model can have a significant impact on performance. Consider using lower precision when possible.
  • Use batch sizes that fit your hardware – Make sure to choose batch sizes that fit your GPU’s memory and compute capabilities. Using batch sizes that are too large or too small can negatively impact performance.

Conclusion

In this post, we dove deep into the TensorRT backend that Triton Inference Server supports on SageMaker. This backend provides for both CPU and GPU acceleration of your TensorRT models. There are many options to consider to get the best performance for inference, such as batch sizes, data input formats, and other factors that can be tuned to meet your needs. SageMaker allows you to take advantage of this capability using single model endpoints for guaranteed performance and multi-model endpoints to get a better balance of performance and cost savings. To get started with MME support for GPU, see Supported algorithms, frameworks, and instances.

We invite you to try Triton Inference Server containers in SageMaker, and share your feedback and questions in the comments.


 About the Authors

Melanie Li is a Senior AI/ML Specialist TAM at AWS based in Sydney, Australia. She helps enterprise customers to build solutions leveraging the state-of-the-art AI/ML tools on AWS and provides guidance on architecting and implementing machine learning solutions with best practices. In her spare time, she loves to explore nature outdoors and spend time with family and friends.

James Park is a Solutions Architect at Amazon Web Services. He works with Amazon to design, build, and deploy technology solutions on AWS, and has a particular interest in AI and machine learning. In his spare time he enjoys seeking out new cultures, new experiences,  and staying up to date with the latest technology trends.

Jiahong Liu is a Solution Architect on the Cloud Service Provider team at NVIDIA. He assists clients in adopting machine learning and AI solutions that leverage NVIDIA accelerated computing to address their training and inference challenges. In his leisure time, he enjoys origami, DIY projects, and playing basketball.

Kshitiz Gupta is a Solutions Architect at NVIDIA. He enjoys educating cloud customers about the GPU AI technologies NVIDIA has to offer and assisting them with accelerating their machine learning and deep learning applications. Outside of work, he enjoys running, hiking and wildlife watching.

Read More