May 2023 – Page 15

Publish predictive dashboards in Amazon QuickSight using ML predictions from Amazon SageMaker Canvas

Understanding business trends, customer behavior, sales revenue, increase in demand, and buyer propensity all start with data. Exploring, analyzing, interpreting, and finding trends in data is essential for businesses to achieve successful outcomes.

Business analysts play a pivotal role in facilitating data-driven business decisions through activities such as the visualization of business metrics and the prediction of future events. Quick iteration and faster time-to-value can be achieved by providing these analysts with a visual business intelligence (BI) tool for simple analysis, supported by technologies like machine learning (ML).

Amazon QuickSight is a fully managed, cloud-native BI service that makes it easy to connect to your data, create interactive dashboards and reports, and share these with tens of thousands of users, either within QuickSight or embedded in your application or website. Amazon SageMaker Canvas is a visual interface that enables business analysts to generate accurate ML predictions on their own, without requiring any ML experience or having to write a single line of code.

In this post, we show how you can publish predictive dashboards in QuickSight using ML-based predictions from Canvas, without explicitly downloading predictions and importing into QuickSight. This solution will help you send predictions from Canvas to QuickSight, enabling you with accelerated decision-making using ML to achieve effective business outcomes.

Solution overview

In the following sections, we discuss steps that will help administrators configure the right permissions to seamlessly redirect users from Canvas to QuickSight. Then we detail how to build a model and run predictions, and demonstrate the business analyst experience.

Prerequisites

The following prerequisites are needed to implement this solution:

An AWS account with permissions to create AWS Identity and Access Management (IAM) policies and roles.
Access to Amazon SageMaker, an instance of Amazon SageMaker Studio, and a user for Studio. For more information about prerequisites, see Getting started with using Amazon SageMaker Canvas.
A QuickSight subscription. For this post, we only use QuickSight features included in the Standard subscription.
Access to the QuickSight dashboard to author and analyze the inferred data.

Make sure to use the same QuickSight Region as Canvas. You can change the Region by navigating from the profile icon on the QuickSight console.

Administrator setup

In this section, we detail the steps to set up IAM resources, prepare the data, train the data with the training dataset, and infer the validation dataset. Thereafter, we send the data to QuickSight for further analysis.

Create a new IAM policy for QuickSight access

To create an IAM policy, complete the following steps:

On the IAM console, choose Policies in the navigation pane.
Choose Create policy.
On the JSON tab, enter the following permissions policy into the editor:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "quicksight:CreateDataSet",
                "quicksight:ListNamespaces",
                "quicksight:CreateDataSource",
                "quicksight:PassDataSet",
                "quicksight:PassDataSource"
            ],
            "Resource":[
                "arn:aws:quicksight:*:<AWS-account-id>:datasource/*", #replace account id
                "arn:aws:quicksight:*:<AWS-account-id>:user/*", #replace account id
                "arn:aws:quicksight:*:<AWS-account-id>:namespace/*", #replace account id
                "arn:aws:quicksight:*:<AWS-account-id>:dataset/*" #replace account id
            ]
        }
    ]
}

For details about the IAM policy language, see IAM JSON policy reference.

Choose Next: Tags.
You can add metadata to the policy by attaching tags as key-value pairs, then choose Next: Review.

For more information about using tags in IAM, see Tagging IAM resources.

On the Review policy page, enter a name (for example, canvas-quicksight-access-policy) and an optional description of the policy.
Review the Summary section to see the permissions that are granted by your policy.
Choose Create policy to save your work.

After you create a policy, you can attach it to your execution role that grants your users the necessary permissions to send batch predictions to users in QuickSight.

Attach the policy to your Studio execution role

To attach the policy to your Studio execution role, complete the following steps:

On the SageMaker console, choose Domains in the navigation pane.
Choose your domain.
Choose Domain settings.
Copy the role name under Execution role.

On the IAM console, choose Roles in the navigation pane.
In the search bar, enter the execution role you copied, then choose the role.

On the page for the user’s role, navigate to the Permissions policies section.
On the Add permissions menu, choose Attach policies.
Search for the previously created policy (canvas-quicksight-access-policy), select it, and choose Add permissions.

Now you have an IAM policy attached to your execution role that grants your users the necessary permissions to send batch predictions to users in QuickSight.

Download the datasets

Let’s download the datasets that we use to train the model and make the predictions:

Build a model and run predictions

In this section, we cover how we can build a model and run predictions on the loan dataset. Then we send the data to the QuickSight dashboard to get business insights.

Launch Canvas

To launch Canvas, complete the following steps:

On the SageMaker console, choose Domains in the navigation pane.
Choose your domain.
On the Launch menu, choose Canvas.

Upload training and validation datasets

Complete the following steps to upload your datasets to Canvas:

On the Canvas home page, choose Datasets.
Choose Import data, then upload lending_club_loan_data_train.csv and lending_club_loan_data_test.csv.
Choose Save & Close, then choose Import data.

Now let’s create new model.

Choose My models in the navigation pane.
Choose New model.
Enter a name to your model (Loan_Prediction) and choose Create.

If this is the first time creating a Canvas model, you will be welcomed by an informative pop-up about how to build your first model in four simple steps. You can read this through, then come back to this guide.

In the model view, on the Select tab, select the lending_club_loan_data_train dataset.

This dataset has 18 columns and 32,000 rows.

Choose Select dataset.

On the Build tab, choose the target column, in our case loan_status.

Canvas will automatically detect that this is a 3+ category prediction problem (also known as multi-class classification).

If another model type is detected, change it manually by choosing Change type.

Choose Quick build, and select Start quick build from the pop-up.

You can also choose Standard build, which goes through the complete AutoML cycle, generating multiple models before recommending the best model.

Now your model is being built. Quick build usually takes 2–15 minutes.

After the model is built, you can find the model status on the Analyze tab.

Make predictions with the model

After we build and train the model, we can generate predictions on this model.

Choose Predict on the Analyze tab, or choose the Predict tab.
Run a single prediction by choosing Single prediction and providing entries.

You will see the loan_status prediction on the right side of the page. You can copy the prediction by choosing Copy, or download it by choosing Download prediction. This is ideal for generating what-if scenarios and testing how different columns impact the predictions of our model.

To run batch predictions, choose Batch prediction.

This is best when you’d like to make predictions for an entire dataset. You should make predictions with a dataset that matches your input dataset.

For each prediction or set of predictions, Canvas returns the predicted values and the probability of the predicted value being correct.

Let’s make predictions from the trained model using the validation dataset.

Choose Select the dataset.
Select lending_club_loan_data_test and choose Generate predictions.

When your predictions are ready, you can find them in the Dataset section. You can preview the prediction, download it to a local machine, delete it, or send it to QuickSight.

Send predictions to QuickSight

You can now share predictions from these ML models as QuickSight datasets that will serve as a new source for enterprise-wide dashboards. You can analyze trends, risks, and business opportunities. Through this capability, ML becomes more accessible to business teams so they can accelerate data-driven decision-making. Sharing data with QuickSight users grants them owner permissions on the dataset. Multiple inferred datasets can be sent at once to QuickSight.

Note that you can only send predictions to users in the default namespace of the QuickSight account, and the user must have the Author or Admin role in QuickSight. Predictions sent to QuickSight are available in the same Region as Canvas.

Select the inferred batch dataset and choose Send to Amazon QuickSight.

Enter one or multiple QuickSight user names to share the dataset with and press Enter.
Choose Send to share data.

After you send your batch predictions, the QuickSight field for the datasets you sent shows as Sent.

In the confirmation box, you can choose Open Amazon QuickSight to open your QuickSight application.
If you’re done using Canvas, log out of the Canvas application.

You can send batch predictions to QuickSight for numeric, categorical prediction, and time series forecasting models. You can also send predictions generated with the bring your own model (BYOM) method. Single-label image prediction and multi-category text prediction models are excluded.

The QuickSight users that you’ve sent datasets to can open their QuickSight console and view the Canvas datasets that have been shared with them. Then they can create predictive dashboards with the data. For more information, see Getting started with Amazon QuickSight data analysis.

By default, all the users to whom you send predictions have owner permissions for the dataset in QuickSight. Owners are able to create analyses, refresh, edit, delete, and reshare datasets. The changes that owners make to a dataset change the dataset for all users with access. To change the permissions, go to the dataset in QuickSight and manage its permissions. For more information, see Viewing and editing the permissions users that a dataset is shared with.

Business analysts experience

With QuickSight, you can visualize your data to better understand it. We start by getting some high-level information.

On the QuickSight console, choose Datasets in the navigation pane.
Create an analysis on the batch prediction dataset shared from Canvas by choosing Create analysis on the drop-down options menu (three vertical dots).

On the analysis page, choose the sheet name and rename to it Loan Data Analysis.

Let’s create a visual to show the count by loan status.

For Visual types, choose Donut chart.
Use the loan_status field for Group/Color.

We can see that 99% are fully paid, 1% are current, and 0% are charged off.

Now we add a second visual to show the amount of loans by status.

On the top-left corner, choose the plus sign and choose Add visual.
For Visual types, choose Waterfall chart.
Use the loan_status field for Category.
Use the loan_amount field for Value.

We can see that the total loan amount is around $88 million, with around $221,000 charged off.

Let’s try to detect some risk drivers for defaulting on loans.

Choose the plus sign and choose Add visual.
For Visual types, choose Horizontal bar chart.
Use the loan_status field for Y axis.
Use the loan_amount field for Value.
Modify the Value field aggregation from Sum to Average.

We can see that on average, the loan amount was around $3,500 lower for the fully paid loans compared to the current loans, and around $3,500 lower for the fully paid loans compared to the charged off loans. There seems to be a correlation between the loan amount and the credit risk.

To duplicate the visual, choose the options menu (three dots), choose Duplicate visual to, and choose This sheet.
Choose the duplicated visual to modify its configuration.
For Visual types, choose Horizontal bar chart.
Use the loan_status field for Y axis.
Use the loan_amount field for Value.
Modify the Value field aggregation from Sum to Average.

You can create additional visuals to check for additional risk drivers. For example:

Loan term
Open credit lines
Revolving line utilization rate
Total credit lines

After you add the visuals, publish the dashboard using the Share option on the analyses page and share the dashboard with the business stakeholders.

Clean up

To avoid incurring future charges, delete or shut down the resources you created while following this post. Refer to Logging out of Amazon SageMaker Canvas for more details.

Conclusion

In this post, we trained an ML model using Canvas without writing a single line of code thanks to its user-friendly interfaces and clear visualizations. We then generated single and batch predictions for this model in Canvas. To assess the trends, risks, and business opportunities across the enterprise, we sent the predictions of this ML model to QuickSight. As business analysts, we created various visualizations to assess the trends in QuickSight.

This capability is available in all Regions where Canvas is now supported. You can learn more on the Canvas product page and documentation.

About the Authors

Ajjay Govindaram is a Senior Solutions Architect at AWS. He works with strategic customers who are using AI/ML to solve complex business problems. His experience lies in providing technical direction as well as design assistance for modest to large-scale AI/ML application deployments. His knowledge ranges from application architecture to big data, analytics, and machine learning. He enjoys listening to music while resting, experiencing the outdoors, and spending time with his loved ones.

Varun Mehta is a Solutions Architect at AWS. He is passionate about helping customers build enterprise-scale well-architected solutions on the AWS Cloud. He works with strategic customers who are using AI/ML to solve complex business problems.

Shyam Srinivasan is a Principal Product Manager on the AWS AI/ML team, leading product management for Amazon SageMaker Canvas. Shyam cares about making the world a better place through technology and is passionate about how AI and ML can be a catalyst in this journey.

How AI and Crowdsourcing Can Advance mRNA Vaccine Distribution

Artificial intelligence is teaming up with crowdsourcing to improve the thermo-stability – the ability to avoid breaking down under heat stress – of mRNA vaccines, making distribution more accessible worldwide.

In this episode of NVIDIA’s AI Podcast, host Noah Kravitz interviewed Bojan Tunguz, a physicist and senior system software engineer at NVIDIA, and Johnny Israeli, senior manager of AI and cloud software at NVIDIA.

The guests delved into AI’s potential in drug discovery and the Stanford Open Vaccine competition, a machine-learning contest using crowdsourcing to tackle the thermo-stability challenges of mRNA vaccines.

Kaggle, the online machine learning competition platform, hosted the Stanford Open Vaccine competition. Tunguz, a quadruple Kaggle grandmaster, shared how Kaggle has grown to encompass not just competitions, but also datasets, code and discussions. Competitors can earn points, rankings and status achievements across these four areas.

The fusion of AI, crowdsourcing and machine learning competitions is opening new possibilities in drug discovery and vaccine distribution. By tapping into the collective wisdom and skills of participants worldwide, it becomes possible to solve pressing global problems, such as enhancing the thermo-stability of mRNA vaccines, allowing for a more efficient and widely accessible distribution process.

The AI Podcast · Bojan Tunguz, Johnny Israeli on How AI and Crowdsourcing Can Advance Vaccine Distribution – Ep. 195

Driver’s Ed: How Waabi Uses AI, Simulation to Teach Autonomous Vehicles to Drive

Teaching the AI brains of autonomous vehicles to understand the world as humans do requires billions of miles of driving experience. The road to achieving this astronomical level of driving leads to the virtual world. Learn how Waabi uses powerful high-fidelity simulations to train and develop production-level autonomous vehicles.

Polestar’s Dennis Nobelius on the Sustainable Performance Brand’s Plans

Driving enjoyment and autonomous driving capabilities can complement one another in intelligent, sustainable vehicles. Learn about the automaker’s plans to unveil its third vehicle, the Polestar 3, the tech inside it, and what the company’s racing heritage brings to the intersection of smarts and sustainability.

GANTheftAuto: Harrison Kinsley on AI-Generated Gaming Environments

Humans playing games against machines is nothing new, but now computers can develop their own games for people to play. Programming enthusiast and social media influencer Harrison Kinsley created GANTheftAuto, an AI-based neural network that generates a playable chunk of the classic video game Grand Theft Auto V.

Subscribe to the AI Podcast: Now Available on Amazon Music

The AI Podcast is now available through Amazon Music.

In addition, get the AI Podcast through iTunes, Google Podcasts, Google Play, Castbox, DoggCatcher, Overcast, PlayerFM, Pocket Casts, Podbay, PodBean, PodCruncher, PodKicker, Soundcloud, Spotify, Stitcher and TuneIn.

Make the AI Podcast better: Have a few minutes to spare? Fill out this listener survey.

Announcing new Jupyter contributions by AWS to democratize generative AI and scale ML workloads

Project Jupyter is a multi-stakeholder, open-source project that builds applications, open standards, and tools for data science, machine learning (ML), and computational science. The Jupyter Notebook, first released in 2011, has become a de facto standard tool used by millions of users worldwide across every possible academic, research, and industry sector. Jupyter enables users to work with code and data interactively, and to build and share computational narratives that provide a full and reproducible record of their work.

Given the importance of Jupyter to data scientists and ML developers, AWS is an active sponsor and contributor to Project Jupyter. Our goal is to work in the open-source community to help Jupyter to be the best possible notebook platform for data science and ML. AWS is a platinum sponsor of Project Jupyter through the NumFOCUS Foundation, and I am proud and honored to lead a dedicated team of AWS engineers who contribute to Jupyter’s software and participate in Jupyter’s community and governance. Our open-source contributions to Jupyter include JupyterLab, Jupyter Server, and the Jupyter Notebook subprojects. We are also members of the Jupyter working groups for Security, and Diversity, Equity, and Inclusion (DEI). In parallel to these open-source contributions, we have AWS product teams who are working to integrate Jupyter with products such as Amazon SageMaker.

Today at JupyterCon, we are excited to announce several new tools for Jupyter users to improve their experience and boost development productivity. All of these tools are open-source and can be used anywhere you are running Jupyter.

Introducing two generative AI extensions for Jupyter

Generative AI can significantly boost the productivity of data scientists and developers as they write code. Today, we are announcing two Jupyter extensions that bring generative AI to Jupyter users through a chat UI, IPython magic commands, and autocompletion. These extensions enable you to perform a wide range of development tasks using generative AI models in JupyterLab and Jupyter notebooks.

Jupyter AI, an open-source project to bring generative AI to Jupyter notebooks

Using the power of large language models like ChatGPT, AI21’s Jurassic-2, and (coming soon) Amazon Titan, Jupyter AI is an open-source project that brings generative AI features to Jupyter notebooks. For example, using a large language model, Jupyter AI can help a programmer generate, debug, and explain their source code. Jupyter AI can also answer questions about local files and generate entire notebooks from a simple natural language prompt. Jupyter AI offers both magic commands that work in any notebook or IPython shell, and a friendly chat UI in JupyterLab. Both of these experiences work with dozens of models from a wide range of model providers. JupyterLab users can select any text or notebook cells, enter a natural language prompt to perform a task with the selection, and then insert the AI-generated response wherever they choose. Jupyter AI is integrated with Jupyter’s MIME type system, which lets you work with inputs and outputs of any type that Jupyter supports (text, images, etc.). Jupyter AI also provides integration points that allows third parties to configure their own models. Jupyter AI is an official open-source project of Project Jupyter.

Amazon CodeWhisperer Jupyter extension

Autocompletion is foundational for developers and generative AI can significantly enhance the code suggestion experience. That is why we announced the general availability of Amazon CodeWhisperer earlier in 2023. is an AI coding companion that uses foundational models under the hood to radically improve developer productivity. This works by generating code suggestions in real time based on developers’ comments in natural language and prior code in their integrated development environment (IDE).

Today, we are excited to announce that JupyterLab users can install and use the CodeWhisperer extension for free to generate real-time, single-line, or full-function code suggestions for Python notebooks in JupyterLab and Amazon SageMaker Studio. With CodeWhisperer, you can write a comment in natural language that outlines a specific task in English, such as “Create a pandas dataframe using a CSV file.” Based on this information, CodeWhisperer recommends one or more code snippets directly in the notebook that can accomplish the task. You can quickly and easily accept the top suggestion, view more suggestions, or continue writing your own code.

During its preview, CodeWhisperer proved it is excellent at generating code to accelerate coding tasks, helping developers complete tasks an average of 57% faster. Additionally, developers who used CodeWhisperer were 27% more likely to complete a coding task successfully than those who did not. This is a giant leap forward in developer productivity. CodeWhisperer also includes a built-in reference tracker that detects whether a code suggestion might resemble open-source training data and can flag such suggestions.

Introducing new Jupyter extensions to build, train, and deploy ML at scale

Our mission at AWS is to democratize access to ML across industries. To achieve this goal, starting from 2017, we launched the Amazon SageMaker notebook instance—a fully managed compute instance running Jupyter that includes all the popular data science and ML packages. In 2019, we made a significant leap forward with the launch of SageMaker Studio, an IDE for ML built on top of JupyterLab that enables you to build, train, tune, debug, deploy, and monitor models from a single application. Tens of thousands of customers are using Studio to empower data science teams of all sizes. In 2021, we further extended the benefits of SageMaker to the community of millions of Jupyter users by launching Amazon SageMaker Studio Lab—a free notebook service, again based on JupyterLab, that includes free compute and persistent storage.

Today, we are excited to announce three new capabilities to help you scale ML development faster.

Notebooks scheduling

In 2022, we released a new capability to enable our customers to run notebooks as scheduled jobs in SageMaker Studio and Studio Lab. Thanks to this capability, many of our customers have saved time by not having to manually set up complex cloud infrastructure to scale their ML workflows.

We are excited to announce that the notebooks scheduling tool is now an open-source Jupyter extension that allows JupyterLab users to run and schedule notebooks on SageMaker anywhere JupyterLab runs. Users can select a notebook and automate it as a job that runs in a production environment via a simple yet powerful user interface. After a notebook is selected, the tool takes a snapshot of the entire notebook, packages its dependencies in a container, builds the infrastructure, runs the notebook as an automated job on a schedule set by the user, and deprovisions the infrastructure upon job completion. This reduces the time it takes to move a notebook to production from weeks to hours.

SageMaker open-source distribution

Data scientists and developers want to begin developing ML applications quickly, and it can be complex to install the mutually compatible versions of all the necessary packages. To remove the manual work and improve productivity, we are excited to announce a new open-source distribution that includes the most popular packages for ML, data science, and data visualization. This distribution includes deep learning frameworks like PyTorch, TensorFlow, and Keras; popular Python packages like NumPy, scikit-learn, and pandas; and IDEs like JupyterLab and the Jupyter Notebook. The distribution is versioned using SemVer and will be released on a regular basis moving forward. The container is available via Amazon ECR Public Gallery, and its source code is available on GitHub. This provides enterprises transparency into the packages and build process, thereby making it easier for them to reproduce, customize, or re-certify the distribution. The base image comes with pip and Conda/Mamba, so that data scientists can quickly install additional packages to meet their specific needs.

Amazon CodeGuru Jupyter extension

Amazon CodeGuru Security now supports security and code quality scans in JupyterLab and SageMaker Studio. This new capability assists notebook users in detecting security vulnerabilities such as injection flaws, data leaks, weak cryptography, or missing encryption within the notebook cells. You can also detect many common issues that affect the readability, reproducibility, and correctness of computational notebooks, such as misuse of ML library APIs, invalid run order, and nondeterminism. When vulnerabilities or quality issues are identified in the notebook, CodeGuru generates recommendations that enable you to remediate those issues based on AWS security best practices.

Conclusion

We are excited to see how the Jupyter community will use these tools to scale development, increase productivity, and take advantage of generative AI to transform their industries. Check out the following resources to learn more about Jupyter on AWS and how to install and get started with these new tools:

About the Author

Brian Granger is a leader of the Python project, co-founder of Project Jupyter, and an active contributor to a number of other open-source projects focused on data science in Python. In 2016, he co-created the Altair package for statistical visualization in Python. He is an advisory board member of the NumFOCUS Foundation, a faculty fellow of the Cal Poly Center for Innovation and Entrepreneurship, and the Sr. Principal Technologist at AWS.

Schedule your notebooks from any JupyterLab environment using the Amazon SageMaker JupyterLab extension

Jupyter notebooks are highly favored by data scientists for their ability to interactively process data, build ML models, and test these models by making inferences on data. However, there are scenarios in which data scientists may prefer to transition from interactive development on notebooks to batch jobs. Examples of such use cases include scaling up a feature engineering job that was previously tested on a small sample dataset on a small notebook instance, running nightly reports to gain insights into business metrics, and retraining ML models on a schedule as new data becomes available.

Migrating from interactive development on notebooks to batch jobs required you to copy code snippets from the notebook into a script, package the script with all its dependencies into a container, and schedule the container to run. To run this job repeatedly on a schedule, you had to set up, configure, and oversee cloud infrastructure to automate deployments, resulting in a diversion of valuable time away from core data science development activities.

To help simplify the process of moving from interactive notebooks to batch jobs, in December 2022, Amazon SageMaker Studio and Studio Lab introduced the capability to run notebooks as scheduled jobs, using notebook-based workflows. You can now use the same capability to run your Jupyter notebooks from any JupyterLab environment such as Amazon SageMaker notebook instances and JupyterLab running on your local machine. SageMaker provides an open-source extension that can be installed on any JupyterLab environment and be used to run notebooks as ephemeral jobs and on a schedule.

In this post, we show you how to run your notebooks from your local JupyterLab environment as scheduled notebook jobs on SageMaker.

Solution overview

The solution architecture for scheduling notebook jobs from any JupyterLab environment is shown in the following diagram. The SageMaker extension expects the JupyterLab environment to have valid AWS credentials and permissions to schedule notebook jobs. We discuss the steps for setting up credentials and AWS Identity and Access Management (IAM) permissions later in this post. In addition to the IAM user and assumed role session scheduling the job, you also need to provide a role for the notebook job instance to assume for access to your data in Amazon Simple Storage Service (Amazon S3) or to connect to Amazon EMR clusters as needed.

In the following sections, we show how to set up the architecture and install the open-source extension, run a notebook with the default configurations, and also use the advanced parameters to run a notebook with custom settings.

Prerequisites

For this post, we assume a locally hosted JupyterLab environment. You can follow the same installation steps for an environment hosted in the cloud as well.

The following steps assume that you already have a valid Python 3 and JupyterLab environment (this extension works with JupyterLab v3.0 or higher).

Install the AWS Command Line Interface (AWS CLI) if you don’t already have it installed. See Installing or updating the latest version of the AWS CLI for instructions.

Set up IAM credentials

You need an IAM user or an active IAM role session to submit SageMaker notebook jobs. To set up your IAM credentials, you can configure the AWS CLI with your AWS credentials for your IAM user, or assume an IAM role. For instructions on setting up your credentials, see Configuring the AWS CLI. The IAM principal (user or assumed role) needs the following permissions to schedule notebook jobs. To add the policy to your principal, refer to Adding IAM identity permissions.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "EventBridgeSchedule",
            "Effect": "Allow",
            "Action": [
                "events:TagResource",
                "events:DeleteRule",
                "events:PutTargets",
                "events:DescribeRule",
                "events:EnableRule",
                "events:PutRule",
                "events:RemoveTargets",
                "events:DisableRule"
            ],
            "Resource": "*",
            "Condition": {
                "StringEquals": {
                    "aws:ResourceTag/sagemaker:is-scheduling-notebook-job": "true"
                }
            }
        },
        {
            "Sid": "IAMPassRoleToNotebookJob",
            "Effect": "Allow",
            "Action": "iam:PassRole",
            "Resource": "arn:aws:iam::*:role/SagemakerJupyterScheduler*",
            "Condition": {
                "StringLike": {
                    "iam:PassedToService": [
                        "sagemaker.amazonaws.com",
                        "events.amazonaws.com"
                    ]
                }
            }
        },
        {
            "Sid": "IAMListRoles",
            "Effect": "Allow",
            "Action": "iam:ListRoles",
            "Resource": "*" 
        },
        {
            "Sid": "S3ArtifactsAccess",
            "Effect": "Allow",
            "Action": [
                "s3:PutEncryptionConfiguration",
                "s3:CreateBucket",
                "s3:PutBucketVersioning",
                "s3:ListBucket",
                "s3:PutObject",
                "s3:GetObject",
                "s3:GetEncryptionConfiguration",
                "s3:DeleteObject",
                "s3:GetBucketLocation"
            ],
            "Resource": [
                "arn:aws:s3:::sagemaker-automated-execution-*"
            ]
        },
        {
            "Sid": "S3DriverAccess",
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket",
                "s3:GetObject",
                "s3:GetBucketLocation"
            ],
            "Resource": [
                "arn:aws:s3:::sagemakerheadlessexecution-*"
            ]
        },
        {
            "Sid": "SagemakerJobs",
            "Effect": "Allow",
            "Action": [
                "sagemaker:DescribeTrainingJob",
                "sagemaker:StopTrainingJob",
                "sagemaker:DescribePipeline",
                "sagemaker:CreateTrainingJob",
                "sagemaker:DeletePipeline",
                "sagemaker:CreatePipeline"
            ],
            "Resource": "*",
            "Condition": {
                "StringEquals": {
                    "aws:ResourceTag/sagemaker:is-scheduling-notebook-job": "true"
                }
            }
        },
         {
            "Sid": "AllowSearch",
            "Effect": "Allow",
            "Action": "sagemaker:Search",
            "Resource": "*"
        },
         {
            "Sid": "SagemakerTags",
            "Effect": "Allow",
            "Action": [
                "sagemaker:ListTags",
                "sagemaker:AddTags"
            ],
            "Resource": [
                "arn:aws:sagemaker:*:*:pipeline/*",
                "arn:aws:sagemaker:*:*:space/*",
                "arn:aws:sagemaker:*:*:training-job/*",
                "arn:aws:sagemaker:*:*:user-profile/*"
            ]
        },
        {
            "Sid": "ECRImage",
            "Effect": "Allow",
            "Action": [
                "ecr:GetAuthorizationToken",
                "ecr:BatchGetImage"
            ],
            "Resource": "*"
        }
   ]
}

If your notebook jobs need to be encrypted with customer managed AWS Key Management Service (AWS KMS) keys, add the policy statement allowing AWS KMS access as well. For a sample policy, see Install policies and permissions for local Jupyter environments.

Set up an IAM role for the notebook job instance

SageMaker requires an IAM role to run jobs on the user’s behalf, such as running the notebook job. This role should have access to the resources required for the notebook to complete the job, such as access to data in Amazon S3.

The scheduler extension automatically looks for IAM roles in the AWS account, with the prefix SagemakerJupyterScheduler to run the notebook jobs.

To create an IAM role, create an execution role for Amazon SageMaker with the AmazonSageMakerFullAccess policy. Name the role SagemakerJupyterSchedulerDemo, or provide a name with the expected prefix.

After the role is created, on the Trust relationships tab, choose Edit trust policy. Replace the existing trust policy with the following:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": [
                    "sagemaker.amazonaws.com",
                    "events.amazonaws.com"
                ]
            },
            "Action": "sts:AssumeRole"
        }
    ]
}

The AmazonSageMakerFullAccess policy is fairly permissive and is generally preferred for experimentation and getting started with SageMaker. We strongly encourage you to create a minimum scoped policy for any future workloads in accordance with security best practices in IAM. For the minimum set of permissions required for the notebook job, see Install policies and permissions for local Jupyter environments.

Install the extension

Open a terminal on your local machine and install the extension by running the following command:

pip install amazon-sagemaker-jupyter-scheduler

After this command runs, you can start JupyterLab by running jupyter lab.

If you’re installing the extension from within the JupyterLab terminal, restart the Jupyter server to load the extension. You can restart the Jupyter server by choosing Shut Down on the File menu from your JupyterLab, and starting JupyterLab from your command line by running jupyter lab.

Submit a notebook job

After the extension is installed on your environment, you can run any self-contained notebook as an ephemeral job. Let’s submit a simple “Hello world” notebook to run as a scheduled job.

On the File menu, choose New and Notebook.

Enter the following contents:

# install packages
!pip install pandas
!pip install boto3

# import block
import boto3
import pandas as pd

# download a sample dataset
s3 = boto3.client("s3")
# Load the dataset
file_name = "abalone.csv"
s3.download_file(
    "sagemaker-sample-files", f"datasets/tabular/uci_abalone/abalone.csv", file_name
)

# display the dataset
df = pd.read_csv(file_name)
df.head()

After the extension is successfully installed, you’ll see the notebook scheduling icon on the notebook.

Choose the icon to create a notebook job.

Alternatively, you can right-click on the notebook in your file explorer and choose Create notebook job.

Provide the job name, input file, compute type, and additional parameters.
Leave the remaining settings at the default and choose Create.

After the job is scheduled, you’re redirected to the Notebook Jobs tab, where you can view the list of notebook jobs and their status, and view the notebook output and logs after the job is complete. You can also access this notebook jobs window from the Launcher, as shown in the following screenshot.

Advanced configurations

From your local compute, notebooks automatically run on the SageMaker Base Python image, which is the official Python 3.8 image from Docker Hub with Boto3 and the AWS CLI included. In real-world cases, data scientists need to install specific packages or frameworks for their notebooks. There are three ways to achieve a reproducible environment:

At the simplest option, you can install the packages and frameworks directly on the first cell of your notebook.
You can also provide an initialization script in the Additional options section, pointing to a bash script on your local storage that is run by the notebook job when the notebook starts up. In the following section, we show an example of using initialization scripts to install packages.
Finally, if you want maximum flexibility in configuring your run environment, you can build your own custom image with a Python3 kernel, push the image to Amazon Elastic Container Registry (Amazon ECR), and provide the ECR image URI to your notebook job under Additional options. The ECR image should follow the requirements for SageMaker images, as listed in Custom SageMaker image specifications.

In addition, your enterprise might set up guardrails like running jobs in internet-free mode within an Amazon VPC, using a custom least-privilege role for the job, and enforcing encryption. You can specify such configurations for your notebook jobs in the Additional options section as well. For a detailed list of advanced configurations, see Additional options.

Add an initialization script

To showcase the initialization script, we now run the sample notebook for Studio notebook jobs available on GitHub. To run this notebook, you need to install the required packages through an initialization script. Complete the following steps:

From your JupyterLab terminal, run the following command to download the file:

curl https://raw.githubusercontent.com/aws/amazon-sagemaker-examples/main/sagemaker-notebook-jobs/studio-scheduling/scheduled-example.ipynb > scheduled-example.ipynb

On the File menu, choose New and Text file.

Enter the following contents to your file, and save the file under the name init-script.sh:

echo "Installing required packages"

pip install --upgrade sagemaker
pip install pandas numpy matplotlib scikit-learn

Choose scheduled-example.ipynb from your file explorer to open the notebook.
Choose the notebook job icon to schedule the notebook, and expand the Additional options section.
For Initialization script location, enter the full path of your script.

You can also optionally customize the input and output S3 folders for your notebook job. SageMaker creates an input folder in a specified S3 location to store the input files, and creates an output S3 folder where the notebook outputs are stored. You can specify encryption, IAM role, and VPC configurations here. See Constraints and considerations for custom image and VPC specifications.

For now, simply update the initialization script, choose Run now for the schedule, and choose Create.

When the job is complete, you can view the notebook with outputs and the output log under Output files, as shown in the following screenshot. In the output log, you should be able to see the initialization script being run before running the notebook.

To further customize your notebook job environment, you can use your own image by specifying the ECR URI of your custom image. If you’re bringing your own image, ensure you install a Python3 kernel when building your image. For a sample Dockerfile that can run a notebook using TensorFlow, see the following code:

FROM tensorflow/tensorflow:latest
RUN pip install ipykernel && 
        python -m ipykernel install --sys-prefix

Conclusion

In this post, we showed you how to run your notebooks from any JupyterLab environment hosted locally as SageMaker training jobs, using the SageMaker Jupyter scheduler extension. Being able to run notebooks in a headless manner, on a schedule, greatly reduces undifferentiated heavy lifting for the data scientists, such as refactoring notebooks to Python scripts, setting up Amazon EventBridge event triggers, and creating AWS Lambda functions or SageMaker pipelines to start the training jobs. SageMaker notebook jobs are run on demand, so you only pay for the time that the notebook is run, and you can use the notebook jobs extension to view the notebook outputs anytime from your JupyterLab environment. We encourage you to try scheduled notebook jobs, and connect with the Machine Learning & AI community on re:Post for feedback!

About the authors

Bhadrinath Pani is a Software Development Engineer at Amazon Web Services, working on Amazon SageMaker interactive ML products, with over 12 years of experience in software development across domains like automotive, IoT, AR/VR, and computer vision. Currently, his main focus is on developing machine learning tools aimed at simplifying the experience for data scientists. In his free time, he enjoys spending time with his family and exploring the beauty of the Pacific Northwest.

Durga Sury is an ML Solutions Architect on the Amazon SageMaker Service SA team. She is passionate about making machine learning accessible to everyone. In her 4 years at AWS, she has helped set up AI/ML platforms for enterprise customers. When she isn’t working, she loves motorcycle rides, mystery novels, and long walks with her 5-year-old husky.

Announcing provisioned concurrency for Amazon SageMaker Serverless Inference

Amazon SageMaker Serverless Inference allows you to serve model inference requests in real time without having to explicitly provision compute instances or configure scaling policies to handle traffic variations. You can let AWS handle the undifferentiated heavy lifting of managing the underlying infrastructure and save costs in the process. A Serverless Inference endpoint spins up the relevant infrastructure, including the compute, storage, and network, to stage your container and model for on-demand inference. You can simply select the amount of memory to allocate and the number of max concurrent invocations to have a production-ready endpoint to service inference requests.

With on-demand serverless endpoints, if your endpoint doesn’t receive traffic for a while and then suddenly receives new requests, it can take some time for your endpoint to spin up the compute resources to process the requests. This is called a cold start. A cold start can also occur if your concurrent requests exceed the current concurrent request usage. With provisioned concurrency on Serverless Inference, you can mitigate cold starts and get predictable performance characteristics for their workloads. You can add provisioned concurrency to your serverless endpoints, and for the predefined amount of provisioned concurrency, Amazon SageMaker will keep the endpoints warm and ready to respond to requests instantaneously. In addition, you can now use Application Auto Scaling with provisioned concurrency to address inference traffic dynamically based on target metrics or a schedule.

In this post, we discuss what provisioned concurrency and Application Auto Scaling are, how to use them, and some best practices and guidance for your inference workloads.

Provisioned concurrency with Application Auto Scaling

With provisioned concurrency on Serverless Inference endpoints, SageMaker manages the infrastructure that can serve multiple concurrent requests without incurring cold starts. SageMaker uses the value specified in your endpoint configuration file called ProvisionedConcurrency, which is used when you create or update an endpoint. The serverless endpoint enables provisioned concurrency, and you can expect that SageMaker will serve the number of requests you have set without a cold start. See the following code:

endpoint_config_response_pc = client.create_endpoint_config(
    EndpointConfigName=xgboost_epc_name_pc,
    ProductionVariants=[
        {
            "VariantName": "byoVariant",
            "ModelName": model_name,
            "ServerlessConfig": {
                "MemorySizeInMB": 4096,
                "MaxConcurrency": 1,
                #Provisioned Concurrency value setting example 
                "ProvisionedConcurrency": 1
            },
        },
    ],
)

By understanding your workloads and knowing how many cold starts you want to mitigate, you can set this to a preferred value.

Serverless Inference with provisioned concurrency also supports Application Auto Scaling, which allows you to optimize costs based on your traffic profile or schedule to dynamically set the amount of provisioned concurrency. This can be set in a scaling policy, which can be applied to an endpoint.

To specify the metrics and target values for a scaling policy, you can configure a target-tracking scaling policy. Define the scaling policy as a JSON block in a text file. You can then use that text file when invoking the AWS Command Line Interface (AWS CLI) or the Application Auto Scaling API. To define a target-tracking scaling policy for a serverless endpoint, use the SageMakerVariantProvisionedConcurrencyUtilization predefined metric:

{
    "TargetValue": 0.5,
    "PredefinedMetricSpecification": 
    {
        "PredefinedMetricType": "SageMakerVariantProvisionedConcurrencyUtilization"
    },
    "ScaleOutCooldown": 1,
    "ScaleInCooldown": 1
}

To specify a scaling policy based on a schedule (for example, every day at 12:15 PM UTC), you can modify the scaling policy as well. If the current capacity is below the value specified for MinCapacity, Application Auto Scaling scales out to the value specified by MinCapacity. The following code is an example of how to set this via the AWS CLI:

aws application-autoscaling put-scheduled-action 
  --service-namespace sagemaker --schedule 'cron(15 12 * * ? *)' 
  --scheduled-action-name 'ScheduledScalingTest' 
  --resource-id endpoint/MyEndpoint/variant/MyVariant 
  --scalable-dimension sagemaker:variant:DesiredProvisionedConcurrency 
  --scalable-target-action 'MinCapacity=10'

With Application Auto Scaling, you can ensure that your workloads can mitigate cold starts, meet business objectives, and optimize cost in the process.

You can monitor your endpoints and provisioned concurrency specific metrics using Amazon CloudWatch. There are four metrics to focus on that are specific to provisioned concurrency:

ServerlessProvisionedConcurrencyExecutions – The number of concurrent runs handled by the endpoint
ServerlessProvisionedConcurrencyUtilization – The number of concurrent runs divided by the allocated provisioned concurrency
ServerlessProvisionedConcurrencyInvocations – The number of InvokeEndpoint requests handled by the provisioned concurrency
ServerlessProvisionedConcurrencySpilloverInvocations – The number of InvokeEndpoint requests not handled provisioned concurrency, which is handled by on-demand Serverless Inference

By monitoring and making decisions based on these metrics, you can tune their configuration with cost and performance in mind and optimize your SageMaker Serverless Inference endpoint.

For SageMaker Serverless Inference, you can choose either a SageMaker-provided container or bring your own. SageMaker provides containers for its built-in algorithms and prebuilt Docker images for some of the most common machine learning (ML) frameworks, such as Apache MXNet, TensorFlow, PyTorch, and Chainer. For a list of available SageMaker images, see Available Deep Learning Containers Images. If you’re bringing your own container, you must modify it to work with SageMaker. For more information about bringing your own container, see Adapting Your Own Inference Container.

Notebook example

Creating a serverless endpoint with provisioned concurrency is a very similar process to creating an on-demand serverless endpoint. For this example, we use a model using the SageMaker built-in XGBoost algorithm. We work with the Boto3 Python SDK to create three SageMaker inference entities:

SageMaker model – Create a SageMaker model that packages your model artifacts for deployment on SageMaker using the CreateModel You can also complete this step via AWS CloudFormation using the AWS::SageMaker::Model resource.
SageMaker endpoint configuration – Create an endpoint configuration using the CreateEndpointConfig API and the new configuration ServerlessConfig options or by selecting the serverless option on the SageMaker console. You can also complete this step via AWS CloudFormation using the AWS::SageMaker::EndpointConfig You must specify the memory size, which, at a minimum, should be as big as your runtime model object, and the maximum concurrency, which represents the max concurrent invocations for a single endpoint. For our endpoint with provisioned concurrency enabled, we specify that parameter in the endpoint configuration step, taking into account that the value must be greater than 0 and less than or equal to max concurrency.
SageMaker endpoint – Finally, using the endpoint configuration that you created in the previous step, create your endpoint using either the SageMaker console or programmatically using the CreateEndpoint You can also complete this step via AWS CloudFormation using the AWS::SageMaker::Endpoint resource.

In this post, we don’t cover the training and SageMaker model creation; you can find all these steps in the complete notebook. We focus primarily on how you can specify provisioned concurrency in the endpoint configuration and compare performance metrics for an on-demand serverless endpoint with a provisioned concurrency enabled serverless endpoint.

Configure a SageMaker endpoint

In the endpoint configuration, you can specify the serverless configuration options. For Serverless Inference, there are two inputs required, and they can be configured to meet your use case:

MaxConcurrency – This can be set from 1–200
Memory Size – This can be the following values: 1024 MB, 2048 MB, 3072 MB, 4096 MB, 5120 MB, or 6144 MB

For this example, we create two endpoint configurations: one on-demand serverless endpoint and one provisioned concurrency enabled serverless endpoint. You can see an example of both configurations in the following code:

xgboost_epc_name_pc = "xgboost-serverless-epc-pc" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
xgboost_epc_name_on_demand = "xgboost-serverless-epc-on-demand" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())

endpoint_config_response_pc = client.create_endpoint_config(
    EndpointConfigName=xgboost_epc_name_pc,
    ProductionVariants=[
        {
            "VariantName": "byoVariant",
            "ModelName": model_name,
            "ServerlessConfig": {
                "MemorySizeInMB": 4096,
                "MaxConcurrency": 1,
                # Providing Provisioned Concurrency in EPC
                "ProvisionedConcurrency": 1
            },
        },
    ],
)

endpoint_config_response_on_demand = client.create_endpoint_config(
    EndpointConfigName=xgboost_epc_name_on_demand,
    ProductionVariants=[
        {
            "VariantName": "byoVariant",
            "ModelName": model_name,
            "ServerlessConfig": {
                "MemorySizeInMB": 4096,
                "MaxConcurrency": 1,
            },
        },
    ],
)

print("Endpoint Configuration Arn Provisioned Concurrency: " + endpoint_config_response_pc["EndpointConfigArn"])
print("Endpoint Configuration Arn On Demand Serverless: " + endpoint_config_response_on_demand["EndpointConfigArn"])

With SageMaker Serverless Inference with a provisioned concurrency endpoint, you also need to set the following, which is reflected in the preceding code:

ProvisionedConcurrency – This value can be set from 1 to the value of your MaxConcurrency

Create SageMaker on-demand and provisioned concurrency endpoints

We use our two different endpoint configurations to create two endpoints: an on-demand serverless endpoint with no provisioned concurrency enabled and a serverless endpoint with provisioned concurrency enabled. See the following code:

endpoint_name_pc = "xgboost-serverless-ep-pc" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())

create_endpoint_response = client.create_endpoint(
    EndpointName=endpoint_name_pc,
    EndpointConfigName=xgboost_epc_name_pc,
)

print("Endpoint Arn Provisioned Concurrency: " + create_endpoint_response["EndpointArn"])

endpoint_name_on_demand = "xgboost-serverless-ep-on-demand" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())

create_endpoint_response = client.create_endpoint(
    EndpointName=endpoint_name_on_demand,
    EndpointConfigName=xgboost_epc_name_on_demand,
)

print("Endpoint Arn Provisioned Concurrency: " + create_endpoint_response["EndpointArn"])

Compare invocation and performance

Next, we can invoke both endpoints with the same payload:

%%time

#On Demand Serverless Endpoint Test
response = runtime.invoke_endpoint(
    EndpointName=endpoint_name_on_demand,
    Body=b".345,0.224414,.131102,0.042329,.279923,-0.110329,-0.099358,0.0",
    ContentType="text/csv",
)

print(response["Body"].read())

%%time

#Provisioned Endpoint Test
response = runtime.invoke_endpoint(
    EndpointName=endpoint_name_pc,
    Body=b".345,0.224414,.131102,0.042329,.279923,-0.110329,-0.099358,0.0",
    ContentType="text/csv",
)

print(response["Body"].read())

When timing both cells for the first request, we immediately notice a drastic improvement in end-to-end latency in the provisioned concurrency enabled serverless endpoint. To validate this, we can send five requests to each endpoint with 10-minute intervals between each request. With the 10-minute gap, we can ensure that the on-demand endpoint is cold. Therefore, we can successfully evaluate cold start performance comparison between the on-demand and provisioned concurrency serverless endpoints. See the following code:

import time
import numpy as np
print("Testing cold start for serverless inference with PC vs no PC")

pc_times = []
non_pc_times = []

# ~50 minutes
for i in range(5):
    time.sleep(600)
    start_pc = time.time()
    pc_response = runtime.invoke_endpoint(
        EndpointName=endpoint_name_pc,
        Body=b".345,0.224414,.131102,0.042329,.279923,-0.110329,-0.099358,0.0",
        ContentType="text/csv",
    )
    end_pc = time.time() - start_pc
    pc_times.append(end_pc)

    start_no_pc = time.time()
    response = runtime.invoke_endpoint(
        EndpointName=endpoint_name_on_demand,
        Body=b".345,0.224414,.131102,0.042329,.279923,-0.110329,-0.099358,0.0",
        ContentType="text/csv",
    )
    end_no_pc = time.time() - start_no_pc
    non_pc_times.append(end_no_pc)

pc_cold_start = np.mean(pc_times)
non_pc_cold_start = np.mean(non_pc_times)

print("Provisioned Concurrency Serverless Inference Average Cold Start: {}".format(pc_cold_start))
print("On Demand Serverless Inference Average Cold Start: {}".format(non_pc_cold_start))

We can then plot these average end-to-end latency values across five requests and see that the average cold start for provisioned concurrency was approximately 200 milliseconds end to end as opposed to nearly 6 seconds with the on-demand serverless endpoint.

When to use Serverless Inference with provisioned concurrency

Provisioned concurrency is a cost-effective solution for low throughput and spiky workloads requiring low latency guarantees. Provisioned concurrency will be suitable for use cases when the throughput is low, and you want to reduce costs compared with instance-based while still having predictable performance or for workloads with predictable traffic bursts with low latency requirements. For example, a chatbot application run by a tax filing software company typically sees high demand during the last week of March from 10:00 AM to 5:00 PM because it’s close to the tax filing deadline. You can choose on-demand Serverless Inference for the remaining part of the year to serve requests from end-users, but for the last week of March, you can add provisioned concurrency to handle the spike in demand. As a result, you can reduce costs during idle time while still meeting your performance goals.

On the other hand, if your inference workload is steady, has high throughput (enough traffic to keep the instances saturated and busy), has a predictable traffic pattern, and requires ultra-low latency, or it includes large or complex models that require GPUs, Serverless Inference isn’t the right option for you, and you should deploy on real-time inference. Synchronous use cases with burst behavior that don’t require performance guarantees are more suitable for using on-demand Serverless Inference. The traffic patterns and the right hosting option (serverless or real-time inference) are depicted in the following figures:

Real-time inference endpoint – Traffic is mostly steady with predictable peaks. The high throughput is enough to keep the instances behind the auto scaling group busy and saturated. This will allow you to efficiently use the existing compute and be cost-effective along with providing ultra-low latency guarantees. For the predictable peaks, you can choose to use the scheduled auto scaling policy in SageMaker for real-time inference endpoints. Read more about the best practices for selecting the right auto scaling policy at Optimize your machine learning deployments with auto scaling on Amazon SageMaker.

On-demand Serverless Inference – This option is suitable for traffic with unpredictable peaks, but the ML application is tolerant to cold start latencies. To help determine whether a serverless endpoint is the right deployment option from a cost and performance perspective, use the SageMaker Serverless Inference benchmarking toolkit, which tests different endpoint configurations and compares the most optimal one against a comparable real-time hosting instance.

Serverless Inference with provisioned concurrency – This option is suitable for the traffic pattern with predictable peaks but is otherwise low or intermittent. This option provides you additional low latency guarantees for ML applications that can’t tolerate cold start latencies.

Use the following factors to determine which hosting option (real time over on-demand Serverless Inference over Serverless Inference with provisioned concurrency) is right for your ML workloads:

Throughput – This represents requests per second or any other metrics that represent the rate of incoming requests to the inference endpoint. We define the high throughput in the following diagram as any throughput that is enough to keep the instances behind the auto scaling group busy and saturated to get the most out of your compute.
Traffic pattern – This represents the type of traffic, including traffic with predictable or unpredictable spikes. If the spikes are unpredictable but the ML application needs low-latency guarantees, Serverless Inference with provisioned concurrency might be cost-effective if it’s a low throughput application.
Response time – If the ML application needs low-latency guarantees, use Serverless Inference with provisioned concurrency for low throughput applications with unpredictable traffic spikes. If the application can tolerate cold start latencies and has low throughput with unpredictable traffic spikes, use on-demand Serverless Inference.
Cost – Consider the total cost of ownership, including infrastructure costs (compute, storage, networking), operational costs (operating, managing, and maintaining the infrastructure), and security and compliance costs.

The following figure illustrates this decision tree.

Best practices

With Serverless Inference with provisioned concurrency, you should still adhere to best practices for workloads that don’t use provisioned concurrency:

Avoid installing packages and other operations during container startup and ensure containers are already in their desired state to minimize cold start time when being provisioned and invoked while staying under the 10 GB maximum supported container size. To monitor how long your cold start time is, you can use the CloudWatch metric OverheadLatency to monitor your serverless endpoint. This metric tracks the time it takes to launch new compute resources for your endpoint.
Set the MemorySizeInMB value to be large enough to meet your needs as well as increase performance. Larger values will also devote more compute resources. At some point, a larger value will have diminishing returns.
Set the MaxConcurrency to accommodate the peaks of traffic while considering the resulting cost.
We recommend creating only one worker in the container and only loading one copy of the model. This is unlike real-time endpoints, where some SageMaker containers may create a worker for each vCPU to process inference requests and load the model in each worker.
Use Application Auto Scaling to automate your provisioned concurrency setting based on target metrics or schedule. By doing so, you can have finer-grained, automated control of the amount of the provisioned concurrency used with your SageMaker serverless endpoint.

In addition, with the ability to configure ProvisionedConcurrency, you should set this value to the integer representing how many cold starts you would like to avoid when requests come in a short time frame after a period of inactivity. Using the metrics in CloudWatch can help you tune this value to be optimal based on preferences.

Pricing

As with on-demand Serverless Inference, when provisioned concurrency is enabled, you pay for the compute capacity used to process inference requests, billed by the millisecond, and the amount of data processed. You also pay for provisioned concurrency usage based on the memory configured, duration provisioned, and amount of concurrency enabled.

Pricing can be broken down into two components: provisioned concurrency charges and inference duration charges. For more details, refer to Amazon SageMaker Pricing.

Conclusion

SageMaker Serverless Inference with provisioned concurrency provides a very powerful capability for workloads when cold starts need to be mitigated and managed. With this capability, you can better balance cost and performance characteristics while providing a better experience to your end-users. We encourage you to consider whether provisioned concurrency with Application Auto Scaling is a good fit for your workloads, and we look forward to your feedback in the comments!

Stay tuned for follow-up posts where we will provide more insight into the benefits, best practices, and cost comparisons using Serverless Inference with provisioned concurrency.

About the Authors

James Park is a Solutions Architect at Amazon Web Services. He works with Amazon.com to design, build, and deploy technology solutions on AWS, and has a particular interest in AI and machine learning. In h is spare time he enjoys seeking out new cultures, new experiences, and staying up to date with the latest technology trends.You can find him on LinkedIn.

Dhawal Patel is a Principal Machine Learning Architect at AWS. He has worked with organizations ranging from large enterprises to mid-sized startups on problems related to distributed computing and artificial intelligence. He focuses on deep learning, including NLP and computer vision domains. He helps customers achieve high-performance model inference on Amazon SageMaker.

Ram Vegiraju is a ML Architect with the SageMaker Service team. He focuses on helping customers build and optimize their AI/ML solutions on Amazon SageMaker. In his spare time, he loves traveling and writing.

Rupinder Grewal is a Sr Ai/ML Specialist Solutions Architect with AWS. He currently focuses on serving of models and MLOps on SageMaker. Prior to this role he has worked as Machine Learning Engineer building and hosting models. Outside of work he enjoys playing tennis and biking on mountain trails.

Rishabh Ray Chaudhury is a Senior Product Manager with Amazon SageMaker, focusing on Machine Learning inference. He is passionate about innovating and building new experiences for Machine Learning customers on AWS to help scale their workloads. In his spare time, he enjoys traveling and cooking. You can find him on LinkedIn.

Shruti Sharma is a Sr. Software Development Engineer in AWS SageMaker team. Her current work focuses on helping developers efficiently host machine learning models on Amazon SageMaker. In her spare time she enjoys traveling, skiing and playing chess. You can find her on LinkedIn.

Hao Zhu is a Software Development with Amazon Web Services. In his spare time he loves to hit the slopes and ski. He also enjoys exploring new places, trying different foods, experiencing different cultures and is always up for a new adventure.

Accelerate protein structure prediction with the ESMFold language model on Amazon SageMaker

Proteins drive many biological processes, such as enzyme activity, molecular transport, and cellular support. The three-dimensional structure of a protein provides insight into its function and how it interacts with other biomolecules. Experimental methods to determine protein structure, such as X-ray crystallography and NMR spectroscopy, are expensive and time-consuming.

In contrast, recently-developed computational methods can rapidly and accurately predict the structure of a protein from its amino acid sequence. These methods are critical for proteins that are difficult to study experimentally, such as membrane proteins, the targets of many drugs. One well-known example of this is AlphaFold, a deep learning-based algorithm celebrated for its accurate predictions.

ESMFold is another highly-accurate, deep learning-based method developed to predict protein structure from its amino acid sequence. ESMFold uses a large protein language model (pLM) as a backbone and operates end to end. Unlike AlphaFold2, it doesn’t need a lookup or Multiple Sequence Alignment (MSA) step, nor does it rely on external databases to generate predictions. Instead, the development team trained the model on millions of protein sequences from UniRef. During training, the model developed attention patterns that elegantly represent the evolutionary interactions between amino acids in the sequence. This use of a pLM instead of an MSA enables up to 60 times faster prediction times than other state-of-the-art models.

In this post, we use the pre-trained ESMFold model from Hugging Face with Amazon SageMaker to predict the heavy chain structure of trastuzumab, a monoclonal antibody first developed by Genentech for the treatment of HER2-positive breast cancer. Quickly predicting the structure of this protein could be useful if researchers wanted to test the effect of sequence modifications. This could potentially lead to improved patient survival or fewer side effects.

This post provides an example Jupyter notebook and related scripts in the following GitHub repository.

Prerequisites

We recommend running this example in an Amazon SageMaker Studio notebook running the PyTorch 1.13 Python 3.9 CPU-optimized image on an ml.r5.xlarge instance type.

Visualize the experimental structure of trastuzumab

To begin, we use the biopython library and a helper script to download the trastuzumab structure from the RCSB Protein Data Bank:

from Bio.PDB import PDBList, MMCIFParser
from prothelpers.structure import atoms_to_pdb

target_id = "1N8Z"
pdbl = PDBList()
filename = pdbl.retrieve_pdb_file(target_id, pdir="data")
parser = MMCIFParser()
structure = parser.get_structure(target_id, filename)
pdb_string = atoms_to_pdb(structure)

Next, we use the py3Dmol library to visualize the structure as an interactive 3D visualization:

view = py3Dmol.view()
view.addModel(pdb_string)
view.setStyle({'chain':'A'},{"cartoon": {'color': 'orange'}})
view.setStyle({'chain':'B'},{"cartoon": {'color': 'blue'}})
view.setStyle({'chain':'C'},{"cartoon": {'color': 'green'}})
view.show()

The following figure represents the 3D protein structure 1N8Z from the Protein Data Bank (PDB). In this image, the trastuzumab light chain is displayed in orange, the heavy chain is blue (with the variable region in light blue), and the HER2 antigen is green.

We’ll first use ESMFold to predict the structure of the heavy chain (Chain B) from its amino acid sequence. Then, we will compare the prediction to the experimentally determined structure shown above.

Predict the trastuzumab heavy chain structure from its sequence using ESMFold

Let’s use the ESMFold model to predict the structure of the heavy chain and compare it to the experimental result. To start, we’ll use a pre-built notebook environment in Studio that comes with several important libraries, like PyTorch, pre-installed. Although we could use an accelerated instance type to improve the performance of our notebook analysis, we’ll instead use a non-accelerated instance and run the ESMFold prediction on a CPU.

First, we load the pre-trained ESMFold model and tokenizer from Hugging Face Hub:

from transformers import AutoTokenizer, EsmForProteinFolding

tokenizer = AutoTokenizer.from_pretrained("facebook/esmfold_v1")
model = EsmForProteinFolding.from_pretrained("facebook/esmfold_v1", low_cpu_mem_usage=True)

Next, we copy the model to our device (CPU in this case) and set some model parameters:

device = torch.device("cpu")
model.esm = model.esm.float()
model = model.to(device)
model.trunk.set_chunk_size(64)

To prepare the protein sequence for analysis, we need to tokenize it. This translates the amino acid symbols (EVQLV…) into a numerical format that the ESMFold model can understand (6,19,5,10,19,…):

tokenized_input = tokenizer([experimental_sequence], return_tensors="pt", add_special_tokens=False)["input_ids"]
tokenized_input = tokenized_input.to(device)

Next, we copy the tokenized input to the mode, make a prediction, and save the result to a file:

with torch.no_grad():
notebook_prediction = model.infer_pdb(experimental_sequence)
with open("data/prediction.pdb", "w") as f:
f.write(notebook_prediction)

This takes about 3 minutes on a non-accelerated instance type, like a r5.

We can check the accuracy of the ESMFold prediction by comparing it to the experimental structure. We do this using the US-Align tool developed by the Zhang Lab at the University of Michigan:

from prothelpers.usalign import tmscore

tmscore("data/prediction.pdb", "data/experimental.pdb", pymol="data/superimposed")

PDBchain1	PDBchain2	TM-Score
data/prediction.pdb:A	data/experimental.pdb:B	0.802

The template modeling score (TM-score) is a metric for assessing the similarity of protein structures. A score of 1.0 indicates a perfect match. Scores above 0.7 indicate that proteins share the same backbone structure. Scores above 0.9 indicate that the proteins are functionally interchangeable for downstream use. In our case of achieving TM-Score 0.802, the ESMFold prediction would likely be appropriate for applications like structure scoring or ligand binding experiments, but may not be suitable for use cases like molecular replacement that require extremely high accuracy.

We can validate this result by visualizing the aligned structures. The two structures show a high, but not perfect, degree of overlap. Protein structure predictions is a rapidly-evolving field and many research teams are developing ever-more accurate algorithms!

Deploy ESMFold as a SageMaker inference endpoint

Running model inference in a notebook is fine for experimentation, but what if you need to integrate your model with an application? Or an MLOps pipeline? In this case, a better option is to deploy your model as an inference endpoint. In the following example, we’ll deploy ESMFold as a SageMaker real-time inference endpoint on an accelerated instance. SageMaker real-time endpoints provide a scalable, cost-effective, and secure way to deploy and host machine learning (ML) models. With automatic scaling, you can adjust the number of instances running the endpoint to meet the demands of your application, optimizing costs and ensuring high availability.

The pre-built SageMaker container for Hugging Face makes it easy to deploy deep learning models for common tasks. However, for novel use cases like protein structure prediction, we need to define a custom inference.py script to load the model, run the prediction, and format the output. This script includes much of the same code we used in our notebook. We also create a requirements.txt file to define some Python dependencies for our endpoint to use. You can see the files we created in the GitHub repository.

In the following figure, the experimental (blue) and predicted (red) structures of the trastuzumab heavy chain are very similar, but not identical.

After we’ve created the necessary files in the code directory, we deploy our model using the SageMaker HuggingFaceModel class. This uses a pre-built container to simplify the process of deploying Hugging Face models to SageMaker. Note that it may take 10 minutes or more to create the endpoint, depending on the availability of ml.g4dn instance types in our Region.

from sagemaker.huggingface import HuggingFaceModel
from datetime import datetime

huggingface_model = HuggingFaceModel(
model_data = model_artifact_s3_uri, # Previously staged in S3
name = f"emsfold-v1-model-" + datetime.now().strftime("%Y%m%d%s"),
transformers_version='4.17',
pytorch_version='1.10',
py_version='py38',
role=role,
source_dir = "code",
entry_point = "inference.py"
)

rt_predictor = huggingface_model.deploy(
initial_instance_count = 1,
instance_type="ml.g4dn.2xlarge",
endpoint_name=f"my-esmfold-endpoint",
serializer = sagemaker.serializers.JSONSerializer(),
deserializer = sagemaker.deserializers.JSONDeserializer()
)

When the endpoint deployment is complete, we can resubmit the protein sequence and display the first few rows of the prediction:

endpoint_prediction = rt_predictor.predict(experimental_sequence)[0]
print(endpoint_prediction[:900])

Because we deployed our endpoint to an accelerated instance, the prediction should only take a few seconds. Each row in the result corresponds to a single atom and includes the amino acid identity, three spatial coordinates, and a pLDDT score representing the prediction confidence at that location.

PDB_GROUP	ID	ATOM_LABEL	RES_ID	CHAIN_ID	SEQ_ID	CARTN_X	CARTN_Y	CARTN_Z	OCCUPANCY	PLDDT	ATOM_ID
ATOM	1	N	GLU	A	1	14.578	-19.953	1.47	1	0.83	N
ATOM	2	CA	GLU	A	1	13.166	-19.595	1.577	1	0.84	C
ATOM	3	CA	GLU	A	1	12.737	-18.693	0.423	1	0.86	C
ATOM	4	CB	GLU	A	1	12.886	-18.906	2.915	1	0.8	C
ATOM	5	O	GLU	A	1	13.417	-17.715	0.106	1	0.83	O
ATOM	6	cg	GLU	A	1	11.407	-18.694	3.2	1	0.71	C
ATOM	7	cd	GLU	A	1	11.141	-18.042	4.548	1	0.68	C
ATOM	8	OE1	GLU	A	1	12.108	-17.805	5.307	1	0.68	O
ATOM	9	OE2	GLU	A	1	9.958	-17.767	4.847	1	0.61	O
ATOM	10	N	VAL	A	2	11.678	-19.063	-0.258	1	0.87	N
ATOM	11	CA	VAL	A	2	11.207	-18.309	-1.415	1	0.87	C

Using the same method as before, we see that the notebook and endpoint predictions are identical.

PDBchain1	PDBchain2	TM-Score
data/endpoint_prediction.pdb:A	data/prediction.pdb:A	1.0

As observed in the following figure, the ESMFold predictions generated in-notebook (red) and by the endpoint (blue) show perfect alignment.

Clean up

To avoid further charges, we delete our inference endpoint and test data:

rt_predictor.delete_endpoint()
bucket = boto_session.resource("s3").Bucket(bucket)
bucket.objects.filter(Prefix=prefix).delete()
os.system("rm -rf data obsolete code")

Summary

Computational protein structure prediction is a critical tool for understanding the function of proteins. In addition to basic research, algorithms like AlphaFold and ESMFold have many applications in medicine and biotechnology. The structural insights generated by these models help us better understand how biomolecules interact. This can then lead to better diagnostic tools and therapies for patients.

In this post, we show how to deploy the ESMFold protein language model from Hugging Face Hub as a scalable inference endpoint using SageMaker. For more information about deploying Hugging Face models on SageMaker, refer to Use Hugging Face with Amazon SageMaker. You can also find more protein science examples in the Awesome Protein Analysis on AWS GitHub repo. Please leave us a comment if there are any other examples you’d like to see!

About the Authors

Brian Loyal is a Senior AI/ML Solutions Architect in the Global Healthcare and Life Sciences team at Amazon Web Services. He has more than 17 years’ experience in biotechnology and machine learning, and is passionate about helping customers solve genomic and proteomic challenges. In his spare time, he enjoys cooking and eating with his friends and family.

Shamika Ariyawansa is an AI/ML Specialist Solutions Architect in the Global Healthcare and Life Sciences team at Amazon Web Services. He passionately works with customers to accelerate their AI and ML adoption by providing technical guidance and helping them innovate and build secure cloud solutions on AWS. Outside of work, he loves skiing and off-roading.

Yanjun Qi is a Senior Applied Science Manager at the AWS Machine Learning Solution Lab. She innovates and applies machine learning to help AWS customers speed up their AI and cloud adoption.

Transform, analyze, and discover insights from unstructured healthcare data using Amazon HealthLake

Healthcare data is complex and siloed, and exists in various formats. An estimated 80% of data within organizations is considered to be unstructured or “dark” data that is locked inside text, emails, PDFs, and scanned documents. This data is difficult to interpret or analyze programmatically and limits how organizations can derive insights from it and serve their customers more effectively. The rapid rate of data generation means that organizations that aren’t investing in document automation risk getting stuck with legacy processes that are manual, slow, error prone, and difficult to scale.

In this post, we propose a solution that automates ingestion and transformation of previously untapped PDFs and handwritten clinical notes and data. We explain how to extract information from customer clinical data charts using Amazon Textract, then use the raw extracted text to identify discrete data elements using Amazon Comprehend Medical. We store the final output in Fast Healthcare Interoperability Resources (FHIR) compatible format in Amazon HealthLake, making it available for downstream analytics.

Solution overview

AWS provides a variety of services and solutions for healthcare providers to unlock the value of their data. For our solution, we process a small sample of documents through Amazon Textract and load that extracted data as appropriate FHIR resources in Amazon HealthLake. We create a custom process for FHIR conversion and test it end to end.

The data is first loaded into DocumentReference. Amazon HealthLake then creates system-generated resources after processing this unstructured text in DocumentReference and loads it into Condition, MedicationStatement, and Observation resources. We identify a few data fields within FHIR resources like patient ID, date of service, provider type, and name of medical facility.

A MedicationStatement is a record of a medication that is being consumed by a patient. It may indicate that the patient is taking the medication now, has taken the medication in the past, or will be taking the medication in the future. A common scenario where this information is captured is during the history-taking process in the course of a patient visit or stay. The source of medication information could be the patient’s memory, a prescription bottle, or from a list of medications the patient, clinician, or other party maintains.

Observations are a central element in healthcare, used to support diagnosis, monitor progress, determine baselines and patterns, and even capture demographic characteristics. Most observations are simple name/value pair assertions with some metadata, but some observations group other observations together logically, or could even be multi-component observations.

The Condition resource is used to record detailed information about a condition, problem, diagnosis, or other event, situation, issue, or clinical concept that has risen to a level of concern. The condition could be a point-in-time diagnosis in the context of an encounter, an item on the practitioner’s problem list, or a concern that doesn’t exist on the practitioner’s problem list.

The following diagram shows the workflow to migrate unstructured data into FHIR for AI and machine learning (ML) analysis in Amazon HealthLake.

The workflow steps are as follows:

A document is uploaded to an Amazon Simple Storage Service (Amazon S3) bucket.
The document upload in Amazon S3 triggers an AWS Lambda function.
The Lambda function sends the image to Amazon Textract.
Amazon Textract extracts text from the image and stores the output in a separate Amazon Textract output S3 bucket.
The final result is stored as specific FHIR resources (the extracted text is loaded in DocumentReference as base64 encoded text) in Amazon HealthLake to extract meaning from the unstructured data with integrated Amazon Comprehend Medical for easy search and querying.
Users can create meaningful analyses and run interactive analytics using Amazon Athena.
Users can build visualizations, perform ad hoc analysis, and quickly get business insights using Amazon QuickSight.
Users can make predictions with health data using Amazon SageMaker ML models.

Prerequisites

This post assumes familiarity with the following services:

Amazon Athena
AWS Cloud Development Kit (AWS CDK)
Amazon CloudWatch
AWS Lambda
AWS Lake Formation
Amazon QuickSight
Amazon SageMaker
Amazon S3

By default, the integrated Amazon Comprehend Medical natural language processing (NLP) capability within Amazon HealthLake is disabled in your AWS account. To enable it, submit a support case with your account ID, AWS Region, and Amazon HealthLake data store ARN. For more information, refer to How do I turn on HealthLake’s integrated natural language processing feature.

Refer to the GitHub repo for more deployment details.

Deploy the solution architecture

To set up the solution, complete the following steps:

Clone the GitHub repo, run cdk deploy PdfMapperToFhirWorkflow from your command prompt or terminal and follow the README file. Deployment will complete in approximately 30 minutes.
On the Amazon S3 console, navigate to the bucket starting with pdfmappertofhirworkflow-, which was created as part of cdk deploy.
Inside the bucket, create a folder called uploads and upload the sample PDF (SampleMedicalRecord.pdf).

As soon as the document upload is successful, it will trigger the pipeline, and you can start seeing data in Amazon HealthLake, which you can query using several AWS tools.

Query the data

To explore your data, complete the following steps:

On the CloudWatch console, search for the HealthlakeTextract log group.
In the log group details, note down the unique ID of the document you processed.
On the Amazon HealthLake console, choose Data Stores in the navigation pane.
Select your data store and choose Run query.
For Query type, choose Search with GET.
For Resource type, choose DocumentReference.
For Search parameters, enter the parameter as relates to and the value as DocumentReference/Unique ID.
Choose Run query.
In the Response body section, minimize the resource sections to just view the six resources that were created for the six-page PDF document.
The following screenshot shows the integrated analysis with Amazon Comprehend Medical and NLP enabled. The screenshot on the left is the source PDF; the screenshot on the right is the NLP result from Amazon HealthLake.
You can also run a query with Query type set as Read and Resource type set as Condition using the appropriate resource ID.

The following screenshot shows the query results.

On the Athena console, run the following query:

SELECT * FROM "healthlakestore"."documentreference";

Similarly, you can query MedicationStatement, Condition, and Observation resources.

Clean up

After you’re done using this solution, run cdk destroy PdfMapperToFhirWorkflow to ensure you don’t incur additional charges. For more information, refer to AWS CDK Toolkit (cdk command).

Conclusion

AWS AI services and Amazon HealthLake can help store, transform, query, and analyze insights from unstructured healthcare data. Although this post only covered a PDF clinical chart, you could extend the solution to other types of healthcare PDFs, images, and handwritten notes. After the data is extracted into text form, parsed into discrete data elements using Amazon Comprehend Medical, and stored in Amazon HealthLake, it could be further enriched by downstream systems to drive meaningful and actionable healthcare information and ultimately improve patient health outcomes.

The proposed solution doesn’t require the deployment and maintenance of server infrastructure. All services are either managed by AWS or serverless. With AWS’s pay-as-you-go billing model and its depth and breadth of services, the cost and effort of initial setup and experimentation is significantly lower than traditional on-premises alternatives.

Additional resources

For more information about Amazon HealthLake, refer to the following:

About the Authors

Shravan Vurputoor is a Senior Solutions Architect at AWS. As a trusted customer advocate, he helps organizations understand best practices around advanced cloud-based architectures, and provides advice on strategies to help drive successful business outcomes across a broad set of enterprise customers through his passion for educating, training, designing, and building cloud solutions. In his spare time, he enjoys reading, spending time with his family, and cooking.

Rafael M. Koike is a Principal Solutions Architect at AWS supporting Enterprise customers in the South East, and is part of the Storage and Security Technical Field Community. Rafael has a passion to build, and his expertise in security, storage, networking, and application development has been instrumental in helping customers move to the cloud securely and fast.

Randheer Gehlot is a Principal Customer Solutions Manager at AWS. Randheer is passionate about AI/ML and its application within HCLS industry. As an AWS builder, he works with large enterprises to design and rapidly implement strategic migrations to the cloud and build modern, cloud-native solutions.

Host ML models on Amazon SageMaker using Triton: Python backend

Amazon SageMaker provides a number of options for users who are looking for a solution to host their machine learning (ML) models. Of these options, one of the key features that SageMaker provides is real-time inference. Real-time inference workloads can have varying levels of requirements and service level agreements (SLAs) in terms of latency and throughput. Regardless of the use case, SageMaker offers a number of options that allow you to find the right balance of cost and performance to meet your business objectives.

There are many factors to consider when choosing the right real-time inference option for your business. For example, your business may have a model that must meet the strictest SLAs for latency and throughput with very predictable performance. For that use case, SageMaker provides SageMaker single model endpoints (SMEs), which allow you to deploy a single ML model against a logical endpoint. For other use cases, you can choose to manage cost and performance using SageMaker multi-model endpoints (MMEs), which allow you to specify multiple models to host behind a logical endpoint. Regardless of the option you may choose, SageMaker endpoints provide a scalable mechanism for even the most demanding enterprise users while providing value in a plethora of features, including shadow variants, auto scaling, and native integration with Amazon CloudWatch (for more information, see CloudWatch Metrics for Multi-Model Endpoint Deployments).

One option supported by SageMaker single and multi-model endpoints is NVIDIA Triton Inference Server. Triton supports various backends as engines to support the running and serving of various ML models for inference. For any Triton deployment, it’s crucial to know how the backend behavior impacts your workloads and what to expect so that you can be successful. In this post, we help you understand the Python backend that is supported by Triton on SageMaker so that you can make an informed decision for your workloads and achieve great results.

SageMaker provides Triton via SMEs and MMEs

The Python backend is available through SageMaker, which enables you to deploy both single and multi-model endpoints with NVIDIA Triton Inference Server. Triton supports instance types that support GPUs, CPUs, and AWS Inferentia chips, which allow you to maximize the performance for your workloads. The following diagram illustrates the NVIDIA Triton Inference Server architecture.

Inference requests arrive at the server via either HTTP/REST or by the C API and are then routed to the appropriate per-model scheduler. Triton implements multiple scheduling and batching algorithms that can be configured on a model-by-model basis and can help tune performance. Each model’s scheduler optionally performs batching of inference requests and then passes the requests to the backend corresponding to the model type. The framework backend performs inference using the inputs provided in the batched requests to produce the requested outputs. The outputs are then formatted and returned in the response. The model repository is an object-based repository of the models powered by Amazon Simple Storage Service (Amazon S3) that Triton will make available for inferencing.

For MMEs, SageMaker takes care of traffic shaping to the endpoint and maintains optimal model copies on GPU instances for the best price performance. It continues to route traffic to the instance where the model is loaded. If the instance resources reach capacity due to high memory utilization, SageMaker unloads the least popular models from the container to free up resources to load more frequently used models. SageMaker MMEs offer capabilities for running multiple deep learning or ML models on the GPU at the same time with Triton Inference Server, which has been extended to implement the MME API contract. MMEs enable sharing GPU instances behind an endpoint across multiple models and dynamically load and unload models based on the incoming traffic. With this, you can easily achieve optimal price performance.

When a SageMaker MME receives an HTTP invocation request for a particular model using TargetModel in the request along with the payload, it routes traffic to the right instance behind the endpoint where the target model is loaded. SageMaker takes care of model management behind the endpoint. It dynamically downloads models from Amazon S3 to the instance’s storage volume if the invoked model isn’t available on the instance storage volume. Then SageMaker loads the model to the NVIDIA Triton container’s memory on a GPU accelerated instance and serves the inference request. The GPU core is shared by all the models in an instance. For more information about SageMaker MMEs on GPUs, see Run multiple deep learning models on GPU with Amazon SageMaker multi-model endpoints.

SageMaker MMEs can horizontally scale using an auto scaling policy and provision additional GPU compute instances based on specified metrics. When configuring your auto scaling groups for SageMaker endpoints, you may want to consider SageMakerVariantInvocationsPerInstance as the primary criteria to determine the scaling characteristics of your auto scaling group. In addition, based on whether your models are running on GPU or CPU, you may also consider using CPUUtilization or GPUUtilization as additional criteria. Note that for single model endpoints, because the models deployed are all the same, it’s fairly straightforward to set proper policies to meet your SLAs. For multi-model endpoints, we recommend deploying similar models behind a given endpoint to have more steady predictable performance. In use cases where models of varying sizes and requirements are used, you may want to separate those workloads across multiple multi-model endpoints or spend some time fine-tuning your auto scaling group policy to obtain the best cost and performance balance.

Python backend runtime architecture

As the name suggests, the Python backend is for running models that are written and run in the Python language. Various use cases fall into this category, such as preprocessing or postprocessing steps composing a model ensemble. In other cases, the Python backend may be used as a wrapper to call a Python-based model or framework. Later in this post, we show an example of how you can use the Python backend to call a PyTorch T5 model. This may not always be the most performant option, but it showcases the flexibility that the Python backend provides.

The Python backend creates a runtime environment that creates Python processes using the host’s CPU and memory. You can still attain GPU acceleration if it’s exposed by a Python front end of the framework running the inference. No additional GPU acceleration occurs by using the Python backend itself, but there should be no compatibility errors for any Python process.

On SageMaker, the default Triton Python backend allocates 16 MB, and grows only by 1 MB. However, you can change this by setting the SageMaker environment variables SAGEMAKER_TRITON_SHM_DEFAULT_BYTE_SIZE and SAGEMAKER_TRITON_SHM_GROWTH_BYTE_SIZE. These variables are important because it’s through shared memory that the Python backend will exchange tensors.

The following diagram shows the ensemble scheduler runtime architecture so that you can fine-tune the memory areas, including CPU addressable shared memory, that are used for inter-process communication between C++ and the Python process for exchanging tensors (input/output).

You can monitor resource utilization using CloudWatch, which has native integration with SageMaker.

To get started with the Python backend, you need to create a Python file that has a structure similar to the following code, which dictates the structure as well as how to interact with parameters and return values. Take note of the point in the lifecycle that the methods are called.

import triton_python_backend_utils as pb_utils


class TritonPythonModel:
    """Your Python model must use the same class name. Every Python model
    that is created must have "TritonPythonModel" as the class name.
    """

    @staticmethod
    def auto_complete_config(auto_complete_model_config):

        Parameters
        ----------
        auto_complete_model_config : pb_utils.ModelConfig
          An object containing the existing model configuration. You can build
          upon the configuration given by this object when setting the
          properties for this model.

        Returns
        -------
        pb_utils.ModelConfig
          An object containing the auto-completed model configuration
        """
       

    def initialize(self, args):
       `initialize` is called only once when the model is being loaded.
        Implementing `initialize` function is optional and allows you to 
        do any initialization before execution. This functional allows
        the model to initialize any state associated with the model.

        Parameters
        ----------
        args : dict
          Both keys and values are strings. The dictionary keys and values are:
          * model_config: A JSON string containing the model configuration
          * model_instance_kind: A string containing model instance kind
          * model_instance_device_id: A string containing model instance device
            ID
          * model_repository: Model repository path
          * model_version: Model version
          * model_name: Model name
        ""

    def execute(self, requests):
       `execute` must be implemented in every Python model. `execute`
        function receives a list of pb_utils.InferenceRequest as the only
        argument. This function is called when an inference is requested
        for this model.

        Parameters
        ----------
        requests : list
          A list of pb_utils.InferenceRequest

        Returns
        -------
        list
          A list of pb_utils.InferenceResponse. The length of this list must
          be the same as `requests`
        
        
    def finalize(self):
       `finalize` is called only once when the model is being unloaded.
        Implementing `finalize` function is optional. This function allows
        the model to perform any necessary clean ups before exit.

By utilizing the model.py methods, you can take on the responsibility to load models on a specific device (CPU or GPU) by writing the code in the model.py file explicitly. Although other backends that Triton provides allow you to specify a KIND attribute in the config.pbtxt file to determine if the backend runs on CPU or GPU, it’s not applicable for the Python backend because the model is loaded in the respective device depending on the code written in model.py, like .to(device) in torch. It’s important to note that if you explicitly load artifacts into memory or create temporary files, you reclaim your resources by cleaning up, which usually occurs in the finalize method. Otherwise, you may experience unwanted situations such as memory leaks.

SageMaker notebook walkthrough

With the NVIDIA Triton container image on SageMaker, you can now use Triton’s Python backend, which allows you to write your model logic in Python. For example, you can use this backend to run preprocessing and postprocessing code written in Python, or run a PyTorch Python script directly (instead of first converting it to TorchScript and then using the PyTorch backend). The python_backend GitHub repo contains the documentation and source for the backend.

In this section, we walk you through the example notebook, which demonstrates how to use NVIDIA Triton Inference Server on an Amazon SageMaker MME with the GPU feature to deploy an T5 NLP model for translation.

Set up the environment

We begin by setting up the required environment. We install the dependencies required to package our model pipeline and run inferences using Triton Inference Server. We also define the AWS Identity and Access Management (IAM) role that gives SageMaker access to the model artifacts and the NVIDIA Triton Amazon Elastic Container Registry (Amazon ECR) image. You can use the following code example to retrieve the prebuilt Triton ECR image:

import boto3, json, sagemaker, time
from sagemaker import get_execution_role
import numpy as np
import os
 
os.environ["TOKENIZERS_PARALLELISM"] = "false"
 
# sagemaker variables
role = get_execution_role()
sm_client = boto3.client(service_name="sagemaker")
runtime_sm_client = boto3.client("sagemaker-runtime")
sagemaker_session = sagemaker.Session(boto_session=boto3.Session())
s3_client = boto3.client("s3")
bucket = sagemaker.Session().default_bucket()
prefix = "nlp-mme-gpu"
 
# account mapping for SageMaker MME Triton Image
account_id_map = {
    "us-east-1": "785573368785",
    "us-east-2": "007439368137",
    "us-west-1": "710691900526",
    "us-west-2": "301217895009",
    "eu-west-1": "802834080501",
    "eu-west-2": "205493899709",
    "eu-west-3": "254080097072",
    "eu-north-1": "601324751636",
    "eu-south-1": "966458181534",
    "eu-central-1": "746233611703",
    "ap-east-1": "110948597952",
    "ap-south-1": "763008648453",
    "ap-northeast-1": "941853720454",
    "ap-northeast-2": "151534178276",
    "ap-southeast-1": "324986816169",
    "ap-southeast-2": "355873309152",
    "cn-northwest-1": "474822919863",
    "cn-north-1": "472730292857",
    "sa-east-1": "756306329178",
    "ca-central-1": "464438896020",
    "me-south-1": "836785723513",
    "af-south-1": "774647643957",
}
 
region = boto3.Session().region_name
if region not in account_id_map.keys():
    raise ("UNSUPPORTED REGION")
 
base = "amazonaws.com.cn" if region.startswith("cn-") else "amazonaws.com"
mme_triton_image_uri = (
    "{account_id}.dkr.ecr.{region}.{base}/sagemaker-tritonserver:23.02-py3".format(
        account_id=account_id_map[region], region=region, base=base
    )
)

Generate model artifacts

In this example, we host a pre-trained T5-small Hugging Face PyTorch model using Triton’s Python backend. Here we have the Python script model.py, which implements all the logic to initialize the T5 model and run inference for the translation task. There are three main functions in the script:

initialize – The initialize function is called one time when the model is being loaded. Implementing initialize is optional. initialize allows you to do any necessary initializations before running the model. This function allows the model to initialize any state associated with this model.
execute – The execute function is called whenever an inference request is made. Every Python model must implement the execute function. In the execute function, you’re given a list of InferenceRequest objects. There are two modes of implementing this function: default and decoupled mode. The default mode is the most generic way you would like to implement your model and requires the execute function to return exactly one response per request. The decoupled mode allows you to send multiple responses for a request or not send any responses for a request. The mode you choose should depend on your use case—that is, whether or not you want to return decoupled responses from this model. In this example notebook, we use the default mode.
finalize – Implementing finalize is optional. This function allows you to do any cleanup necessary before the model is unloaded from Triton Inference Server.

Build the model repository

Using Triton on SageMaker requires us to first set up a model repository folder containing the models we want to serve. For each model, we need to create a model directory consisting of the model artifact and define the config.pbtxt file to specify the model configuration that Triton uses to load and serve the model. To learn more about the config settings, refer to Model Configuration. The model repository structure for the T5 model is as follows:

Note that Triton has specific requirements for the model repository layout. Within the top-level model repository directory, each model has its own subdirectory containing the information for the corresponding model. Each model directory in Triton must have at least one numeric subdirectory representing a version of the model. Here, that is 1, representing version 1 of our T5 PyTorch model. Each model is run by a specific backend, so each version subdirectory must contain the model artifact required by that backend. Here, we are using the Python backend, and it requires the Python file that is used for serving (model.py). If we were using a PyTorch backend, a model.pt file would be required. For more details on naming conventions for model files, refer to Model Files.

Every Python Triton model must provide a config.pbtxt file describing the model configuration. To use this backend, you must set the backend field of your model config.pbtxt file to python. The following code shows how to define the config file for the T5 PyTorch model being served through Triton’s Python backend:

name: "t5_pytorch"
backend: "python"
max_batch_size: 8
input: [
    {
        name: "input_ids"
        data_type: TYPE_INT32
        dims: [ -1 ]
    },
    {
        name: "attention_mask"
        data_type: TYPE_INT32
        dims: [ -1 ]
    }
]
output [
  {
    name: "output"
    data_type: TYPE_INT32
    dims: [ -1 ]
  }
]
instance_group {
  count: 1
  kind: KIND_GPU
}
dynamic_batching {
}
parameters: {
  key: "EXECUTION_ENV_PATH",
  value: {string_value: "$$TRITON_MODEL_DIRECTORY/hf_env.tar.gz"}
}

In this configuration, we have defined the parameters section to provide an environment path. This is because to serve the Hugging Face T5 PyTorch model using Triton’s Python backend, we have PyTorch and Hugging Face transformers as dependencies. You need to create a custom run environment in the Python backend to include all the dependencies in this example. The alternative is to install Python and all the dependencies in the local environment. The custom run environment is only needed if you want portability across different systems that might not have the Python environment to run the inference. If a custom run environment is required for SageMaker, then this should be pointed out clearly. Currently, the Python backend only supports conda-pack for this purpose. conda-pack ensures that your Conda environment is portable. We follow the instructions from the Triton documentation for packaging dependencies to be used in the Python backend as the Conda environment TAR file. Running the bash script create_hf_env.sh creates the Conda environment containing PyTorch and Hugging Face transformers and packages it as a TAR file, and then we move it into the t5-pytorch model directory:

!bash workspace/create_hf_env.sh
!mv hf_env.tar.gz model_repository/t5_pytorch/

After we create the TAR file from the Conda environment, we place it in the model folder. The following code in the model config.pbtxt file tells the Python backend to use this custom environment for your model:

parameters: {
  key: "EXECUTION_ENV_PATH",
  value: {string_value: "$$TRITON_MODEL_DIRECTORY/hf_env.tar.gz"}
}

Here, $$TRITON_MODEL_DIRECTORY helps provide the environment path relative to the model folder in the model repository, and is resolved to $pwd/model_repository/t5_pytorch. Finally, hf_env.tar.gz is the name we gave to our Conda environment file.

Next, we package our model as *.tar.gz files for uploading to Amazon S3:

!tar -C model_repository/ -czf t5_pytorch.tar.gz t5_pytorch
model_uri_t5_pytorch = sagemaker_session.upload_data(path="t5_pytorch.tar.gz", key_prefix=prefix)

Create a SageMaker endpoint

Now that we have uploaded the model artifacts to Amazon S3, we can create a SageMaker multi-model endpoint. To create a SageMaker endpoint, we need to first create the SageMaker model object and endpoint configuration.

Firstly, we need to define the serving container. In the container definition, define the ModelDataUrl to specify the S3 directory that contains all the models that the SageMaker multi-model endpoint will use to load and serve predictions. Set Mode to MultiModel to indicate SageMaker would create the endpoint with MME container specifications. See the following code:

container = {
"Image": mme_triton_image_uri,
"ModelDataUrl": model_data_url,
"Mode": "MultiModel",
}

Then we create the SageMaker model object using the create_model boto3 API by specifying the ModelName and container definition:

create_model_response = sm_client.create_model(
ModelName=sm_model_name, ExecutionRoleArn=role, PrimaryContainer=container
)

We use this model to create an endpoint configuration where we can specify the type and number of instances we want in the endpoint. Here we are deploying to a g5.2xlarge NVIDIA GPU instance:

create_endpoint_config_response = sm_client.create_endpoint_config(
EndpointConfigName=endpoint_config_name,
ProductionVariants=[
        {
            "InstanceType": "ml.g5.2xlarge",
            "InitialVariantWeight": 1,
            "InitialInstanceCount": 1,
            "ModelName": sm_model_name,
            "VariantName": "AllTraffic",
        }
    ],
)

We use this configuration to create a new SageMaker endpoint and wait for the deployment to finish:

endpoint_name = f"{prefix}-ep-{ts}-2xl"
create_endpoint_response = sm_client.create_endpoint(
EndpointName=endpoint_name, EndpointConfigName=endpoint_config_name
)

The status will change to InService after the deployment is successful.

Invoke your model hosted on the SageMaker endpoint

After the endpoint is running, we can use some sample raw data to perform inference using JSON as the payload format. For the inference request format, Triton uses the KFServing community standard inference protocols. We can send inference requests to the multi-model endpoint using the invoke_enpoint API. We specify the TargetModel in the invocation call and pass in the payload for each model type. See the following code:

texts_to_translate = ["translate English to German: The house is wonderful."]
batch_size = len(texts_to_translate)

t5_payload = get_text_payload("t5-small", texts_to_translate)
response = runtime_sm_client.invoke_endpoint(
EndpointName=endpoint_name,
ContentType="application/octet-stream",
Body=json.dumps(t5_payload),
TargetModel="t5_pytorch.tar.gz",
)
response_body = json.loads(response["Body"].read().decode("utf8"))
output_ids = np.array(response_body["outputs"][0]["data"]).reshape(batch_size, -1)
t5_tokenizer = get_tokenizer("t5-small")
decoded_outputs = t5_tokenizer.batch_decode(output_ids, skip_special_tokens=True)
for text in decoded_outputs:
print(text, "n")

The notebook can be found in the GitHub repository.

Best practices

When using the Python backend, it can sometimes be complicated to optimize the workload for throughput and latency. You should consider the options available through the SageMaker and Triton environment variables that we discussed previously in regards to batch sizes, max delay, and other factors. In addition, you should be aware of the Python backend-specific configuration and the configuration of the underlying framework. The following are some best practices:

If using PyTorch (or any other deep learning framework) module in the Python backend, consider experimenting with different values of intra/inter op thread pool size. Because each Python backend model instance runs in a separate process, limiting the number of threads per process prevents over-subscribing the system resources when scaling up the instance count.
Even though the Python backend is highly flexible, it performs some extra data copies that can impact inference performance. For the best performance on GPU, consider using Triton’s TensorRT backend when possible.
When using Python backend models in an ensemble, refer to Interoperability and GPU Support for a possible zero-copy transfer of Python backend tensors to other frameworks.
You can also use the instance_group_count variable in the config.pbtxt file to add a worker process and increase throughput. Be aware that increasing this variable will increase the amount of resource consumption, including CPU and GPU utilization.

You can explore these options and parameters to get the desired performance characteristics you seek. As always, be aware that resources such as processor or memory consumption can change and should be monitored so you can fine-tune and optimize inference performance.

Conclusion

In this post, we dove deep into the Python backend that Triton Inference Server supports on SageMaker. This backend provides for both CPU and GPU acceleration of your models that are written and run in the Python language. There are many options to consider to get the best performance for inference, such as batch sizes, data input formats, and other factors that can be tuned to meet your needs. SageMaker allows you to use single model endpoints for guaranteed performance and multi-model endpoints to get a better balance of performance and cost savings. To get started with MME support for GPU, see Supported algorithms, frameworks, and instances.

We invite you to try Triton Inference Server containers in SageMaker, and share your feedback and questions in the comments.

About the Authors

James Park is a Solutions Architect at Amazon Web Services. He works with Amazon to design, build, and deploy technology solutions on AWS, and has a particular interest in AI and machine learning. In his spare time he enjoys seeking out new cultures, new experiences, and staying up to date with the latest technology trends.

Melanie Li is a Senior AI/ML Specialist TAM at AWS based in Sydney, Australia. She helps enterprise customers to build solutions leveraging the state-of-the-art AI/ML tools on AWS and provides guidance on architecting and implementing machine learning solutions with best practices. In her spare time, she loves to explore nature outdoors and spend time with family and friends.

Jiahong Liu is a Solution Architect on the Cloud Service Provider team at NVIDIA. He assists clients in adopting machine learning and AI solutions that leverage NVIDIA accelerated computing to address their training and inference challenges. In his leisure time, he enjoys origami, DIY projects, and playing basketball.

Kshitiz Gupta is a Solutions Architect at NVIDIA. He enjoys educating cloud customers about the GPU AI technologies NVIDIA has to offer and assisting them with accelerating their machine learning and deep learning applications. Outside of work, he enjoys running, hiking and wildlife watching.

Explore the Hidden Temple of Itzamná This Week ‘In the NVIDIA Studio’

Editor’s note: This post is part of our weekly In the NVIDIA Studio series, which celebrates featured artists, offers creative tips and tricks, and demonstrates how NVIDIA Studio technology improves creative workflows.

3D artist Milan Dey finds inspiration in games, movies, comics and pop culture. He drew from all of the above when creating a stunning 3D scene of Mayan ruins, The Hidden Temple of Itzamná, this week In the NVIDIA Studio.

“One evening, I was playing an adventure game and wanted to replicate the scene,” Milan said. “But I wanted my version to have a heavy Mayan influence.”

Milan sought vast, detailed architecture and carved rocks that look like they’ve stood with pride for centuries, similar to what can be seen in the Indiana Jones movies. The artist’s goals for his scene were to portray mother nature giving humanity a reminder that she is the greatest, to kick off with a grand introduction shot with light falling directly on the camera lens to create negative spaces in the frame, and to evoke that wild, wet smell of greens.

Below, Milan outlines his creative workflow, which combines tenacity with technical ability.

And for more inspiration, check out the NVIDIA Studio #GameArtChallenge reel, which includes highlights from our video-game-themed #GameArtChallenge entries.

It Belongs in a Museum

First things first, Milan gathers reference material. For this scene, the artist spent an afternoon capturing hundreds of screenshots and walkthrough videos of the game. He spent the next day on Artstation and Adobe Behance gathering visuals and sorting out projects of ruins.

Next, Milan browsed the Epic Games marketplace, which offers an extensive collection of assets for Unreal Engine creators.

“It crossed my mind that Aztec and Inca cultures are a great choice for a ruins environment,” said Milan. “Tropical settings have a variety of vegetation, whereas caves are deep enough to create their own biology and ecosystem.” With the assets in place, Milan organized them by level to create a 3D palette.

He then began with the initial blockout to prototype, test and adjust the foundational scene elements in Unreal Engine. The artist tested scene basics, replacing blocks with polished assets and applying lighting. He didn’t add anything fancy yet — just a single source of light to mimic normal daylight.

Milan then searched for the best possible cave rocks and rock walls, with Quixel Megascans delivering the goods. Milan revisited the blocking process with the temple courtyard, placing cameras in multiple positions after initial asset placements. Next came the heavy task of adding vegetation and greens to the stone walls.

“I put big patches of moss decals all around the walls, which gives a realistic look and feel,” Milan said. “Placing large- and medium-sized trees filled in a substantial part of the environment without using many resources.”

Vegetation is applied in painstaking detail.

As they say, the devil is in the details, Milan said.

“It’s very easy to get carried away with foliage painting and get lost in the depths of the cave,” the artist added. It took him another three days to fill in the smaller vegetation: shrubs, vines, plants, grass and even more moss.

The scene was starting to become staggeringly large, Milan said, but his ASUS ROG Strix Scar 15 NVIDIA Studio laptop was up to the task. His GeForce RTX 3080 GPU enabled RTX-accelerated rendering for high-fidelity, interactive visualization of his large 3D environment.

NVIDIA DLSS technology increased interactivity of the viewport by using AI to upscale frames rendered at lower resolution while retaining photorealistic detail.

“It’s simple: NVIDIA nailed ray tracing.” Milan said. “And Unreal Engine works best with NVIDIA and GeForce RTX graphics cards.”

A famed professor of archaeology explores the Mayan ruins.

Milan lit his scene with the HDRI digital image format to enhance the visuals and save file space, adding select directional lighting with exponential height fog. This created more density in low places of the map and less density in high places, adding further realism and depth.

Height fog adds realism to the 3D scene.

“It’s wild what you can do with a GeForce RTX GPU — using ray tracing or Lumen, the global illumination calculation is instant, when it used to take hours. What a time to be alive!” — Milan Dey

The artist doesn’t take these leaps in technology for granted, he said. “I’m from an era where we were required to do manual bouncing,” Dey said. “It’s obsolete now and Lumen is incredible.”

Lumen is Unreal Engine 5’s fully dynamic global illumination and reflections system that brings realistic lighting to scenes.

Milan reviewed each camera angle and made custom lighting adjustments, sometimes removing or replacing vegetation to make them pop with the lighting. He also added free assets from Sketchfab and special water effects to give the fountain an “eternity” vibe, he said.

With the scene complete, Milan quickly exported final renders thanks to his RTX GPU. “Art is the expression of human beings,” he stressed. “It demands understanding and attention.

To his past self or someone at the beginning of their creative journey, Milan would advise, “Keep an open mind and be teachable.”

Check out Milan’s portfolio on Instagram.

Follow NVIDIA Studio on Instagram, Twitter and Facebook. Access tutorials on the Studio YouTube channel and get updates directly in your inbox by subscribing to the Studio newsletter.

Language models can explain neurons in language models

We use GPT-4 to automatically write explanations for the behavior of neurons in large language models and to score those explanations. We release a dataset of these (imperfect) explanations and scores for every neuron in GPT-2.OpenAI Blog

Solution overview

Prerequisites

Administrator setup

Create a new IAM policy for QuickSight access

Attach the policy to your Studio execution role

Download the datasets

Build a model and run predictions

Launch Canvas

Upload training and validation datasets

Make predictions with the model

Send predictions to QuickSight

Business analysts experience

Clean up

Conclusion

About the Authors

You Might Also Like

Subscribe to the AI Podcast: Now Available on Amazon Music

Introducing two generative AI extensions for Jupyter

Jupyter AI, an open-source project to bring generative AI to Jupyter notebooks

Amazon CodeWhisperer Jupyter extension

Introducing new Jupyter extensions to build, train, and deploy ML at scale

Notebooks scheduling

SageMaker open-source distribution

Amazon CodeGuru Jupyter extension

Conclusion

About the Author

Solution overview

Prerequisites

Set up IAM credentials

Set up an IAM role for the notebook job instance

Install the extension

Submit a notebook job

Advanced configurations

Add an initialization script

Conclusion

About the authors

Provisioned concurrency with Application Auto Scaling

Notebook example

Configure a SageMaker endpoint

Create SageMaker on-demand and provisioned concurrency endpoints

Compare invocation and performance

When to use Serverless Inference with provisioned concurrency

Best practices

Pricing

Conclusion

About the Authors

Prerequisites

Visualize the experimental structure of trastuzumab

Predict the trastuzumab heavy chain structure from its sequence using ESMFold

Deploy ESMFold as a SageMaker inference endpoint

Clean up

Summary

About the Authors

Solution overview

Prerequisites

Deploy the solution architecture

Query the data

Clean up

Conclusion

Additional resources

About the Authors

SageMaker provides Triton via SMEs and MMEs

Python backend runtime architecture

SageMaker notebook walkthrough

Set up the environment

Generate model artifacts

Build the model repository

Create a SageMaker endpoint

Invoke your model hosted on the SageMaker endpoint

Best practices

Conclusion

About the Authors

It Belongs in a Museum

Navigation

GenAI Vision Endless Possibilities

"I'm interested in things that change the world or that affect the future and wondrous, new technology where you see it, and you're like, 'Wow, how did that even happen? How is that possible?'" -- Elon Musk

Copyright © 2019-2025 Vedere AI. All Rights Reserved.